2 minute read

Why this quarter mattered

We haven’t gathered all together since our June kick-off meeting, so in Mid October, we met (online) with all members of the diffUSE project to discuss our overall goals for each project team, progress made over the first few months, and goals for the next three months. We emphasized how the different pieces of the project integrate to build methods, data, models, and encodings so the community can routinely use diffuse scattering in basic biology and drug discovery.

What have we completed in the first few months?

Data collection:

  • We have collected ambient-temperature datasets from CHESS for lysozyme, macrodomain, NrdE, and DNA fibers.
  • We have collected data at ALS on Mac1 and Huwe1 using humidity boxes and watershed sleeves with controlled transmission and beam size.
  • We have played around with temperature modulation used to probe dose dependence and mosaicity effects with samples shipped] from UCSF to Cornell
  • We implemented standardized background frames, uniform sleeve lengths, and precise humidity control to enhance map quality and cross-beamline comparability with an eye toward future multi-site data collection campaigns.
  • We have documented all collection procedures on the diffUSE website logbooks.

Data processing:

  • Developing xia2.multiplex for automated data merging, mdx2 for data extraction, and comprehensive data quality control (QC) workflows.

  • Building graphical user interfaces (GUIs) for mdx2 to improve usability and accessibility.

  • Identifying and resolving bugs that arise when multiple users concurrently process the same datasets.

  • Implementing differentiable refinement by treating molecular dynamics (MD) frame weights as trainable parameters in a Pearson correlation–based objective function.

Machine Learning Modeling:

  • Developing pipeline scaffolds to integrate experimental structural data directly into generative model training and evaluation.
  • Creating quantitative metrics for assessing and benchmarking ensemble data.
  • Building a generative water model that learns to predict water molecule positions from protein structure, designed for future integration into broader generative modeling frameworks.

Simulations:

  • We simulated a Mac1 2×2×2 supercell with OPC3 waters and 279,004 atoms reaches 150 ns per day on Voltage Park. With refined masking and resampling, total CC is 0.96 and anisotropic CC is 0.56 on the H8 dataset, which sets a clear target for larger supercells and ligand or mutant comparisons.
  • Taylor completed his rotation developing a simulator.

Encoding:

  • We continue to advocate for conformational and compositional heterogeneity-encoding strategies.
  • We have developed a script to translate between encodings for multiconformer and ensemble representations.
  • We are working on developing a standalone script and a COOT integration script for conformational heterogeneity.

Infrastructure and Open Science:

  • We have access to Voltage Park compute and S3 storage via the command line, which will make sharing maps and models easier.
  • 16 blog posts and 6 beamtime logbooks to date!

What is up for the next 3 months?

Data Collection: Catalog all existing data, fill gaps, complete background series at ALS, finalize hardened collection procedures that travel across LBL and CHESS, and post collection reports on the site. Developing shared checklists to coordinate and standardize future data-collection cycles.

Data Processing: Converge on a single, documented workflow, generate preliminary maps for all ALS and CHESS datasets, produce fine maps for GOODVIBES and DISCOBALL, stand up a CHESS 2026-1 pipeline, and publish processing reports on the site.

Modeling: Ship an initial pipeline that accepts maps for guided sampling.

Encoding: Land final working group approval, publish the schema and examples, and connect the web app to our catalog so processed maps, models, and metadata are searchable and shareable.

Infrastructure and sharing science: More blog posts!

We will meet in the Bay Area in January and report back more after that.

Updated: