<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://diffuse.science/feed.xml" rel="self" type="application/atom+xml" /><link href="https://diffuse.science/" rel="alternate" type="text/html" /><updated>2026-05-12T23:50:06+00:00</updated><id>https://diffuse.science/feed.xml</id><title type="html">The DiffUSE Project</title><subtitle>The DiffUSE Project</subtitle><entry><title type="html">The Tortured Proteins Department, Episode 14</title><link href="https://diffuse.science/posts/TTPD-14/" rel="alternate" type="text/html" title="The Tortured Proteins Department, Episode 14" /><published>2026-05-08T00:00:00+00:00</published><updated>2026-05-08T00:00:00+00:00</updated><id>https://diffuse.science/posts/TTPD-14</id><content type="html" xml:base="https://diffuse.science/posts/TTPD-14/"><![CDATA[<h1 id="episode-14-mastermind">Episode 14: Mastermind</h1>

<p>We chat about NIH strategic plan comments, AI lab guidelines, preprints on cryo-EM vitrification and automated image processing, and training scientists for the future.</p>

<p>NIH Strategic Plan Comments (Due 5/26):</p>
<ul>
  <li><a href="https://grants.nih.gov/news-events/nih-extramural-nexus-news/2026/03/nih-seeks-input-on-framework-for-next-nih-wide-strategic-plan">NIH seeks input on framework for next NIH-wide strategic plan</a></li>
</ul>

<p>Preprints:</p>
<ul>
  <li><a href="https://www.biorxiv.org/content/10.64898/2026.04.21.720011v1">Cooling fast and slow: Characterising the effects of vitrification in cryo-EM and the subsequent recovery of equilibrium populations</a></li>
  <li><a href="https://www.biorxiv.org/content/10.64898/2026.04.16.718662v1">cryoAgent: An agentic workflow for robust and adaptive end-to-end cryo-EM image processing</a></li>
</ul>

<p>Other Links:</p>
<ul>
  <li><a href="https://blekhman.substack.com/p/you-need-to-make-ai-guidelines-for?utm_medium=email">You need to make AI guidelines for your lab</a></li>
  <li>
    <p><a href="https://sciencepolitics.org/2026/03/18/were-training-scientists-for-a-world-that-no-longer-exists/">We’re Training Scientists for a World That No Longer Exists</a></p>
  </li>
  <li><a href="https://open.spotify.com/episode/4HVoEhB7XATEDWiVImfiJn?si=Hwf_7wvOSoqSfvcDCc1xCA">Spotify</a></li>
  <li><a href="https://podcasts.apple.com/us/podcast/mastermind/id1802420696?i=1000766862718">Apple Podcasts</a></li>
</ul>

<iframe data-testid="embed-iframe" style="border-radius:12px" src="https://open.spotify.com/embed/episode/4HVoEhB7XATEDWiVImfiJn?utm_source=generator" width="100%" height="352" frameborder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie.wankowicz@astera.org</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[The Tortured Proteins Department Podcast, Episode 14]]></summary></entry><entry><title type="html">Encoding Heterogeneity in mmCIF Files</title><link href="https://diffuse.science/post/encoding-proposal/" rel="alternate" type="text/html" title="Encoding Heterogeneity in mmCIF Files" /><published>2026-04-29T00:00:00+00:00</published><updated>2026-04-29T00:00:00+00:00</updated><id>https://diffuse.science/post/encoding-proposal</id><content type="html" xml:base="https://diffuse.science/post/encoding-proposal/"><![CDATA[<h2 id="executive-summary">Executive Summary</h2>
<p>PDBx/mmCIF models do not adequately encode conformational and compositional heterogeneity intrinsic in structural biology data. This problem becomes more pressing when examining new classes of experiments that frequently or expressly induce such heterogeneities, including fragment screening and time-resolved techniques. Leveraging mmCIF’s expressive nature, we propose a new category that enables users and software to explicitly describe heterogeneity by explicitly encoding the relationships between different pieces of heterogeneity.</p>

<h2 id="the-problem">The Problem</h2>
<p>Most PDB depositions result from X-ray crystallography or single-particle cryo-electron microscopy (cryo-EM), which collect information from thousands to trillions of copies of the macromolecular complex. Each copy of the macromolecular complex can vary in the relative positions of atoms within it (conformational heterogeneity) or in the presence/absence of part of the complex (compositional heterogeneity). Conformational heterogeneity can occur across length scales from a single atom (e.g. a terminal hydroxyl in serine partially occupying two locations) to entire domain movements (e.g. the ribosome stalk partially occupying two locations)1–6. Compositional heterogeneity can also occur across length scales from an ion to a protein subunit7–11. Models intended to describe this experimental data must be able to effectively capture this inherent conformational and compositional heterogeneity; however, the current encoding falls short.</p>

<p>Currently, in PDBx/mmCIFs, harmonic conformational heterogeneity is captured by B-factors or translation-libration-screw [TLS] parameters. In contrast, non-harmonic conformational or compositional heterogeneity is captured through the alternative location indicator (altloc). Altlocs indicate the mutually exclusive positions any given atom can occupy, providing information on coordinates, occupancy, and B-factors for each position. Multiple atoms are often grouped into the same altloc, indicating that, on a local level, they are part of the same state.</p>

<p>This convention falls well short when it becomes necessary to describe how altlocs are related to each other12.  It has long been recognized that additional information on the relationships between altlocs is necessary to adequately fit this data13,14. However, this information is only provided in refinement programs (via grouped occupancy) and is not explicitly stated in deposited models.</p>

<p>For example, conditional on the presence or absence of a ligand, two ethylene glycol (EDO) molecules may be present, one of which can occupy two mutually exclusive states (Figure 1a). While there are ways to apply restraints and constraints in refinement, this conditionality cannot be expressed in PDBx/mmCIF altlocs. This limitation hurts our ability to read a model and understand the downstream</p>

<h2 id="biological-applications">Biological Applications</h2>
<p>While the interplay between conformational and compositional heterogeneity is present in all structural experiments, accurate encoding has particularly large implications for several classes of experiments that frequently or explicitly induce such heterogeneities, including fragment screening and time-resolved techniques.</p>

<p>Prominent use cases include: 
X-ray crystallography fragment screening: X-ray crystallography fragment screening is an increasingly popular drug discovery technique that involves collecting hundreds to thousands of crystal structures of small-molecule fragments bound to proteins. Collectively, these structures provide building blocks for ligands by observing and modeling diverse small molecule starting points15. The challenge arises because the fragment or ligand is bound to only a subset of macromolecules, inducing partial occupancy changes across states, resulting in experimental data with high compositional and conformational heterogeneity 10,16,17. Such screens make up an increasing fraction of depositions in the PDB and are of significant importance to the pharmaceutical industry. A clear understanding of the hierarchical heterogeneity is essential for future chemical modifications and for using these data in automated methods or machine learning.</p>

<p>Time-resolved experiments: Time-resolved X-ray crystallography or cryo-EM involves collecting multiple datasets over time as the sample undergoes a dynamic process. To achieve this, the sample (typically protein crystals) is rapidly subjected to a specific stimulus (e.g. with temperature, light, or electric field). The changes induce structural rearrangements in the protein, and diffraction data are collected at multiple time points, enabling real-time tracking of these changes. These studies are critical to understanding protein function in terms of time-domain structural heterogeneity, such as catalysis, ligand binding, or macromolecular machines 18–22. While experiments often yield multiple depositions linked to different maps, within a single deposition, the ground state contains a mixture of conformers, and the excited state also contains a mixture of conformers, although we expect different occupancies. The inability to express the dependencies among occupancy groups in the mmCIF/PDBx means that the structure can be understood only with the accompanying literature, which is problematic for automated methods and machine learning.</p>

<p>Other experiments purposefully seek to enrich compositional heterogeneity. In these cases, a single collection reflects data from many biochemically distinct samples. For example, a collection of small molecules in a “cocktail” can be soaked into the same protein crystal, resulting in a single protein bound to many ligands - potentially in the same binding site 23. In another example, native complexes with multiple binding partners can be purified for Cryo-EM 24 or imaged by cryo-electron tomography (cryo-ET) 6,25.</p>

<h2 id="proposed-encoding-of-heterogeneity-in-pdbxmmcif">Proposed encoding of heterogeneity in PDBx/mmCIF</h2>
<p>Thanks to the more expressive data model of PDBx/mmCIF, it is now possible to explicitly encode heterogeneity-related relationships that are implicit or ambiguous in current representations. To enable this, we propose a new category that directly encodes (i) the hierarchical relationships among compositional and conformational states and (ii) the coexistence or mutual exclusivity of those states. Together, these two tables provide a formal mechanism for representing both the dependency structure and logical compatibility among heterogeneous states in multiconformer or ensemble models.
The core table, termed the heterogeneity hierarchy, is a mandatory loop that links heterogeneity identifiers associated with atoms in the _atom_site category (and, when applicable, model IDs) to one another through explicit parent–child relationships. This hierarchy encodes how individual heterogeneous states are related—such as alternative ligand-binding events, conformational substates, or compositional variants—allowing users and software to reconstruct a structured view of heterogeneity rather than relying on proximity-based altloc conventions alone. The hierarchy table is complemented by a second table, the state coexistence table, which explicitly specifies which heterogeneity states may or may not co-occur.
Within the _atom_site category, we introduce a heterogeneity ID. This identifier is optional for PDB entries that do not require explicit heterogeneity encoding, but becomes mandatory for atoms that participate in the heterogeneity hierarchy. In some cases, heterogeneity IDs may correspond directly to existing altloc identifiers; however, unlike altlocs, each heterogeneity ID must represent a distinct biochemical or structural state. For example, two spatially distant atoms may both be labeled with altloc “A” under current conventions, even though there is no evidence that they represent the same underlying state. In such cases, distinct heterogeneity IDs would be assigned, allowing their relationships (or lack thereof) to be represented unambiguously within the hierarchy.
The heterogeneity hierarchy category uses four fields: name, id, parent, and details. The name field is mandatory and provides a short, human-readable label for the state. The id field is the category key and serves as the formal link between this table and the _atom_site category. The optional details field allows descriptive text (quoted if it contains spaces). The parent field references another heterogeneity ID and encodes the hierarchical structure: each heterogeneity ID has at most one parent, whereas any parent may have multiple children. This constraint enforces a tree-like hierarchy that captures how states refine or branch from one another. By traversing parent–child relationships, software can explicitly reconstruct the hierarchy of heterogeneity encoded across the atomic model.
Unlike traditional altloc usage—where identical altloc identifiers implicitly define a single mutually exclusive state within a local clash radius—the heterogeneity hierarchy provides a global, explicit description of state relationships. When conformational or compositional states are known to be linked biochemically, physically, or mechanistically, they may share the same altloc and heterogeneity ID. Conversely, when no such linkage is known, distinct heterogeneity IDs are assigned, even if the altloc labels coincide. This ensures that altloc usage remains compatible with existing practice while heterogeneity IDs provide the formal semantic layer needed to interpret relationships correctly.
Mutual compatibility among states is encoded separately in the state coexistence table. This table operates on heterogeneity IDs defined in the hierarchy and specifies logical rules(e.g., OR or NOT) that determine whether states may co-occur. Hierarchical relationships inherently encode AND relationships (parent and child states are coexistent by definition), whereas the coexistence table captures non-hierarchical logical constraints. Importantly, if no coexistence rule is specified between two heterogeneity IDs, they are assumed not to be mutually exclusive. As with altlocs, individual heterogeneity IDs are not expected to represent complete models in isolation; rather, the full hierarchy and coexistence definitions together should account for all atoms and fully explain the experimental data.</p>

<h2 id="examples">Examples</h2>
<p>A few examples are visually laid out here. Example mmCIF models and other updated information can be found in the <a href="https://github.com/diff-use/mmcif_encoding">Encoding Github Repository</a>.</p>

<p>Consider a common situation in fragment screen data with extensive compositional heterogeneity. This structure has four ligands: three EDO molecules and a small fragment bound (Figure 1A). Based on occupancy, biochemical information, and overlap, it is known that when the fragment or EDO1 is bound, either EDO2 or EDO3 can then be bound. When EDO2 or EDO3 is bound, either EDO1 or the fragment is bound (Figure 1B). This example is shown in the proposed heterogeneity table in Figure 1C, and Figure 1D displays the atom table (which shows only one atom per ligand for simplicity).</p>

<h2 id="situations-not-addressed-by-the-heterogeneity-category">Situations not addressed by the heterogeneity category</h2>
<p>We deliberately chose not to encode the following scenarios in our current model. 
We excluded a many-to-many relationship, specifying that a parent could have multiple children, but a child could not be associated with multiple parents. This does not strike us as a scientifically sensible thing to do: it would hardcode the commonality between conformations, whereas rationalising such correlations is decidedly a downstream task, namely the analysis of what the models mean, and highly dependent on the questions asked. 
We did not include the association of multiple maps, such as those used in time-resolved or classification methods in cryoEM. It is conceivable that a single model, supported by multiple maps, could be created within this hierarchical framework in the future. 
The current encoding of hierarchical compositional and conformational heterogeneity is based solely on thermodynamic ensembles without accounting for timescales. However, the framework is flexible enough that timescale information could be incorporated into the table, such as with time-resolved techniques.</p>

<h2 id="working-with-existing-software">Working with existing software</h2>
<p>As the proposed loop does not change the atom_site category, it should not break any existing software. However, to fully realize the potential of this encoding, it should be used by existing structural biology software, including model building10,27–30, refinement programs31–33, and visualization software30,34. Overall, these programs need to account for the heterogeneity loop when present. Additionally, software should account for the possibility that models using the new category may interpret altlocs slightly differently, with each altloc representing a distinct state.</p>

<h3 id="model-building">Model Building</h3>
<p>For modeling building software, the software should, to the best of its ability, output the loop and/or a grouped occupancy file describing how altlocs are related. While this loop/grouped occupancy may need to be manually adjusted for local heterogeneity hierarchy, it can be automated when clashes determine labeling, as is currently done in qFit and Coot28,30. Determining distally hierarchical relationships cannot be done using information from atom_site and will require external methods and individual input from modelers. Additionally, model-building software should err on the side of assigning new altlocs (_atom_site.label_alt_id), unless there is evidence two pieces of heterogeneity are linked. In Coot, we imagine a user-defined flag (i.e. use hierarchy category) for assigning altlocs. Coot would also need to re-assign altlocs as altlocs are deleted from the model during manual building. If the hierarchy category exists in the input model, Coot must amend the table as needed. Currently, this would be based solely on user input and/or conflicting information regarding the conformations.</p>

<p>We also encourage the development of tools that enable modelers to specify the hierarchical relationships between altlocs through a grouped occupancy file or by manually creating the hierarchy loop, with the software automatically correcting the labeling of altlocs in the atom table.</p>

<h3 id="refinement">Refinement</h3>
<p>We propose that refinement software include a flag to enable the use of information from the hierarchy category. If the flag is enabled, the software should expect a grouped-occupancy file and/or a hierarchy category within the PDBx/mmCIF file. Refinement software should then generate constraints and/or restraints based on the hierarchy or grouped occupancy file or hierarchy category. If a grouped occupancy file is present, it will overwrite the existing hierarchy category in the PDBx/mmCIF file. 
Refinement should read the hierarchy loop from children to parents and/or use a grouped occupancy file to determine which hierarchy levels should be equal to or less than one occupancy. Similar to Refmac’s current “occupancy group alts complete” parameter, we envision a modified version of this command that would create a restraint ensuring that the occupancy of specific IDs sums to that of a specified parent, reflecting their hierarchical relationship. This would be in addition to the restraint of having atoms with the same ID (pdbx_heterogeneity_hierarchy.id) having the same occupancy. In Phenix and SHELX31,35, while you can create constraints of mutually exclusive groups using grouped occupancy or PART parameters, it is currently impossible to encode the hierarchy between these groups. When flagged, refinement software must output the category in the PDBx/mmCIF file, specifying how heterogeneity hierarchy is encoded and refined within the model. 
Visualization
Visualization software should read the hierarchical heterogeneity loop to display hierarchical heterogeneity. Currently, Pymol supports the use of altlocs in selection criteria and visualization. We aim for Pymol to support the new heterogeneity loop directive in selection and visualization, allowing users to select a state and display all its children and/or parents.</p>

<h2 id="future-indications">Future Indications</h2>
<p>This proposal is primarily motivated by X-ray crystallography, which enables the precise detection of individual states owing to its high resolution and recent technological advances. While cryo-EM currently captures most of its compositional and conformational heterogeneity through 3D classification of maps36, we are beginning to routinely obtain structures at sufficiently high resolution to capture heterogeneity through both map classification and the encoding of multiplicity within structural models.</p>

<p>Furthermore, the machine-readability and human interpretability of the inherent heterogeneity within models are critical for advancing the use and prediction of conformational ensembles. By implementing this model, we anticipate that it will yield new tools and analyses to examine heterogeneity within and across structures, significantly expanding our understanding of the inherent heterogeneity in experimental structural biology data.</p>

<h2 id="references">References</h2>
<ol>
  <li>Baxevanis, A. D. &amp; Francis Ouellette, B. F. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. (John Wiley &amp; Sons, 2004).</li>
  <li>DePristo, M. A., de Bakker, P. I. W. &amp; Blundell, T. L. Heterogeneity and inaccuracy in protein structures solved by X-ray crystallography. Structure 12, 831–838 (2004).</li>
  <li>Leschziner, A. E. &amp; Nogales, E. Visualizing flexibility at molecular resolution: analysis of heterogeneity in single-particle electron microscopy reconstructions. Annu Rev Biophys Biomol Struct 36, 43–62 (2007).</li>
  <li>Smith, J. L., Hendrickson, W. A., Honzatko, R. B. &amp; Sheriff, S. Structural heterogeneity in protein crystals. Biochemistry 25, 5018–5027 (1986).</li>
  <li>Zhong, E. D., Bepler, T., Berger, B. &amp; Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat Methods 18, 176–185 (2021).</li>
  <li>Powell, B. M. &amp; Davis, J. H. Learning structural heterogeneity from cryo-electron sub-tomograms with tomoDRGN. Nat Methods 21, 1525–1536 (2024).</li>
  <li>Forsberg, B. O., Shah, P. N. M. &amp; Burt, A. A robust normalized local filter to estimate compositional heterogeneity directly from cryo-EM maps. Nat Commun 14, 5802 (2023).</li>
  <li>Rabuck-Gibbons, J. N., Lyumkis, D. &amp; Williamson, J. R. Quantitative mining of compositional heterogeneity in cryo-EM datasets of ribosome assembly intermediates. Structure 30, 498–509.e4 (2022).</li>
  <li>Du, S. et al. Refinement of multiconformer ensemble models from multi-temperature X-ray diffraction data. Methods Enzymol 688, 223–254 (2023).</li>
  <li>Pearce, N. M. et al. A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron density. Nat Commun 8, 15123 (2017).</li>
  <li>Pearce, N. M., Krojer, T. &amp; von Delft, F. Proper modelling of ligand binding requires an ensemble of bound and unbound states. Acta Crystallogr D Struct Biol 73, 256–266 (2017).</li>
  <li>Wankowicz, S. A. &amp; Fraser, J. S. Comprehensive encoding of conformational and compositional protein structural ensembles through the mmCIF data structure. IUCrJ 11, 494–501 (2024).</li>
  <li>Hendrickson, W. A. Stereochemically restrained refinement of macromolecular structures. Methods Enzymol 115, 252–270 (1985).</li>
  <li>Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr D Biol Crystallogr 68, 352–367 (2012).</li>
  <li>Correy, G. J. et al. Extensive exploration of structure activity relationships for the SARS-CoV-2 macrodomain from shape-based fragment merging and active learning. bioRxiv (2024) doi:10.1101/2024.08.25.609621.</li>
  <li>Douangamath, A. et al. Achieving Efficient Fragment Screening at XChem Facility at Diamond Light Source. J Vis Exp (2021) doi:10.3791/62414.</li>
  <li>Erlanson, D. et al. Where to house big data on small fragments? ChemRxiv (2025) doi:10.26434/chemrxiv-2025-hjjnj.</li>
  <li>Šrajer, V. &amp; Schmidt, M. Watching Proteins Function with Time-resolved X-ray Crystallography. J Phys D Appl Phys 50, (2017).</li>
  <li>De Zitter, E., Coquelle, N., Oeser, P., Barends, T. R. M. &amp; Colletier, J.-P. Xtrapol8 enables automatic elucidation of low-occupancy intermediate-states in crystallographic studies. Commun Biol 5, 640 (2022).</li>
  <li>Greisman, J. B. et al. Resolving conformational changes that mediate a two-step catalytic mechanism in a model enzyme. bioRxiv (2023) doi:10.1101/2023.06.02.543507.</li>
  <li>Wolff, A. M. et al. Mapping protein dynamics at high spatial resolution with temperature-jump X-ray crystallography. Nat Chem 15, 1549–1558 (2023).</li>
  <li>Thompson, M. C. et al. Temperature-jump solution X-ray scattering reveals distinct motions in a dynamic enzyme. Nat Chem 11, 1058–1066 (2019).</li>
  <li>Verlinde, C. L. M. J. et al. Fragment-based cocktail crystallography by the medical structural genomics of pathogenic protozoa consortium. Curr Top Med Chem 9, 1678–1687 (2009).</li>
  <li>Chen, Z. et al. EMC chaperone-Ca structure reveals an ion channel assembly intermediate. Nature 619, 410–419 (2023).</li>
  <li>Baker, L. A., Grange, M. &amp; Grünewald, K. Electron cryo-tomography captures macromolecular complexes in native environments. Curr Opin Struct Biol 46, 149–156 (2017).</li>
  <li>Westbrook, J. D. et al. PDBx/mmCIF Ecosystem: Foundational Semantic Tools for Structural Biology. J Mol Biol 434, 167599 (2022).</li>
  <li>Flowers, J. et al. Expanding Automated Multiconformer Ligand Modeling to Macrocycles and Fragments. bioRxiv (2024) doi:10.1101/2024.09.20.613996.</li>
  <li>Wankowicz, S. A. et al. Automated multiconformer model building for X-ray crystallography and cryo-EM. Elife 12, (2024).</li>
  <li>Stachowski, T. R. &amp; Fischer, M. FLEXR: automated multi-conformer model building using electron-density map sampling. Acta Crystallogr D Struct Biol 79, 354–367 (2023).</li>
  <li>Emsley, P., Lohkamp, B., Scott, W. G. &amp; Cowtan, K. Features and development of Coot. Acta Crystallogr D Biol Crystallogr 66, 486–501 (2010).</li>
  <li>Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr 66, 213–221 (2010).</li>
  <li>Smart, O. S. et al. Exploiting structure similarity in refinement: automated NCS and target-structure restraints in BUSTER. Acta Crystallogr D Biol Crystallogr 68, 368–380 (2012).</li>
  <li>Murshudov, G. N. et al. REFMAC5 for the refinement of macromolecular crystal structures. Acta Crystallogr D Biol Crystallogr 67, 355–367 (2011).</li>
  <li>Lill, M. A. &amp; Danielson, M. L. Computer-aided drug design platform using PyMOL. J Comput Aided Mol Des 25, 13–19 (2011).</li>
  <li>Schneider, T. R. &amp; Sheldrick, G. M. Substructure solution with SHELXD. Acta Crystallogr D Biol Crystallogr 58, 1772–1779 (2002).</li>
  <li>Kimanius, D., Dong, L., Sharov, G., Nakane, T. &amp; Scheres, S. H. W. New tools for automated cryo-EM single-particle analysis in RELION-4.0. Biochem J 478, 4169–4185 (2021).</li>
</ol>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie.wankowicz@astera.org</email></author><category term="post" /><category term="modeling" /><category term="meta" /><category term="encoding" /><summary type="html"><![CDATA[Explicitly encoding hierarchical heterogeneity in mmCIF files]]></summary></entry><entry><title type="html">Water analysis</title><link href="https://diffuse.science/post/water-analysis/" rel="alternate" type="text/html" title="Water analysis" /><published>2026-04-24T00:00:00+00:00</published><updated>2026-04-24T00:00:00+00:00</updated><id>https://diffuse.science/post/water-analysis</id><content type="html" xml:base="https://diffuse.science/post/water-analysis/"><![CDATA[<p>A cartoon representation of a protein structure in an empty background easily makes one forget about the physical presence of numerous water molecules around the protein and their significance. In fact, a major driving force of protein folding comes from entropic changes from the water molecules surrounding the protein. When taking measurements of protein structures, solvent inevitably contributes, and it can contribute substantially to the experimental data. Consequently, protein structural refinement refines both the protein and its experimental context, including the solvent.</p>

<p>On the DiffUSE project, we have been thinking about solvent modeling and major hurdles that need to be overcome as we model protein ensembles (see our previous <a href="/posts/allhands/">all hands post</a> for more context). In this post, I hope to initiate discussions on why water prediction is hard with some preliminary analysis. In particular, is there something about the deposited data that makes them difficult to learn by data-driven approaches?</p>

<h2 id="how-is-water-modeled">How is water modeled?</h2>

<p>In crystallography, solvent modeling is simplified as either ordered or bulk. Bulk solvent represents disordered water molecules with a flat density. Ordered solvent molecules form hydrogen bonds with the protein and/or other solvent and are represented with atomic coordinates. They have attracted much research interest as some ordered water molecules can be conserved across protein families and participate in or influence chemical reactions. Several machine learning approaches, such as <a href="https://doi.org/10.1038/s42004-025-01789-4">SuperWater</a>, <a href="https://doi.org/10.1021/acs.jcim.3c01559?urlappend=%3Fref%3DPDF&amp;jav=VoR&amp;rel=cite-as">HydroProt</a>, and <a href="https://doi.org/10.1021/acs.jcim.2c00306?urlappend=%3Fref%3DPDF&amp;jav=VoR&amp;rel=cite-as">GalaxyWater-CNN</a> attempted to learn directly from the PDB on placing ordered water around proteins. While protein structure predictors trained on the PDB have achieved impressive results, existing water predictors are still not at the same level of out-of-the box performance.</p>

<p>Unlike for proteins, validation procedures are not enforced for the deposition of ordered water in the PDB. <a href="https://doi.org/10.1107/S2052252524009928">Wlodawer et al.</a> analyzed the expected ratio of the number of ordered waters to the number of residues in PDB as a function of the resolution (check out their <a href="https://bioreproducibility.org/figures/water_paper/fig1B/">interactive figure</a>), and found many structures with an unusual number of water molecules modeled, including some completely waterless structures on which the authors did in-depth examination. Realistically, especially in high-resolution structures, only a small subset of ordered waters might be near an active site and are interesting or important enough for the depositor to carefully inspect.</p>

<p>For a scalable approach to model ordered water, established refinement programs have built automated protocols. For example, <a href="https://doi.org/10.1107/S2052252514009324">PDB-REDO</a> re-refines structures at scale with the removal of water molecules that are unsupported by the density. Density-based automated protocols, however, are not perfect, as they are subjected to phase errors. Other heuristics are often used as an auxiliary to model as well as evaluate the individual ordered water molecules. The distance and the number of hydrogen bonds suggest how physically plausible the ordered water is, and the B-factor and density-based metrics such as EDIA show how well the water molecules fit to experimental data.</p>

<h2 id="what-do-the-water-data-look-like">What do the water data look like?</h2>

<p>It is unclear how effective the auto-refinement and popular quality metrics are in providing “clean” data from the perspective of statistical learning. Consider each individual PDB file as containing some water molecules that are true positives, some that are false positives, and some water molecules are missing, how good is a PDB file as an instance of prediction? One way to explore this analysis is by examining conserved waters as a surrogate of the ground truth that can be reliably evaluated against. Our motivation is that the water molecules that consistently show up are the ones we want to predict and are probably more learnable.</p>

<p>For the consensus analysis, I gather isomorphous structures from PDB-REDO for 3 protein cases – 66 structures of hen egg white lysozyme (HEWL), 18 structures of beta-lactamase (TEM1), and 60 structures of protein tyrosine phosphatase 1B (PTP1B) – and cluster the ordered waters with a 1Å distance cutoff after alignment. Water molecules that show up in more than half of the structures (i.e. cluster occupancy &gt;= 0.5) are considered as conserved.</p>

<h3 id="precision-coverage-tradeoff-as-a-function-of-the-number-of-water">Precision-coverage tradeoff as a function of the number of water</h3>

<figure class="third ">
  
    
      <a href="/assets/images/posts/2026-04-24/precision_coverage_TEM1.png" title="TEM1">
          <img src="/assets/images/posts/2026-04-24/precision_coverage_TEM1.png" alt="" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-04-24/precision_coverage_HEWL.png" title="HEWL">
          <img src="/assets/images/posts/2026-04-24/precision_coverage_HEWL.png" alt="" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-04-24/precision_coverage_PTP1B.png" title="PTP1B">
          <img src="/assets/images/posts/2026-04-24/precision_coverage_PTP1B.png" alt="" />
      </a>
    
  
  
    <figcaption>Each scatter point represents a PDB file and is colored by the number of ordered water molecules it models.
</figcaption>
  
</figure>

<p>The figures above show that there is a spectrum of how well each PDB file models the set of conserved water when considering a PDB file as an instance of prediction. In all 3 cases, there is a precision-coverage tradeoff with the number of water molecules modeled, which is highly correlated to the resolution of the structure.</p>

<p>High-resolution structures resolve more ordered water molecules and have a good coverage of the conserved water. Low-resolution structures, however, also contain useful information, as they can be quite precise on capturing the conserved waters. In other words, even though some structures have fewer ordered waters resolved, presumably, the ones that can still be modeled have high confidence. The frontier of the precision-coverage tradeoff, however, varies case by case.</p>

<h3 id="consensus-water-is-a-subset-of-the-high-quality-water">Consensus water is a subset of the high quality water</h3>

<p>Given the high coverage of conserved water in high resolution structures, one might wonder if we can deploy some filtering to improve the precision of water placement. To this end, I made a scatter plot of the water conservedness against quality metrics. Conservedness is defined as cluster occupancy, which is the fraction of PDB files that has a water molecule near the cluster center.</p>

<figure class="third ">
  
    
      <a href="/assets/images/posts/2026-04-24/edia_vs_cluster_occupancy_TEM1.png" title="TEM1">
          <img src="/assets/images/posts/2026-04-24/edia_vs_cluster_occupancy_TEM1.png" alt="" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-04-24/edia_vs_cluster_occupancy_HEWL.png" title="HEWL">
          <img src="/assets/images/posts/2026-04-24/edia_vs_cluster_occupancy_HEWL.png" alt="" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-04-24/edia_vs_cluster_occupancy_PTP1B.png" title="PTP1B">
          <img src="/assets/images/posts/2026-04-24/edia_vs_cluster_occupancy_PTP1B.png" alt="" />
      </a>
    
  
  
    <figcaption>Each scatter point represents a clustering center of water across PDB files. The occupancy of a cluster is plotted against the median EDIA score of the water molecules that are members of that cluster. Each cluster is also colored by the median B-factor of the member water molecules.
</figcaption>
  
</figure>

<p>While there is a positive relationship between high EDIA (or low B-factor) and cluster occupancy, clusters in the upper left corner suggest that there are some water molecules that are high-quality but not conserved. They might be unique waters specific to certain experimental contexts (crystallization condition, presence of ligands, protonation states, etc.). Alternatively, waters at those cluster positions might be missing in some structures due to limited resolution, experimental errors, and modeling errors of water as well as protein (e.g. alternative conformations). This makes interpreting the precision evaluation metric especially challenging.</p>

<h2 id="what-should-we-do-about-water">What should we do about water?</h2>

<p>The consensus water analysis here gives a glimpse into the complexity of the data challenges we have on water prediction. Between the true positives and overfitted noises, there are ordered waters in PDB that are hard to validate, let alone to learn. While it is common practice to align and compare related protein structures, similar inspection for water is much rarer and limited to a few case studies. We think it is worth expanding the consensus analysis at scale for more generalizable insights, including identifying potential sources of biases such as the initial model used and depositor preferences.</p>

<p>We understand that consensus water is not equal to the ground truth water, and physical factors on an apparently small or local scale like an alternative conformation of a protein side chain can change the structure of water networks. Predicting water positions from just the protein structure is strictly speaking an ill-posed task. That said, one could still try to learn a statistical prior for experimental guidance as Bayesian refinement, in similar spirit of <a href="/posts/sampleworks/">Sampleworks</a>. We are building a water predictor for this purpose, and stay tuned for Vratin’s post about this!</p>

<p>Echoing discussion points in <a href="https://doi.org/10.1107/S2052252524009928">Wlodawer et al.</a>, we hope the structural community can work together on the water modeling challenge – whether it’s increasing awareness at the individual depositor level, exploration of new computational modeling methods and evaluation metrics, or infrastructure-level support in the PDB.</p>]]></content><author><name>Doris Mai</name><email>doris.mai@astera.org</email></author><category term="post" /><category term="modeling" /><category term="meta" /><summary type="html"><![CDATA[Case analysis of consensus water in PDB]]></summary></entry><entry><title type="html">The Tortured Proteins Department, Episode 13</title><link href="https://diffuse.science/posts/TTPD-13/" rel="alternate" type="text/html" title="The Tortured Proteins Department, Episode 13" /><published>2026-03-29T00:00:00+00:00</published><updated>2026-03-29T00:00:00+00:00</updated><id>https://diffuse.science/posts/TTPD-13</id><content type="html" xml:base="https://diffuse.science/posts/TTPD-13/"><![CDATA[<h1 id="episode-13-long-live">Episode 13: Long Live!</h1>

<p>We chat about preprints, AI grad students, the diffUSE project, an April Fools blog, NIH funding news, and the Fraser Lab “broken monitor” April Fools’ Day prank — caught live during recording!</p>

<p><img src="https://github.com/user-attachments/assets/6ab7f026-2549-49dc-a33c-2ec204215385" alt="Fraser Lab broken monitor April Fools' Day prank - James Fraser photographing his cracked monitor screen during the recording session with Stephanie Wankowicz" class="img-fluid" /></p>

<p>Preprints:</p>
<ul>
  <li><a href="https://www.biorxiv.org/content/10.64898/2026.03.08.710389v1">Intrinsic dataset features drive mutational effect prediction by protein language models</a></li>
  <li><a href="https://www.biorxiv.org/content/10.64898/2026.03.08.710403v1">Ribosome Molecular Aging Shapes Translation Dynamics</a></li>
</ul>

<p>AI Grad Student Articles:</p>
<ul>
  <li><a href="https://www.science.org/content/article/why-i-may-hire-ai-instead-graduate-student">Why I may ‘hire’ AI instead of a graduate student</a></li>
  <li><a href="https://www.anthropic.com/research/vibe-physics">Vibe physics: The AI grad student</a></li>
</ul>

<p>Other Links:</p>
<ul>
  <li><a href="https://diffuse.science/">diffUSE Project</a></li>
  <li><a href="https://jgreener64.github.io/posts/technical_report/">April Fools Blog</a></li>
</ul>

<p>NIH News:</p>
<ul>
  <li><a href="https://grant-witness.us/funding_curves.html">Funding Curve</a></li>
  <li>
    <p><a href="https://www.pnas.org/doi/10.1073/pnas.2527755123">How the 2025 NIH grant terminations varied by researchers’ demographic groups</a></p>
  </li>
  <li><a href="https://open.spotify.com/episode/7jyXj4QUI4zmZgnfWfaY3x?si=ucQCu_WgShel6V0X4cbzUQ">Spotify</a></li>
  <li><a href="https://podcasts.apple.com/us/podcast/long-live/id1802420696?i=1000759838149">Apple Podcasts</a></li>
</ul>

<iframe data-testid="embed-iframe" style="border-radius:12px" src="https://open.spotify.com/embed/episode/7jyXj4QUI4zmZgnfWfaY3x?utm_source=generator" width="100%" height="352" frameborder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie.wankowicz@astera.org</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[The Tortured Proteins Department Podcast, Episode 13]]></summary></entry><entry><title type="html">Expanding The DiffUSE Project</title><link href="https://diffuse.science/posts/expanding_diffuse/" rel="alternate" type="text/html" title="Expanding The DiffUSE Project" /><published>2026-03-11T00:00:00+00:00</published><updated>2026-03-11T00:00:00+00:00</updated><id>https://diffuse.science/posts/expanding_diffuse</id><content type="html" xml:base="https://diffuse.science/posts/expanding_diffuse/"><![CDATA[<p>Biology is motion. From ecosystems to molecules, nothing is ever still. Life relies on this constant motion to function. Yet we tend to study biology through static snapshots. In structural biology, we often interpret macromolecules as single, static structures. These static structures have been transformative, enabling decades of progress in drug design, disease understanding, and protein engineering. But we are reaching a plateau in the questions we can answer with them. Because proteins do not function as single structures. They function through their conformational ensemble, the range of states a macromolecule can sample. And yet, we have almost no data on these ensembles. The tools to collect and model it at scale do not exist. The infrastructure to share and interpret it has not been built.</p>

<p><strong>What could we discover about biology if we had protein ensembles at the scale of the Protein Data Bank?</strong></p>

<p><em>Could we understand how drug efficacy and resistance arise? Could we better explain why mutations cause disease, or what makes a designed enzyme functional? Could we deliberately engineer dynamics to bias cell signaling?</em></p>

<p>Today, we are excited to announce that the DiffUSE Project is expanding. We are building the infrastructure to collect and analyze protein ensembles at the scale of the Protein Data Bank. Because many of the most important biological questions can only be answered through ensembles.</p>

<p>The DiffUSE Project, by <a href="https://radial.org/">Radial, and funded by the Astera Institute</a>, began by asking how we could rethink the pipeline for extracting the faint yet information-rich diffuse scattering signal in the “background” of X-ray crystallography experiments. Diffuse scattering may provide key, but often untapped, information on the heterogeneity of macromolecules in a crystal. With funding and operational support from the <a href="https://astera.org/">Astera Institute</a>, we assembled a purpose-built team spanning expertise across Astera and multiple institutions and began to re-engineer the components needed to fully optimize this, revealing structural ensembles.</p>

<p>But diffuse scattering is just one window into protein conformational ensembles. Different techniques uncover different facets of the same underlying ensemble reality. Cryo-EM captures heterogeneity across length scales. NMR resolves timescale-specific motions. SAXS reports on global shape fluctuations. HDX-MS maps solvent accessibility changes. No single method can reveal the full dynamic landscape. But we do not yet know which techniques, or which combination, will best uncover these ensembles. So we start with a not-so-simple question: which data are most fruitful for understanding biology?</p>

<p>But collecting data is only part of the challenge. Every stage of structural biology today reinforces a static worldview. Algorithms model single conformations. Representations emphasize static structures. Metrics evaluate fit to a single model. Biological interpretations are based on static structures. Addressing any piece in isolation will fall short. The entire pipeline must be redesigned.</p>

<p>The DiffUSE Project is completely rethinking how we collect, process, model, encode, disseminate, and interpret dynamic structural data. We are designing tools to enable downstream applications that meaningfully impact biology.</p>

<p><img width="1920" height="1080" alt="Diffuse Project Areas" src="https://github.com/user-attachments/assets/ecb768aa-789b-4b19-8efc-721e72a28cfe" /></p>

<p>At every step, we are guided by the following questions:</p>
<ul>
  <li>What structural ensemble data are we missing, and how do we build the technologies to capture it?</li>
  <li>Where will structural ensemble data best help reveal biological function?</li>
  <li>How do we create tools that empower everyone, not just specialists, to work with conformational ensembles?</li>
  <li>How do we openly disseminate what we build to maximize scientific impact?</li>
</ul>

<p>We are taking this approach because we cannot get there alone. AlphaFold succeeded because decades of experimental structure determination built the data foundation it needed. No equivalent foundation exists for ensembles. Building that foundation and building the prediction methods must happen together — at a scale no single group can achieve. That is why we are not just generating data ourselves. We are building the tools, methods, and infrastructure to enable the entire community to collect, curate, and benchmark dynamic structural data.</p>

<p>This entire project is also a metascience experiment that asks what it truly takes to tackle a large, systematic, and deeply integrated scientific problem. Some of what lies ahead is engineering. Some of it is science. Some of it is community building. We are pursuing this work both internally at Radial and through a growing network of partnerships and collaborations. Experimenting not only with the science itself but with how best to do it.</p>

<p><strong>What are we building?</strong></p>

<p>The DiffUSE project will focus on four areas, each advancing a different part of the dynamic structural pipeline simultaneously. Technique focused spokes will transform specialized dynamics techniques into routine, accessible methods. Algorithms will uncover hidden conformational states and integrate signals across techniques to help reveal the full dynamic landscape and provide key integration points for ensemble prediction. Infrastructure will evolve the PDB from a static archive into a living database capable of storing, validating, and querying dynamic structural data. And Biological Impact will ask questions that help ground our technical advance by asking: how does understanding dynamics actually change what we can understand, predict, design, and discover biology?</p>

<p><strong>Technique focused spokes:</strong> Every technique for capturing protein ensembles faces its own barriers to making the data truly usable. We are aiming to experiment with how to overcome these barriers and make these techniques routine and accessible for the broader community. Current methods often discard the dynamic signal, and none were designed to scale to the thousands of ensembles needed to uncover general principles. Each spoke tackles these challenges for a specific technique, producing open algorithms, protocols, models, and metrics, all designed from the start to be adopted and built upon by others. We do not yet know which techniques will generate truly additive data, which approaches will scale, or which will result in the best biological insights. We plan to run parallel experiments, and we will follow the evidence, expanding what works, rethinking what doesn’t, and sharing what we learn along the way.</p>

<p><strong>Algorithms:</strong> No single technique captures the full picture of how a protein moves. Each method sees a different slice of the conformational landscape. To tackle this, we are developing algorithms along three fronts: guidance frameworks that steer ensemble models using experimental data, machine learning approaches that learn directly from experimental measurements rather than structures, and agentic modeling pipelines that autonomously build, evaluate, and refine ensemble models. Everything we build is open, designed for the community to use, extend, and improve. Our first major algorithm is <a href="https://github.com/diff-use/sampleworks">Sampleworks</a>, a platform for testing new guidance frameworks, stress-testing ensemble predictions, understanding where they fail, and establishing the datasets needed to train the next generation of machine learning models on conformational ensembles rather than static snapshots.</p>

<p><strong>Infrastructure:</strong> The PDB transformed structural biology by providing the community with a shared, standardized, and trusted platform for depositing and retrieving structures. But the PDB was designed for static structures. Without accessible ensemble data, researchers default to inconsistent practices: ignoring the inherent heterogeneity in the data and training models on static structures that don’t reflect how proteins actually behave. These limitations don’t stay contained; they propagate through the entire discovery workflow, leading to biased biological conclusions.</p>

<p>We are working on evolving the PDB into a living database that supports dynamic structural data through three advances. First, ensemble-aware validation, because today, when an ensemble model is deposited, only part of it is validated against experimental data. Validation metrics are essential for model deposition and for benchmarking new machine learning models. Second, heterogeneity-aware encoding enables queries that let users do things currently impossible: retrieve only ligand-bound states across the database, find entries where a specific loop adopts multiple conformations, or compare ensemble representations across related proteins. Third, a unified, continually updated repository where experimental data serve as ground truth and the models explaining them improve over time as algorithms advance. The endgame is to enable a virtuous cycle where labs deposit dynamic data, the community builds better ensemble models, those models generate new biological insights, and the standardized data fuels structural ensemble prediction.</p>

<p><strong>Biological Impact:</strong> We want our efforts to be grounded in a simple, but demanding question: how does understanding dynamics change what we can predict, design, and discover? To answer that, we are building ensemble-aware representations to help quantify metrics for what has long been hard to quantify, conformational and solvent entropy, flexibility, and long-range dynamic coupling, and validating them against real biological problems in binding, enzyme design, and allosteric regulation.
But metrics alone aren’t enough. We aim to help understand how dynamics propagate and change protein function. How does a single mutation redistribute an entire conformational ensemble and alter functional output? How do changes in hydration networks reshape binding? Can we trace these effects well enough to use dynamics not just as a descriptor, but as a design principle, such as designing allosteric switches by tuning the dynamic landscape rather than the static structure?</p>

<p><strong>What’s next?</strong></p>

<p>I am thrilled to be joining the DiffUSE team full-time as Scientific Program Director. In this role, I will design and direct the roadmap to unlock protein ensembles. I will be directly leading the efforts in algorithms, infrastructure, and biology, and guiding technical deep dives. My passion has always been building tools and working with open data. I see this role and this project as an incredible opportunity to deliver on this vision of unlocking protein ensembles at scale. The biology questions I care most about, the role that ensembles and entropy play in protein function, and how we leverage that understanding for design, can only be answered once we commit to an ensemble point of view. That is exactly what the DiffUSE Project intends to do.
We’re also growing the <a href="https://diffuse.science/work-with-us/">team</a> on the algorithms, infrastructure, and biology team, <a href="https://diffuse.science/work-with-us/">engaging with those looking to build open tools to make this shift happen</a>. And we are doing it all in the <a href="https://diffuse.science/publications/">open</a>.</p>

<p>Our point is this - <em>break’s over</em>.</p>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie.wankowicz@astera.org</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[Expanding The DiffUSE Project]]></summary></entry><entry><title type="html">To ensembles and beyond!</title><link href="https://diffuse.science/posts/sampleworks/" rel="alternate" type="text/html" title="To ensembles and beyond!" /><published>2026-03-11T00:00:00+00:00</published><updated>2026-03-11T00:00:00+00:00</updated><id>https://diffuse.science/posts/sampleworks</id><content type="html" xml:base="https://diffuse.science/posts/sampleworks/"><![CDATA[<p>Most structural biology experimental observables, including those from X-ray crystallography and cryo-electron microscopy (cryo-EM), reflect time and ensemble averages over millions of conformations. Thus, only conformational ensemble models accurately represent the experimental measurements <a href="https://pubmed.ncbi.nlm.nih.gov/36639584/">(Lane 2021)</a>. However, due to limitations in modeling algorithms, most structural modeling approaches tend to model only the ground state, discarding the rich heterogeneity in the raw data that reflects the broader conformational ensemble (<a href="https://pubmed.ncbi.nlm.nih.gov/33020277/">Bozovic et al. 2020</a>, <a href="https://pubmed.ncbi.nlm.nih.gov/24504120/">Kuzmanic et al. 2014</a>). While new algorithms have pushed the field beyond a singular state, they only capture a fraction of the accessible states and are often limited by resolution (<a href="https://pubmed.ncbi.nlm.nih.gov/38904665/">Wankowicz et al. 2024</a>, <a href="https://doi.org/10.7554/eLife.103797.3">Flowers et al. 2025</a>, <a href="https://pubmed.ncbi.nlm.nih.gov/34726164/">Ploscariu et al. 2021</a>).</p>

<p>Generative biomolecular structure predictors like AlphaFold 3 (<a href="https://www.nature.com/articles/s41586-024-07487-w">Abramson et al., 2024</a>), Boltz-1/2 (<a href="https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1">Passaro et al., 2025</a>, <a href="https://www.biorxiv.org/content/10.1101/2024.11.19.624167v3">Wohlwend et al., 2025</a>), RoseTTAFold 3 (<a href="https://www.biorxiv.org/content/10.1101/2025.08.14.670328v2">Corley et al., 2025</a>), and Protenix (<a href="https://www.biorxiv.org/content/10.64898/2026.02.05.703733v3">Protenix Team, 2026</a>) produce remarkably accurate single-state structures. These structure predictors sample from a distribution of conformations learned from data–but largely from the single-state structures they were trained on, implicitly assuming a single protein sequence maps to just one conformation. We have only begun to understand the extent to which these models learn the actual conformational heterogeneity of each protein, or simply memorize the individual protein structures (<a href="https://www.nature.com/articles/s41467-024-51801-z">Chakravarty et al., 2024</a>; <a href="https://elifesciences.org/articles/75751">del Alamo et al., 2022</a>).</p>

<p>“Diffusion”-based generative structure predictors work by predicting a series of updates which map from noise to a final protein structure–creating a “trajectory” that can be steered by modifying the updates to better match some criteria–such as how well the model fits experimental data. This “guidance” approach was recently used (<a href="https://www.biorxiv.org/content/10.64898/2026.02.27.708490v1">Maddipatla et al., 2026</a>) to steer Protenix to identify previously unmodeled states in existing PDB entries. CryoBoltz (<a href="http://arxiv.org/abs/2506.04490">Raghu et al. 2025</a>) similarly guides Boltz-1 to better model cryoEM density maps when multiple conformations are found during classification of the particle stack. But to date, there have been no systematic explorations of how well different structure predictors respond to guidance, or of the trade-offs involved in applying this guidance to sample ensembles consistent with the experimental data. To simplify and generalize experiments that generate ensemble predictions and use ensemble data, we introduce Sampleworks.</p>

<h2 id="what-is-sampleworks"><strong>What is Sampleworks?</strong></h2>

<p>Sampleworks is a framework for integrating structural biology observables, structure predictors, and guidance to improve the modeling of conformational ensembles. Sampleworks does this by defining a common interface for structure predictors and linking that interface to methods for computing experimental observables, like an X-ray electron density map. Sampleworks then applies existing methods to guide generative models during inference, steering sample generation towards an ensemble that reflects the data.</p>

<p>To date, we’ve implemented:</p>

<ul>
  <li>Two guidance methods: diffusion posterior sampling (<a href="https://arxiv.org/abs/2209.14687">Chung et al., 2024</a>; <a href="http://arxiv.org/abs/2406.04239">Levy et al., 2024</a>) and Feynman-Kaç steering (<a href="http://arxiv.org/abs/2501.06848">Singhal et al., 2025</a>)</li>
  <li>Wrappers for <a href="https://github.com/jwohlwend/boltz">Boltz-1 and -2</a>; <a href="https://github.com/bytedance/Protenix">Protenix</a>; and <a href="https://github.com/RosettaCommons/foundry">RosettaFold3</a>.</li>
  <li>Guidance based on real-space electron density maps.</li>
</ul>

<p>We will continue to expand the guidance methods, structure predictors, and the kinds of experimental data used for guidance, and we welcome contributions from this community. You can use Sampleworks, follow our progress, and contribute <a href="https://github.com/diff-use/sampleworks">here</a>!</p>

<h2 id="where-were-at"><strong>Where we’re at</strong></h2>

<p>Sampleworks’s first goal is to benchmark the extent to which diffusion-based structure predictors can be guided to generate accurate structural ensembles. Specifically, we ask whether these models can produce out-of-distribution conformations and, when they do, how well those conformations balance between fit to experimental data and geometric quality.</p>

<p>Since these structure predictors’ training data were on single state structures, this is a natural test of their ability to generalize. In this first test, we guide the predictors using synthetic electron density maps that mix two conformations. We evaluate how well each predictor recovers the synthetic ensemble while still satisfying protein-geometry constraints.</p>

<p><img src="/assets/images/posts/6b8x_w_wo.png" alt="6B8X With and Without Guidance" /></p>

<p><em>(Left) Without guidance, Boltz-2 does not predict the two conformations (grey) in 6B8X when run with multiple random seeds. (Right) Guidance using synthetic density representing the two conformations enables Boltz-2 to predict both.</em></p>

<p>We’ve now tested our initial structure predictors with guidance for 40 different structures from the PDB that have known alternate conformations in the original depositions. In the example above, for the protein phosphatase PTP1B (PDB code 6B8X), the predictor, without guidance, falls back on its training data and can only generate structures in one of the two known conformations. Encouragingly, when incorporating guidance with electron density representing both conformations, we observe the structure predictors now predicting two states. However, this is not universally true across different proteins and models. Because Sampleworks is modular and easily accommodates different structure predictors, we can quickly test them all at different guidance levels across a wide range of proteins.</p>

<div style="display: flex; gap: 10px;">
  <img src="/assets/images/posts/17_tradeoff_conceptual_weak_zoomed.png" alt="Tradeoff Weak" style="width: 50%;" />
  <img src="/assets/images/posts/18_tradeoff_conceptual_strong.png" alt="Tradeoff Strong" style="width: 50%;" />
</div>

<p><em>Geometry penalty vs. fit to data for each model’s ensemble prediction on 40 examples of multiple conformations when guided with synthetic density. (Left) Weak guidance, zoomed close to show differences between models. (Right) Strong guidance, resulting in much larger geometry penalties.</em></p>

<p>We also assessed the trade-offs between fit to electron-density data and geometry. To determine fit to electron density, we use the Real Space Correlation Coefficient (RSCC). We also compute a conglomerate geometry score, which becomes larger with increased clashes, bad backbone dihedrals, or unrealistic bond lengths. With relatively weak guidance, we can see that the models are all able to improve RSCC without incurring too much cost to protein geometry. Paradoxically, when we use stronger guidance, most structure predictors fall apart completely, and the RSCC gets worse! This probably means that strong guidance pushes the predictors’ trajectories outside what they’ve been trained on, far enough that they cannot recover. Despite their very similar architectures and training data, these structure predictors differ in how they respond to guidance.</p>

<p>We are just beginning to learn how best to guide each model to better model biologically relevant ensembles. We are already seeing signs that the models are limited in their current forms, and we will be working to explore and overcome those limitations. In the coming months, we’ll have more to share about the first results shown above and about the experiments in progress now. We’ll be looking to expand the range of predictors and experimental data we can work with, and to engage with the community to make protein conformational ensembles more accessible for everyone.</p>]]></content><author><name>Karson Chrispens, Marcus Collins, PhD, and Stephanie Wankowicz, PhD</name></author><category term="posts" /><category term="meta" /><category term="modeling" /><summary type="html"><![CDATA[Sampleworks is a framework for integrating structural biology observables, structure predictors, and guidance to improve the modeling of conformational ensembles.]]></summary></entry><entry><title type="html">Shake It Up!</title><link href="https://diffuse.science/post/shake-it-up/" rel="alternate" type="text/html" title="Shake It Up!" /><published>2026-02-22T00:00:00+00:00</published><updated>2026-02-22T00:00:00+00:00</updated><id>https://diffuse.science/post/shake-it-up</id><content type="html" xml:base="https://diffuse.science/post/shake-it-up/"><![CDATA[<h2 id="md-simulations-of-changes-in-diffuse-scattering-depending-on-ligand-binding"><em>MD simulations of changes in diffuse scattering depending on ligand binding</em></h2>

<figure class="third ">
  
    
      <a href="/assets/images/posts/2026-02-23/Mac1_NoADPr.png" title="MD diffuse with CHES (buffer molecule)">
          <img src="/assets/images/posts/2026-02-23/Mac1_NoADPr.png" alt="MD diffuse with CHES" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-02-23/Mac1_WithADPr.png" title="MD diffuse with ADPr">
          <img src="/assets/images/posts/2026-02-23/Mac1_WithADPr.png" alt="MD diffuse with ADPr" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-02-23/Mac1_DeltaADPr.png" title="Difference ADPr - CHES">
          <img src="/assets/images/posts/2026-02-23/Mac1_DeltaADPrAniso.png" alt="Difference ADPr - CHES" />
      </a>
    
  
  
    <figcaption>MD simulations of Mac1 diffuse scattering change depending on ligand binding. <em>Left</em>. Mac1 with CHES (buffer molecule). <em>Center</em>. Mac1 with ADPr. <em>Right</em>. Difference.
</figcaption>
  
</figure>

<h2 id="what-did-we-find">What did we find?</h2>

<p>In a previous <a href="/post/in_the_cloud/">post</a> I talked about our plan to perform MD simulations of Mac1 crystals under different conditions. We had just started an MD simulation of Mac1 in complex with ADP-ribose (ADPr), and were waiting for it to finish. We weren’t sure how different it would be from the <a href="/post/lets-dance/">initial simulations</a> of Mac1 without ADPr, in which a CHES buffer occupies the ADPr binding site.</p>

<p>Now we know more. Visual comparisons (above images) show that the rich anisotropic diffuse diffuse features in the maps with CHES buffer (left) vs. ADPr (center) in the binding pocket have similar patterns of peaks and troughs that differ in detail. Subtracting them reveals the differences more clearly (right). Quantitative analysis shows that the variations in the difference map are comparable in strength to the intensities in either individual map. This result is encouraging as it indicates that such differences might be observed in experiments.</p>

<h2 id="its-a-trap">It’s a trap!</h2>

<p>Along the way we encountered a common pitfall in making these kinds of comparisons: due to an indexing ambiguity in the P43 space group, the structures of Mac1 used for simulations with and without ADPr were solved using different definitions of the lattice vectors, with the <em>h</em> and <em>k</em> axis swapped, and the <em>l</em> axis reversed (compare PDB IDs <a href="https://www.rcsb.org/structure/7TX0">7TX0</a> and <a href="https://www.rcsb.org/structure/7TX3">7TX3</a>). The diffUSE modeling team worked out how to make the simulated diffuse maps consistent at our recent <a href="/posts/allhands/">all hands meeting</a>, enabling us to perform a controlled comparison of the simulations.</p>

<h2 id="what-next">What next?</h2>

<p>The next step is to compare both of these simulations with data recently collected at CHESS (see <a href="https://diffuse.science/logbook/beamtime/20251105-chess/">logbook</a>), in one of a series of diffUSE beam times that are expected to yield a large number of datasets. These runs already have revealed that diffuse data are <a href="/posts/allhands/">reproducible between CHESS and ALS beamlines</a>. Data from Mac1 +/- ADPr are now in the processing pipeline; we’re eager to see how Mac1 diffuse scattering changes upon ligand binding, and whether MD simulations can help explain what we see.</p>

<hr />

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name>Michael Wall</name><email>mewall00@gmail.com</email></author><category term="post" /><category term="diffuse scattering" /><category term="molecular dynamics" /><category term="modeling" /><category term="open science" /><category term="meta" /><summary type="html"><![CDATA[MD simulations of changes in diffuse scattering depending on ligand binding]]></summary></entry><entry><title type="html">DiffUSE January 2026 Retreat: From Coast to Coast, Diffuse Scattering Reproduces</title><link href="https://diffuse.science/posts/allhands/" rel="alternate" type="text/html" title="DiffUSE January 2026 Retreat: From Coast to Coast, Diffuse Scattering Reproduces" /><published>2026-02-02T00:00:00+00:00</published><updated>2026-02-02T00:00:00+00:00</updated><id>https://diffuse.science/posts/allhands</id><content type="html" xml:base="https://diffuse.science/posts/allhands/"><![CDATA[<div class="notice" style="font-style: italic;">
DiffUSE is a Radial Project by <a href="https://astera.org">Astera</a>. This initiative aims to make diffuse X-ray scattering a routine tool for understanding protein dynamics in basic biology and drug discovery.
</div>

<h2 id="why-this-retreat-mattered"><strong>Why This Retreat Mattered</strong></h2>

<p>In late January, the DiffUSE Project team gathered in person for our first progress meeting at Astera’s headquarters in Emeryville, California. Since our October online meeting, every team has made substantial progress.</p>

<p>The retreat brought together team members working on data collection, data processing, molecular dynamics simulations, machine learning modeling, infrastructure, and open science to assess progress against our six-month goals and chart the path forward.</p>

<p>Perhaps the most exciting development is a deceptively simple one: diffuse scattering data collected at CHESS (Cornell) and ALS (Berkeley) are reproducible. This cross-country validation marks a critical step toward making diffuse scattering a routine tool for structural biology.</p>

<h2 id="what-have-we-accomplished-since-october"><strong>What Have We Accomplished Since October?</strong></h2>

<h3 id="data-collection"><strong>Data Collection</strong></h3>

<p>Kara Zielinski (Fraser Lab, UCSF) reported on an intensive fall data collection campaign:</p>

<ul>
  <li><strong>9 beamtimes</strong> since project inception across two synchrotrons (CHESS and ALS)</li>
  <li><strong>18 participants</strong> contributed to data collection</li>
  <li><strong>7 protein systems</strong>: Mac1, NrdE, Lysozyme, DNA fibers, ATCase, Insulin, and Huwe1</li>
  <li><strong>129 “good” datasets</strong> collected (no data collection errors)</li>
</ul>

<p>The team systematically explored experimental perturbations:</p>

<ul>
  <li><strong>Temperature</strong>: Data collected at 100K (cryo), 220-275K (intermediate), and 310-315K (elevated), though sample handling for intermediate temperatures requires further optimization</li>
  <li><strong>Ligands</strong>: Mac1 + ADPr (11 datasets from CHESS and ALS combined) and Mac1 + small molecule “opener” (6 datasets from ALS)</li>
  <li><strong>Radiation damage mitigation</strong>: Vector scans implemented at CHESS to spread dose across radiation-sensitive samples like NrdE and ATCase</li>
</ul>

<p>Beamline-specific improvements included:</p>

<ul>
  <li><strong>ALS</strong>: Explored dose dependence, wavelength effects, and exposure time optimization; addressed collimator ring scatter issues at 14 keV</li>
  <li><strong>CHESS</strong>: Continued X-ray aperture optimization for background reduction</li>
</ul>

<p><img src="/assets/images/posts/2026-02-02/2026_crystals_spm_allhands.png" alt="2026_crystals_spm_allhands" title="Initial protein systems tested by the diffUSE team" />
<em>Slide shared at the DiffUSE project’s retreat showcasing crystals of initially tested protein systems.</em></p>

<p><strong>Data Processing</strong></p>

<p>Steve Meisburger (Cornell/CHESS) presented major advances in data processing tools and a landmark reproducibility result.</p>

<p><strong><a href="https://github.com/diff-use/mdx2">mdx2</a></strong> is an open-source software package for processing and analyzing diffuse X-ray scattering data. Development has accelerated with a new team (Steve Meisburger, Justin Biel, Joseph Lee) and modern development practices, including version control, issue tracking, and code review. Version 10.3 was released in December 2025 with:</p>

<ul>
  <li>Containerized deployment via <code class="language-plaintext highlighter-rouge">conda install -c conda-forge mdx2</code></li>
  <li>Jupyter Lab environment integration</li>
  <li>Live processing capability on Voltage Park during beam times</li>
</ul>

<p><strong>Reference datasets from CHESS</strong> now span multiple systems (Mac1, NrdE, DNA, ATCase, Insulin) with systematic tracking through integration, merging, and fine map generation stages.</p>

<p><strong>The headline result</strong>: Diffuse scattering is reproducible between CHESS and ALS. Side-by-side comparisons of Mac1 diffuse maps from both beamlines show consistent features, validating that the signal is robust across different detector systems, beam profiles, and facilities. This East-meets-West reproducibility is foundational for any future multi-site data collection campaigns.</p>

<p><img src="/assets/images/posts/2026-02-02/reproducible_ds_allhands.png" alt="Scattering comparison across beamlines" title="DiffUSE Scattering is reproducible across coasts!" />
<em>Slides shared at the DiffUSE project’s retreat showcasing reproducibility across beamlines.</em></p>

<p>Additional findings from DNA crystal analysis revealed that correlated disorder differs between room temperature and 100K conditions, even when the static structures appear similar—and that diffuse signal extends beyond the Bragg resolution limit, suggesting untapped information content.</p>

<p>A <strong>Galaxy platform prototype</strong> was demonstrated, pointing toward a vision of “Cryosparc for diffuse,” making diffuse data processing accessible through a GUI with integrated workflows and interactive visualizations.</p>

<h3 id="molecular-dynamics-simulations"><strong>Molecular Dynamics Simulations</strong></h3>

<p>Mike Wall presented substantial progress on crystallographic MD simulations.</p>

<p><strong>Apo Mac1 baseline results</strong> show exceptional agreement between simulation and experiment:</p>

<ul>
  <li>Total correlation coefficient: CC = 0.96</li>
  <li>Anisotropic correlation coefficient: CC = 0.56</li>
  <li>Simulation: 2×2×2 supercell with OPC3 waters (279,004 atoms), neutron crystal structure 7TX3, 1100 ns unrestrained trajectory</li>
</ul>

<p><strong>MD optimization methods</strong> are advancing on two fronts:</p>

<ul>
  <li><strong>Enrichment</strong>: Selectively removing MD frames to increase diffuse correlation</li>
  <li><strong>Reweighting</strong>: Using JAX to optimize frame weights via differentiable Pearson CC maximization. Initial test on experimental diffuse data achieved CC = 0.97 with 47,150 reflections to 3.5 Å resolution (<a href="https://github.com/diff-use/sampleworks">work</a> by Karson Chrispens, documented in a <a href="https://diffuse.science/posts/jax_refine/">DiffUSE blog post</a>)</li>
</ul>

<p><strong>Ligand perturbations</strong> are now being simulated: Mac1 + ADPr shows distinct diffuse patterns compared to baseline Mac1, with protonation state variations (ASP157 → ASH157) under investigation.</p>

<p><strong>Second system</strong>: Dihydrofolate reductase (DHFR, PDB: 7FPV) is being developed as a generalization target, expanding beyond the Mac1 test case.</p>

<p><strong>Simulated diffraction</strong> capabilities using nanoBragg (James Holton) enable validation of data processing pipelines—simulated diffuse intensity can be extracted using mdx2, closing the loop between simulation and experiment.</p>

<p><img src="/assets/images/posts/2026-02-02/DHFR_allhands.png" alt="Simulated diffuse intensity of PDB 7FPV" title="A new protein system is being tested for simulated diffuse intensity, PDB 7FPV" />
<em>Slides shared at the DiffUSE project’s retreat showcasing simulation results of a new protein system, Dihydrofolate reductase (DHFR, PDB: 7FPV).</em></p>

<h3 id="machine-learning-modeling"><strong>Machine Learning Modeling</strong></h3>

<p>Marcus Collins presented the ML modeling roadmap focused on using experimental data to reveal hidden protein conformations.</p>

<p><strong>Key insight</strong>: Current ML structure predictors (AlphaFold3-like models, including Boltz-2, Protenix, RF3) do not reliably predict alternate conformations (altlocs) even with multiple random seeds, indicating they have not learned about underlying ensembles. This gap motivates developing density-guided ensemble generation (Sampleworks).</p>

<p><strong>Density guidance approach</strong>: The team is implementing training-free guidance from experimental density maps (2Fo-Fc), using the difference between experimental and calculated maps to steer diffusion model sampling toward conformations consistent with crystallographic data. Early results are promising but mixed: Boltz-2 with density guidance can capture both altlocs in some test cases like PTP1B (6B8X), though performance varies across systems.</p>

<p><strong>Sampleworks pipeline</strong> is being built as a plug-and-play guidance framework to use different structure prediction models, experimental data, and guidance strategies.</p>

<ul>
  <li>Model wrappers implemented for RF3, Protenix, Boltz-1, and Boltz-2 (MD and X-ray modes)</li>
  <li>Initial test set of ~50 structures from PDB prepared with altlocs; electron density maps being generated</li>
  <li>Evaluation metrics: RSCC, LDDT, clash scores, backbone and sidechain geometry</li>
</ul>

<p><strong>Water modeling</strong> emerges as a critical challenge for advancing to reciprocal space. Our first attempt is to improve the modeling of explicit solvent. Current models achieve ~0.3 precision/recall at 0.5 Å—insufficient for improving Rwork/Rfree. The team is exploring flow-matching approaches and evaluating whether a single unified model or separate protein/water models will be more effective. Ordered waters coupled to protein altlocs are particularly important targets.</p>

<h3 id="infrastructure-and-publishing"><strong>Infrastructure and Publishing</strong></h3>

<p>Justin Biel presented the computational infrastructure supporting DiffUSE, built around a three-pillar model: Data, Compute, and Publishing.</p>

<p><strong>Compute Infrastructure</strong> uses Voltage Park as the backbone:</p>

<ul>
  <li>H100 SXM5 GPUs available via bare metal (8× GPU configurations)</li>
  <li>Two usage patterns supported:
    <ul>
      <li><strong>Workspaces</strong>: Interactive environments for experimental work, debugging, and visualization</li>
      <li><strong>Workflows</strong>: Hardened, scalable pipelines for production analysis</li>
    </ul>
  </li>
  <li>The DiffUSE web app now provides resource checkout, visibility into running resources, and SSH/Jupyter access</li>
  <li>Custom container management enables workspace pausing and environment customization</li>
  <li>Workflow orchestration via Prefect and Docker</li>
</ul>

<p><strong>Data Infrastructure</strong> centers on the DiffUSE web app:</p>

<ul>
  <li><strong>Storage</strong>: Core Backblaze storage (S3-compatible) with OSN bucket integration for beamline data</li>
  <li><strong>Access</strong>: Automatic mounting to Voltage Park resources, plus web app download, CLI, Python SDK, and API</li>
  <li><strong>Metadata</strong>: Experiments have artifacts, optional markdown content (like logbook entries), relationships to other experiments, and tags</li>
  <li><strong>Automation</strong>: Beam-trip data automatically triggers experiment registration; dataset files populate metadata fields</li>
  <li><strong>Governance</strong>: Standards compliance checking, staging-to-public workflows, DOI attachment decisions</li>
</ul>

<p><strong>Publishing workflow</strong> discussions focused on:</p>

<ul>
  <li>When to stage data privately vs. make everything open immediately</li>
  <li>When to attach DOIs (content should be largely immutable)</li>
  <li>External database destinations: SBGrid Databank, PDB, Zenodo</li>
</ul>

<h3 id="open-science"><strong>Open Science</strong></h3>

<p>Prachee Avasthi (Head of Open Science, Astera) led a discussion on publishing expectations and open science practices. <strong>Discussion</strong> explored barriers to sharing, evidence of downstream reuse, orphan artifacts without ideal homes, and prioritization of unaddressed data sharing issues.</p>

<h2 id="reflections-on-our-distributed-model"><strong>Reflections on Our Distributed Model</strong></h2>

<p>This retreat underscored how the diffUSE’s distributed structure works. By embedding team members across institutions (Cornell, UCSF, Berkeley Lab, and beyond) we maintain direct access to beamlines, computational expertise, and scientific communities that would be impossible to replicate in a single location. The “Diffuse East ≈ Diffuse West” result is itself a product of this model: data collected by different teams at facilities 2,500 miles apart, processed with shared tools, yielding consistent results. Our infrastructure investments (the DiffUSE web app, Voltage Park compute, standardized containerized environments) bridge the geographic gaps, allowing a scientist at Cornell to spin up the same analysis environment as a colleague in California.</p>

<p>The in-person retreat revealed how much asynchronous collaboration had already accomplished, sessions focused on integration and next steps rather than catching people up. Open science practices (shared logbooks, blog posts, open repositories) keep everyone aligned between meetings. The challenge ahead is scaling this approach: as we add systems, datasets, and collaborators, maintaining the coherence that makes distributed work effective will require continued investment in documentation, automation, and the human connections that make a dispersed team feel like one group working toward a shared goal.</p>

<p>The science described here represents the output of a significant and coordinated resource investment. Since DiffUSE’s start in July, Astera has committed <span>$3.2M</span> to stand up the project: <span>$2.63M</span> in research grants distributed directly to our partner labs at Fraser Lab/LBL, Ando Lab/CHESS, and Wankowicz Lab, $567K in Astera personnel and contractor support, and <span>$30K</span> in computational infrastructure. On top of this, CHESS contributed an estimated <span>$700K</span> in beamtime, bringing the total resource investment to roughly <span>$3.9M</span>. Looking ahead, an additional <span>$2.4M</span> is projected for 2026 as the project scales toward its core scientific goals.</p>

<hr />

<p><img src="/assets/images/posts/2026-02-02/diffuse_demo_datamanagement.png" alt="The DiffUSE App is currently under development" title="the DiffUSE App, currently under development" />
<em>A screenshot from our data management infrastructure, demonstrated at the retreat. This is in active development with Prophet Town and Voltage Park.</em></p>

<p><img src="/assets/images/posts/2026-02-02/2026_allhands_pres.png" alt="Mike Wall presents progress on MD optimization of diffuse scattering" title="Mike Wall presents progress on MD optimization of diffuse scattering" />
<em>Mike Wall presents progress on MD optimization of diffuse scattering to a full house at the Astera Institute.</em></p>

<hr />

<h2 id="whats-next"><strong>What’s Next?</strong></h2>

<h3 id="data-collection-3-month-goals"><strong>Data Collection (3-month goals)</strong></h3>

<ul>
  <li>Collect data on additional systems; collect lysozyme at ALS</li>
  <li>Optimize sample handling for intermediate temperatures (oil-based approaches)</li>
  <li>Explore serial crystallography approaches (chip types, small wedges, crystal size variation)</li>
  <li>Continue investigating cryo options (traditional, NANUQ, high-pressure cryocooling)</li>
</ul>

<h3 id="data-processing-2026-goals"><strong>Data Processing (2026 goals)</strong></h3>

<ul>
  <li>Improve mdx2 performance (~2× speedup)</li>
  <li>Implement GOODVIBES and DISCOBALL in Python (JAX)</li>
  <li>Fully explore serial crystallography processing</li>
  <li>Deploy on Ando lab Galaxy server; add mdx2 tools</li>
  <li>Develop “Cryosparc for diffuse” project roadmap</li>
</ul>

<h3 id="md-simulations"><strong>MD Simulations</strong></h3>

<ul>
  <li>Continue model/data comparisons and refine MD models (protonation states, parameterization)</li>
  <li>Expand to new systems and additional ligand/mutation perturbations</li>
  <li>Explore how MD optimizations can support other DiffUSE activities (ML modeling, diffraction image simulation, data processing validation)</li>
</ul>

<h3 id="ml-modeling"><strong>ML Modeling</strong></h3>

<ul>
  <li>Scale up Sampleworks evaluation across initial test set</li>
  <li>Improve water prediction models (retrain SuperWater with better data, explore flow matching vs. diffusion)</li>
  <li>Quantify water model precision requirements by systematically perturbing well-supported waters</li>
  <li>Progress toward reciprocal space/Bragg peak guidance, ultimately targeting diffuse data guidance</li>
</ul>

<h3 id="infrastructure"><strong>Infrastructure</strong></h3>

<ul>
  <li>Finalize containerized workspace management with pause/resume capability</li>
  <li>Expand workflow orchestration options</li>
  <li>Refine data governance workflows for staging → public → external database publication</li>
</ul>

<h3 id="open-science-1"><strong>Open Science</strong></h3>

<ul>
  <li>Address identified barriers to sharing</li>
  <li>Establish timelines for DOI attachment and external database deposition</li>
  <li>Continue documentation through blog posts and logbooks</li>
</ul>

<p>Special thanks to Astera for hosting the retreat in Emeryville.</p>

<hr />

<h2 id="glossary"><strong>Glossary</strong></h2>

<table>
  <tr>
   <td><strong>Acronym</strong>
   </td>
   <td><strong>Definition</strong>
   </td>
  </tr>
  <tr>
   <td>ADPr
   </td>
   <td>Adenosine diphosphate ribose (a ligand)
   </td>
  </tr>
  <tr>
   <td>ALS
   </td>
   <td>Advanced Light Source (synchrotron at Lawrence Berkeley National Laboratory)
   </td>
  </tr>
  <tr>
   <td>API
   </td>
   <td>Application Programming Interface
   </td>
  </tr>
  <tr>
   <td>ASH
   </td>
   <td>Protonated aspartic acid residue
   </td>
  </tr>
  <tr>
   <td>ASP
   </td>
   <td>Aspartic acid residue
   </td>
  </tr>
  <tr>
   <td>ATCase
   </td>
   <td>Aspartate Transcarbamylase (enzyme)
   </td>
  </tr>
  <tr>
   <td>CC
   </td>
   <td>Correlation Coefficient
   </td>
  </tr>
  <tr>
   <td>CHESS
   </td>
   <td>Cornell High Energy Synchrotron Source
   </td>
  </tr>
  <tr>
   <td>CLI
   </td>
   <td>Command Line Interface
   </td>
  </tr>
  <tr>
   <td>DHFR
   </td>
   <td>Dihydrofolate Reductase (enzyme)
   </td>
  </tr>
  <tr>
   <td>DOI
   </td>
   <td>Digital Object Identifier
   </td>
  </tr>
  <tr>
   <td>GPU
   </td>
   <td>Graphics Processing Unit
   </td>
  </tr>
  <tr>
   <td>GUI
   </td>
   <td>Graphical User Interface
   </td>
  </tr>
  <tr>
   <td>JAX
   </td>
   <td>Just After eXecution (Google's autodiff/ML library for Python)
   </td>
  </tr>
  <tr>
   <td>keV
   </td>
   <td>Kiloelectronvolt (unit of X-ray energy)
   </td>
  </tr>
  <tr>
   <td>LDDT
   </td>
   <td>Local Distance Difference Test (structure quality metric)
   </td>
  </tr>
  <tr>
   <td>Mac1
   </td>
   <td>Macrodomain 1 (SARS-CoV-2 nonstructural protein 3)
   </td>
  </tr>
  <tr>
   <td>MD
   </td>
   <td>Molecular Dynamics
   </td>
  </tr>
  <tr>
   <td>ML
   </td>
   <td>Machine Learning
   </td>
  </tr>
  <tr>
   <td>NrdE
   </td>
   <td>Ribonucleotide Reductase class Ib alpha subunit (enzyme)
   </td>
  </tr>
  <tr>
   <td>ns
   </td>
   <td>Nanoseconds
   </td>
  </tr>
  <tr>
   <td>OPC3
   </td>
   <td>Optimal Point Charge 3-point water model
   </td>
  </tr>
  <tr>
   <td>OSN
   </td>
   <td>Open Storage Network
   </td>
  </tr>
  <tr>
   <td>PDB
   </td>
   <td>Protein Data Bank
   </td>
  </tr>
  <tr>
   <td>PTP1B
   </td>
   <td>Protein Tyrosine Phosphatase 1B (enzyme)
   </td>
  </tr>
  <tr>
   <td>RF3
   </td>
   <td>RoseTTAFold 3 (structure prediction model)
   </td>
  </tr>
  <tr>
   <td>Rfree
   </td>
   <td>Free R-factor (crystallographic validation metric)
   </td>
  </tr>
  <tr>
   <td>Rwork
   </td>
   <td>Working R-factor (crystallographic refinement metric)
   </td>
  </tr>
  <tr>
   <td>RSCC
   </td>
   <td>Real Space Correlation Coefficient
   </td>
  </tr>
  <tr>
   <td>S3
   </td>
   <td>Simple Storage Service (cloud storage protocol)
   </td>
  </tr>
  <tr>
   <td>SBGrid
   </td>
   <td>Structural Biology Software Grid (consortium)
   </td>
  </tr>
  <tr>
   <td>SDK
   </td>
   <td>Software Development Kit
   </td>
  </tr>
  <tr>
   <td>SSH
   </td>
   <td>Secure Shell (network protocol)
   </td>
  </tr>
  <tr>
   <td>UCSF
   </td>
   <td>University of California, San Francisco
   </td>
  </tr>
</table>

<hr />

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name></name></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[A report on our January 2026 all-hands meeting]]></summary></entry><entry><title type="html">In the Cloud</title><link href="https://diffuse.science/post/in_the_cloud/" rel="alternate" type="text/html" title="In the Cloud" /><published>2025-11-15T00:00:00+00:00</published><updated>2025-11-15T00:00:00+00:00</updated><id>https://diffuse.science/post/in_the_cloud</id><content type="html" xml:base="https://diffuse.science/post/in_the_cloud/"><![CDATA[<figure class="half ">
  
    
      <a href="/assets/images/posts/Clouds.jpg" title="Cloudy diffuse features in the sky">
          <img src="/assets/images/posts/Clouds.jpg" alt="Cloudy diffuse features in the sky" />
      </a>
    
  
    
      <a href="/assets/images/posts/DiffuseClouds.png" title="MD simulation of cloudy diffuse features">
          <img src="/assets/images/posts/DiffuseClouds.png" alt="MD simulation of cloudy diffuse features" />
      </a>
    
  
  
    <figcaption>(Left) Cloudy diffuse features in the sky. (Right) MD simulation of cloudy diffuse features.
</figcaption>
  
</figure>

<h2 id="diffuse-scattering-in-the-cloud">Diffuse Scattering in the Cloud</h2>

<p>While out on a walk, as I looked up at the sky, a certain cloud formation (above left) reminded me of the <em>l</em> = 0 slice through the MD simulation of Mac1 diffuse scattering (above right). That got me thinking about the next steps for the diffUSE MD simulations (which are, of course, being performed using <a href="https://www.voltagepark.com">cloud computing resources</a>).</p>

<p>As described in the <a href="/posts/allhands/">Quarterly All Hands Meeting</a> post, we recently shared our short-term plans for the various components of the diffUSE project. We’ve already performed baseline comparisons of crystalline MD simulations of Nsp3 macrodomain (Mac1) to diffuse scattering data (see the <a href="/post/3-2-1-contact/">3-2-1 Contact</a> post). Now we want to improve the models and see what happens in the simulations when we make changes. With this in mind, we’re planning to: (1) improve the current MD model of Mac1; (2) simulate Mac1 crystals under different conditions; and (3) develop a model of a new system.</p>

<p>Thinking about (2), I contacted James Fraser to chat about what to do for the next MD simulations of Mac1. We decided to look at Mac1 in complex with ADP-ribose. This choice is timely, as Kara Zielinski just collected diffUSE diffraction data from crystals of this complex at the Cornell High-Energy Synchrotron Source (CHESS), during a recent trip from UCSF to Nozomi Ando’s lab at the Cornell.</p>

<p>What will the MD simulation of diffuse scattering from crystals of Mac1 in complex with ADPr look like? Probably a lot like the ones we’ve done already, with some small changes. We’re planning to analyze the differences and find out what happens to the dynamics when different ligands bind. But we don’t really know yet what we’ll see. These moments of suspense are very common in science, but they’re absent from the stories we usually tell in the literature. The open science model we’re using on the diffUSE project enables us to document these periods of uncertainty as a part of the public narrative of the project. It feels kind of liberating.</p>

<hr />

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name>Michael Wall</name><email>mewall00@gmail.com</email></author><category term="post" /><category term="diffuse scattering" /><category term="molecular dynamics" /><category term="clouds" /><category term="planning" /><category term="open science" /><category term="meta" /><summary type="html"><![CDATA[Next steps in diffUSE MD simulations]]></summary></entry><entry><title type="html">Quarterly All-Hands Meeting Summary</title><link href="https://diffuse.science/posts/allhands/" rel="alternate" type="text/html" title="Quarterly All-Hands Meeting Summary" /><published>2025-11-10T00:00:00+00:00</published><updated>2025-11-10T00:00:00+00:00</updated><id>https://diffuse.science/posts/allhands</id><content type="html" xml:base="https://diffuse.science/posts/allhands/"><![CDATA[<h2 id="why-this-quarter-mattered"><strong>Why this quarter mattered</strong></h2>

<p>We haven’t gathered all together since our June kick-off meeting, so in Mid October, we met (online) with all members of the diffUSE project to discuss our overall goals for each project team, progress made over the first few months, and goals for the next three months. We emphasized how the different pieces of the project integrate to build methods, data, models, and encodings so the community can routinely use diffuse scattering in basic biology and drug discovery.</p>

<h2 id="what-have-we-completed-in-the-first-few-months"><strong>What have we completed in the first few months?</strong></h2>

<h3 id="data-collection"><strong>Data collection:</strong></h3>

<ul>
  <li>We have collected ambient-temperature datasets from CHESS for <a href="https://diffuse.science/logbook/beamtime/20251008-chess/">lysozyme</a>, <a href="https://diffuse.science/logbook/beamtime/20251015-chess/">macrodomain</a>, <a href="https://diffuse.science/logbook/beamtime/20250924-chess/">NrdE</a>, and <a href="https://diffuse.science/logbook/beamtime/20251015-chess/">DNA fibers</a>.</li>
  <li>We have collected data at ALS on <a href="https://diffuse.science/logbook/beamtime/20250701-als/">Mac1</a> and <a href="https://diffuse.science/logbook/beamtime/20251015-17-als/">Huwe1</a> using humidity boxes and watershed sleeves with controlled transmission and beam size.</li>
  <li>We have played around with temperature modulation used to probe dose dependence and mosaicity effects with <a href="https://diffuse.science/diffuse-shipping/">samples shipped</a>] from UCSF to Cornell</li>
  <li>We implemented standardized background frames, uniform sleeve lengths, and precise humidity control to enhance map quality and cross-beamline comparability with an eye toward <a href="https://diffuse.science/posts/windows/">future</a> multi-site data collection campaigns.</li>
  <li>We have documented all collection procedures on the <a href="https://diffuse.science/logbook/beamtime/">diffUSE website logbooks</a>.</li>
</ul>

<h3 id="data-processing"><strong>Data processing:</strong></h3>

<ul>
  <li>
    <p>Developing xia2.multiplex for automated data merging, <em>mdx2</em> for data extraction, and comprehensive data quality control (QC) workflows.</p>
  </li>
  <li>
    <p>Building graphical user interfaces (GUIs) for <em>mdx2</em> to improve usability and accessibility.</p>
  </li>
  <li>
    <p><a href="https://diffuse.science/next-steps-macrodomain/">Identifying and resolving</a> bugs that arise when multiple users concurrently process the same datasets.</p>
  </li>
  <li>
    <p><a href="https://diffuse.science/posts/jax_refine/">Implementing differentiable refinement</a> by treating molecular dynamics (MD) frame weights as trainable parameters in a Pearson correlation–based objective function.</p>
  </li>
</ul>

<h3 id="machine-learning-modeling"><strong>Machine Learning Modeling:</strong></h3>

<ul>
  <li>Developing <a href="https://diffuse.science/posts/modeling/">pipeline scaffolds</a> to integrate experimental structural data directly into generative model training and evaluation.</li>
  <li>Creating quantitative metrics for assessing and benchmarking ensemble data.</li>
  <li>Building a generative water model that learns to predict water molecule positions from protein structure, designed for future integration into broader generative modeling frameworks.</li>
</ul>

<h3 id="simulations"><strong>Simulations:</strong></h3>

<ul>
  <li>We <a href="https://diffuse.science/post/3-2-1-contact/">simulated</a> a Mac1 2×2×2 supercell  with OPC3 waters and 279,004 atoms reaches 150 ns per day on Voltage Park. With refined masking and resampling, total CC is 0.96 and anisotropic CC is 0.56 on the H8 dataset, which sets a clear target for larger supercells and ligand or mutant comparisons.</li>
  <li>Taylor completed his rotation developing a <a href="https://diffuse.science/posts/diffuse_rotation/">simulator</a>.</li>
</ul>

<p><strong>Encoding:</strong></p>

<ul>
  <li>We continue to <a href="https://diffuse.science/posts/encoding/">advocate</a> for conformational and compositional heterogeneity-encoding strategies.</li>
  <li>We have developed a <a href="https://diffuse.science/posts/multi_to_ens/">script</a> to translate between encodings for multiconformer and ensemble representations.</li>
  <li>We are working on developing a standalone script and a COOT integration script for conformational heterogeneity.</li>
</ul>

<h3 id="infrastructure-and-open-science"><strong>Infrastructure and Open Science:</strong></h3>

<ul>
  <li>We have access to Voltage Park compute and S3 storage via the command line, which will make sharing maps and models easier.</li>
  <li><a href="https://diffuse.science/posts/">16 blog posts</a> and 6 beamtime <a href="https://diffuse.science/logbook/">logbooks</a> to date!</li>
</ul>

<h2 id="what-is-up-for-the-next-3-months"><strong>What is up for the next 3 months?</strong></h2>

<p><strong>Data Collection:</strong> Catalog all existing data, fill gaps, complete background series at ALS, finalize hardened collection procedures that travel across LBL and CHESS, and post collection reports on the site. Developing shared checklists to coordinate and standardize future data-collection cycles.</p>

<p><strong>Data Processing:</strong> Converge on a single, documented workflow, generate preliminary maps for all ALS and CHESS datasets, produce fine maps for GOODVIBES and DISCOBALL, stand up a CHESS 2026-1 pipeline, and publish processing reports on the site.</p>

<p><strong>Modeling:</strong> Ship an initial pipeline that accepts maps for guided sampling.</p>

<p><strong>Encoding:</strong> Land final working group approval, publish the schema and examples, and connect the web app to our catalog so processed maps, models, and metadata are searchable and shareable.</p>

<p><strong>Infrastructure and sharing science</strong>: More blog posts!</p>

<p>We will meet in the Bay Area in January and report back more after that.</p>]]></content><author><name>James Fraser</name><email>jfraser@fraserlab.com</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[A report on our October 2025 all-hands meeting]]></summary></entry></feed>