<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://diffuse.science/feed.xml" rel="self" type="application/atom+xml" /><link href="https://diffuse.science/" rel="alternate" type="text/html" /><updated>2026-04-17T16:38:07+00:00</updated><id>https://diffuse.science/feed.xml</id><title type="html">The DiffUSE Project</title><subtitle>The DiffUSE Project</subtitle><entry><title type="html">The Tortured Proteins Department, Episode 13</title><link href="https://diffuse.science/posts/TTPD-13/" rel="alternate" type="text/html" title="The Tortured Proteins Department, Episode 13" /><published>2026-03-29T00:00:00+00:00</published><updated>2026-03-29T00:00:00+00:00</updated><id>https://diffuse.science/posts/TTPD-13</id><content type="html" xml:base="https://diffuse.science/posts/TTPD-13/"><![CDATA[<h1 id="episode-13-long-live">Episode 13: Long Live!</h1>

<p>We chat about preprints, AI grad students, the diffUSE project, an April Fools blog, NIH funding news, and the Fraser Lab “broken monitor” April Fools’ Day prank — caught live during recording!</p>

<p><img src="https://github.com/user-attachments/assets/6ab7f026-2549-49dc-a33c-2ec204215385" alt="Fraser Lab broken monitor April Fools' Day prank - James Fraser photographing his cracked monitor screen during the recording session with Stephanie Wankowicz" class="img-fluid" /></p>

<p>Preprints:</p>
<ul>
  <li><a href="https://www.biorxiv.org/content/10.64898/2026.03.08.710389v1">Intrinsic dataset features drive mutational effect prediction by protein language models</a></li>
  <li><a href="https://www.biorxiv.org/content/10.64898/2026.03.08.710403v1">Ribosome Molecular Aging Shapes Translation Dynamics</a></li>
</ul>

<p>AI Grad Student Articles:</p>
<ul>
  <li><a href="https://www.science.org/content/article/why-i-may-hire-ai-instead-graduate-student">Why I may ‘hire’ AI instead of a graduate student</a></li>
  <li><a href="https://www.anthropic.com/research/vibe-physics">Vibe physics: The AI grad student</a></li>
</ul>

<p>Other Links:</p>
<ul>
  <li><a href="https://diffuse.science/">diffUSE Project</a></li>
  <li><a href="https://jgreener64.github.io/posts/technical_report/">April Fools Blog</a></li>
</ul>

<p>NIH News:</p>
<ul>
  <li><a href="https://grant-witness.us/funding_curves.html">Funding Curve</a></li>
  <li>
    <p><a href="https://www.pnas.org/doi/10.1073/pnas.2527755123">How the 2025 NIH grant terminations varied by researchers’ demographic groups</a></p>
  </li>
  <li><a href="https://open.spotify.com/episode/7jyXj4QUI4zmZgnfWfaY3x?si=ucQCu_WgShel6V0X4cbzUQ">Spotify</a></li>
  <li><a href="https://podcasts.apple.com/us/podcast/long-live/id1802420696?i=1000759838149">Apple Podcasts</a></li>
</ul>

<iframe data-testid="embed-iframe" style="border-radius:12px" src="https://open.spotify.com/embed/episode/7jyXj4QUI4zmZgnfWfaY3x?utm_source=generator" width="100%" height="352" frameborder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie@wankowiczlab.com</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[The Tortured Proteins Department Podcast, Episode 13]]></summary></entry><entry><title type="html">Expanding The DiffUSE Project</title><link href="https://diffuse.science/posts/expanding_diffuse/" rel="alternate" type="text/html" title="Expanding The DiffUSE Project" /><published>2026-03-11T00:00:00+00:00</published><updated>2026-03-11T00:00:00+00:00</updated><id>https://diffuse.science/posts/expanding_diffuse</id><content type="html" xml:base="https://diffuse.science/posts/expanding_diffuse/"><![CDATA[<p>Biology is motion. From ecosystems to molecules, nothing is ever still. Life relies on this constant motion to function. Yet we tend to study biology through static snapshots. In structural biology, we often interpret macromolecules as single, static structures. These static structures have been transformative, enabling decades of progress in drug design, disease understanding, and protein engineering. But we are reaching a plateau in the questions we can answer with them. Because proteins do not function as single structures. They function through their conformational ensemble, the range of states a macromolecule can sample. And yet, we have almost no data on these ensembles. The tools to collect and model it at scale do not exist. The infrastructure to share and interpret it has not been built.</p>

<p><strong>What could we discover about biology if we had protein ensembles at the scale of the Protein Data Bank?</strong></p>

<p><em>Could we understand how drug efficacy and resistance arise? Could we better explain why mutations cause disease, or what makes a designed enzyme functional? Could we deliberately engineer dynamics to bias cell signaling?</em></p>

<p>Today, we are excited to announce that the DiffUSE Project is expanding. We are building the infrastructure to collect and analyze protein ensembles at the scale of the Protein Data Bank. Because many of the most important biological questions can only be answered through ensembles.</p>

<p>The DiffUSE Project, by <a href="https://radial.org/">Radial, and funded by the Astera Institute</a>, began by asking how we could rethink the pipeline for extracting the faint yet information-rich diffuse scattering signal in the “background” of X-ray crystallography experiments. Diffuse scattering may provide key, but often untapped, information on the heterogeneity of macromolecules in a crystal. With funding and operational support from the <a href="https://astera.org/">Astera Institute</a>, we assembled a purpose-built team spanning expertise across Astera and multiple institutions and began to re-engineer the components needed to fully optimize this, revealing structural ensembles.</p>

<p>But diffuse scattering is just one window into protein conformational ensembles. Different techniques uncover different facets of the same underlying ensemble reality. Cryo-EM captures heterogeneity across length scales. NMR resolves timescale-specific motions. SAXS reports on global shape fluctuations. HDX-MS maps solvent accessibility changes. No single method can reveal the full dynamic landscape. But we do not yet know which techniques, or which combination, will best uncover these ensembles. So we start with a not-so-simple question: which data are most fruitful for understanding biology?</p>

<p>But collecting data is only part of the challenge. Every stage of structural biology today reinforces a static worldview. Algorithms model single conformations. Representations emphasize static structures. Metrics evaluate fit to a single model. Biological interpretations are based on static structures. Addressing any piece in isolation will fall short. The entire pipeline must be redesigned.</p>

<p>The DiffUSE Project is completely rethinking how we collect, process, model, encode, disseminate, and interpret dynamic structural data. We are designing tools to enable downstream applications that meaningfully impact biology.</p>

<p><img width="1920" height="1080" alt="Diffuse Project Areas" src="https://github.com/user-attachments/assets/ecb768aa-789b-4b19-8efc-721e72a28cfe" /></p>

<p>At every step, we are guided by the following questions:</p>
<ul>
  <li>What structural ensemble data are we missing, and how do we build the technologies to capture it?</li>
  <li>Where will structural ensemble data best help reveal biological function?</li>
  <li>How do we create tools that empower everyone, not just specialists, to work with conformational ensembles?</li>
  <li>How do we openly disseminate what we build to maximize scientific impact?</li>
</ul>

<p>We are taking this approach because we cannot get there alone. AlphaFold succeeded because decades of experimental structure determination built the data foundation it needed. No equivalent foundation exists for ensembles. Building that foundation and building the prediction methods must happen together — at a scale no single group can achieve. That is why we are not just generating data ourselves. We are building the tools, methods, and infrastructure to enable the entire community to collect, curate, and benchmark dynamic structural data.</p>

<p>This entire project is also a metascience experiment that asks what it truly takes to tackle a large, systematic, and deeply integrated scientific problem. Some of what lies ahead is engineering. Some of it is science. Some of it is community building. We are pursuing this work both internally at Radial and through a growing network of partnerships and collaborations. Experimenting not only with the science itself but with how best to do it.</p>

<p><strong>What are we building?</strong></p>

<p>The DiffUSE project will focus on four areas, each advancing a different part of the dynamic structural pipeline simultaneously. Technique focused spokes will transform specialized dynamics techniques into routine, accessible methods. Algorithms will uncover hidden conformational states and integrate signals across techniques to help reveal the full dynamic landscape and provide key integration points for ensemble prediction. Infrastructure will evolve the PDB from a static archive into a living database capable of storing, validating, and querying dynamic structural data. And Biological Impact will ask questions that help ground our technical advance by asking: how does understanding dynamics actually change what we can understand, predict, design, and discover biology?</p>

<p><strong>Technique focused spokes:</strong> Every technique for capturing protein ensembles faces its own barriers to making the data truly usable. We are aiming to experiment with how to overcome these barriers and make these techniques routine and accessible for the broader community. Current methods often discard the dynamic signal, and none were designed to scale to the thousands of ensembles needed to uncover general principles. Each spoke tackles these challenges for a specific technique, producing open algorithms, protocols, models, and metrics, all designed from the start to be adopted and built upon by others. We do not yet know which techniques will generate truly additive data, which approaches will scale, or which will result in the best biological insights. We plan to run parallel experiments, and we will follow the evidence, expanding what works, rethinking what doesn’t, and sharing what we learn along the way.</p>

<p><strong>Algorithms:</strong> No single technique captures the full picture of how a protein moves. Each method sees a different slice of the conformational landscape. To tackle this, we are developing algorithms along three fronts: guidance frameworks that steer ensemble models using experimental data, machine learning approaches that learn directly from experimental measurements rather than structures, and agentic modeling pipelines that autonomously build, evaluate, and refine ensemble models. Everything we build is open, designed for the community to use, extend, and improve. Our first major algorithm is <a href="https://github.com/diff-use/sampleworks">Sampleworks</a>, a platform for testing new guidance frameworks, stress-testing ensemble predictions, understanding where they fail, and establishing the datasets needed to train the next generation of machine learning models on conformational ensembles rather than static snapshots.</p>

<p><strong>Infrastructure:</strong> The PDB transformed structural biology by providing the community with a shared, standardized, and trusted platform for depositing and retrieving structures. But the PDB was designed for static structures. Without accessible ensemble data, researchers default to inconsistent practices: ignoring the inherent heterogeneity in the data and training models on static structures that don’t reflect how proteins actually behave. These limitations don’t stay contained; they propagate through the entire discovery workflow, leading to biased biological conclusions.</p>

<p>We are working on evolving the PDB into a living database that supports dynamic structural data through three advances. First, ensemble-aware validation, because today, when an ensemble model is deposited, only part of it is validated against experimental data. Validation metrics are essential for model deposition and for benchmarking new machine learning models. Second, heterogeneity-aware encoding enables queries that let users do things currently impossible: retrieve only ligand-bound states across the database, find entries where a specific loop adopts multiple conformations, or compare ensemble representations across related proteins. Third, a unified, continually updated repository where experimental data serve as ground truth and the models explaining them improve over time as algorithms advance. The endgame is to enable a virtuous cycle where labs deposit dynamic data, the community builds better ensemble models, those models generate new biological insights, and the standardized data fuels structural ensemble prediction.</p>

<p><strong>Biological Impact:</strong> We want our efforts to be grounded in a simple, but demanding question: how does understanding dynamics change what we can predict, design, and discover? To answer that, we are building ensemble-aware representations to help quantify metrics for what has long been hard to quantify, conformational and solvent entropy, flexibility, and long-range dynamic coupling, and validating them against real biological problems in binding, enzyme design, and allosteric regulation.
But metrics alone aren’t enough. We aim to help understand how dynamics propagate and change protein function. How does a single mutation redistribute an entire conformational ensemble and alter functional output? How do changes in hydration networks reshape binding? Can we trace these effects well enough to use dynamics not just as a descriptor, but as a design principle, such as designing allosteric switches by tuning the dynamic landscape rather than the static structure?</p>

<p><strong>What’s next?</strong></p>

<p>I am thrilled to be joining the DiffUSE team full-time as Scientific Program Director. In this role, I will design and direct the roadmap to unlock protein ensembles. I will be directly leading the efforts in algorithms, infrastructure, and biology, and guiding technical deep dives. My passion has always been building tools and working with open data. I see this role and this project as an incredible opportunity to deliver on this vision of unlocking protein ensembles at scale. The biology questions I care most about, the role that ensembles and entropy play in protein function, and how we leverage that understanding for design, can only be answered once we commit to an ensemble point of view. That is exactly what the DiffUSE Project intends to do.
We’re also growing the <a href="https://diffuse.science/work-with-us/">team</a> on the algorithms, infrastructure, and biology team, <a href="https://diffuse.science/work-with-us/">engaging with those looking to build open tools to make this shift happen</a>. And we are doing it all in the <a href="https://diffuse.science/publications/">open</a>.</p>

<p>Our point is this - <em>break’s over</em>.</p>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie@wankowiczlab.com</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[Expanding The DiffUSE Project]]></summary></entry><entry><title type="html">To ensembles and beyond!</title><link href="https://diffuse.science/posts/sampleworks/" rel="alternate" type="text/html" title="To ensembles and beyond!" /><published>2026-03-11T00:00:00+00:00</published><updated>2026-03-11T00:00:00+00:00</updated><id>https://diffuse.science/posts/sampleworks</id><content type="html" xml:base="https://diffuse.science/posts/sampleworks/"><![CDATA[<p>Most structural biology experimental observables, including those from X-ray crystallography and cryo-electron microscopy (cryo-EM), reflect time and ensemble averages over millions of conformations. Thus, only conformational ensemble models accurately represent the experimental measurements <a href="https://pubmed.ncbi.nlm.nih.gov/36639584/">(Lane 2021)</a>. However, due to limitations in modeling algorithms, most structural modeling approaches tend to model only the ground state, discarding the rich heterogeneity in the raw data that reflects the broader conformational ensemble (<a href="https://pubmed.ncbi.nlm.nih.gov/33020277/">Bozovic et al. 2020</a>, <a href="https://pubmed.ncbi.nlm.nih.gov/24504120/">Kuzmanic et al. 2014</a>). While new algorithms have pushed the field beyond a singular state, they only capture a fraction of the accessible states and are often limited by resolution (<a href="https://pubmed.ncbi.nlm.nih.gov/38904665/">Wankowicz et al. 2024</a>, <a href="https://doi.org/10.7554/eLife.103797.3">Flowers et al. 2025</a>, <a href="https://pubmed.ncbi.nlm.nih.gov/34726164/">Ploscariu et al. 2021</a>).</p>

<p>Generative biomolecular structure predictors like AlphaFold 3 (<a href="https://www.nature.com/articles/s41586-024-07487-w">Abramson et al., 2024</a>), Boltz-1/2 (<a href="https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1">Passaro et al., 2025</a>, <a href="https://www.biorxiv.org/content/10.1101/2024.11.19.624167v3">Wohlwend et al., 2025</a>), RoseTTAFold 3 (<a href="https://www.biorxiv.org/content/10.1101/2025.08.14.670328v2">Corley et al., 2025</a>), and Protenix (<a href="https://www.biorxiv.org/content/10.64898/2026.02.05.703733v3">Protenix Team, 2026</a>) produce remarkably accurate single-state structures. These structure predictors sample from a distribution of conformations learned from data–but largely from the single-state structures they were trained on, implicitly assuming a single protein sequence maps to just one conformation. We have only begun to understand the extent to which these models learn the actual conformational heterogeneity of each protein, or simply memorize the individual protein structures (<a href="https://www.nature.com/articles/s41467-024-51801-z">Chakravarty et al., 2024</a>; <a href="https://elifesciences.org/articles/75751">del Alamo et al., 2022</a>).</p>

<p>“Diffusion”-based generative structure predictors work by predicting a series of updates which map from noise to a final protein structure–creating a “trajectory” that can be steered by modifying the updates to better match some criteria–such as how well the model fits experimental data. This “guidance” approach was recently used (<a href="https://www.biorxiv.org/content/10.64898/2026.02.27.708490v1">Maddipatla et al., 2026</a>) to steer Protenix to identify previously unmodeled states in existing PDB entries. CryoBoltz (<a href="http://arxiv.org/abs/2506.04490">Raghu et al. 2025</a>) similarly guides Boltz-1 to better model cryoEM density maps when multiple conformations are found during classification of the particle stack. But to date, there have been no systematic explorations of how well different structure predictors respond to guidance, or of the trade-offs involved in applying this guidance to sample ensembles consistent with the experimental data. To simplify and generalize experiments that generate ensemble predictions and use ensemble data, we introduce Sampleworks.</p>

<h2 id="what-is-sampleworks"><strong>What is Sampleworks?</strong></h2>

<p>Sampleworks is a framework for integrating structural biology observables, structure predictors, and guidance to improve the modeling of conformational ensembles. Sampleworks does this by defining a common interface for structure predictors and linking that interface to methods for computing experimental observables, like an X-ray electron density map. Sampleworks then applies existing methods to guide generative models during inference, steering sample generation towards an ensemble that reflects the data.</p>

<p>To date, we’ve implemented:</p>

<ul>
  <li>Two guidance methods: diffusion posterior sampling (<a href="https://arxiv.org/abs/2209.14687">Chung et al., 2024</a>; <a href="http://arxiv.org/abs/2406.04239">Levy et al., 2024</a>) and Feynman-Kaç steering (<a href="http://arxiv.org/abs/2501.06848">Singhal et al., 2025</a>)</li>
  <li>Wrappers for <a href="https://github.com/jwohlwend/boltz">Boltz-1 and -2</a>; <a href="https://github.com/bytedance/Protenix">Protenix</a>; and <a href="https://github.com/RosettaCommons/foundry">RosettaFold3</a>.</li>
  <li>Guidance based on real-space electron density maps.</li>
</ul>

<p>We will continue to expand the guidance methods, structure predictors, and the kinds of experimental data used for guidance, and we welcome contributions from this community. You can use Sampleworks, follow our progress, and contribute <a href="https://github.com/diff-use/sampleworks">here</a>!</p>

<h2 id="where-were-at"><strong>Where we’re at</strong></h2>

<p>Sampleworks’s first goal is to benchmark the extent to which diffusion-based structure predictors can be guided to generate accurate structural ensembles. Specifically, we ask whether these models can produce out-of-distribution conformations and, when they do, how well those conformations balance between fit to experimental data and geometric quality.</p>

<p>Since these structure predictors’ training data were on single state structures, this is a natural test of their ability to generalize. In this first test, we guide the predictors using synthetic electron density maps that mix two conformations. We evaluate how well each predictor recovers the synthetic ensemble while still satisfying protein-geometry constraints.</p>

<p><img src="/assets/images/posts/6b8x_w_wo.png" alt="6B8X With and Without Guidance" /></p>

<p><em>(Left) Without guidance, Boltz-2 does not predict the two conformations (grey) in 6B8X when run with multiple random seeds. (Right) Guidance using synthetic density representing the two conformations enables Boltz-2 to predict both.</em></p>

<p>We’ve now tested our initial structure predictors with guidance for 40 different structures from the PDB that have known alternate conformations in the original depositions. In the example above, for the protein phosphatase PTP1B (PDB code 6B8X), the predictor, without guidance, falls back on its training data and can only generate structures in one of the two known conformations. Encouragingly, when incorporating guidance with electron density representing both conformations, we observe the structure predictors now predicting two states. However, this is not universally true across different proteins and models. Because Sampleworks is modular and easily accommodates different structure predictors, we can quickly test them all at different guidance levels across a wide range of proteins.</p>

<div style="display: flex; gap: 10px;">
  <img src="/assets/images/posts/17_tradeoff_conceptual_weak_zoomed.png" alt="Tradeoff Weak" style="width: 50%;" />
  <img src="/assets/images/posts/18_tradeoff_conceptual_strong.png" alt="Tradeoff Strong" style="width: 50%;" />
</div>

<p><em>Geometry penalty vs. fit to data for each model’s ensemble prediction on 40 examples of multiple conformations when guided with synthetic density. (Left) Weak guidance, zoomed close to show differences between models. (Right) Strong guidance, resulting in much larger geometry penalties.</em></p>

<p>We also assessed the trade-offs between fit to electron-density data and geometry. To determine fit to electron density, we use the Real Space Correlation Coefficient (RSCC). We also compute a conglomerate geometry score, which becomes larger with increased clashes, bad backbone dihedrals, or unrealistic bond lengths. With relatively weak guidance, we can see that the models are all able to improve RSCC without incurring too much cost to protein geometry. Paradoxically, when we use stronger guidance, most structure predictors fall apart completely, and the RSCC gets worse! This probably means that strong guidance pushes the predictors’ trajectories outside what they’ve been trained on, far enough that they cannot recover. Despite their very similar architectures and training data, these structure predictors differ in how they respond to guidance.</p>

<p>We are just beginning to learn how best to guide each model to better model biologically relevant ensembles. We are already seeing signs that the models are limited in their current forms, and we will be working to explore and overcome those limitations. In the coming months, we’ll have more to share about the first results shown above and about the experiments in progress now. We’ll be looking to expand the range of predictors and experimental data we can work with, and to engage with the community to make protein conformational ensembles more accessible for everyone.</p>]]></content><author><name>Karson Chrispens, Marcus Collins, PhD, and Stephanie Wankowicz, PhD</name></author><category term="posts" /><category term="meta" /><category term="modeling" /><summary type="html"><![CDATA[Sampleworks is a framework for integrating structural biology observables, structure predictors, and guidance to improve the modeling of conformational ensembles.]]></summary></entry><entry><title type="html">Shake It Up!</title><link href="https://diffuse.science/post/shake-it-up/" rel="alternate" type="text/html" title="Shake It Up!" /><published>2026-02-22T00:00:00+00:00</published><updated>2026-02-22T00:00:00+00:00</updated><id>https://diffuse.science/post/shake-it-up</id><content type="html" xml:base="https://diffuse.science/post/shake-it-up/"><![CDATA[<h2 id="md-simulations-of-changes-in-diffuse-scattering-depending-on-ligand-binding"><em>MD simulations of changes in diffuse scattering depending on ligand binding</em></h2>

<figure class="third ">
  
    
      <a href="/assets/images/posts/2026-02-23/Mac1_NoADPr.png" title="MD diffuse with CHES (buffer molecule)">
          <img src="/assets/images/posts/2026-02-23/Mac1_NoADPr.png" alt="MD diffuse with CHES" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-02-23/Mac1_WithADPr.png" title="MD diffuse with ADPr">
          <img src="/assets/images/posts/2026-02-23/Mac1_WithADPr.png" alt="MD diffuse with ADPr" />
      </a>
    
  
    
      <a href="/assets/images/posts/2026-02-23/Mac1_DeltaADPr.png" title="Difference ADPr - CHES">
          <img src="/assets/images/posts/2026-02-23/Mac1_DeltaADPrAniso.png" alt="Difference ADPr - CHES" />
      </a>
    
  
  
    <figcaption>MD simulations of Mac1 diffuse scattering change depending on ligand binding. <em>Left</em>. Mac1 with CHES (buffer molecule). <em>Center</em>. Mac1 with ADPr. <em>Right</em>. Difference.
</figcaption>
  
</figure>

<h2 id="what-did-we-find">What did we find?</h2>

<p>In a previous <a href="/post/in_the_cloud/">post</a> I talked about our plan to perform MD simulations of Mac1 crystals under different conditions. We had just started an MD simulation of Mac1 in complex with ADP-ribose (ADPr), and were waiting for it to finish. We weren’t sure how different it would be from the <a href="/post/lets-dance/">initial simulations</a> of Mac1 without ADPr, in which a CHES buffer occupies the ADPr binding site.</p>

<p>Now we know more. Visual comparisons (above images) show that the rich anisotropic diffuse diffuse features in the maps with CHES buffer (left) vs. ADPr (center) in the binding pocket have similar patterns of peaks and troughs that differ in detail. Subtracting them reveals the differences more clearly (right). Quantitative analysis shows that the variations in the difference map are comparable in strength to the intensities in either individual map. This result is encouraging as it indicates that such differences might be observed in experiments.</p>

<h2 id="its-a-trap">It’s a trap!</h2>

<p>Along the way we encountered a common pitfall in making these kinds of comparisons: due to an indexing ambiguity in the P43 space group, the structures of Mac1 used for simulations with and without ADPr were solved using different definitions of the lattice vectors, with the <em>h</em> and <em>k</em> axis swapped, and the <em>l</em> axis reversed (compare PDB IDs <a href="https://www.rcsb.org/structure/7TX0">7TX0</a> and <a href="https://www.rcsb.org/structure/7TX3">7TX3</a>). The diffUSE modeling team worked out how to make the simulated diffuse maps consistent at our recent <a href="/posts/allhands/">all hands meeting</a>, enabling us to perform a controlled comparison of the simulations.</p>

<h2 id="what-next">What next?</h2>

<p>The next step is to compare both of these simulations with data recently collected at CHESS (see <a href="https://diffuse.science/logbook/beamtime/20251105-chess/">logbook</a>), in one of a series of diffUSE beam times that are expected to yield a large number of datasets. These runs already have revealed that diffuse data are <a href="/posts/allhands/">reproducible between CHESS and ALS beamlines</a>. Data from Mac1 +/- ADPr are now in the processing pipeline; we’re eager to see how Mac1 diffuse scattering changes upon ligand binding, and whether MD simulations can help explain what we see.</p>

<hr />

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name>Michael Wall</name><email>mewall00@gmail.com</email></author><category term="post" /><category term="diffuse scattering" /><category term="molecular dynamics" /><category term="modeling" /><category term="open science" /><category term="meta" /><summary type="html"><![CDATA[MD simulations of changes in diffuse scattering depending on ligand binding]]></summary></entry><entry><title type="html">DiffUSE January 2026 Retreat: From Coast to Coast, Diffuse Scattering Reproduces</title><link href="https://diffuse.science/posts/allhands/" rel="alternate" type="text/html" title="DiffUSE January 2026 Retreat: From Coast to Coast, Diffuse Scattering Reproduces" /><published>2026-02-02T00:00:00+00:00</published><updated>2026-02-02T00:00:00+00:00</updated><id>https://diffuse.science/posts/allhands</id><content type="html" xml:base="https://diffuse.science/posts/allhands/"><![CDATA[<div class="notice" style="font-style: italic;">
DiffUSE is a Radial Project by <a href="https://astera.org">Astera</a>. This initiative aims to make diffuse X-ray scattering a routine tool for understanding protein dynamics in basic biology and drug discovery.
</div>

<h2 id="why-this-retreat-mattered"><strong>Why This Retreat Mattered</strong></h2>

<p>In late January, the DiffUSE Project team gathered in person for our first progress meeting at Astera’s headquarters in Emeryville, California. Since our October online meeting, every team has made substantial progress.</p>

<p>The retreat brought together team members working on data collection, data processing, molecular dynamics simulations, machine learning modeling, infrastructure, and open science to assess progress against our six-month goals and chart the path forward.</p>

<p>Perhaps the most exciting development is a deceptively simple one: diffuse scattering data collected at CHESS (Cornell) and ALS (Berkeley) are reproducible. This cross-country validation marks a critical step toward making diffuse scattering a routine tool for structural biology.</p>

<h2 id="what-have-we-accomplished-since-october"><strong>What Have We Accomplished Since October?</strong></h2>

<h3 id="data-collection"><strong>Data Collection</strong></h3>

<p>Kara Zielinski (Fraser Lab, UCSF) reported on an intensive fall data collection campaign:</p>

<ul>
  <li><strong>9 beamtimes</strong> since project inception across two synchrotrons (CHESS and ALS)</li>
  <li><strong>18 participants</strong> contributed to data collection</li>
  <li><strong>7 protein systems</strong>: Mac1, NrdE, Lysozyme, DNA fibers, ATCase, Insulin, and Huwe1</li>
  <li><strong>129 “good” datasets</strong> collected (no data collection errors)</li>
</ul>

<p>The team systematically explored experimental perturbations:</p>

<ul>
  <li><strong>Temperature</strong>: Data collected at 100K (cryo), 220-275K (intermediate), and 310-315K (elevated), though sample handling for intermediate temperatures requires further optimization</li>
  <li><strong>Ligands</strong>: Mac1 + ADPr (11 datasets from CHESS and ALS combined) and Mac1 + small molecule “opener” (6 datasets from ALS)</li>
  <li><strong>Radiation damage mitigation</strong>: Vector scans implemented at CHESS to spread dose across radiation-sensitive samples like NrdE and ATCase</li>
</ul>

<p>Beamline-specific improvements included:</p>

<ul>
  <li><strong>ALS</strong>: Explored dose dependence, wavelength effects, and exposure time optimization; addressed collimator ring scatter issues at 14 keV</li>
  <li><strong>CHESS</strong>: Continued X-ray aperture optimization for background reduction</li>
</ul>

<p><img src="/assets/images/posts/2026-02-02/2026_crystals_spm_allhands.png" alt="2026_crystals_spm_allhands" title="Initial protein systems tested by the diffUSE team" />
<em>Slide shared at the DiffUSE project’s retreat showcasing crystals of initially tested protein systems.</em></p>

<p><strong>Data Processing</strong></p>

<p>Steve Meisburger (Cornell/CHESS) presented major advances in data processing tools and a landmark reproducibility result.</p>

<p><strong><a href="https://github.com/diff-use/mdx2">mdx2</a></strong> is an open-source software package for processing and analyzing diffuse X-ray scattering data. Development has accelerated with a new team (Steve Meisburger, Justin Biel, Joseph Lee) and modern development practices, including version control, issue tracking, and code review. Version 10.3 was released in December 2025 with:</p>

<ul>
  <li>Containerized deployment via <code class="language-plaintext highlighter-rouge">conda install -c conda-forge mdx2</code></li>
  <li>Jupyter Lab environment integration</li>
  <li>Live processing capability on Voltage Park during beam times</li>
</ul>

<p><strong>Reference datasets from CHESS</strong> now span multiple systems (Mac1, NrdE, DNA, ATCase, Insulin) with systematic tracking through integration, merging, and fine map generation stages.</p>

<p><strong>The headline result</strong>: Diffuse scattering is reproducible between CHESS and ALS. Side-by-side comparisons of Mac1 diffuse maps from both beamlines show consistent features, validating that the signal is robust across different detector systems, beam profiles, and facilities. This East-meets-West reproducibility is foundational for any future multi-site data collection campaigns.</p>

<p><img src="/assets/images/posts/2026-02-02/reproducible_ds_allhands.png" alt="Scattering comparison across beamlines" title="DiffUSE Scattering is reproducible across coasts!" />
<em>Slides shared at the DiffUSE project’s retreat showcasing reproducibility across beamlines.</em></p>

<p>Additional findings from DNA crystal analysis revealed that correlated disorder differs between room temperature and 100K conditions, even when the static structures appear similar—and that diffuse signal extends beyond the Bragg resolution limit, suggesting untapped information content.</p>

<p>A <strong>Galaxy platform prototype</strong> was demonstrated, pointing toward a vision of “Cryosparc for diffuse,” making diffuse data processing accessible through a GUI with integrated workflows and interactive visualizations.</p>

<h3 id="molecular-dynamics-simulations"><strong>Molecular Dynamics Simulations</strong></h3>

<p>Mike Wall presented substantial progress on crystallographic MD simulations.</p>

<p><strong>Apo Mac1 baseline results</strong> show exceptional agreement between simulation and experiment:</p>

<ul>
  <li>Total correlation coefficient: CC = 0.96</li>
  <li>Anisotropic correlation coefficient: CC = 0.56</li>
  <li>Simulation: 2×2×2 supercell with OPC3 waters (279,004 atoms), neutron crystal structure 7TX3, 1100 ns unrestrained trajectory</li>
</ul>

<p><strong>MD optimization methods</strong> are advancing on two fronts:</p>

<ul>
  <li><strong>Enrichment</strong>: Selectively removing MD frames to increase diffuse correlation</li>
  <li><strong>Reweighting</strong>: Using JAX to optimize frame weights via differentiable Pearson CC maximization. Initial test on experimental diffuse data achieved CC = 0.97 with 47,150 reflections to 3.5 Å resolution (<a href="https://github.com/diff-use/sampleworks">work</a> by Karson Chrispens, documented in a <a href="https://diffuse.science/posts/jax_refine/">DiffUSE blog post</a>)</li>
</ul>

<p><strong>Ligand perturbations</strong> are now being simulated: Mac1 + ADPr shows distinct diffuse patterns compared to baseline Mac1, with protonation state variations (ASP157 → ASH157) under investigation.</p>

<p><strong>Second system</strong>: Dihydrofolate reductase (DHFR, PDB: 7FPV) is being developed as a generalization target, expanding beyond the Mac1 test case.</p>

<p><strong>Simulated diffraction</strong> capabilities using nanoBragg (James Holton) enable validation of data processing pipelines—simulated diffuse intensity can be extracted using mdx2, closing the loop between simulation and experiment.</p>

<p><img src="/assets/images/posts/2026-02-02/DHFR_allhands.png" alt="Simulated diffuse intensity of PDB 7FPV" title="A new protein system is being tested for simulated diffuse intensity, PDB 7FPV" />
<em>Slides shared at the DiffUSE project’s retreat showcasing simulation results of a new protein system, Dihydrofolate reductase (DHFR, PDB: 7FPV).</em></p>

<h3 id="machine-learning-modeling"><strong>Machine Learning Modeling</strong></h3>

<p>Marcus Collins presented the ML modeling roadmap focused on using experimental data to reveal hidden protein conformations.</p>

<p><strong>Key insight</strong>: Current ML structure predictors (AlphaFold3-like models, including Boltz-2, Protenix, RF3) do not reliably predict alternate conformations (altlocs) even with multiple random seeds, indicating they have not learned about underlying ensembles. This gap motivates developing density-guided ensemble generation (Sampleworks).</p>

<p><strong>Density guidance approach</strong>: The team is implementing training-free guidance from experimental density maps (2Fo-Fc), using the difference between experimental and calculated maps to steer diffusion model sampling toward conformations consistent with crystallographic data. Early results are promising but mixed: Boltz-2 with density guidance can capture both altlocs in some test cases like PTP1B (6B8X), though performance varies across systems.</p>

<p><strong>Sampleworks pipeline</strong> is being built as a plug-and-play guidance framework to use different structure prediction models, experimental data, and guidance strategies.</p>

<ul>
  <li>Model wrappers implemented for RF3, Protenix, Boltz-1, and Boltz-2 (MD and X-ray modes)</li>
  <li>Initial test set of ~50 structures from PDB prepared with altlocs; electron density maps being generated</li>
  <li>Evaluation metrics: RSCC, LDDT, clash scores, backbone and sidechain geometry</li>
</ul>

<p><strong>Water modeling</strong> emerges as a critical challenge for advancing to reciprocal space. Our first attempt is to improve the modeling of explicit solvent. Current models achieve ~0.3 precision/recall at 0.5 Å—insufficient for improving Rwork/Rfree. The team is exploring flow-matching approaches and evaluating whether a single unified model or separate protein/water models will be more effective. Ordered waters coupled to protein altlocs are particularly important targets.</p>

<h3 id="infrastructure-and-publishing"><strong>Infrastructure and Publishing</strong></h3>

<p>Justin Biel presented the computational infrastructure supporting DiffUSE, built around a three-pillar model: Data, Compute, and Publishing.</p>

<p><strong>Compute Infrastructure</strong> uses Voltage Park as the backbone:</p>

<ul>
  <li>H100 SXM5 GPUs available via bare metal (8× GPU configurations)</li>
  <li>Two usage patterns supported:
    <ul>
      <li><strong>Workspaces</strong>: Interactive environments for experimental work, debugging, and visualization</li>
      <li><strong>Workflows</strong>: Hardened, scalable pipelines for production analysis</li>
    </ul>
  </li>
  <li>The DiffUSE web app now provides resource checkout, visibility into running resources, and SSH/Jupyter access</li>
  <li>Custom container management enables workspace pausing and environment customization</li>
  <li>Workflow orchestration via Prefect and Docker</li>
</ul>

<p><strong>Data Infrastructure</strong> centers on the DiffUSE web app:</p>

<ul>
  <li><strong>Storage</strong>: Core Backblaze storage (S3-compatible) with OSN bucket integration for beamline data</li>
  <li><strong>Access</strong>: Automatic mounting to Voltage Park resources, plus web app download, CLI, Python SDK, and API</li>
  <li><strong>Metadata</strong>: Experiments have artifacts, optional markdown content (like logbook entries), relationships to other experiments, and tags</li>
  <li><strong>Automation</strong>: Beam-trip data automatically triggers experiment registration; dataset files populate metadata fields</li>
  <li><strong>Governance</strong>: Standards compliance checking, staging-to-public workflows, DOI attachment decisions</li>
</ul>

<p><strong>Publishing workflow</strong> discussions focused on:</p>

<ul>
  <li>When to stage data privately vs. make everything open immediately</li>
  <li>When to attach DOIs (content should be largely immutable)</li>
  <li>External database destinations: SBGrid Databank, PDB, Zenodo</li>
</ul>

<h3 id="open-science"><strong>Open Science</strong></h3>

<p>Prachee Avasthi (Head of Open Science, Astera) led a discussion on publishing expectations and open science practices. <strong>Discussion</strong> explored barriers to sharing, evidence of downstream reuse, orphan artifacts without ideal homes, and prioritization of unaddressed data sharing issues.</p>

<h2 id="reflections-on-our-distributed-model"><strong>Reflections on Our Distributed Model</strong></h2>

<p>This retreat underscored how the diffUSE’s distributed structure works. By embedding team members across institutions (Cornell, UCSF, Berkeley Lab, and beyond) we maintain direct access to beamlines, computational expertise, and scientific communities that would be impossible to replicate in a single location. The “Diffuse East ≈ Diffuse West” result is itself a product of this model: data collected by different teams at facilities 2,500 miles apart, processed with shared tools, yielding consistent results. Our infrastructure investments (the DiffUSE web app, Voltage Park compute, standardized containerized environments) bridge the geographic gaps, allowing a scientist at Cornell to spin up the same analysis environment as a colleague in California.</p>

<p>The in-person retreat revealed how much asynchronous collaboration had already accomplished, sessions focused on integration and next steps rather than catching people up. Open science practices (shared logbooks, blog posts, open repositories) keep everyone aligned between meetings. The challenge ahead is scaling this approach: as we add systems, datasets, and collaborators, maintaining the coherence that makes distributed work effective will require continued investment in documentation, automation, and the human connections that make a dispersed team feel like one group working toward a shared goal.</p>

<p>The science described here represents the output of a significant and coordinated resource investment. Since DiffUSE’s start in July, Astera has committed <span>$3.2M</span> to stand up the project: <span>$2.63M</span> in research grants distributed directly to our partner labs at Fraser Lab/LBL, Ando Lab/CHESS, and Wankowicz Lab, $567K in Astera personnel and contractor support, and <span>$30K</span> in computational infrastructure. On top of this, CHESS contributed an estimated <span>$700K</span> in beamtime, bringing the total resource investment to roughly <span>$3.9M</span>. Looking ahead, an additional <span>$2.4M</span> is projected for 2026 as the project scales toward its core scientific goals.</p>

<hr />

<p><img src="/assets/images/posts/2026-02-02/diffuse_demo_datamanagement.png" alt="The DiffUSE App is currently under development" title="the DiffUSE App, currently under development" />
<em>A screenshot from our data management infrastructure, demonstrated at the retreat. This is in active development with Prophet Town and Voltage Park.</em></p>

<p><img src="/assets/images/posts/2026-02-02/2026_allhands_pres.png" alt="Mike Wall presents progress on MD optimization of diffuse scattering" title="Mike Wall presents progress on MD optimization of diffuse scattering" />
<em>Mike Wall presents progress on MD optimization of diffuse scattering to a full house at the Astera Institute.</em></p>

<hr />

<h2 id="whats-next"><strong>What’s Next?</strong></h2>

<h3 id="data-collection-3-month-goals"><strong>Data Collection (3-month goals)</strong></h3>

<ul>
  <li>Collect data on additional systems; collect lysozyme at ALS</li>
  <li>Optimize sample handling for intermediate temperatures (oil-based approaches)</li>
  <li>Explore serial crystallography approaches (chip types, small wedges, crystal size variation)</li>
  <li>Continue investigating cryo options (traditional, NANUQ, high-pressure cryocooling)</li>
</ul>

<h3 id="data-processing-2026-goals"><strong>Data Processing (2026 goals)</strong></h3>

<ul>
  <li>Improve mdx2 performance (~2× speedup)</li>
  <li>Implement GOODVIBES and DISCOBALL in Python (JAX)</li>
  <li>Fully explore serial crystallography processing</li>
  <li>Deploy on Ando lab Galaxy server; add mdx2 tools</li>
  <li>Develop “Cryosparc for diffuse” project roadmap</li>
</ul>

<h3 id="md-simulations"><strong>MD Simulations</strong></h3>

<ul>
  <li>Continue model/data comparisons and refine MD models (protonation states, parameterization)</li>
  <li>Expand to new systems and additional ligand/mutation perturbations</li>
  <li>Explore how MD optimizations can support other DiffUSE activities (ML modeling, diffraction image simulation, data processing validation)</li>
</ul>

<h3 id="ml-modeling"><strong>ML Modeling</strong></h3>

<ul>
  <li>Scale up Sampleworks evaluation across initial test set</li>
  <li>Improve water prediction models (retrain SuperWater with better data, explore flow matching vs. diffusion)</li>
  <li>Quantify water model precision requirements by systematically perturbing well-supported waters</li>
  <li>Progress toward reciprocal space/Bragg peak guidance, ultimately targeting diffuse data guidance</li>
</ul>

<h3 id="infrastructure"><strong>Infrastructure</strong></h3>

<ul>
  <li>Finalize containerized workspace management with pause/resume capability</li>
  <li>Expand workflow orchestration options</li>
  <li>Refine data governance workflows for staging → public → external database publication</li>
</ul>

<h3 id="open-science-1"><strong>Open Science</strong></h3>

<ul>
  <li>Address identified barriers to sharing</li>
  <li>Establish timelines for DOI attachment and external database deposition</li>
  <li>Continue documentation through blog posts and logbooks</li>
</ul>

<p>Special thanks to Astera for hosting the retreat in Emeryville.</p>

<hr />

<h2 id="glossary"><strong>Glossary</strong></h2>

<table>
  <tr>
   <td><strong>Acronym</strong>
   </td>
   <td><strong>Definition</strong>
   </td>
  </tr>
  <tr>
   <td>ADPr
   </td>
   <td>Adenosine diphosphate ribose (a ligand)
   </td>
  </tr>
  <tr>
   <td>ALS
   </td>
   <td>Advanced Light Source (synchrotron at Lawrence Berkeley National Laboratory)
   </td>
  </tr>
  <tr>
   <td>API
   </td>
   <td>Application Programming Interface
   </td>
  </tr>
  <tr>
   <td>ASH
   </td>
   <td>Protonated aspartic acid residue
   </td>
  </tr>
  <tr>
   <td>ASP
   </td>
   <td>Aspartic acid residue
   </td>
  </tr>
  <tr>
   <td>ATCase
   </td>
   <td>Aspartate Transcarbamylase (enzyme)
   </td>
  </tr>
  <tr>
   <td>CC
   </td>
   <td>Correlation Coefficient
   </td>
  </tr>
  <tr>
   <td>CHESS
   </td>
   <td>Cornell High Energy Synchrotron Source
   </td>
  </tr>
  <tr>
   <td>CLI
   </td>
   <td>Command Line Interface
   </td>
  </tr>
  <tr>
   <td>DHFR
   </td>
   <td>Dihydrofolate Reductase (enzyme)
   </td>
  </tr>
  <tr>
   <td>DOI
   </td>
   <td>Digital Object Identifier
   </td>
  </tr>
  <tr>
   <td>GPU
   </td>
   <td>Graphics Processing Unit
   </td>
  </tr>
  <tr>
   <td>GUI
   </td>
   <td>Graphical User Interface
   </td>
  </tr>
  <tr>
   <td>JAX
   </td>
   <td>Just After eXecution (Google's autodiff/ML library for Python)
   </td>
  </tr>
  <tr>
   <td>keV
   </td>
   <td>Kiloelectronvolt (unit of X-ray energy)
   </td>
  </tr>
  <tr>
   <td>LDDT
   </td>
   <td>Local Distance Difference Test (structure quality metric)
   </td>
  </tr>
  <tr>
   <td>Mac1
   </td>
   <td>Macrodomain 1 (SARS-CoV-2 nonstructural protein 3)
   </td>
  </tr>
  <tr>
   <td>MD
   </td>
   <td>Molecular Dynamics
   </td>
  </tr>
  <tr>
   <td>ML
   </td>
   <td>Machine Learning
   </td>
  </tr>
  <tr>
   <td>NrdE
   </td>
   <td>Ribonucleotide Reductase class Ib alpha subunit (enzyme)
   </td>
  </tr>
  <tr>
   <td>ns
   </td>
   <td>Nanoseconds
   </td>
  </tr>
  <tr>
   <td>OPC3
   </td>
   <td>Optimal Point Charge 3-point water model
   </td>
  </tr>
  <tr>
   <td>OSN
   </td>
   <td>Open Storage Network
   </td>
  </tr>
  <tr>
   <td>PDB
   </td>
   <td>Protein Data Bank
   </td>
  </tr>
  <tr>
   <td>PTP1B
   </td>
   <td>Protein Tyrosine Phosphatase 1B (enzyme)
   </td>
  </tr>
  <tr>
   <td>RF3
   </td>
   <td>RoseTTAFold 3 (structure prediction model)
   </td>
  </tr>
  <tr>
   <td>Rfree
   </td>
   <td>Free R-factor (crystallographic validation metric)
   </td>
  </tr>
  <tr>
   <td>Rwork
   </td>
   <td>Working R-factor (crystallographic refinement metric)
   </td>
  </tr>
  <tr>
   <td>RSCC
   </td>
   <td>Real Space Correlation Coefficient
   </td>
  </tr>
  <tr>
   <td>S3
   </td>
   <td>Simple Storage Service (cloud storage protocol)
   </td>
  </tr>
  <tr>
   <td>SBGrid
   </td>
   <td>Structural Biology Software Grid (consortium)
   </td>
  </tr>
  <tr>
   <td>SDK
   </td>
   <td>Software Development Kit
   </td>
  </tr>
  <tr>
   <td>SSH
   </td>
   <td>Secure Shell (network protocol)
   </td>
  </tr>
  <tr>
   <td>UCSF
   </td>
   <td>University of California, San Francisco
   </td>
  </tr>
</table>

<hr />

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name></name></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[A report on our January 2026 all-hands meeting]]></summary></entry><entry><title type="html">In the Cloud</title><link href="https://diffuse.science/post/in_the_cloud/" rel="alternate" type="text/html" title="In the Cloud" /><published>2025-11-15T00:00:00+00:00</published><updated>2025-11-15T00:00:00+00:00</updated><id>https://diffuse.science/post/in_the_cloud</id><content type="html" xml:base="https://diffuse.science/post/in_the_cloud/"><![CDATA[<figure class="half ">
  
    
      <a href="/assets/images/posts/Clouds.jpg" title="Cloudy diffuse features in the sky">
          <img src="/assets/images/posts/Clouds.jpg" alt="Cloudy diffuse features in the sky" />
      </a>
    
  
    
      <a href="/assets/images/posts/DiffuseClouds.png" title="MD simulation of cloudy diffuse features">
          <img src="/assets/images/posts/DiffuseClouds.png" alt="MD simulation of cloudy diffuse features" />
      </a>
    
  
  
    <figcaption>(Left) Cloudy diffuse features in the sky. (Right) MD simulation of cloudy diffuse features.
</figcaption>
  
</figure>

<h2 id="diffuse-scattering-in-the-cloud">Diffuse Scattering in the Cloud</h2>

<p>While out on a walk, as I looked up at the sky, a certain cloud formation (above left) reminded me of the <em>l</em> = 0 slice through the MD simulation of Mac1 diffuse scattering (above right). That got me thinking about the next steps for the diffUSE MD simulations (which are, of course, being performed using <a href="https://www.voltagepark.com">cloud computing resources</a>).</p>

<p>As described in the <a href="/posts/allhands/">Quarterly All Hands Meeting</a> post, we recently shared our short-term plans for the various components of the diffUSE project. We’ve already performed baseline comparisons of crystalline MD simulations of Nsp3 macrodomain (Mac1) to diffuse scattering data (see the <a href="/post/3-2-1-contact/">3-2-1 Contact</a> post). Now we want to improve the models and see what happens in the simulations when we make changes. With this in mind, we’re planning to: (1) improve the current MD model of Mac1; (2) simulate Mac1 crystals under different conditions; and (3) develop a model of a new system.</p>

<p>Thinking about (2), I contacted James Fraser to chat about what to do for the next MD simulations of Mac1. We decided to look at Mac1 in complex with ADP-ribose. This choice is timely, as Kara Zielinski just collected diffUSE diffraction data from crystals of this complex at the Cornell High-Energy Synchrotron Source (CHESS), during a recent trip from UCSF to Nozomi Ando’s lab at the Cornell.</p>

<p>What will the MD simulation of diffuse scattering from crystals of Mac1 in complex with ADPr look like? Probably a lot like the ones we’ve done already, with some small changes. We’re planning to analyze the differences and find out what happens to the dynamics when different ligands bind. But we don’t really know yet what we’ll see. These moments of suspense are very common in science, but they’re absent from the stories we usually tell in the literature. The open science model we’re using on the diffUSE project enables us to document these periods of uncertainty as a part of the public narrative of the project. It feels kind of liberating.</p>

<hr />

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name>Michael Wall</name><email>mewall00@gmail.com</email></author><category term="post" /><category term="diffuse scattering" /><category term="molecular dynamics" /><category term="clouds" /><category term="planning" /><category term="open science" /><category term="meta" /><summary type="html"><![CDATA[Next steps in diffUSE MD simulations]]></summary></entry><entry><title type="html">Quarterly All-Hands Meeting Summary</title><link href="https://diffuse.science/posts/allhands/" rel="alternate" type="text/html" title="Quarterly All-Hands Meeting Summary" /><published>2025-11-10T00:00:00+00:00</published><updated>2025-11-10T00:00:00+00:00</updated><id>https://diffuse.science/posts/allhands</id><content type="html" xml:base="https://diffuse.science/posts/allhands/"><![CDATA[<h2 id="why-this-quarter-mattered"><strong>Why this quarter mattered</strong></h2>

<p>We haven’t gathered all together since our June kick-off meeting, so in Mid October, we met (online) with all members of the diffUSE project to discuss our overall goals for each project team, progress made over the first few months, and goals for the next three months. We emphasized how the different pieces of the project integrate to build methods, data, models, and encodings so the community can routinely use diffuse scattering in basic biology and drug discovery.</p>

<h2 id="what-have-we-completed-in-the-first-few-months"><strong>What have we completed in the first few months?</strong></h2>

<h3 id="data-collection"><strong>Data collection:</strong></h3>

<ul>
  <li>We have collected ambient-temperature datasets from CHESS for <a href="https://diffuse.science/logbook/beamtime/20251008-chess/">lysozyme</a>, <a href="https://diffuse.science/logbook/beamtime/20251015-chess/">macrodomain</a>, <a href="https://diffuse.science/logbook/beamtime/20250924-chess/">NrdE</a>, and <a href="https://diffuse.science/logbook/beamtime/20251015-chess/">DNA fibers</a>.</li>
  <li>We have collected data at ALS on <a href="https://diffuse.science/logbook/beamtime/20250701-als/">Mac1</a> and <a href="https://diffuse.science/logbook/beamtime/20251015-17-als/">Huwe1</a> using humidity boxes and watershed sleeves with controlled transmission and beam size.</li>
  <li>We have played around with temperature modulation used to probe dose dependence and mosaicity effects with <a href="https://diffuse.science/diffuse-shipping/">samples shipped</a>] from UCSF to Cornell</li>
  <li>We implemented standardized background frames, uniform sleeve lengths, and precise humidity control to enhance map quality and cross-beamline comparability with an eye toward <a href="https://diffuse.science/posts/windows/">future</a> multi-site data collection campaigns.</li>
  <li>We have documented all collection procedures on the <a href="https://diffuse.science/logbook/beamtime/">diffUSE website logbooks</a>.</li>
</ul>

<h3 id="data-processing"><strong>Data processing:</strong></h3>

<ul>
  <li>
    <p>Developing xia2.multiplex for automated data merging, <em>mdx2</em> for data extraction, and comprehensive data quality control (QC) workflows.</p>
  </li>
  <li>
    <p>Building graphical user interfaces (GUIs) for <em>mdx2</em> to improve usability and accessibility.</p>
  </li>
  <li>
    <p><a href="https://diffuse.science/next-steps-macrodomain/">Identifying and resolving</a> bugs that arise when multiple users concurrently process the same datasets.</p>
  </li>
  <li>
    <p><a href="https://diffuse.science/posts/jax_refine/">Implementing differentiable refinement</a> by treating molecular dynamics (MD) frame weights as trainable parameters in a Pearson correlation–based objective function.</p>
  </li>
</ul>

<h3 id="machine-learning-modeling"><strong>Machine Learning Modeling:</strong></h3>

<ul>
  <li>Developing <a href="https://diffuse.science/posts/modeling/">pipeline scaffolds</a> to integrate experimental structural data directly into generative model training and evaluation.</li>
  <li>Creating quantitative metrics for assessing and benchmarking ensemble data.</li>
  <li>Building a generative water model that learns to predict water molecule positions from protein structure, designed for future integration into broader generative modeling frameworks.</li>
</ul>

<h3 id="simulations"><strong>Simulations:</strong></h3>

<ul>
  <li>We <a href="https://diffuse.science/post/3-2-1-contact/">simulated</a> a Mac1 2×2×2 supercell  with OPC3 waters and 279,004 atoms reaches 150 ns per day on Voltage Park. With refined masking and resampling, total CC is 0.96 and anisotropic CC is 0.56 on the H8 dataset, which sets a clear target for larger supercells and ligand or mutant comparisons.</li>
  <li>Taylor completed his rotation developing a <a href="https://diffuse.science/posts/diffuse_rotation/">simulator</a>.</li>
</ul>

<p><strong>Encoding:</strong></p>

<ul>
  <li>We continue to <a href="https://diffuse.science/posts/encoding/">advocate</a> for conformational and compositional heterogeneity-encoding strategies.</li>
  <li>We have developed a <a href="https://diffuse.science/posts/multi_to_ens/">script</a> to translate between encodings for multiconformer and ensemble representations.</li>
  <li>We are working on developing a standalone script and a COOT integration script for conformational heterogeneity.</li>
</ul>

<h3 id="infrastructure-and-open-science"><strong>Infrastructure and Open Science:</strong></h3>

<ul>
  <li>We have access to Voltage Park compute and S3 storage via the command line, which will make sharing maps and models easier.</li>
  <li><a href="https://diffuse.science/posts/">16 blog posts</a> and 6 beamtime <a href="https://diffuse.science/logbook/">logbooks</a> to date!</li>
</ul>

<h2 id="what-is-up-for-the-next-3-months"><strong>What is up for the next 3 months?</strong></h2>

<p><strong>Data Collection:</strong> Catalog all existing data, fill gaps, complete background series at ALS, finalize hardened collection procedures that travel across LBL and CHESS, and post collection reports on the site. Developing shared checklists to coordinate and standardize future data-collection cycles.</p>

<p><strong>Data Processing:</strong> Converge on a single, documented workflow, generate preliminary maps for all ALS and CHESS datasets, produce fine maps for GOODVIBES and DISCOBALL, stand up a CHESS 2026-1 pipeline, and publish processing reports on the site.</p>

<p><strong>Modeling:</strong> Ship an initial pipeline that accepts maps for guided sampling.</p>

<p><strong>Encoding:</strong> Land final working group approval, publish the schema and examples, and connect the web app to our catalog so processed maps, models, and metadata are searchable and shareable.</p>

<p><strong>Infrastructure and sharing science</strong>: More blog posts!</p>

<p>We will meet in the Bay Area in January and report back more after that.</p>]]></content><author><name>James Fraser</name><email>jfraser@fraserlab.com</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[A report on our October 2025 all-hands meeting]]></summary></entry><entry><title type="html">3-2-1 Contact: Comparing MD simulations with diffuse data</title><link href="https://diffuse.science/post/3-2-1-contact/" rel="alternate" type="text/html" title="3-2-1 Contact: Comparing MD simulations with diffuse data" /><published>2025-10-20T00:00:00+00:00</published><updated>2025-10-20T00:00:00+00:00</updated><id>https://diffuse.science/post/3-2-1-contact</id><content type="html" xml:base="https://diffuse.science/post/3-2-1-contact/"><![CDATA[<h2 id="overview">Overview</h2>

<figure class="third ">
  
    
      <a href="/assets/images/posts/h8_md_view.png" title="Slice through the MD simulated diffuse map">
          <img src="/assets/images/posts/h8_md_view.png" alt="Slice through the MD simulated diffuse map" />
      </a>
    
  
    
      <a href="/assets/images/posts/h8_initial_processing.png" title="Slice through the diffuse data with initial processing">
          <img src="/assets/images/posts/h8_initial_processing.png" alt="Slice through the diffuse data with initial processing" />
      </a>
    
  
    
      <a href="/assets/images/posts/h8_final_processing.png" title="Slice through the diffuse data with final processing">
          <img src="/assets/images/posts/h8_final_processing.png" alt="Slice through the diffuse data with final processing" />
      </a>
    
  
  
    <figcaption>First diffUSE project comparisons of MD simulations with diffuse data. Slices through the anisotropic diffuse map are visualized in the <em>l</em> = 0 plane. (Left) MD simulation. (Center) Data with initial processing. (Right) Data with refined processing.
</figcaption>
  
</figure>

<h2 id="overview-1">Overview</h2>

<p>This post describes the <strong>first systematic comparisons</strong> between molecular dynamics (MD)–derived diffuse scattering and experimental measurements on the diffUSE project. 
These early analyses establish a baseline for evaluating how well MD simulations reproduce the observed isotropic and anisotropic features of the data. They also suggest possible improvements in the way we compare MD simulations to diffuse data.</p>

<hr />

<h2 id="initial-comparisons">Initial Comparisons</h2>

<p>Data from the first diffUSE experiments were used for comparisons (see <a href="/posts/wetfeet/">Getting our feet wet</a> post). The <strong>H8 dataset</strong> was used, which was obtained using a medium radiation dose (see <a href="https://diffuse.science/logbook/20250624-als831-macrodomain-analysis/">preliminary analysis of the diffuse scattering</a>).</p>

<p>The H8 dataset was originally sampled on a grid using <strong>2×2×4</strong> points per integer <em>hkl</em> value.  To align with the 2x2x2 supercell MD simulation output, it was resampled onto a <strong>2×2×2</strong> grid for direct comparison.</p>

<p>This initial processing yielded a correlation with the MD simulation of 0.88 for the total diffuse intensity, and a correlation of 0.32 for just the anisotropic component.</p>

<hr />

<h2 id="refinement-and-improvements">Refinement and Improvements</h2>

<p>The MD simulations were performed using 2x2x2 supercell, which is too small to include contributions from long-range correlations in the lattice. We therefore wondered whether the agreement of the MD with the diffuse data might be diminished in the immediate neighborhood of the Bragg peak, which is associated with lattice vibrations (<a href="https://doi.org/10.1038/s41467-023-36734-3">GOODVIBES</a> is specifically designed to accurately model this part of the signal). We also noticed certain outlier intensity values in the data (specifically, negative values), and wished to avoid using those in comparisons.</p>

<p>Another round of data processing was performed considering these ideas. First, Steve re-processed the diffraction images, sampling more finely onto a <strong>4x4x4</strong> grid. The intensities at integer <em>hkl</em>, correspinding to the immediate neighborhood of the Bragg peaks, were then masked out prior to downsampling to 2x2x2. Negative intensities were additionally masked out.</p>

<p>The revised processing improved the agreement with the MD substantially:</p>

<ul>
  <li><strong>Total correlation:</strong> 0.96</li>
  <li><strong>Anisotropic correlation:</strong> 0.56</li>
</ul>

<hr />

<h2 id="outlook">Outlook</h2>

<p>We now have our first assessment of the agreement of MD simulations with diffuse data collected on the diffUSE project. This is our baseline for improving the MD models. We also identified specific regions of the diffuse map – close to the Bragg peaks – where the MD might be currently lacking. These regions of the model might be improved by using a larger supercell for the MD.</p>

<hr />

<p><em>This post was initially drafted in ChatGPT based on a Slack exchange between Steve Meisburger and Michael Wall, and was rewritten and posted by Michael Wall on October 20, 2025.</em></p>

<script src="https://giscus.app/client.js" data-repo="diff-use/diff-use.github.io" data-repo-id="R_kgDOPO07gg" data-category="General" data-category-id="DIC_kwDOPO07gs4CtV5I" data-mapping="title" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async="">
</script>

<noscript>Please enable JavaScript to view comments.</noscript>]]></content><author><name>Michael Wall</name><email>mewall00@gmail.com</email></author><category term="post" /><category term="diffuse scattering" /><category term="molecular dynamics" /><category term="data processing" /><category term="mdx2" /><category term="meta" /><summary type="html"><![CDATA[Insights from the first comparisons of MD simulations to diffuse data]]></summary></entry><entry><title type="html">Ensemble &amp;lt;-&amp;gt; Multiconformer Model Conversion</title><link href="https://diffuse.science/posts/multi_to_ens/" rel="alternate" type="text/html" title="Ensemble &amp;lt;-&amp;gt; Multiconformer Model Conversion" /><published>2025-10-17T00:00:00+00:00</published><updated>2025-10-17T00:00:00+00:00</updated><id>https://diffuse.science/posts/multi_to_ens</id><content type="html" xml:base="https://diffuse.science/posts/multi_to_ens/"><![CDATA[<h2 id="representing-conformational-heterogeneity">Representing conformational heterogeneity</h2>
<p>Our structural biology techniques capture an enormous amount of conformational heterogeneity that is often lost in the transition from experimental data to deposited models. Part of this loss stems from a lack of sufficiently sophisticated algorithmic methods, which is an active area of development in this project and elsewhere. Still, an equally important factor is how we choose to encode structural heterogeneity in the models themselves.</p>

<p>In the majority of structures deposited in the Protein Data Bank (PDB), conformational heterogeneity is represented only in a harmonic sense, through atomic displacement parameters (B-factors) or translation–libration–screw (TLS) parameters. These parameters can be incorporated into a single structural model, describing the amplitude and anisotropy of atomic fluctuations around a mean position. However, they do not encode anharmonic or discrete conformational variability. To capture non-harmonic conformational heterogeneity, models have emerged that explicitly include multiple atomic coordinate sets. Two dominant strategies have emerged from X-ray crystallography and cryo-EM: multiconformer models and multi-model (ensemble) models[1].</p>

<p>A multiconformer model represents conformational diversity locally, without duplicating the entire macromolecule. When a region of the electron density is well described by a single conformation, only one set of coordinates with appropriate B-factors is modeled. When the density indicates discrete alternative conformations, such as side chain rotamers or backbone flips, the relevant atoms are copied and assigned alternate location (ALTLOC) identifiers in the PDB file. We have previously demonstrated that this modeling approach can yield substantial improvements in fitting to experimental data and also reduce geometric distortions and eliminate many rotamer outliers[2].</p>

<p>Multi-model approaches model heterogeneity by encoding multiple complete copies of the system, which can sometimes more effectively capture structural motions like backbone shifts. However, ensembles containing tens to hundreds of models can lead to a high parameter-to-data ratio. The most common ensemble models used in Bragg peak analysis today are time-averaged ensembles, generated by molecular dynamics simulations. These ensembles are restrained by time-averaged X-ray structure factors to produce a large number of models, often hundreds, each representing a snapshot from a single trajectory [3]. Further, crystalline MD simulations are currently the best model to describe the diffuse data[4].</p>

<p>Converting between the two representations is a non-trivial task, as the generation and data contained in the two model types are unique. However, there are many times when transferring between the two representations may be needed. For example, the primary methods we use to model and represent diffuse data are molecular dynamics, which generate ensemble models. However, as described by members of this project and others, this MD data has a poor correspondence with the Bragg peak data. However, currently, there are no approaches to further refine this MD model against Bragg peak data. One way we can represent and refine this against Bragg peaks is with multiconformer models, which are compatible with traditional refinement software and allow for manual manipulation.</p>

<h2 id="converting-multiconformer-to-ensemble-models">Converting multiconformer to ensemble models</h2>
<p>In general, there is an exponential number of ways to combine single-residue conformations. For this reason, enumerating all combinations becomes infeasible as soon as structures contain a higher number of residues with AltLoc conformations. Previously, Gutermuth et al proposed an algorithm to convert multiconformer models into ensemble models (AltLocEnumerator)[5]. This fast branch-and-bound algorithm to generate valid alternative protein structure conformations is described through AltLoc annotations. The algorithm searches for compatible residue conformations, maximizing the probabilities of conformational states by scoring the AltLoc occupancy values. 
We aimed to convert a qFit multiconformer PDB into an ensemble structure (PDB: 5iu1). While we attempted multiple options, including enumerating all, optimizing the occupancy score, and changing the number of models, all resulted in multiple models that were almost always overlapping (distribution of all heavy atom RMSD shown below).</p>

<p><img width="900" height="675" alt="rmsd_protein" src="https://github.com/user-attachments/assets/4e8e1c18-d566-49d7-90a0-bcc5e1df5f28" /></p>

<p>We then decided to make five different models (as qFit models make up to 5 alt locs per residue) using the option ‘atllocid’, which provided us with models that were separated with more realistic RMSD (distribution of all heavy atom RMSD shown below).</p>

<p>AltLocEnumerator –file 5i1u_final_qFit.pdb –altlocid A</p>

<p><img width="900" height="675" alt="5ilu_altloc_rmsd_protein" src="https://github.com/user-attachments/assets/fb2242ea-c4dc-421b-adb6-4ad2bc127e69" /></p>

<p>This command needs to be repeated for each altloc ID, and then the models should be concatendated with MODEL/ENDMODEL lines. Of note, this algorithm is not open source but available to academics for free.</p>

<h2 id="converting-ensemble-models-to-multiconformer">Converting ensemble models to multiconformer</h2>

<p>There was no existing tool to convert an ensemble model to a multiconformer model, prompting us to design one. We did this using <a href="https://github.com/ExcitedStates/qfit-3.0">qFit</a>. Our approach systematically collapses an ensemble by iterating over each residue across all models and clustering equivalent residues. Taking the first model as the reference, we assign each subsequent residue to an existing cluster if its RMSD to the cluster centroid is within 1 Å (default parameter). If the RMSD exceeds this threshold, a new conformation is created.  We then used the relabel function in qFit, which uses simulated annealing (SA) optimization of a Lennard-Jones potential to reassign altloc labels, ensuring that conformers of different residues/ligands have consistent altloc labels. Note that while we collapsed many conformers, many residues still have 26+ conformers, meaning these cannot be represented with the historic PDB format.</p>

<p>While we can create a multiconformer model, a few issues remain. 1) We do not have a correct occupancy for any residue (all residues are currently assigned occupancy of 0.50), 2) There may be issues in the geometry of backbone atoms due to removing conformations on a residue level. The other thing to note is that this is currently incredibly slow (~45 minutes for 400 residues with 70 models). While imperfect, the multiconformer models enable us to feed this into other algorithms, such as refinement or qFit.</p>

<p>This tool is available in the <a href="https://github.com/ExcitedStates/qfit-3.0">qFit repository</a>, calling multimodel_2_multiconformer.py, only requiring an input PDB.</p>

<h2 id="references">References</h2>

<ol>
  <li>Woldeyes RA, Sivak DA, Fraser JS. E pluribus unum, no more: from one crystal, many conformations. Curr Opin Struct Biol. 2014;28: 56–62.</li>
  <li>Wankowicz SA, Ravikumar A, Sharma S, Riley BT, Raju A, Flowers J, et al. Uncovering Protein Ensembles: Automated Multiconformer Model Building for X-ray Crystallography and Cryo-EM. bioRxiv. 2024. doi:10.1101/2023.06.28.546963</li>
  <li>Burnley BT, Afonine PV, Adams PD, Gros P. Modelling dynamics in protein crystal structures by ensemble refinement. Elife. 2012;1: e00311.</li>
  <li>Wall ME. Internal protein motions in molecular-dynamics simulations of Bragg and diffuse X-ray scattering. IUCrJ. 2018;5: 172–181.</li>
  <li>Gutermuth T, Sieg J, Stohn T, Rarey M. Modeling with Alternate Locations in X-ray Protein Structures. J Chem Inf Model. 2023;63: 2573–2585.</li>
</ol>]]></content><author><name>Stephanie Wankowicz</name><email>stephanie@wankowiczlab.com</email></author><category term="posts" /><category term="meta" /><summary type="html"><![CDATA[Converting between ensemble and multiconformer models]]></summary></entry><entry><title type="html">Optimizing Molecular Dynamics Weights with Machine Learning Tools</title><link href="https://diffuse.science/posts/jax_refine/" rel="alternate" type="text/html" title="Optimizing Molecular Dynamics Weights with Machine Learning Tools" /><published>2025-10-16T00:00:00+00:00</published><updated>2025-10-16T00:00:00+00:00</updated><id>https://diffuse.science/posts/jax_refine</id><content type="html" xml:base="https://diffuse.science/posts/jax_refine/"><![CDATA[<p>In our latest round of diffuse scattering experiments, we ran into an intriguing optimization problem that feels a lot like training a neural network.</p>

<hr />

<h2 id="the-scientific-setup">The Scientific Setup</h2>

<p>For each 3D pixel in reciprocal space (indexed by <strong>h</strong>), we have:</p>

<ul>
  <li><strong>Observed data</strong>, $y(h)$ from experiment</li>
  <li><strong>Predicted data</strong>, $x(h)$ computed from molecular dynamics (MD) trajectories</li>
</ul>

<p>We evaluate agreement using the <strong>Pearson correlation coefficient</strong>:</p>

\[\mathrm{CC} =
\frac{\langle x \cdot y \rangle - \langle x \rangle \langle y \rangle}
{\sqrt{(\langle x^2 \rangle - \langle x \rangle^2)(\langle y^2 \rangle - \langle y \rangle^2)}}\]

<p>Each prediction $x(h)$ is derived from <strong>structure factors</strong> $F(h, t)$ across time points in the MD simulation:</p>

\[x(h) = \langle F(h)^2 \rangle_t - \langle F(h) \rangle_t^2\]

<p>The goal is to assign <strong>weights</strong> $w(t)$ to each time point to maximize $\mathrm{CC}$:</p>

\[x'(h) = \sum_t w_t F(h,t)^2 - \left(\sum_t w_t F(h,t)\right)^2\]

<p>If we can find optimal weights, we can identify which regions of the trajectory best match experimental reality — potentially distinguishing “good” frames from those that detract from agreement.</p>

<hr />

<h2 id="community-brainstorming">Community Brainstorming</h2>

<p><strong>Steve</strong> suggested asking whether $\mathrm{CC}$ is the right target — perhaps a likelihood might better capture the physics.</p>

<p><strong>Karson Chrispens</strong> proposed leveraging machine learning frameworks like <strong>JAX</strong> or <strong>PyTorch</strong> to treat the weights as trainable parameters.<br />
By backpropagating through the Pearson correlation, an optimizer like Adam could efficiently learn the optimal weights.</p>

<p><strong>James Holton</strong> suspected this approach could outperform traditional non-linear least-squares optimization and shared example MTZ datasets for testing.</p>

<p><strong>Steve</strong> also mentioned using a <strong>genetic algorithm</strong> if the weights were binary ($0$ or $1$), though he acknowledged the continuous formulation might not have a unique minimum.</p>

<hr />

<h2 id="prototyping-the-optimizer">Prototyping the Optimizer</h2>

<p>Karson quickly implemented a JAX-based prototype using <strong>reciprocalspaceship</strong> for MTZ I/O and <strong>optax</strong> for optimization.<br />
The loss function was simply $-\mathrm{CC}$, and weights were constrained to $(0, 1)$ via a sigmoid transform.</p>

<p>When tested on toy datasets and real MTZ files, the optimizer:</p>

<ul>
  <li>Successfully recovered <strong>50:50</strong> weights for mixtures of two “ground-truth” structures</li>
  <li>Produced sensible intermediate values when one or both inputs were “wrong”</li>
  <li>Converged robustly from different initializations</li>
</ul>

<p>Example output for a ground-truth mixture:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Final weights: [0.46, 0.54]
Final CC: 1.0000

</code></pre></div></div>

<p>And for mismatched data:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Final weights: [0.76, 0.24]
Final CC: 0.78

</code></pre></div></div>

<hr />

<h2 id="discussion">Discussion</h2>

<p><strong>Marcus Collins</strong> noted that this approach resembles computing <strong>Boltzmann-like factors</strong> for each configuration and suggested PyTorch could be an equally good (and more common) platform.<br />
He also cautioned that Pearson $\mathrm{CC}$ may not be the optimal objective function.</p>

<p>Karson confirmed that JAX runs efficiently on GPUs and planned to scale the approach to larger datasets by stacking multiple MTZ files.</p>

<hr />

<h2 id="where-this-might-go-next">Where This Might Go Next</h2>

<p>This prototype demonstrates that <strong>gradient-based optimization</strong> can efficiently identify the contribution of different MD frames to observed diffuse scattering patterns.<br />
Future directions include:</p>

<ul>
  <li>Expanding to full MD trajectories with thousands of frames</li>
  <li>Experimenting with alternate objectives (e.g., likelihood, cross-entropy)</li>
  <li>Incorporating <strong>crystal symmetry</strong> and <strong>resolution weighting</strong></li>
  <li>Exploring physical interpretations of the resulting weights</li>
</ul>

<hr />

<h2 id="code-and-data">Code and Data</h2>

<p>Karson’s implementation, <code class="language-plaintext highlighter-rouge">pearson_target.py</code>, is available <a href="https://github.com/k-chrispens/simulation_timeseries_optim">here</a>, and the test MTZ data can be downloaded from<br />
<a href="http://bl831.als.lbl.gov/~jamesh/pickup/diffUSE_CC_opt_test.tgz">here</a>.</p>

<hr />

<p><strong>TL;DR:</strong><br />
By treating MD frame weights as trainable parameters in a differentiable Pearson correlation objective, we can use ML optimizers like Adam to rapidly identify which parts of a trajectory best explain experimental diffuse scattering — turning a brute-force search into a smooth, data-driven optimization problem.</p>]]></content><author><name>James Holton, with contributions from Karson Chrispens, Steve, and Marcus Collins</name></author><category term="posts" /><category term="diffuse scattering" /><category term="molecular dynamics" /><category term="optimization" /><category term="machine learning" /><summary type="html"><![CDATA[Using gradient-based optimization to identify the most physically relevant portions of MD trajectories by maximizing agreement with diffuse scattering data.]]></summary></entry></feed>