Encoding data for a protein dynamics centered AlphaFold
The breakthroughs of AlphaFold and its successors galvanized both biology and machine learning communities. What enabled this revolution? High-quality, standardized data openly organized for all to use.
For decades, structural biology has relied on the Protein Data Bank (PDB), a remarkable archive of macromolecular structures. Its standardization made large-scale machine learning possible, creating the conditions that allowed AlphaFold to thrive. But while AlphaFold changed our expectations for static structure prediction, the next frontier is dynamics, and here, our data infrastructure is still stuck in the past.
Exciting, much of this information on dynamics is already encoded in the raw experimental data we’ve been collecting for decades, and more information is being gleaned from expanding data collection in diffuse scattering. Yet our current data formats are built for single, static models, making them ill-suited to capture loop motions, alternate side-chain conformations, ligand-induced flexibility, or compositional variation.
As a result, much of this rich information is lost, limiting both the biological insights we can draw and the conclusions AI algorithms can reach. We are working to change that. If AlphaFold’s success was fueled by standardized, static structures, imagine the possibilities if we could deliver standardized, machine-readable, and human-interpretable representations of dynamic structures.
As we have written about before, we envision a hierarchical, ensemble-aware encoding framework designed to capture the full complexity of macromolecular dynamics. This includes distinguishing between different sources of heterogeneity, such as conformational changes versus compositional variation, and representing them in a nested structure that reflects the true physical states. Beyond encoding a single model, this would enable searches based on dynamic properties—for example, identifying all proteins where a particular loop adopts multiple conformations or where ligand binding alters flexibility in a neighboring site. To enable AI co-driven discovery, these representations must serve both human reasoning and machine learning, which means encoding at the individual model level and making this data practical for researchers to access, adapt, and integrate into their pipelines. Such a system would not only help scientists interpret complex structures but also establish community benchmarks, uncover systematic errors, and accelerate method development.
Our goal is to re-engineer the encoding and infrastructure of structural biology to embrace dynamics and set the stage for the next revolution, one where machine learning models not only predict what a protein looks like but also how it moves and functions.