Encoding Heterogeneity in mmCIF Files

15 minute read

Executive Summary

PDBx/mmCIF models do not adequately encode conformational and compositional heterogeneity intrinsic in structural biology data. This problem becomes more pressing when examining new classes of experiments that frequently or expressly induce such heterogeneities, including fragment screening and time-resolved techniques. Leveraging mmCIF’s expressive nature, we propose a new category that enables users and software to explicitly describe heterogeneity by explicitly encoding the relationships between different pieces of heterogeneity.

The Problem

Most PDB depositions result from X-ray crystallography or single-particle cryo-electron microscopy (cryo-EM), which collect information from thousands to trillions of copies of the macromolecular complex. Each copy of the macromolecular complex can vary in the relative positions of atoms within it (conformational heterogeneity) or in the presence/absence of part of the complex (compositional heterogeneity). Conformational heterogeneity can occur across length scales from a single atom (e.g. a terminal hydroxyl in serine partially occupying two locations) to entire domain movements (e.g. the ribosome stalk partially occupying two locations)1–6. Compositional heterogeneity can also occur across length scales from an ion to a protein subunit7–11. Models intended to describe this experimental data must be able to effectively capture this inherent conformational and compositional heterogeneity; however, the current encoding falls short.

Currently, in PDBx/mmCIFs, harmonic conformational heterogeneity is captured by B-factors or translation-libration-screw [TLS] parameters. In contrast, non-harmonic conformational or compositional heterogeneity is captured through the alternative location indicator (altloc). Altlocs indicate the mutually exclusive positions any given atom can occupy, providing information on coordinates, occupancy, and B-factors for each position. Multiple atoms are often grouped into the same altloc, indicating that, on a local level, they are part of the same state.

This convention falls well short when it becomes necessary to describe how altlocs are related to each other12. It has long been recognized that additional information on the relationships between altlocs is necessary to adequately fit this data13,14. However, this information is only provided in refinement programs (via grouped occupancy) and is not explicitly stated in deposited models.

For example, conditional on the presence or absence of a ligand, two ethylene glycol (EDO) molecules may be present, one of which can occupy two mutually exclusive states (Figure 1a). While there are ways to apply restraints and constraints in refinement, this conditionality cannot be expressed in PDBx/mmCIF altlocs. This limitation hurts our ability to read a model and understand the downstream

Biological Applications

While the interplay between conformational and compositional heterogeneity is present in all structural experiments, accurate encoding has particularly large implications for several classes of experiments that frequently or explicitly induce such heterogeneities, including fragment screening and time-resolved techniques.

Prominent use cases include: X-ray crystallography fragment screening: X-ray crystallography fragment screening is an increasingly popular drug discovery technique that involves collecting hundreds to thousands of crystal structures of small-molecule fragments bound to proteins. Collectively, these structures provide building blocks for ligands by observing and modeling diverse small molecule starting points15. The challenge arises because the fragment or ligand is bound to only a subset of macromolecules, inducing partial occupancy changes across states, resulting in experimental data with high compositional and conformational heterogeneity 10,16,17. Such screens make up an increasing fraction of depositions in the PDB and are of significant importance to the pharmaceutical industry. A clear understanding of the hierarchical heterogeneity is essential for future chemical modifications and for using these data in automated methods or machine learning.

Time-resolved experiments: Time-resolved X-ray crystallography or cryo-EM involves collecting multiple datasets over time as the sample undergoes a dynamic process. To achieve this, the sample (typically protein crystals) is rapidly subjected to a specific stimulus (e.g. with temperature, light, or electric field). The changes induce structural rearrangements in the protein, and diffraction data are collected at multiple time points, enabling real-time tracking of these changes. These studies are critical to understanding protein function in terms of time-domain structural heterogeneity, such as catalysis, ligand binding, or macromolecular machines 18–22. While experiments often yield multiple depositions linked to different maps, within a single deposition, the ground state contains a mixture of conformers, and the excited state also contains a mixture of conformers, although we expect different occupancies. The inability to express the dependencies among occupancy groups in the mmCIF/PDBx means that the structure can be understood only with the accompanying literature, which is problematic for automated methods and machine learning.

Other experiments purposefully seek to enrich compositional heterogeneity. In these cases, a single collection reflects data from many biochemically distinct samples. For example, a collection of small molecules in a “cocktail” can be soaked into the same protein crystal, resulting in a single protein bound to many ligands - potentially in the same binding site 23. In another example, native complexes with multiple binding partners can be purified for Cryo-EM 24 or imaged by cryo-electron tomography (cryo-ET) 6,25.

Proposed encoding of heterogeneity in PDBx/mmCIF

Thanks to the more expressive data model of PDBx/mmCIF, it is now possible to explicitly encode heterogeneity-related relationships that are implicit or ambiguous in current representations. To enable this, we propose a new category that directly encodes (i) the hierarchical relationships among compositional and conformational states and (ii) the coexistence or mutual exclusivity of those states. Together, these two tables provide a formal mechanism for representing both the dependency structure and logical compatibility among heterogeneous states in multiconformer or ensemble models. The core table, termed the heterogeneity hierarchy, is a mandatory loop that links heterogeneity identifiers associated with atoms in the _atom_site category (and, when applicable, model IDs) to one another through explicit parent–child relationships. This hierarchy encodes how individual heterogeneous states are related—such as alternative ligand-binding events, conformational substates, or compositional variants—allowing users and software to reconstruct a structured view of heterogeneity rather than relying on proximity-based altloc conventions alone. The hierarchy table is complemented by a second table, the state coexistence table, which explicitly specifies which heterogeneity states may or may not co-occur. Within the _atom_site category, we introduce a heterogeneity ID. This identifier is optional for PDB entries that do not require explicit heterogeneity encoding, but becomes mandatory for atoms that participate in the heterogeneity hierarchy. In some cases, heterogeneity IDs may correspond directly to existing altloc identifiers; however, unlike altlocs, each heterogeneity ID must represent a distinct biochemical or structural state. For example, two spatially distant atoms may both be labeled with altloc “A” under current conventions, even though there is no evidence that they represent the same underlying state. In such cases, distinct heterogeneity IDs would be assigned, allowing their relationships (or lack thereof) to be represented unambiguously within the hierarchy. The heterogeneity hierarchy category uses four fields: name, id, parent, and details. The name field is mandatory and provides a short, human-readable label for the state. The id field is the category key and serves as the formal link between this table and the _atom_site category. The optional details field allows descriptive text (quoted if it contains spaces). The parent field references another heterogeneity ID and encodes the hierarchical structure: each heterogeneity ID has at most one parent, whereas any parent may have multiple children. This constraint enforces a tree-like hierarchy that captures how states refine or branch from one another. By traversing parent–child relationships, software can explicitly reconstruct the hierarchy of heterogeneity encoded across the atomic model. Unlike traditional altloc usage—where identical altloc identifiers implicitly define a single mutually exclusive state within a local clash radius—the heterogeneity hierarchy provides a global, explicit description of state relationships. When conformational or compositional states are known to be linked biochemically, physically, or mechanistically, they may share the same altloc and heterogeneity ID. Conversely, when no such linkage is known, distinct heterogeneity IDs are assigned, even if the altloc labels coincide. This ensures that altloc usage remains compatible with existing practice while heterogeneity IDs provide the formal semantic layer needed to interpret relationships correctly. Mutual compatibility among states is encoded separately in the state coexistence table. This table operates on heterogeneity IDs defined in the hierarchy and specifies logical rules(e.g., OR or NOT) that determine whether states may co-occur. Hierarchical relationships inherently encode AND relationships (parent and child states are coexistent by definition), whereas the coexistence table captures non-hierarchical logical constraints. Importantly, if no coexistence rule is specified between two heterogeneity IDs, they are assumed not to be mutually exclusive. As with altlocs, individual heterogeneity IDs are not expected to represent complete models in isolation; rather, the full hierarchy and coexistence definitions together should account for all atoms and fully explain the experimental data.

Examples

A few examples are visually laid out here. Example mmCIF models and other updated information can be found in the Encoding Github Repository.

Consider a common situation in fragment screen data with extensive compositional heterogeneity. This structure has four ligands: three EDO molecules and a small fragment bound (Figure 1A). Based on occupancy, biochemical information, and overlap, it is known that when the fragment or EDO1 is bound, either EDO2 or EDO3 can then be bound. When EDO2 or EDO3 is bound, either EDO1 or the fragment is bound (Figure 1B). This example is shown in the proposed heterogeneity table in Figure 1C, and Figure 1D displays the atom table (which shows only one atom per ligand for simplicity).

Situations not addressed by the heterogeneity category

We deliberately chose not to encode the following scenarios in our current model. We excluded a many-to-many relationship, specifying that a parent could have multiple children, but a child could not be associated with multiple parents. This does not strike us as a scientifically sensible thing to do: it would hardcode the commonality between conformations, whereas rationalising such correlations is decidedly a downstream task, namely the analysis of what the models mean, and highly dependent on the questions asked. We did not include the association of multiple maps, such as those used in time-resolved or classification methods in cryoEM. It is conceivable that a single model, supported by multiple maps, could be created within this hierarchical framework in the future. The current encoding of hierarchical compositional and conformational heterogeneity is based solely on thermodynamic ensembles without accounting for timescales. However, the framework is flexible enough that timescale information could be incorporated into the table, such as with time-resolved techniques.

Working with existing software

As the proposed loop does not change the atom_site category, it should not break any existing software. However, to fully realize the potential of this encoding, it should be used by existing structural biology software, including model building10,27–30, refinement programs31–33, and visualization software30,34. Overall, these programs need to account for the heterogeneity loop when present. Additionally, software should account for the possibility that models using the new category may interpret altlocs slightly differently, with each altloc representing a distinct state.

Model Building

For modeling building software, the software should, to the best of its ability, output the loop and/or a grouped occupancy file describing how altlocs are related. While this loop/grouped occupancy may need to be manually adjusted for local heterogeneity hierarchy, it can be automated when clashes determine labeling, as is currently done in qFit and Coot28,30. Determining distally hierarchical relationships cannot be done using information from atom_site and will require external methods and individual input from modelers. Additionally, model-building software should err on the side of assigning new altlocs (_atom_site.label_alt_id), unless there is evidence two pieces of heterogeneity are linked. In Coot, we imagine a user-defined flag (i.e. use hierarchy category) for assigning altlocs. Coot would also need to re-assign altlocs as altlocs are deleted from the model during manual building. If the hierarchy category exists in the input model, Coot must amend the table as needed. Currently, this would be based solely on user input and/or conflicting information regarding the conformations.

We also encourage the development of tools that enable modelers to specify the hierarchical relationships between altlocs through a grouped occupancy file or by manually creating the hierarchy loop, with the software automatically correcting the labeling of altlocs in the atom table.

We propose that refinement software include a flag to enable the use of information from the hierarchy category. If the flag is enabled, the software should expect a grouped-occupancy file and/or a hierarchy category within the PDBx/mmCIF file. Refinement software should then generate constraints and/or restraints based on the hierarchy or grouped occupancy file or hierarchy category. If a grouped occupancy file is present, it will overwrite the existing hierarchy category in the PDBx/mmCIF file. Refinement should read the hierarchy loop from children to parents and/or use a grouped occupancy file to determine which hierarchy levels should be equal to or less than one occupancy. Similar to Refmac’s current “occupancy group alts complete” parameter, we envision a modified version of this command that would create a restraint ensuring that the occupancy of specific IDs sums to that of a specified parent, reflecting their hierarchical relationship. This would be in addition to the restraint of having atoms with the same ID (pdbx_heterogeneity_hierarchy.id) having the same occupancy. In Phenix and SHELX31,35, while you can create constraints of mutually exclusive groups using grouped occupancy or PART parameters, it is currently impossible to encode the hierarchy between these groups. When flagged, refinement software must output the category in the PDBx/mmCIF file, specifying how heterogeneity hierarchy is encoded and refined within the model. Visualization Visualization software should read the hierarchical heterogeneity loop to display hierarchical heterogeneity. Currently, Pymol supports the use of altlocs in selection criteria and visualization. We aim for Pymol to support the new heterogeneity loop directive in selection and visualization, allowing users to select a state and display all its children and/or parents.

Future Indications

This proposal is primarily motivated by X-ray crystallography, which enables the precise detection of individual states owing to its high resolution and recent technological advances. While cryo-EM currently captures most of its compositional and conformational heterogeneity through 3D classification of maps36, we are beginning to routinely obtain structures at sufficiently high resolution to capture heterogeneity through both map classification and the encoding of multiplicity within structural models.

Furthermore, the machine-readability and human interpretability of the inherent heterogeneity within models are critical for advancing the use and prediction of conformational ensembles. By implementing this model, we anticipate that it will yield new tools and analyses to examine heterogeneity within and across structures, significantly expanding our understanding of the inherent heterogeneity in experimental structural biology data.

References

Baxevanis, A. D. & Francis Ouellette, B. F. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. (John Wiley & Sons, 2004).
DePristo, M. A., de Bakker, P. I. W. & Blundell, T. L. Heterogeneity and inaccuracy in protein structures solved by X-ray crystallography. Structure 12, 831–838 (2004).
Leschziner, A. E. & Nogales, E. Visualizing flexibility at molecular resolution: analysis of heterogeneity in single-particle electron microscopy reconstructions. Annu Rev Biophys Biomol Struct 36, 43–62 (2007).
Smith, J. L., Hendrickson, W. A., Honzatko, R. B. & Sheriff, S. Structural heterogeneity in protein crystals. Biochemistry 25, 5018–5027 (1986).
Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat Methods 18, 176–185 (2021).
Powell, B. M. & Davis, J. H. Learning structural heterogeneity from cryo-electron sub-tomograms with tomoDRGN. Nat Methods 21, 1525–1536 (2024).
Forsberg, B. O., Shah, P. N. M. & Burt, A. A robust normalized local filter to estimate compositional heterogeneity directly from cryo-EM maps. Nat Commun 14, 5802 (2023).
Rabuck-Gibbons, J. N., Lyumkis, D. & Williamson, J. R. Quantitative mining of compositional heterogeneity in cryo-EM datasets of ribosome assembly intermediates. Structure 30, 498–509.e4 (2022).
Du, S. et al. Refinement of multiconformer ensemble models from multi-temperature X-ray diffraction data. Methods Enzymol 688, 223–254 (2023).
Pearce, N. M. et al. A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron density. Nat Commun 8, 15123 (2017).
Pearce, N. M., Krojer, T. & von Delft, F. Proper modelling of ligand binding requires an ensemble of bound and unbound states. Acta Crystallogr D Struct Biol 73, 256–266 (2017).
Wankowicz, S. A. & Fraser, J. S. Comprehensive encoding of conformational and compositional protein structural ensembles through the mmCIF data structure. IUCrJ 11, 494–501 (2024).
Hendrickson, W. A. Stereochemically restrained refinement of macromolecular structures. Methods Enzymol 115, 252–270 (1985).
Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr D Biol Crystallogr 68, 352–367 (2012).
Correy, G. J. et al. Extensive exploration of structure activity relationships for the SARS-CoV-2 macrodomain from shape-based fragment merging and active learning. bioRxiv (2024) doi:10.1101/2024.08.25.609621.
Douangamath, A. et al. Achieving Efficient Fragment Screening at XChem Facility at Diamond Light Source. J Vis Exp (2021) doi:10.3791/62414.
Erlanson, D. et al. Where to house big data on small fragments? ChemRxiv (2025) doi:10.26434/chemrxiv-2025-hjjnj.
Šrajer, V. & Schmidt, M. Watching Proteins Function with Time-resolved X-ray Crystallography. J Phys D Appl Phys 50, (2017).
De Zitter, E., Coquelle, N., Oeser, P., Barends, T. R. M. & Colletier, J.-P. Xtrapol8 enables automatic elucidation of low-occupancy intermediate-states in crystallographic studies. Commun Biol 5, 640 (2022).
Greisman, J. B. et al. Resolving conformational changes that mediate a two-step catalytic mechanism in a model enzyme. bioRxiv (2023) doi:10.1101/2023.06.02.543507.
Wolff, A. M. et al. Mapping protein dynamics at high spatial resolution with temperature-jump X-ray crystallography. Nat Chem 15, 1549–1558 (2023).
Thompson, M. C. et al. Temperature-jump solution X-ray scattering reveals distinct motions in a dynamic enzyme. Nat Chem 11, 1058–1066 (2019).
Verlinde, C. L. M. J. et al. Fragment-based cocktail crystallography by the medical structural genomics of pathogenic protozoa consortium. Curr Top Med Chem 9, 1678–1687 (2009).
Chen, Z. et al. EMC chaperone-Ca structure reveals an ion channel assembly intermediate. Nature 619, 410–419 (2023).
Baker, L. A., Grange, M. & Grünewald, K. Electron cryo-tomography captures macromolecular complexes in native environments. Curr Opin Struct Biol 46, 149–156 (2017).
Westbrook, J. D. et al. PDBx/mmCIF Ecosystem: Foundational Semantic Tools for Structural Biology. J Mol Biol 434, 167599 (2022).
Flowers, J. et al. Expanding Automated Multiconformer Ligand Modeling to Macrocycles and Fragments. bioRxiv (2024) doi:10.1101/2024.09.20.613996.
Wankowicz, S. A. et al. Automated multiconformer model building for X-ray crystallography and cryo-EM. Elife 12, (2024).
Stachowski, T. R. & Fischer, M. FLEXR: automated multi-conformer model building using electron-density map sampling. Acta Crystallogr D Struct Biol 79, 354–367 (2023).
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr D Biol Crystallogr 66, 486–501 (2010).
Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr 66, 213–221 (2010).
Smart, O. S. et al. Exploiting structure similarity in refinement: automated NCS and target-structure restraints in BUSTER. Acta Crystallogr D Biol Crystallogr 68, 368–380 (2012).
Murshudov, G. N. et al. REFMAC5 for the refinement of macromolecular crystal structures. Acta Crystallogr D Biol Crystallogr 67, 355–367 (2011).
Lill, M. A. & Danielson, M. L. Computer-aided drug design platform using PyMOL. J Comput Aided Mol Des 25, 13–19 (2011).
Schneider, T. R. & Sheldrick, G. M. Substructure solution with SHELXD. Acta Crystallogr D Biol Crystallogr 58, 1772–1779 (2002).
Kimanius, D., Dong, L., Sharov, G., Nakane, T. & Scheres, S. H. W. New tools for automated cryo-EM single-particle analysis in RELION-4.0. Biochem J 478, 4169–4185 (2021).

Share on

X Facebook LinkedIn Bluesky

Stephanie Wankowicz

Encoding Heterogeneity in mmCIF Files

Executive Summary

The Problem

Biological Applications

Proposed encoding of heterogeneity in PDBx/mmCIF

Examples

Situations not addressed by the heterogeneity category

Working with existing software

Model Building

Refinement

Future Indications

References

Share on

Comments

You May Also Enjoy

mmCIF Explorer and the Heterogeneity Proposal

Is it all just the lattice?

Cruel Kinase Summer, Episode 16

BANDICOOT: Coot 0.9 That Runs on MacOS Tahoe