5 minute read

Representing conformational heterogeneity

Our structural biology techniques capture an enormous amount of conformational heterogeneity that is often lost in the transition from experimental data to deposited models. Part of this loss stems from a lack of sufficiently sophisticated algorithmic methods, which is an active area of development in this project and elsewhere. Still, an equally important factor is how we choose to encode structural heterogeneity in the models themselves.

In the majority of structures deposited in the Protein Data Bank (PDB), conformational heterogeneity is represented only in a harmonic sense, through atomic displacement parameters (B-factors) or translation–libration–screw (TLS) parameters. These parameters can be incorporated into a single structural model, describing the amplitude and anisotropy of atomic fluctuations around a mean position. However, they do not encode anharmonic or discrete conformational variability. To capture non-harmonic conformational heterogeneity, models have emerged that explicitly include multiple atomic coordinate sets. Two dominant strategies have emerged from X-ray crystallography and cryo-EM: multiconformer models and multi-model (ensemble) models[1].

A multiconformer model represents conformational diversity locally, without duplicating the entire macromolecule. When a region of the electron density is well described by a single conformation, only one set of coordinates with appropriate B-factors is modeled. When the density indicates discrete alternative conformations, such as side chain rotamers or backbone flips, the relevant atoms are copied and assigned alternate location (ALTLOC) identifiers in the PDB file. We have previously demonstrated that this modeling approach can yield substantial improvements in fitting to experimental data and also reduce geometric distortions and eliminate many rotamer outliers[2].

Multi-model approaches model heterogeneity by encoding multiple complete copies of the system, which can sometimes more effectively capture structural motions like backbone shifts. However, ensembles containing tens to hundreds of models can lead to a high parameter-to-data ratio. The most common ensemble models used in Bragg peak analysis today are time-averaged ensembles, generated by molecular dynamics simulations. These ensembles are restrained by time-averaged X-ray structure factors to produce a large number of models, often hundreds, each representing a snapshot from a single trajectory [3]. Further, crystalline MD simulations are currently the best model to describe the diffuse data[4].

Converting between the two representations is a non-trivial task, as the generation and data contained in the two model types are unique. However, there are many times when transferring between the two representations may be needed. For example, the primary methods we use to model and represent diffuse data are molecular dynamics, which generate ensemble models. However, as described by members of this project and others, this MD data has a poor correspondence with the Bragg peak data. However, currently, there are no approaches to further refine this MD model against Bragg peak data. One way we can represent and refine this against Bragg peaks is with multiconformer models, which are compatible with traditional refinement software and allow for manual manipulation.

Converting multiconformer to ensemble models

In general, there is an exponential number of ways to combine single-residue conformations. For this reason, enumerating all combinations becomes infeasible as soon as structures contain a higher number of residues with AltLoc conformations. Previously, Gutermuth et al proposed an algorithm to convert multiconformer models into ensemble models (AltLocEnumerator)[5]. This fast branch-and-bound algorithm to generate valid alternative protein structure conformations is described through AltLoc annotations. The algorithm searches for compatible residue conformations, maximizing the probabilities of conformational states by scoring the AltLoc occupancy values. We aimed to convert a qFit multiconformer PDB into an ensemble structure (PDB: 5iu1). While we attempted multiple options, including enumerating all, optimizing the occupancy score, and changing the number of models, all resulted in multiple models that were almost always overlapping (distribution of all heavy atom RMSD shown below).

rmsd_protein

We then decided to make five different models (as qFit models make up to 5 alt locs per residue) using the option ‘atllocid’, which provided us with models that were separated with more realistic RMSD (distribution of all heavy atom RMSD shown below).

AltLocEnumerator –file 5i1u_final_qFit.pdb –altlocid A

5ilu_altloc_rmsd_protein

This command needs to be repeated for each altloc ID, and then the models should be concatendated with MODEL/ENDMODEL lines. Of note, this algorithm is not open source but available to academics for free.

Converting ensemble models to multiconformer

There was no existing tool to convert an ensemble model to a multiconformer model, prompting us to design one. We did this using qFit. Our approach systematically collapses an ensemble by iterating over each residue across all models and clustering equivalent residues. Taking the first model as the reference, we assign each subsequent residue to an existing cluster if its RMSD to the cluster centroid is within 1 Å (default parameter). If the RMSD exceeds this threshold, a new conformation is created. We then used the relabel function in qFit, which uses simulated annealing (SA) optimization of a Lennard-Jones potential to reassign altloc labels, ensuring that conformers of different residues/ligands have consistent altloc labels. Note that while we collapsed many conformers, many residues still have 26+ conformers, meaning these cannot be represented with the historic PDB format.

While we can create a multiconformer model, a few issues remain. 1) We do not have a correct occupancy for any residue (all residues are currently assigned occupancy of 0.50), 2) There may be issues in the geometry of backbone atoms due to removing conformations on a residue level. The other thing to note is that this is currently incredibly slow (~45 minutes for 400 residues with 70 models). While imperfect, the multiconformer models enable us to feed this into other algorithms, such as refinement or qFit.

This tool is available in the qFit repository, calling multimodel_2_multiconformer.py, only requiring an input PDB.

References

  1. Woldeyes RA, Sivak DA, Fraser JS. E pluribus unum, no more: from one crystal, many conformations. Curr Opin Struct Biol. 2014;28: 56–62.
  2. Wankowicz SA, Ravikumar A, Sharma S, Riley BT, Raju A, Flowers J, et al. Uncovering Protein Ensembles: Automated Multiconformer Model Building for X-ray Crystallography and Cryo-EM. bioRxiv. 2024. doi:10.1101/2023.06.28.546963
  3. Burnley BT, Afonine PV, Adams PD, Gros P. Modelling dynamics in protein crystal structures by ensemble refinement. Elife. 2012;1: e00311.
  4. Wall ME. Internal protein motions in molecular-dynamics simulations of Bragg and diffuse X-ray scattering. IUCrJ. 2018;5: 172–181.
  5. Gutermuth T, Sieg J, Stohn T, Rarey M. Modeling with Alternate Locations in X-ray Protein Structures. J Chem Inf Model. 2023;63: 2573–2585.

Updated: