Sep 202011

One of the goals of computational biology is to predict the complete high-order structure of a protein from its amino acid sequence. Often reasonably good structures can be produced by modeling a new protein according to an already-known structure of a homologous protein, one with a similar sequence and presumably a similar structure. However, these structures can be inaccurate, and obviously this method will not work if no homologous structure is known.

Foldit is an online game developed by the research team of Dr. David Baker that attempts to address this problem by combining an automated structure prediction program called ROSETTA with input from human players who manually remodel structures to improve them. Even though most of the players have little or no advanced biochemical knowledge, Foldit has already had some striking results improving on computational models. An upcoming paper in Nature Structural & Molecular Biology (1) (PDF also available directly from the Baker lab) details some interesting new successes from the Foldit players.

Contrary to some reports, the Foldit players did not solve any mystery directly related to HIV, although their work may prove helpful in developing new drugs for AIDS. What the Foldit players actually did was to outperform many protein structure prediction algorithms in the CASP9 contest, and to play a key role in helping solve the structure of an unusual protease from a simian retrovirus.

M-PMV Protease

If you don’t recognize Mason-Pfizer Monkey Virus (M-PMV) as a cause of AIDS in humans, that’s because it isn’t. It causes acquired immune deficiency in macaques, however, and it has an unusual protease that may tell us useful things.

Crystal structure of inactive HIV-1 protease mutant in complex with substrate.

A crystal structure of an inactive mutant of HIV-1 protease in complex with its substrate. The protease monomers are in dark green and cyan, the substrate is represented as purple bonds.

Retroviruses like HIV often produce proteins in a fused form rather than as individual folded units. In order to be functional, the various proteins must be snipped out of these long polyprotein strands, so the virus includes a protease (protein-cutting enzyme) to do this. In most retroviruses, this protease is dimeric: it is composed of two protein molecules with identical sequences and similar, symmetric structures. The long-known structure of HIV protease, seen on the right (learn more about HIV protease or explore this structure at the Protein Data Bank) is an example of this architecture.

People infected with HIV often take protease inhibitors to interfere with viral replication. These drugs attack the active site, where the chemical reaction that cuts the protein strand takes place, but it has been theorized that viral proteases could also be attacked by splitting up the dimers into single proteins, or monomers. The problem is, the free monomer structures aren’t known.

This is where the M-PMV protease comes in. Although it is homologous to the dimeric proteases, M-PMV protease is a monomer in the absence of its cutting target. If we knew this protein’s structure, we could perhaps design drugs that would stabilize other proteases in their monomer form, rendering them inactive. An attempt to determine the structure using magnetic resonance data (NMR) produced models that seemed poorly folded and had bad ROSETTA energy scores. And, although the protein formed crystals, X-ray crystallography could not solve its structure either, despite a decade of effort.

An X-ray diffraction pattern.The reason for this has to do with how X-ray crystallography works. If you fire a beam of X-rays at a crystal of a protein, some of the rays will be deflected by electrons within it and you will observe a pattern of diffracted dots similar to the one at left, kindly provided by my colleague Young-Jin Cho. The intensities and locations of these dots depend on the structure and arrangement of the molecules within the crystal. X-ray crystallographers can use the diffraction patterns to calculate the electron density of the protein and fit the molecular bonds into it (below, also courtesy of Young-Jin). However, the electron density cannot be calculated from the diffraction pattern unless the phases of the diffracted X-rays are also known. Unfortunately there is no way to calculate the phases from the dots.

An electron density map

An electron density model (wireframe) with the chemical bonds of the peptide backbone (heavy lines) fitted into it.

There are many ways to solve this problem, but not all of them work in every system. One widely-applicable approach is called “molecular replacement”. In this method, a protein with a structure similar to that of the one being studied is used to guess the phases. If this guess is close enough, the structure factors can be refined from there. In the case of M-PMV protease, however, the dimeric homologues could not be used for replacement, and an attempt to use the NMR structure to calculate the phases also failed.

Then the Foldit players went to work. Starting from the NMR structure, Foldit players made a variety of refinements. A player called spvincent made some improvements using the new alignment tool, which a player called grabhorn improved further by rearranging the amino acid side chains in the core of the molecule. A player named mimi added the final touch by rearranging a critical loop.

Going from mimi’s structure (several others also proved suitable), the crystallographers were able to solve the phase problem by molecular replacement and finally determine the protease’s structure. None of the Foldit results were exactly right, so it’s inaccurate to say that the players solved the structure. However, their models were very close to the right answer, and provided the critical data that allowed the crystal structure to be solved. Once the paper is published, you’ll be able to find that structure at the PDB under the accession code 3SQF.

We can’t know right now whether this structure will enable the design of new drugs, but the Foldit players were the key to giving us a better chance of using it for this purpose. What may be even more exciting is the possibility that Foldit could be used in other structural studies to come up with improved starting models for molecular replacement. As with any method of predicting protein structures, however, the gold standard is CASP, so the Foldit teams participated in CASP9.


The Critical Assessment of protein Structure Prediction is a long-running biennial test of computer algorithms to calculate a protein’s structure from its sequence. This experiment in prediction has a fairly simple setup.

1) Structural biologists give unpublished structures to the CASP organizers.

2) The sequences belonging to these structures are given to computational biologists.

3) After a set period, the computational predictions are compared to the known structural results.

The Baker group generated starting structures using ROSETTA, then handed the five lowest-energy results off to the Foldit players. For proteins that had known homologues, the results were disappointing. Foldit players did well, but they overused Foldit’s ROSETTA-based minimization routine, which tended to distort conserved loops.

An energy landscape showing an incorrect move towards a false minimum and a correct, more difficult move towards a true minimum.The nature of this problem became even more clear when the Baker group handed the Foldit players ROSETTA results for proteins that had no known homologues. In that case they noticed that players were using the minimization routine to “tunnel” to nearby, incorrect minima. You can get a feel for what that means by looking at the figure to the left.

In this energy landscape diagram, the blue line represents every possible structure of a pretend protein laid out in a line, with similar structures near each other and the higher-energy (worse) structures placed higher on the Y axis. From a relatively high-energy initial structure, Foldit players tended to use minimization to draw it ever-downward towards the nearest minimum-energy structure (red arrow). Overuse of the computer algorithm discouraged them from pulling the structure past a disfavored state that would then start to collapse towards the true, global minimum energy (green arrow).

The Foldit players still had some successes — for instance, they were able to recognize one structure ROSETTA didn’t like very much as a near-native structure. The Void Crushers team successfully optimized this structure, producing the best score for that particular target, and one of the highest scores of the CASP test. If the initial ROSETTA structures had too low of a starting energy, though, the players wouldn’t perturb them enough to get over humps in the landscape.

Thus, Baker’s group tried a new strategy. Taking the parts of one structure that they knew (from the CASP organizers) had a correct structure, they aligned the sequence with those parts and then took a hammer to the rest, pushing loops and structural elements out of alignment. This encouraged the players to be more daring in their remodeling of regions where the predictions had been poor, while preserving the good features of the structure. Again, the Void Crushers won special mention, producing the best-scoring structure of target TR624 in the whole competition.

Man over machine?

Does this prove that gamers know more about folding proteins than computers do? Some of them might, but Foldit doesn’t really use human expertise. Rather, the game uses human intelligence to identify when the ROSETTA program has gone down the wrong path and figure out how to push it over the hump. When the human intelligences aren’t daring enough, or trust the system too much, as in the case of the CASP results, Foldit doesn’t do any better than completely automated structural methods. When the human players are encouraged to challenge the computational results, however, the results can be striking. As Baker’s group are clearly aware, further development of the program needs to be oriented towards encouraging players to go further afield from the initial ROSETTA predictions. This will likely mean many more failed attempts by players, but also more significant successes like these.

Disclaimer: I am currently collaborating with David Baker’s group on a research project involving ROSETTA (but not Foldit).

1) Khatib, F., DiMaio, F., Cooper, S., Kazmierczyk, M., Gilski, M., Krzywda, S., Zabranska, H., Pichova, I., Thompson, J., Popović, Z., Jaskolski, M., & Baker, D. (2011). Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nature Structural & Molecular Biology DOI: 10.1038/nsmb.2119

Mar 232010
ResearchBlogging.orgA protein has several different levels of structure. The primary structure is the arrangements of atoms and bonds, and it is formed in the ribosome by the assembly of amino acids as directed by an RNA template. The secondary structure is the local topology, the helices and strands, and this forms mostly because of the release of energy through the formation of hydrogen bonds. The tertiary structure is the actual fold of the protein, the way helices, strands, and loops are arranged in space. The fold forms primarily because of the favorable entropy of burying the protein’s hydrophobic groups where water cannot access them, analogous to the formation of an oil droplet in water. This suggests that, in addition to the well-known phenomenon of proteins denaturing, or losing their higher-order structure, under conditions of high heat, proteins might also denature when they get too cold.

As you might remember from your chemistry classes, the change in free energy due to a reaction under conditions of constant pressure is given by:
ΔG = ΔH – T ΔS
Where ΔH is the change in enthalpy (i.e. the heat released or absorbed by a reaction), ΔS is the change in entropy, and T is the temperature of the system in Kelvin. Here, the change we are talking about is the transition from the folded state to some unfolded state. Simplistically, since the entropic contribution is scaled by the temperature, one can imagine that for a reaction with favorable entropy and unfavorable enthalpy, lowering the temperature could cause the reaction to reverse. Protein folding is only marginally favorable at biological temperatures, so one could easily imagine that lowering the temperature enough could cause a protein to prefer the unfolded state.

Of course, this is an oversimplification: the entropy and enthalpy of a particular protein state do not remain constant over all temperatures. Rather, they vary in a way determined by the heat capacity (Cp), such that ΔG as a function of temperature is (1):

ΔG(T) = ΔH(Tr) + ΔCp(T-Tr) – T [ΔS(Tr) + ΔCp ln(T/Tr)]
Where Tr is some reference state at which the thermodynamic parameters have been determined, and ΔCp is defined with respect to the native (folded) state. Because the various states of the protein have different Cp (unfolded chains typically have higher Cp), at certain temperatures above and below the biological optimum we can expect proteins to lose their higher levels of structure. Even this is still an oversimplification, of course, because it does not directly account for changes in water structure and cosolute properties with temperature. These features may cause ΔCp itself to vary with temperature rather than remain constant.

Unfortunately, for most proteins the temperature that favors unfolding lies below the freezing point of water, which makes this phenomenon difficult to study unless you do something unusual to your system. In 2004, Babu et al. (1) reported results from experiments that used reverse micelles to study the denaturation of ubiquitin at temperatures below freezing. By encapsulating a protein-water droplet in inverted micelles dissolved in pentane, it was possible to reduce the temperature to 243 K without causing freezing. These micelles also had the convenient property of tumbling quickly in the pentane, which allowed for reasonable NMR spectra even at these low temperatures. The appearance of the spectra they obtained indicated that the protein underwent a slow unfolding process with many different unfolded states, and also that the protein did not unfold in a cooperative fashion. Rather, it appeared that one contiguous region of the protein unfolded while the rest remained folded (the main helix was particularly stable).

This wasn’t expected, because ubiquitin apparently unfolds in a completely two-state manner when overheated. This being the case, what’s expected is for the protein to either be all folded or all unfolded, not some mixture of the two. However, cold does not affect all intramolecular contacts the same way. Lowering the temperature is expected to make hydrophobic interactions less favorable while not significantly affecting polar interactions like hydrogen bonds. This being the case, one might expect an α-helix to persist through a cold-denaturation transition, as happens in this case.

Something similar is observed in an upcoming paper in JACS from the Raleigh and Eliezer Labs (2), which approaches cold denaturation using a mutant form of the C-terminal domain of ribosomal protein L9. An isoleucine to alanine mutation at residue 98 of this domain doesn’t appear to significantly alter the structure, but it causes the protein to denature somewhere in the high teens. At 12 °C the unfolded state is about 80% of the visible population, and this is where Shan et al. performed their NMR experiments. They assigned the unfolded state using standard techniques and then decided to see what they could learn from the chemical shifts.

As I’ve mentioned before, the chemical shift of a nucleus depends on the probability distribution of the surrounding electrons, and therefore is sensitive to the strength, composition, and angles of the atom’s chemical bonds. Because the dihedral angles of the protein backbone are a good proxy for the secondary structure, one can use the chemical shifts of particular atoms to determine whether a given residue is in a helix or strand. When they performed this analysis, Shan et al. noticed two major differences between the native and cold-denatured states of the protein. The first was that the helix and strand propensities of the denatured protein were much lower than the folded form, as expected. In addition, however, they noticed that one loop of the protein had gained α-helical character. That is, it seemed like an α-helix had actually gotten longer as a result of the unfolding.

This doesn’t mean that denaturing the protein added secondary structure. The low values in the output from the algorithm Shan et al. used suggest that the secondary structure in this denatured state forms only transiently. However, the chemical shifts suggest, and other structural data appear to confirm, that this region of the protein has an increased propensity to inhabit a helical structure as a consequence of the unfolding.

These results emphasize the fact that the “unfolded state” isn’t as simple as it’s often described. Residual structure persists in unfolded states of many proteins, and unfolded ensembles of one protein generated through different means (heat, cold, pH, cosolutes) may not resemble each other. Unlike unfolding at high temperature, cold denaturation of ubiquitin appears to be non-cooperative. In both ubiquitin and L9, it appears that helices are robust to the unfolding process, persisting and even propagating as the protein denatures. While some of these features may be held in common between different kinds of denatured states, others may be unique to particular denaturation conditions. The lingering question is which of these unfolded ensembles best resembles the denatured state that exists under biological conditions, giving rise to misfolded states and their associated diseases.

(1) Babu, C., Hilser, V., & Wand, A. (2004). Direct access to the cooperative substructure of proteins and the protein ensemble via cold denaturation Nature Structural & Molecular Biology, 11 (4), 352-357 DOI: 10.1038/nsmb739

(2) Shan, B., McClendon, S., Rospigliosi, C., Eliezer, D., & Raleigh, D. (2010). The Cold Denatured State of the C-terminal Domain of Protein L9 Is Compact and Contains Both Native and Non-native Structure Journal of the American Chemical Society DOI: 10.1021/ja908104s

 Posted by at 10:00 PM
Aug 182009
This post continues my series about selected articles from the dynamics-focused topical issue of JBNMR.

ResearchBlogging.orgIt is helpful, in examining some NMR articles, to understand that NMR spectroscopists have a long and resilient tradition of giving their pulse sequences silly names. You can think of it as the biophysical equivalent of fly geneticist behavior. From the basic COSY and NOESY experiments (pronounced “cozy” and “nosy”) to the INEPT spin-echo train, to more complicated pulse trains such as AMNESIA and DIPSI (which, I am not making this up, is used in an experiment sometimes called the HOHAHA), the field is just littered with ludicrous acronyms (look upon our words, ye mighty, and despair). A team from Josh Wand’s lab now joins this club by developing a multiple optimization for radially enhanced NMR-based hydrogen exchange (AMORE-HX) approach. The name is ridiculous, but the experiment fills an important role and illustrates a very active area of technical development in NMR.

The experiment they developed is intended to measure the rate of hydrogen/deuterium exchange at amide groups on the backbone of the protein. This sort of exchange reaction proceeds pretty quickly for most residue types, and can be either acid- or base- catalyzed. For it to happen, however, two things must be true. One of them is that the amide proton must not be in a hydrogen bond already. Also, the site of the reaction must be accessible to water. These requirements should indicate to you that HX measures the rate of local unfolding and can therefore be interpreted as a measure of fold stability at each NH group on the backbone. This data is of obvious interest to researchers studying protein folding. In addition, because some structural transitions are proposed to involve an unfolded state this may have explanatory power for protein interactions and regulation.

A typical HX experiment involves taking your protein, switching it rapidly into >75% D2O buffer, then placing it in the magnet and taking a series of HSQC or HMQC spectra that separate signals from backbone NH groups by the proton and nitrogen chemical shift. These spectra can be taken with very high time resolution (<2 min each), and the rate of exchange can then be measured by the decay of peak intensity as hydrogen is replaced by deuterium. Assuming that the chemical step occurs significantly faster than the rate of local unfolding and refolding, this decay can be directly interpreted as a local unfolding rate. This works quite well, but as proteins get larger there is a significant likelihood of signal overlap. It would be nice, with these large proteins, to separate the hydrogen signals using an additional chemical shift — say, that of the adjacent carbonyl. Unfortunately, taking these decay curves using 3-dimensional spectra like the HNCO turns out to be impossible because of the way these experiments are collected. Multidimensional NMR spectra rely on a series of internal delays during which a coherence acquires the frequency characteristics of a particular nucleus. In a typical experiment, the delays are multiples of a set dwell time, the length of which is determined by the frequency range one wishes to examine. Typically the collection proceeds linearly through the array, so for m y dwell times and n z dwell times you would collect 1D spectra with the delays:

0,0 0,y 0,2y 0,3y … 0,my


z,0 z,y z,2y z,3y … z,my

and so on until

nz,0 nz,y nz,2y nz,3y … nz,my

This is called Cartesian sampling, and it has some advantages. The numerous data points typically do a good job of specifying resonance frequencies, and processing this data is a fairly straightforward proposition. The glaringly obvious disadvantage is time, of which a great deal is required. Completely sampling either one of these dimensions separately can take less than 30 minutes, but sampling both can push a triple-resonance experiment into the 60 hour range. Most annoyingly, because triple-resonance spectra can be really rather sparse, this extremely long experiment often over-specifies the resonance frequencies. That is, much of this time is spent collecting data you don’t need.

Because spectrometer availability and sample stability are not infinite, there is considerable interest in making this process more efficient. One of the methods for doing so is called radial sampling. In this approach, the spectrum is built up from a series of “diagonal” spectra that lie along a certain defined angle with respect to the two time domains (imagine the above array as a rectangle with sides of my and nz to get a rough idea of what this means). If these angles are judiciously chosen, the spectrum can then be rebuilt from just a few of them with only modest losses in resolution. Gledhill et al. apply this approach as a means of addressing their time-resolution problem. Guided by a selection algorithm, they use just four angles (at 500 MHz) to resolve more than 90% of the peaks possible in myelin basic protein. As a result, they were able to collect HNCO-based HX data with 15-minute resolution. This isn’t enough to catch the fastest-exchanging peaks, but it’s more than sufficient to catch core residues.

Gledhill et al. used some additional tricks to gain extra speed in the experiment, however. Using band-selective excitation, they cut down the experiment’s relaxation delay to 0.6 s, which is important because this delay is a considerable portion of the duration of each transient. Having done this, they started to get really clever. Because this experiment is being used to measure the intensities of known frequencies, it is possible to significantly reduce the amount of processing required by employing the 2D-FT only for those regions that contained actual peak intensity. Moreover, they could extract peak intensities from each individual angle plane. Because they did not interleave the collection, this enabled them to substantially increase the time-resolution when necessary.

For peaks that exchanged quickly Gledhill et al. took relaxation data from the individual angle spectra, to maximize the time-sensitivity of the data. For slowly-exchanging peaks, they averaged the data from the angle spectra to maximize the signal-to-noise ratio. The resulting intensity curve seems a bit noisy, but this is an acceptable price to access new peaks. More importantly, the precision of the overall rate (as opposed to the instantaneous intensity) appears to be on par with simpler methods of measuring HX.

Successful use of the AMORE-HX experiment will depend on a wise selection of acquisition angles, a process that may benefit from further optimization. Because the HNCO has relatively good dispersion, the pulse sequence should enable HX measurements for just about any protein that is suitable for NMR. This would allow for a direct assessment of large enzymes and complexes, as well as a measurement of local stabilities in domain-domain interfaces.

Gledhill, J., Walters, B., & Wand, A. (2009). AMORE-HX: a multidimensional optimization of radial enhanced NMR-sampled hydrogen exchange Journal of Biomolecular NMR DOI: 10.1007/s10858-009-9357-4

Mar 242009
ResearchBlogging.orgThe ribosome produces proteins by matching tRNA that has been correctly loaded with an amino acid to a codon (triplet of DNA bases) in the mRNA that contains the gene sequence. The triplet code allows 64 combinations of nucleotide bases, but proteins are made from only 20 amino acids (plus a “stop” signal). This means that most amino acids are coded by multiple codons, and hence have multiple tRNAs. Not all codons are created equal, however; in bacteria some codons are found much less frequently than others that represent the same amino acid. The tRNA associated with these “rare codons” is also less abundant than other tRNA, and this means that when a ribosome hits a rare codon, it often has to pause while it waits to encounter a loaded tRNA. To structural biologists like myself, who do their work by overexpressing proteins in bacteria, rare codons can be a nuisance because they slow down protein production, or even prevent it entirely. In a recent paper in Nature Structural & Molecular Biology, however, researchers from Germany suggest that the slowdown due to rare codons may have a functional advantage in vivo.

As a first step, Zhang et al. used a bioinformatics approach to survey the sequences of bacterial genes so that they could identify patches that would be slow to translate (the Methods section appears to contain an error in the description of this technique). They found that for proteins longer than about 300 amino acid residues, nearly every transcript contained at least one cluster of slow-translating codons. When the authors used a cell-free E. coli expression system to make some of these proteins and allowed only one round of translation initiation per ribosome, they saw a pattern of translation intermediates that matched the sizes predicted by the location of slow-translating patches.

In order to find out whether these translation intermediates had any significance, the authors examined the multi-domain protein SufI. In their prediction of the translation speed, which is on top in this figure that I have shamelessly stolen, there are four slow spots. Aside from the first one, these appear to correspond to the boundaries of different structural domains in the protein (lower part of the figure). Experiments with proteases suggested that these domains actually folded during the pauses, as the ribosome-bound translation intermediates were resistant to proteolysis.

Interestingly, when two rare leucine codons were replaced by more common ones (the authors call this SufIΔ25-28), the whole protein became vulnerable to degradation. Similarly, when extra tRNA for these rare codons was added to the cell-free expression system, the full-length protein became protease-sensitive. This suggests that the slow patches are actually necessary for proper folding of the protein. It’s often the case that lowering the incubation temperature can improve the expression of certain proteins in E. coli. The authors of this study find that is also true for SufI, as the protease resistance of SufIΔ25-28 can be restored by lowering the temperature, and thus the overall translation rate. When analogous experiments with SufIΔ25-28 and tRNA supplementation were carried out in living E. coli, the translocation of SufI into the periplasmic space was reduced by a factor of 10 even though the overall protein concentration was not affected, indicating that the co-translational folding allowed by the rare codons is necessary for proper functioning of the protein in vivo.

Of course this is a single case study, and it would be premature to conclude that every patch of rare codons corresponds to an important co-translational folding event. Indeed, that doesn’t even appear to be true of SufI, which folds properly when one of its other slow patches is removed. However, at certain key locations these stretches of rare codons may be an important part of the folding machinery in multidomain proteins. In addition, the more frequent appearance of rare codons in β-strands (as opposed to α-helices) may also be related to folding due to the slower kinetics of β-sheet formation. As the authors note, the intrinsic kinetics aren’t everything — pauses in the translation process may also buy time for the complex to encounter essential chaperones or cofactors. Regardless of the mechanism, it appears that rare codons, in at least some instances, provide a way for the folding process to catch up with the translation process.

Zhang, G., Hubalewska, M., & Ignatova, Z. (2009). Transient ribosomal attenuation coordinates protein synthesis and co-translational folding Nature Structural & Molecular Biology, 16 (3), 274-280 DOI: 10.1038/nsmb.1554

 Posted by at 1:00 AM