One of the goals of computational biology is to predict the complete high-order structure of a protein from its amino acid sequence. Often reasonably good structures can be produced by modeling a new protein according to an already-known structure of a homologous protein, one with a similar sequence and presumably a similar structure. However, these structures can be inaccurate, and obviously this method will not work if no homologous structure is known.
Foldit is an online game developed by the research team of Dr. David Baker that attempts to address this problem by combining an automated structure prediction program called ROSETTA with input from human players who manually remodel structures to improve them. Even though most of the players have little or no advanced biochemical knowledge, Foldit has already had some striking results improving on computational models. An upcoming paper in Nature Structural & Molecular Biology (1) (PDF also available directly from the Baker lab) details some interesting new successes from the Foldit players.
Contrary to some reports, the Foldit players did not solve any mystery directly related to HIV, although their work may prove helpful in developing new drugs for AIDS. What the Foldit players actually did was to outperform many protein structure prediction algorithms in the CASP9 contest, and to play a key role in helping solve the structure of an unusual protease from a simian retrovirus.
If you don’t recognize Mason-Pfizer Monkey Virus (M-PMV) as a cause of AIDS in humans, that’s because it isn’t. It causes acquired immune deficiency in macaques, however, and it has an unusual protease that may tell us useful things.
Retroviruses like HIV often produce proteins in a fused form rather than as individual folded units. In order to be functional, the various proteins must be snipped out of these long polyprotein strands, so the virus includes a protease (protein-cutting enzyme) to do this. In most retroviruses, this protease is dimeric: it is composed of two protein molecules with identical sequences and similar, symmetric structures. The long-known structure of HIV protease, seen on the right (learn more about HIV protease or explore this structure at the Protein Data Bank) is an example of this architecture.
People infected with HIV often take protease inhibitors to interfere with viral replication. These drugs attack the active site, where the chemical reaction that cuts the protein strand takes place, but it has been theorized that viral proteases could also be attacked by splitting up the dimers into single proteins, or monomers. The problem is, the free monomer structures aren’t known.
This is where the M-PMV protease comes in. Although it is homologous to the dimeric proteases, M-PMV protease is a monomer in the absence of its cutting target. If we knew this protein’s structure, we could perhaps design drugs that would stabilize other proteases in their monomer form, rendering them inactive. An attempt to determine the structure using magnetic resonance data (NMR) produced models that seemed poorly folded and had bad ROSETTA energy scores. And, although the protein formed crystals, X-ray crystallography could not solve its structure either, despite a decade of effort.
The reason for this has to do with how X-ray crystallography works. If you fire a beam of X-rays at a crystal of a protein, some of the rays will be deflected by electrons within it and you will observe a pattern of diffracted dots similar to the one at left, kindly provided by my colleague Young-Jin Cho. The intensities and locations of these dots depend on the structure and arrangement of the molecules within the crystal. X-ray crystallographers can use the diffraction patterns to calculate the electron density of the protein and fit the molecular bonds into it (below, also courtesy of Young-Jin). However, the electron density cannot be calculated from the diffraction pattern unless the phases of the diffracted X-rays are also known. Unfortunately there is no way to calculate the phases from the dots.
There are many ways to solve this problem, but not all of them work in every system. One widely-applicable approach is called “molecular replacement”. In this method, a protein with a structure similar to that of the one being studied is used to guess the phases. If this guess is close enough, the structure factors can be refined from there. In the case of M-PMV protease, however, the dimeric homologues could not be used for replacement, and an attempt to use the NMR structure to calculate the phases also failed.
Then the Foldit players went to work. Starting from the NMR structure, Foldit players made a variety of refinements. A player called spvincent made some improvements using the new alignment tool, which a player called grabhorn improved further by rearranging the amino acid side chains in the core of the molecule. A player named mimi added the final touch by rearranging a critical loop.
Going from mimi’s structure (several others also proved suitable), the crystallographers were able to solve the phase problem by molecular replacement and finally determine the protease’s structure. None of the Foldit results were exactly right, so it’s inaccurate to say that the players solved the structure. However, their models were very close to the right answer, and provided the critical data that allowed the crystal structure to be solved. Once the paper is published, you’ll be able to find that structure at the PDB under the accession code 3SQF.
We can’t know right now whether this structure will enable the design of new drugs, but the Foldit players were the key to giving us a better chance of using it for this purpose. What may be even more exciting is the possibility that Foldit could be used in other structural studies to come up with improved starting models for molecular replacement. As with any method of predicting protein structures, however, the gold standard is CASP, so the Foldit teams participated in CASP9.
The Critical Assessment of protein Structure Prediction is a long-running biennial test of computer algorithms to calculate a protein’s structure from its sequence. This experiment in prediction has a fairly simple setup.
1) Structural biologists give unpublished structures to the CASP organizers.
2) The sequences belonging to these structures are given to computational biologists.
3) After a set period, the computational predictions are compared to the known structural results.
The Baker group generated starting structures using ROSETTA, then handed the five lowest-energy results off to the Foldit players. For proteins that had known homologues, the results were disappointing. Foldit players did well, but they overused Foldit’s ROSETTA-based minimization routine, which tended to distort conserved loops.
The nature of this problem became even more clear when the Baker group handed the Foldit players ROSETTA results for proteins that had no known homologues. In that case they noticed that players were using the minimization routine to “tunnel” to nearby, incorrect minima. You can get a feel for what that means by looking at the figure to the left.
In this energy landscape diagram, the blue line represents every possible structure of a pretend protein laid out in a line, with similar structures near each other and the higher-energy (worse) structures placed higher on the Y axis. From a relatively high-energy initial structure, Foldit players tended to use minimization to draw it ever-downward towards the nearest minimum-energy structure (red arrow). Overuse of the computer algorithm discouraged them from pulling the structure past a disfavored state that would then start to collapse towards the true, global minimum energy (green arrow).
The Foldit players still had some successes — for instance, they were able to recognize one structure ROSETTA didn’t like very much as a near-native structure. The Void Crushers team successfully optimized this structure, producing the best score for that particular target, and one of the highest scores of the CASP test. If the initial ROSETTA structures had too low of a starting energy, though, the players wouldn’t perturb them enough to get over humps in the landscape.
Thus, Baker’s group tried a new strategy. Taking the parts of one structure that they knew (from the CASP organizers) had a correct structure, they aligned the sequence with those parts and then took a hammer to the rest, pushing loops and structural elements out of alignment. This encouraged the players to be more daring in their remodeling of regions where the predictions had been poor, while preserving the good features of the structure. Again, the Void Crushers won special mention, producing the best-scoring structure of target TR624 in the whole competition.
Man over machine?
Does this prove that gamers know more about folding proteins than computers do? Some of them might, but Foldit doesn’t really use human expertise. Rather, the game uses human intelligence to identify when the ROSETTA program has gone down the wrong path and figure out how to push it over the hump. When the human intelligences aren’t daring enough, or trust the system too much, as in the case of the CASP results, Foldit doesn’t do any better than completely automated structural methods. When the human players are encouraged to challenge the computational results, however, the results can be striking. As Baker’s group are clearly aware, further development of the program needs to be oriented towards encouraging players to go further afield from the initial ROSETTA predictions. This will likely mean many more failed attempts by players, but also more significant successes like these.
Disclaimer: I am currently collaborating with David Baker’s group on a research project involving ROSETTA (but not Foldit).
1) Khatib, F., DiMaio, F., Cooper, S., Kazmierczyk, M., Gilski, M., Krzywda, S., Zabranska, H., Pichova, I., Thompson, J., Popović, Z., Jaskolski, M., & Baker, D. (2011). Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nature Structural & Molecular Biology DOI: 10.1038/nsmb.2119