AlphaFold & The Structure of Proteins
Most people only encounter protein in the nutritional sense, as a major component of one's diet alongside carbohydrates and fats. A so-called "macronutrient", protein is broken down during digestion, providing the same amount of energy as carbohydrates (4kCal per gram) and the essential building blocks called amino acids required for new protein synthesis.
Depending on one's dietary needs and consumption philosophies, protein can be obtained from many sources, though most people get the bulk of their protein intake from animal products. While dairy products and meat provide the most dense sources of protein, vegetarians and vegans can absolutely obtain enough protein through plant-based sources.
Most organisms use combinations of the twenty amino acids shown above to synthesize the many proteins necessary for life. In humans, nine of the above amino acids (those with the dotted circles) cannot be made in the body and must be obtained through food, thus earning the title of "essential amino acids". For this reason, an entire supplement industry has arisen around providing these necessary nutrients in pill form (see the following image).
The idea of essential amino acids was a major plot line in Michael Crichton's Jurassic Park, where dinosaurs were genetically modified to prevent internal lysine synthesis. This would result in dinosaurs dying if not given enough lysine in their food and theoretically prevent the escape and spread from the island. While it seems a wild, sci-fi plot, amino acid deficiency is a real threat if the pathway to synthesize an amino acid is disrupted or one of the essential amino acids is not ingested at the required level. In humans, an amino acid deficiency can cause a wide range of symptoms and health impacts because they are the basic building materials for cellular critical proteins.
So what's the point?
The easiest analogy for the role of amino acids is that of the alphabet. Amino acids are the letters used by the body to create the words, sentences, and paragraphs necessary to keep a body functioning and healthy. Just as typing a novel (or defending one's job) without the letter E would be nearly impossible, creating the many proteins needed for life would be severely impacted without the full complement of amino acids.
This brings us to today's main topic, Alphafold and the ability to use the processing power of computers to accurately predict the structure of proteins from their amino acid sequence.
This review is based on the paper: Highly accurate protein structure prediction with Alphafold and the information available at alphafold.com.
First a brief overview of the complicated structure of proteins.
Primary Protein Structure: While the sequence of amino acids is vitally important to the structure and function of a given protein, this is only the first level of determining a protein's structure. Proteins exist in complex 3D structures based on the interactions of each amino acid and their place in the sequence.
Secondary Protein Structure: The process of transforming a chain of amino acids into a fully-formed protein is known as folding. In this first step, the chain folds into one of two different structures: an alpha-helix or a beta-pleated sheet.
Tertiary Protein Structure: After the local folding of the secondary step, the tertiary step involves the full 3D folding of the entire amino acid chain. This is influenced by the side chains of each amino acid, their location on the chain, and the secondary alpha-helices and beta-pleated sheets. Due to the different interactions that can occur, two proteins with similar amino acid chains can have very different structures. Types of interactions that fuel protein folding include: charge, affinity for water (i.e. hydrophobic or hydrophilic), direct binding of sulfur ions, and both ionic and hydrogen bonds.
Quaternary Protein Structure: Many proteins don't contain a single amino acid chain but are actually made up of multiple chains. The interactions between these individual chains and the resultant structure is the final level of folding before a protein is functional.
As you can see, this process is extremely complex and is happening in every cell of every organism at all times. In addition, the number of proteins in each organism is staggering (with the number in humans alone estimated to be as high as two million) and each has a different number of amino acids (ranging from a couple dozen to more than 1200).
Due to the complex nature present in the folding of a protein and the variety in amino acid sequence and length, guessing the structure of a protein and its purpose is extremely challenging. Scientists have overcome this by using crystallized protein and x-rays in conjunction with computer processing power to deduce protein structure. This method is very time-consuming and not practical for all applications. In brief the process is as follows:
1. Scientists generate or collect enough protein for a useable sample
2. This sample is subjected to various methods to attempt to generate a crystal (each protein is vastly different and this step is more of an art form than a regimented scientific endeavor)
3. The sample is loaded onto the equipment and subjected to bombardment by x-rays
4. The x-rays bounce off the protein and this diffraction pattern is collected by a computer
5. Specialized software then analyzes the data to generate a structural representation of the protein
Unfortunately, this method requires researchers to focus solely on working backwards, from identifying a biologically active and interesting protein and then decoding its structure. Given all the difficulties with predicting structure of proteins based on their amino acid sequences, one would guess that identifying potential tasks or purposes of proteins would also be challenging, however this isn't quite true. Most proteins fall into groupings with similarly structured proteins that serve similar functions, either within a single cell/organism or across species and identifying structure would allow researchers to group novel proteins together.
For example, at a previous job I led an experimental group looking to find potential pore forming proteins that could out-compete our current best candidate. Because our purpose was unique and the usage of the protein was not something being done in many laboratories, I could not simply scan papers for other proteins that had the same function as the one I was currently using, instead I needed to look for proteins that had a similar structure. This led to an exhaustive review of protein structure databases and the selection of nearly a dozen potential replacements. None ended up out-competing our initial protein but they did all function as expected.
Having the capacity to generate structural data on many proteins view computer modeling could reduce the time to decode structure, allow for analysis of proteins that are difficult to work with (for example they refuse to crystallize) and give scientists a method to scan proteins without having to work backwards. As of now, despite decades of research, there may be billions of proteins with unknown structure and function that could benefit humankind.
So how does DeepMind intend to solve this problem using their AlpaFold technique? Well first we need to take a detour to understand that modern scientific research is different than previous generations. A lot of work is outsourced to labs or companies with unique skills or equipment in an effort to improve accuracy, turnaround time, or because of other in-house constraints. For this reason, resources like protein structures found by AlphaFold are critical to a lab's success, the quality of its output, and the health of the industry. Gone are the days where a single lab could work on decoding a single protein for months or years and then figure out its function. This work needs to be sped up and produce results that are impactful and repeatable. Utilizing an AlphaFold repository removes the discovery step and allows a lab to start with a number of candidate proteins for their experiment.
To develop AlphaFold, DeepMind identified two major pathways to protein structure deduction through computational methods: evolutionary history and physical interactions. Evolutionary history attempts to figure out structure based on similar proteins with similar sequences, while physical interactions looks at the types of interactions shown in an earlier image above to determine structure. Both methods have had success but are slow and often still outperformed by the x-ray diffraction method listed above.
AlphaFold is a neural network trained to incorporate both the evolutionary history and physical interaction models to generate highly accurate 3D models of proteins in a much shorter time frame. While the underlying science for the model is intricate and beyond my scope to explain in this blog, the essence is that the neural network takes amino acid chains of unknown proteins and compares them to both similar proteins already in the database and locations in proteins with similar sequences to determine structure.
In the above image from the AlphaFold paper linked previously in this post, we can see how effective their predictions are. Each of the above images is a protein or segment of a protein and shows the experimental obtained structure in green and the predicted structure by AlphaFold in blue. In each case, the structure is very similar though not perfect identical. In each case however, the predictions from AlphaFold were superior to other computational methods on the market and substantially faster than obtaining the same data through experimental means. You may also notice something called the "r.m.s.d.95" score above, this is a root mean square deviation value when 95% of the amino acid residues are in place. This is a representation of how far the predicted model varies in distance along the background from the experimentally obtained structure. For example in image A, the score is 0.8A or averages 0.8 angstroms of distance from the backbone chain. (An angstrom is 1x10-7 of a millimeter, an extremely small distance)
So what does this all mean? AlphaFold is a neural network that has been developed to quickly and accurately determine protein structure using a predictive modeling method that 'learns' over time based on new experimental data. This method allows for rapid deduction of protein structure and provides a repository for researchers to search for interesting or similar proteins necessary for their research. This process allows for a wider range of proteins to be available to and testable by researchers, potentially leading to novel medical treatments or research applications. This is accomplished without the time and resource investment often associated with protein identification and structural analysis. In the end, harnessing the power of computers to reduce the workload and improve the productivity of labs is a boon for all humankind.