[Articles] Highly accurate protein structure prediction with AlphaFold

About this Article

Authors: John Jumper, Richard Evans, Alexander Pritzel et al.
Journal: Nature
Year: 2021
Official Citation: Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021).

AlphaFold, solving the grand challenge of predicting protein structure by using Deep NN.

AlphaFold combines three theme to predict a protein structure.
- Neural Network
- Training procedure based on the evolutionary
- Physical and geometric constraints
There are seven key technologies in AlphaFold.
- Jointly embed Multiple Sequence Alignment (MSA, information about evolution) and pairwise features (information about amino acid pairs).
- A new output representation and associated loss.
- A new equivariant(change of input (e.g. rotation) is also applied to the output, without additional computations) attention architecture.
- Intermediate loss to achieve iterative refinement.
- Masked MSA loss.
- Learning from unlabelled protein sequences.
- Self-estimation of accuracy.

The following figure is a full architecture of AlphaFold. AlphaFold consists of two parts: Evoformer → Structural Module.

Combine MSA and Pair representation to make a reasoning based on spatial and evolutionary relationships.
Steps: Below three steps are conducted not once, but continuously.
- (1) Update pair representation matrix by using MSA information. (element-wise outer product)
- (2) Within pair representation, update it in terms of triangles of edges involving three different nodes. By using triangles, you can apply various constraints (e.g. triangular inequality) to figure out unknown information (edge).
- (3) Apply updated pair representation to MSA.

(a): y-axis shows the error between the actual and the predicted one (Median Ca r.m.s.d.95). AlphaFold shows the lowest error.
(b): Structure comparison between the result of AlphaFold (blue), and of experiment (green).
(c): Side chain comparison between the result of AlphaFold (blue), and of experiment (green).
(d): Large protein structure comparison between the result of AlphaFold (blue), and of experiment (green).
(e): Model structure of AlphaFold.

Not only on CASP14 dataset, authors extended to a large PDB dataset.
(a): Distribution(histogram) of error.
(b): Correlation between backbone accuracy and side chain accuracy. If backbone prediction is accurate, the prediction of side chain also become accurate.
(c): Correlation between the actual structural accuracy of prediction (IDDT-Ca) and the accuracy that AlphaFold computed (pLDDT). High correlation shows that the accuracy score of AlphaFold itself is reliable.
(d): Correlation between the actual fold accuracy of prediction (TM-score) and the accuracy that AlphaFold computed (pTM). High correlation shows that the accuracy score of AlphaFold itself is reliable.

(a): The result of ablation study for each element of AlphaFold. Self-distillation(student-teacher model) improved the accuracy of AlphaFold. No IPA version was the worst, demonstrating its importance. Although not shown in the graph, BERT style masking also removed the demand of hard coding a particular correlation statistic.
(b): AlphaFold incrementally improves protein structure as it progresses through each step of the network. In the case of difficult sample such as T1064, AlphaFold searches and rearranges secondary structures.

If MSA is less than 30 sequence, accuracy drops dramatically. MSA is needed to find the correct structure in the aearly stage. However, for MSAs with over 100 sequences, the improvement in accuracy becomes marginal.
Low accuracy when (# heterotypic contacts > # intra-chain or homotypic contacts). As AlphaFold uses only one chain as an input, proteins whose structures are determined by the interaction with others could not be predicted accurately. However, there are also exceptions (Figure 5b).
Possible to apply in larger scale, such as human genome.