Algorithms Behind AlphaFold

Introduction

Recently, AlphaFold becomes a hot topic due to its extraordinary abilities in predicting protein structures. This is a giant leap in structural biology. Since this method shows great ability in predicting protein conformations, I decide to write this blog in order to give a comprehensive understanding of the algorithm behind AlphaFold. Up to now only a description of the first version of AlphaFold is published, so this blog will introduce the first version of AlphaFold, see this link.

AlphaFold, Part I

1. Basics of AlphaFold

AlphaFold uses the distance between $C_\beta$ atoms to predict the protein structure. This distance is important, since it represents the distance between amino residues in proteins. While in PDB database, a lot of known proteins structures can be retrieved, it can provide training dataset for the AlphaFold’s deep learning model. One parts of AlphaFold is a convolution neural network to encode amino sequences into space distances $d_{ij}$ and torsion angel $\left(\phi_i, \psi_i\right)$.

Suppose the known amino acid sequence is $S$ with length $L$, AlphaFold try to solve the problem of predicting a $64 \times 64$ region, sampled from the $L \times L$ distance matrix ($d_{ij}$ between $C_\beta$) as mentioned before. This can be converted into a classification problem, provided the distance is represented as a discrete distribution (i.e. a histogram, where each bin in the histogram represents a range of distances). What AlphaFold do is training a model which maximize the probability $P(d_{i,j}|S)$.

Following is a figure taken from the original AlphaFold paper to illustrate how AlphaFold works. From (a), we can see the amino sequence and MSA features are fed into a deep neural network, and then the neural network predict the distance and torsion distributions. The prediction is then finally passed to gradient descent on protein-specific potential to finally produce an optimal protein conformation.

From the appendix, we can find the deep learning model uses the following features.

  • Number of HHblits alignments (scalar)
  • Sequence-length features: 1-hot amino acid type (21 features); other useful profile-based features
  • Sequence-length-squared features

Moreover, we can find the described neural network structures as follows, and the total loss is the weighted sum of RMSD, secondary structure and accessible surface area. The three components can ensure a good prediction of the protein distance as well as the macroscopic structure.

  • 7 groups of 4 blocks with 256 channels, cycling through dilations
    1,2,4,8.
  • 48 groups of 4 blocks with 128 channels, cycling through dilations
    1,2,4,8.
  • Using ELU function as the activation function
  • Auxiliary loss weights: secondary structure: 0.005; accessible sur-
    face area: 0.001. These auxiliary losses were cut by a factor 10 after
    100 000 steps.
  • The target is a quantification of the distance between the $C_\beta$ atoms of the residues (or $C_\alpha$ atoms for glycine). The range is between 2-22 Å is equally divided into 64 bins, then the problem is converted to minimize the cross entropy for distance bin classification. The input features are 2-D, which is the concatenation of embedding of 1-D features of each residues with the 2-D features between residues.

The above features are then fed into neural networks to make distrogram prediction. The neural network is 2d dilated convolutional residual network. Amino acids are encoding into hidden vectors using 1d embedding layers, then they are combined into 2d feature maps, and fed into 2d convolutional neural networks to predict their distances.

To predict the accurate protein structure, not only the distance between amino residues are needed, but also we need the dihedral and torsion angles between amino resides. This requires modelling the probability distribution $P\left(\phi_i, \psi_i\right|S, MSA(S))$. This probability is also discretized into many bins (every $10\degree$, in total $36 \times 36 = 1296$), this can sever as the potential which can be minimized to get a more accurate protein structure.

In order to minimize the memory consumptions, the AlphaFold model using a sampling strategy to sample part of the proteins, and predict their connections.

Alpha Fold Part II