The development of ParaFold will greatly speed up high-throughput studies and render the protein “structure-omics” feasible, leveraging the predictive power by running on supercomputers, with shorter time and at a lower cost.
AlphaFold developed by DeepMind predicts protein structures from the amino acid sequence at or near experimental resolution, solving the 50-year-old protein folding challenge, leading to progress by transforming large-scale genomics data into protein structures. AlphaFold will also greatly change the scientific research model from low-throughput to high-throughput manner. The overall AlphaFold prediction process consists of two stages: 1) MSA construction based on CPUs and 2) model inferences on GPUs. In the first stage, AlphaFold uses CPUs only, taking up to hours for MSA construction of a single protein due to the large database sizes and I/O bottlenecks. However, GPUs in this stage remain idle, resulting in low GPU utilization and restricting the capacity of large-scale structure predictions. Therefore, we proposed “ParaFold”, an open-source parallel version of AlphaFold for high throughput protein structure predictions. ParaFold separates the CPU and GPU parts to enable large-scale structure predictions and to improve GPU utilization. ParaFold also effectively reduces the CPU and GPU runtime with two optimizations without compromising the quality of prediction results: using multi-threaded parallelism on CPUs and using optimized JAX compilation on GPUs. We evaluated ParaFold with three datasets of different protein lengths. We showed the large-scale structure prediction capability by running model 1 inference of ∼ 20,000 small proteins in 5.4 hours on one NVIDIA DGX-2. With the CPU/GPU separation and JAX compile optimization, the total GPU runtime was reduced to 5.4 hours, compared with 1,352.6 hours when using AlphaFold, achieving a 99.7% GPU runtime reduction. ParaFold largely increased the protein structure prediction capacity of GPU per day, getting a 250X speedup over AlphaFold with this case (∼ 20,000 proteins of the same 50 residues). ParaFold offers an rapid and effective approach for high-throughput structure predictions, leveraging the predictive power by running on supercomputers, with shorter time and at a lower cost. The development of ParaFold will greatly speed up high-throughput studies and render the protein “structure-omics” feasible.