introductory
Protein Language Models (PLMs) have become a powerful tool for protein structure and function prediction and design. At the International Conference on Machine Learning (ICML) 2023, MILA and Intel Labs are jointly presenting theProtST model, which is a multimodal model that can be used to design proteins based on textual cues. Since then, ProtST has been well received in the research community, accumulating more than 40 citations in less than a year, underscoring the impact of this work.
One of the most common tasks of PLM is to predict the subcellular location of an amino acid sequence. In this case, the user inputs an amino acid sequence to the model, and the model outputs a label to indicate the subcellular location where the sequence resides. The paper shows that the zero-sample subcellular localization performance of ProtST-ESM-1b outperforms state-of-the-art less-sample classifiers (figure below).
To democratize ProtST, Intel and MILA have rewritten the model to make it available to everyone through the Hugging Face Hub. The model is available to everyone athere (literary) Download the model and dataset.
This article will show how to use the Intel Gaudi 2 accelerator card and theoptimum-habana
Open source library to efficiently run ProtST inference and fine-tuning.Intel Gaudi 2 is the second generation of AI acceleration cards designed by Intel. Interested readers can see ourPrevious blog postsfor an in-depth look at this accelerator card and how it can be utilized through theIntel Developer Cloud use it. Thanks to theoptimum-habanaWith only a few code changes, users can port their transformers-based code to Gaudi 2.
Reasoning about ProtST
Common subcellular locations include the nucleus, the cell membrane, the cytoplasm, and the mitochondria, which you can see from theThis data set Get full details of the location in the
We useProtST-SubcellularLocalization
A test subset of the dataset was used to compare ProtST's performance on NVIDIA'sA100 80GB PCIe
cap (a poem)Gaudi 2
Inference performance on two accelerator cards. The test set contains 2772 amino acid sequences with sequence lengths ranging from 79 to 1999.
You can use thethis script Reproducing our experiments, we takebfloat16
Accuracy and batch size 1 Run the model. On NVIDIA A100 and Intel Gaudi 2, we obtained the same accuracy (0.44), but the Gaudi 2's inference is 1.76 times faster than A100. The runtimes for a single A100 and a single Gaudi 2 are shown below.
Fine-tuning ProtST
Fine-tuning ProtST models for downstream tasks is a simple and well-recognized way to improve model accuracy. In this experiment, we specifically investigated fine-tuning for a binary localization task, which is a simpler version of subcellular localization, where the task uses a binary tag to indicate whether the protein is membrane-bound or soluble.
You can use thethis script Reproducing our experiments. Among other things, we are in theProtST-BinaryLocalization The dataset is organized inbfloat16
Precision fine-tuningProtST-ESM1b-for-sequential-classification. The table below shows the model accuracy for the subset of tests under different hardware configurations, and it can be found that they are all comparable to the accuracy published in the paper (~92.5%).
The graph below shows the time taken for fine tuning. It can be seen that a single Gaudi 2 is 2.92 times faster than a single A100. The graph also shows that near-linear scaling can be achieved using distributed training on 4 or 8 Gaudi 2 accelerator cards.
summarize
In this article, we show how to build on theoptimum-habana
Easily deploy ProtST inference and fine-tuning on Gaudi 2. Furthermore, our results show that Gaudi 2 performs competitively on these tasks compared to A100: 1.76x faster inference and 2.92x faster fine-tuning.
If you want to start a modeling journey on an Intel Gaudi 2 accelerator card, the following resources can help you.
- optimum-habana code library
- Intel Gaudi(computer) file
Thank you for reading! We look forward to seeing what Intel Gaudi 2-accelerated ProtST can do to help you innovate.
Original in English./blog/intel-protein-language-model-protst
Original authors: Julien Simon, Jiqing Feng, Santiago Miret, Xinyu Yuan, Yi Wang, Matrix Yao, Minghao Xu, Ke Ding
Translator: Matrix Yao (Yao Weifeng), Deep Learning Engineer at Intel, working on the application of transformer-family models to modal data and training inference for large-scale models.