1
|
He J, Wu W, Wang X. DIProT: A deep learning based interactive toolkit for efficient and effective Protein design. Synth Syst Biotechnol 2024; 9:217-222. [PMID: 38385151 PMCID: PMC10876589 DOI: 10.1016/j.synbio.2024.01.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/02/2024] [Accepted: 01/30/2024] [Indexed: 02/23/2024] Open
Abstract
The protein inverse folding problem, designing amino acid sequences that fold into desired protein structures, is a critical challenge in biological sciences. Despite numerous data-driven and knowledge-driven methods, there remains a need for a user-friendly toolkit that effectively integrates these approaches for in-silico protein design. In this paper, we present DIProT, an interactive protein design toolkit. DIProT leverages a non-autoregressive deep generative model to solve the inverse folding problem, combined with a protein structure prediction model. This integration allows users to incorporate prior knowledge into the design process, evaluate designs in silico, and form a virtual design loop with human feedback. Our inverse folding model demonstrates competitive performance in terms of effectiveness and efficiency on TS50 and CATH4.2 datasets, with promising sequence recovery and inference time. Case studies further illustrate how DIProT can facilitate user-guided protein design.
Collapse
Affiliation(s)
| | | | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
2
|
Wang H, Liu D, Zhao K, Wang Y, Zhang G. SPDesign: protein sequence designer based on structural sequence profile using ultrafast shape recognition. Brief Bioinform 2024; 25:bbae146. [PMID: 38600663 PMCID: PMC11006797 DOI: 10.1093/bib/bbae146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 03/02/2024] [Accepted: 03/15/2024] [Indexed: 04/12/2024] Open
Abstract
Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.
Collapse
Affiliation(s)
| | | | | | - Yajun Wang
- Corresponding authors. Guijun Zhang, College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China. E-mail: ; Yajun Wang, College of Pharmaceutical Science, Zhejiang University of Technology, Hangzhou 310014, China. E-mail:
| | - Guijun Zhang
- Corresponding authors. Guijun Zhang, College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China. E-mail: ; Yajun Wang, College of Pharmaceutical Science, Zhejiang University of Technology, Hangzhou 310014, China. E-mail:
| |
Collapse
|
3
|
Castorina LV, Ünal SM, Subr K, Wood CW. TIMED-Design: flexible and accessible protein sequence design with convolutional neural networks. Protein Eng Des Sel 2024; 37:gzae002. [PMID: 38288671 PMCID: PMC10939383 DOI: 10.1093/protein/gzae002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Revised: 12/12/2023] [Accepted: 01/12/2024] [Indexed: 02/18/2024] Open
Abstract
Sequence design is a crucial step in the process of designing or engineering proteins. Traditionally, physics-based methods have been used to solve for optimal sequences, with the main disadvantages being that they are computationally intensive for the end user. Deep learning-based methods offer an attractive alternative, outperforming physics-based methods at a significantly lower computational cost. In this paper, we explore the application of Convolutional Neural Networks (CNNs) for sequence design. We describe the development and benchmarking of a range of networks, as well as reimplementations of previously described CNNs. We demonstrate the flexibility of representing proteins in a three-dimensional voxel grid by encoding additional design constraints into the input data. Finally, we describe TIMED-Design, a web application and command line tool for exploring and applying the models described in this paper. The user interface will be available at the URL: https://pragmaticproteindesign.bio.ed.ac.uk/timed. The source code for TIMED-Design is available at https://github.com/wells-wood-research/timed-design.
Collapse
Affiliation(s)
- Leonardo V Castorina
- School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB United Kingdom
| | - Suleyman Mert Ünal
- School of Biological Sciences, University of Edinburgh, Roger Land Building, Edinburgh EH9 3FF, United Kingdom
| | - Kartic Subr
- School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB United Kingdom
| | - Christopher W Wood
- School of Biological Sciences, University of Edinburgh, Roger Land Building, Edinburgh EH9 3FF, United Kingdom
| |
Collapse
|
4
|
Zhang X, Yin H, Ling F, Zhan J, Zhou Y. SPIN-CGNN: Improved fixed backbone protein design with contact map-based graph construction and contact graph neural network. PLoS Comput Biol 2023; 19:e1011330. [PMID: 38060617 PMCID: PMC10729952 DOI: 10.1371/journal.pcbi.1011330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 12/19/2023] [Accepted: 11/27/2023] [Indexed: 12/20/2023] Open
Abstract
Recent advances in deep learning have significantly improved the ability to infer protein sequences directly from protein structures for the fix-backbone design. The methods have evolved from the early use of multi-layer perceptrons to convolutional neural networks, transformers, and graph neural networks (GNN). However, the conventional approach of constructing K-nearest-neighbors (KNN) graph for GNN has limited the utilization of edge information, which plays a critical role in network performance. Here we introduced SPIN-CGNN based on protein contact maps for nearest neighbors. Together with auxiliary edge updates and selective kernels, we found that SPIN-CGNN provided a comparable performance in refolding ability by AlphaFold2 to the current state-of-the-art techniques but a significant improvement over them in term of sequence recovery, perplexity, deviation from amino-acid compositions of native sequences, conservation of hydrophobic positions, and low complexity regions, according to the test by unseen structures, "hallucinated" structures and diffusion models. Results suggest that low complexity regions in the sequences designed by deep learning, for generated structures in particular, remain to be improved, when compared to the native sequences.
Collapse
Affiliation(s)
- Xing Zhang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, People’s Republic of China
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, People’s Republic of China
| | - Hongmei Yin
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, People’s Republic of China
| | - Fei Ling
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, People’s Republic of China
| | - Jian Zhan
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, People’s Republic of China
| | - Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, People’s Republic of China
| |
Collapse
|
5
|
Zhou X, Chen G, Ye J, Wang E, Zhang J, Mao C, Li Z, Hao J, Huang X, Tang J, Heng PA. ProRefiner: an entropy-based refining strategy for inverse protein folding with global graph attention. Nat Commun 2023; 14:7434. [PMID: 37973874 PMCID: PMC10654420 DOI: 10.1038/s41467-023-43166-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 11/02/2023] [Indexed: 11/19/2023] Open
Abstract
Inverse Protein Folding (IPF) is an important task of protein design, which aims to design sequences compatible with a given backbone structure. Despite the prosperous development of algorithms for this task, existing methods tend to rely on noisy predicted residues located in the local neighborhood when generating sequences. To address this limitation, we propose an entropy-based residue selection method to remove noise in the input residue context. Additionally, we introduce ProRefiner, a memory-efficient global graph attention model to fully utilize the denoised context. Our proposed method achieves state-of-the-art performance on multiple sequence design benchmarks in different design settings. Furthermore, we demonstrate the applicability of ProRefiner in redesigning Transposon-associated transposase B, where six out of the 20 variants we propose exhibit improved gene editing activity.
Collapse
Affiliation(s)
- Xinyi Zhou
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Central Ave, Hong Kong, China
| | | | - Junjie Ye
- Noah's Ark Lab, Huawei, Shenzhen, China
| | - Ercheng Wang
- Zhejiang Lab, Kechuang Avenue, Hangzhou, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jun Zhang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
| | - Cong Mao
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
| | - Zhanwei Li
- Zhejiang Lab, Kechuang Avenue, Hangzhou, China
| | | | | | - Jin Tang
- Zhejiang Lab, Kechuang Avenue, Hangzhou, China
| | - Pheng Ann Heng
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Central Ave, Hong Kong, China
- Zhejiang Lab, Kechuang Avenue, Hangzhou, China
| |
Collapse
|
6
|
Lategan FA, Schreiber C, Patterton HG. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 2023; 24:373. [PMID: 37789284 PMCID: PMC10546711 DOI: 10.1186/s12859-023-05498-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 09/25/2023] [Indexed: 10/05/2023] Open
Abstract
BACKGROUND The relationship between the sequence of a protein, its structure, and the resulting connection between its structure and function, is a foundational principle in biological science. Only recently has the computational prediction of protein structure based only on protein sequence been addressed effectively by AlphaFold, a neural network approach that can predict the majority of protein structures with X-ray crystallographic accuracy. A question that is now of acute relevance is the "inverse protein folding problem": predicting the sequence of a protein that folds into a specified structure. This will be of immense value in protein engineering and biotechnology, and will allow the design and expression of recombinant proteins that can, for instance, fold into specified structures as a scaffold for the attachment of recombinant antigens, or enzymes with modified or novel catalytic activities. Here we describe the development of SeqPredNN, a feed-forward neural network trained with X-ray crystallographic structures from the RCSB Protein Data Bank to predict the identity of amino acids in a protein structure using only the relative positions, orientations, and backbone dihedral angles of nearby residues. RESULTS We predict the sequence of a protein expected to fold into a specified structure and assess the accuracy of the prediction using both AlphaFold and RoseTTAFold to computationally generate the fold of the derived sequence. We show that the sequences predicted by SeqPredNN fold into a structure with a median TM-score of 0.638 when compared to the crystal structure according to AlphaFold predictions, yet these sequences are unique and only 28.4% identical to the sequence of the crystallized protein. CONCLUSIONS We propose that SeqPredNN will be a valuable tool to generate proteins of defined structure for the design of novel biomaterials, pharmaceuticals, catalysts, and reporter systems. The low sequence identity of its predictions compared to the native sequence could prove useful for developing proteins with modified physical properties, such as water solubility and thermal stability. The speed and ease of use of SeqPredNN offers a significant advantage over physics-based protein design methods.
Collapse
Affiliation(s)
- F Adriaan Lategan
- Center for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, 7600, South Africa
| | - Caroline Schreiber
- Center for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, 7600, South Africa
| | - Hugh G Patterton
- Center for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, 7600, South Africa.
| |
Collapse
|
7
|
Xu T, Xu Q, Li J. Toward the appropriate interpretation of Alphafold2. Front Artif Intell 2023; 6:1149748. [PMID: 37664078 PMCID: PMC10469483 DOI: 10.3389/frai.2023.1149748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Accepted: 07/24/2023] [Indexed: 09/05/2023] Open
Abstract
In life science, protein is an essential building block for life forms and a crucial catalyst for metabolic reactions in organisms. The structures of protein depend on an infinity of amino acid residues' complex combinations determined by gene expression. Predicting protein folding structures has been a tedious problem in the past seven decades but, due to robust development of artificial intelligence, astonishing progress has been made. Alphafold2, whose key component is Evoformer, is a typical and successful example of such progress. This article attempts to not only isolate and dissect every detail of Evoformer, but also raise some ideas for potential improvement.
Collapse
Affiliation(s)
- Tian Xu
- Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| | - Qin Xu
- Department of Mathematics, The University of Arizona, Tucson, AZ, United States
| | - Jianyong Li
- Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| |
Collapse
|
8
|
Yan J, Li S, Zhang Y, Hao A, Zhao Q. ZetaDesign: an end-to-end deep learning method for protein sequence design and side-chain packing. Brief Bioinform 2023; 24:bbad257. [PMID: 37429578 DOI: 10.1093/bib/bbad257] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 06/05/2023] [Accepted: 06/21/2023] [Indexed: 07/12/2023] Open
Abstract
Computational protein design has been demonstrated to be the most powerful tool in the last few years among protein designing and repacking tasks. In practice, these two tasks are strongly related but often treated separately. Besides, state-of-the-art deep-learning-based methods cannot provide interpretability from an energy perspective, affecting the accuracy of the design. Here we propose a new systematic approach, including both a posterior probability and a joint probability parts, to solve the two essential questions once for all. This approach takes the physicochemical property of amino acids into consideration and uses the joint probability model to ensure the convergence between structure and amino acid type. Our results demonstrated that this method could generate feasible, high-confidence sequences with low-energy side conformations. The designed sequences can fold into target structures with high confidence and maintain relatively stable biochemical properties. The side chain conformation has a significantly lower energy landscape without delegating to a rotamer library or performing the expensive conformational searches. Overall, we propose an end-to-end method that combines the advantages of both deep learning and energy-based methods. The design results of this model demonstrate high efficiency, and precision, as well as a low energy state and good interpretability.
Collapse
Affiliation(s)
- Junyu Yan
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| | - Shuai Li
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| | - Ying Zhang
- The Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Aimin Hao
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| | - Qinping Zhao
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| |
Collapse
|
9
|
Kazmirchuk TDD, Bradbury-Jost C, Withey TA, Gessese T, Azad T, Samanfar B, Dehne F, Golshani A. Peptides of a Feather: How Computation Is Taking Peptide Therapeutics under Its Wing. Genes (Basel) 2023; 14:1194. [PMID: 37372372 DOI: 10.3390/genes14061194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 05/24/2023] [Accepted: 05/26/2023] [Indexed: 06/29/2023] Open
Abstract
Leveraging computation in the development of peptide therapeutics has garnered increasing recognition as a valuable tool to generate novel therapeutics for disease-related targets. To this end, computation has transformed the field of peptide design through identifying novel therapeutics that exhibit enhanced pharmacokinetic properties and reduced toxicity. The process of in-silico peptide design involves the application of molecular docking, molecular dynamics simulations, and machine learning algorithms. Three primary approaches for peptide therapeutic design including structural-based, protein mimicry, and short motif design have been predominantly adopted. Despite the ongoing progress made in this field, there are still significant challenges pertaining to peptide design including: enhancing the accuracy of computational methods; improving the success rate of preclinical and clinical trials; and developing better strategies to predict pharmacokinetics and toxicity. In this review, we discuss past and present research pertaining to the design and development of in-silico peptide therapeutics in addition to highlighting the potential of computation and artificial intelligence in the future of disease therapeutics.
Collapse
Affiliation(s)
- Thomas David Daniel Kazmirchuk
- Department of Biology, and the Ottawa Institute of Systems Biology (OISB), Carleton University, Ottawa, ON K1S 5B6, Canada
| | - Calvin Bradbury-Jost
- Department of Biology, and the Ottawa Institute of Systems Biology (OISB), Carleton University, Ottawa, ON K1S 5B6, Canada
| | - Taylor Ann Withey
- Department of Biology, and the Ottawa Institute of Systems Biology (OISB), Carleton University, Ottawa, ON K1S 5B6, Canada
| | - Tadesse Gessese
- Department of Biology, and the Ottawa Institute of Systems Biology (OISB), Carleton University, Ottawa, ON K1S 5B6, Canada
| | - Taha Azad
- Department of Microbiology and Infectious Diseases, Université de Sherbrooke, Sherbrooke, QC J1E 4K8, Canada
- Centre de Recherche du Centre Hospitalier Universitaire de Sherbrooke (CHUS), Sherbrooke, QC J1H 5N4, Canada
| | - Bahram Samanfar
- Department of Biology, and the Ottawa Institute of Systems Biology (OISB), Carleton University, Ottawa, ON K1S 5B6, Canada
- Agriculture and Agri-Food Canada, Ottawa Research and Development Centre (ORDC), Ottawa, ON K1A 0C6, Canada
| | - Frank Dehne
- School of Computer Science, Carleton University, Ottawa, ON K1S 5B6, Canada
| | - Ashkan Golshani
- Department of Biology, and the Ottawa Institute of Systems Biology (OISB), Carleton University, Ottawa, ON K1S 5B6, Canada
| |
Collapse
|
10
|
Huang B, Fan T, Wang K, Zhang H, Yu C, Nie S, Qi Y, Zheng WM, Han J, Fan Z, Sun S, Ye S, Yang H, Bu D. Accurate and efficient protein sequence design through learning concise local environment of residues. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:7077134. [PMID: 36916746 PMCID: PMC10027430 DOI: 10.1093/bioinformatics/btad122] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 01/30/2023] [Accepted: 02/19/2023] [Indexed: 03/15/2023]
Abstract
MOTIVATION Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. RESULTS Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein. AVAILABILITY AND IMPLEMENTATION The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.
Collapse
Affiliation(s)
- Bin Huang
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
| | - Tingwen Fan
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Kaiyue Wang
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, School of Engineering Medicine, Beihang University, Beijing 100083, China
- Key Laboratory of Big Data-based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of the People's Republic of China, Beijing 100083, China
| | - Haicang Zhang
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Chungong Yu
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Shuyu Nie
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
- School of Life Sciences, Hebei University, Baoding, Hebei 071002, China
| | - Yangshuo Qi
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
- School of Life Sciences, Hebei University, Baoding, Hebei 071002, China
| | - Wei-Mou Zheng
- University of Chinese Academy of Sciences, Beijing 100110, China
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China
| | - Jian Han
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Zheng Fan
- Institutional Center for Shared Technologies and Facilities, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Sheng Ye
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, School of Engineering Medicine, Beihang University, Beijing 100083, China
- Key Laboratory of Big Data-based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of the People's Republic of China, Beijing 100083, China
| | - Huaiyi Yang
- University of Chinese Academy of Sciences, Beijing 100110, China
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| |
Collapse
|
11
|
Castorina LV, Petrenas R, Subr K, Wood CW. PDBench: evaluating computational methods for protein-sequence design. Bioinformatics 2023; 39:btad027. [PMID: 36637198 PMCID: PMC9869650 DOI: 10.1093/bioinformatics/btad027] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 11/14/2022] [Accepted: 01/12/2023] [Indexed: 01/14/2023] Open
Abstract
SUMMARY Ever increasing amounts of protein structure data, combined with advances in machine learning, have led to the rapid proliferation of methods available for protein-sequence design. In order to utilize a design method effectively, it is important to understand the nuances of its performance and how it varies by design target. Here, we present PDBench, a set of proteins and a number of standard tests for assessing the performance of sequence-design methods. PDBench aims to maximize the structural diversity of the benchmark, compared with previous benchmarking sets, in order to provide useful biological insight into the behaviour of sequence-design methods, which is essential for evaluating their performance and practical utility. We believe that these tools are useful for guiding the development of novel sequence design algorithms and will enable users to choose a method that best suits their design target. AVAILABILITY AND IMPLEMENTATION https://github.com/wells-wood-research/PDBench. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leonardo V Castorina
- School of Informatics, University of Edinburgh, 10 Crichton Street, Newington, Edinburgh EH8 9AB, UK
| | - Rokas Petrenas
- School of Biological Sciences, University of Edinburgh, Roger Land Building, Edinburgh EH9 3FF, UK
| | - Kartic Subr
- School of Informatics, University of Edinburgh, 10 Crichton Street, Newington, Edinburgh EH8 9AB, UK
| | - Christopher W Wood
- School of Biological Sciences, University of Edinburgh, Roger Land Building, Edinburgh EH9 3FF, UK
| |
Collapse
|
12
|
Liu J, Zhang C, Lai L. GeoPacker: A novel deep learning framework for protein side-chain modeling. Protein Sci 2022; 31:e4484. [PMID: 36309961 PMCID: PMC9667900 DOI: 10.1002/pro.4484] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 10/23/2022] [Accepted: 10/26/2022] [Indexed: 12/13/2022]
Abstract
Atomic interactions play essential roles in protein folding, structure stabilization, and function performance. Recent advances in deep learning-based methods have achieved impressive success not only in protein structure prediction, but also in protein sequence design. However, highly efficient and accurate protein side-chain prediction methods that can give detailed atomic interactions are still lacking. In the present study, we developed a deep learning based method, GeoPacker, that uses geometric deep learning coupled ResNet for protein side-chain modeling. GeoPacker explicitly represents atomic interactions with rotational and translational invariance for information extraction of relative locations. GeoPacker outperformed the state-of-the-art energy function-based methods in side-chain structure prediction accuracy and runs about 10 and 700 times faster than the deep learning-based method DLPacker and OPUS-rota4 with comparable prediction accuracy, respectively. The performance of GeoPacker does not depend on the secondary structures that the residues belong to. GeoPacker gives highly accurate predictions for buried residues in the protein core as well as protein-protein interface, making it a useful tool for protein structure modeling, protein, and interaction design.
Collapse
Affiliation(s)
- Jiale Liu
- Center for Life Sciences, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
| | - Changsheng Zhang
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
| | - Luhua Lai
- Center for Life Sciences, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
| |
Collapse
|
13
|
Ferruz N, Heinzinger M, Akdel M, Goncearenco A, Naef L, Dallago C. From sequence to function through structure: Deep learning for protein design. Comput Struct Biotechnol J 2022; 21:238-250. [PMID: 36544476 PMCID: PMC9755234 DOI: 10.1016/j.csbj.2022.11.014] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/20/2022] Open
Abstract
The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.
Collapse
Key Words
- ADMM, Alternating Direction Method of Multipliers
- CNN, Convolutional Neural Network
- DL, Deep learning
- Deep learning
- Drug discovery
- FNN, fully-connected neural network
- GAN, Generative Adversarial Network
- GCN, Graph Convolutional Network
- GNN, Graph Neural Network
- GO, Gene Ontology
- GVP, Geometric Vector Perceptron
- LSTM, Long-Short Term Memory
- MLP, Multilayer Perceptron
- MSA, Multiple Sequence Alignment
- NLP, Natural Language Processing
- NSR, Natural Sequence Recovery
- Protein design
- Protein language models
- Protein prediction
- VAE, Variational Autoencoder
- pLM, protein Language Model
Collapse
Affiliation(s)
- Noelia Ferruz
- Institute of Informatics and Applications, University of Girona, Girona, Spain
- Department of Biochemistry, University of Bayreuth, Bayreuth, Germany
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
| | - Mehmet Akdel
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | | | - Luca Naef
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
- NVIDIA DE GmbH, Einsteinstraße 172, 81677 München, Germany
| |
Collapse
|
14
|
Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, Wicky BIM, Courbet A, de Haas RJ, Bethel N, Leung PJY, Huddy TF, Pellock S, Tischer D, Chan F, Koepnick B, Nguyen H, Kang A, Sankaran B, Bera AK, King NP, Baker D. Robust deep learning-based protein sequence design using ProteinMPNN. Science 2022; 378:49-56. [PMID: 36108050 PMCID: PMC9997061 DOI: 10.1126/science.add2187] [Citation(s) in RCA: 285] [Impact Index Per Article: 142.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning-based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo-electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.
Collapse
Affiliation(s)
- J. Dauparas
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - I. Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - N. Bennett
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Molecular Engineering Graduate Program, University of Washington, Seattle, WA, USA
| | - H. Bai
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - R. J. Ragotte
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - L. F. Milles
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - B. I. M. Wicky
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - A. Courbet
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - R. J. de Haas
- Department of Physical Chemistry and Soft Matter, Wageningen University and Research, Wageningen, The Netherlands
| | - N. Bethel
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - P. J. Y. Leung
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Molecular Engineering Graduate Program, University of Washington, Seattle, WA, USA
| | - T. F. Huddy
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - S. Pellock
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - D. Tischer
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - F. Chan
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - B. Koepnick
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - H. Nguyen
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - A. Kang
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - B. Sankaran
- Berkeley Center for Structural Biology, Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - A. K. Bera
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - N. P. King
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - D. Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| |
Collapse
|
15
|
Wang C, Chen Y, Zhang Y, Li K, Lin M, Pan F, Wu W, Zhang J. A reinforcement learning approach for protein-ligand binding pose prediction. BMC Bioinformatics 2022; 23:368. [PMID: 36076158 PMCID: PMC9454149 DOI: 10.1186/s12859-022-04912-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 08/25/2022] [Indexed: 11/10/2022] Open
Abstract
Protein ligand docking is an indispensable tool for computational prediction of protein functions and screening drug candidates. Despite significant progress over the past two decades, it is still a challenging problem, characterized by the still limited understanding of the energetics between proteins and ligands, and the vast conformational space that has to be searched to find a satisfactory solution. In this project, we developed a novel reinforcement learning (RL) approach, the asynchronous advantage actor-critic model (A3C), to address the protein ligand docking problem. The overall framework consists of two models. During the search process, the agent takes an action selected by the actor model based on the current location. The critic model then evaluates this action and predict the distance between the current location and true binding site. Experimental results showed that in both single- and multi-atom cases, our model improves binding site prediction substantially compared to a naïve model. For the single-atom ligand, copper ion (Cu2+), the model predicted binding sites have a median root-mean-square-deviation (RMSD) of 2.39 Å to the true binding sites when starting from random starting locations. For the multi-atom ligand, sulfate ion (SO42-), the predicted binding sites have a median RMSD of 3.82 Å to the true binding sites. The ligand-specific models built in this study can be used in solvent mapping studies and the RL framework can be readily scaled up to larger and more diverse sets of ligands.
Collapse
Affiliation(s)
- Chenran Wang
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA
| | - Yang Chen
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA
| | - Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA
| | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA
| | - Menghan Lin
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA
| | - Feng Pan
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA
| | - Wei Wu
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA.
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, 32306-4330, USA.
| |
Collapse
|
16
|
Liu Y, Zhang L, Wang W, Zhu M, Wang C, Li F, Zhang J, Li H, Chen Q, Liu H. Rotamer-free protein sequence design based on deep learning and self-consistency. NATURE COMPUTATIONAL SCIENCE 2022; 2:451-462. [PMID: 38177863 DOI: 10.1038/s43588-022-00273-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Accepted: 06/07/2022] [Indexed: 01/06/2024]
Abstract
Several previously proposed deep learning methods to design amino acid sequences that autonomously fold into a given protein backbone yielded promising results in computational tests but did not outperform conventional energy function-based methods in wet experiments. Here we present the ABACUS-R method, which uses an encoder-decoder network trained using a multitask learning strategy to predict the sidechain type of a central residue from its three-dimensional local environment, which includes, besides other features, the types but not the conformations of the surrounding sidechains. This eliminates the need to reconstruct and optimize sidechain structures, and drastically simplifies the sequence design process. Thus iteratively applying the encoder-decoder to different central residues is able to produce self-consistent overall sequences for a target backbone. Results of wet experiments, including five structures solved by X-ray crystallography, show that ABACUS-R outperforms state-of-the-art energy function-based methods in success rate and design precision.
Collapse
Affiliation(s)
- Yufeng Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Lu Zhang
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Weilun Wang
- CAS Key Laboratory of GIPAS, School of Information Science and Technology, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, Anhui, China
| | - Min Zhu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Chenchen Wang
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Fudong Li
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China
| | - Jiahai Zhang
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China
| | - Houqiang Li
- CAS Key Laboratory of GIPAS, School of Information Science and Technology, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, Anhui, China.
| | - Quan Chen
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China.
| | - Haiyan Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China.
- School of Data Science, University of Science and Technology of China, Hefei, Anhui, China.
| |
Collapse
|
17
|
Behjati A, Zare-Mirakabad F, Arab SS, Nowzari-Dalini A. Protein sequence profile prediction using ProtAlbert transformer. Comput Biol Chem 2022; 99:107717. [DOI: 10.1016/j.compbiolchem.2022.107717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 06/03/2022] [Accepted: 06/21/2022] [Indexed: 11/03/2022]
|
18
|
Talluri S. Algorithms for protein design. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 130:1-38. [PMID: 35534105 DOI: 10.1016/bs.apcsb.2022.01.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Computational Protein Design has the potential to contribute to major advances in enzyme technology, vaccine design, receptor-ligand engineering, biomaterials, nanosensors, and synthetic biology. Although Protein Design is a challenging problem, proteins can be designed by experts in Protein Design, as well as by non-experts whose primary interests are in the applications of Protein Design. The increased accessibility of Protein Design technology is attributable to the accumulated knowledge and experience with Protein Design as well as to the availability of software and online resources. The objective of this review is to serve as a guide to the relevant literature with a focus on the novel methods and algorithms that have been developed or applied for Protein Design, and to assist in the selection of algorithms for Protein Design. Novel algorithms and models that have been introduced to utilize the enormous amount of experimental data and novel computational hardware have the potential for producing substantial increases in the accuracy, reliability and range of applications of designed proteins.
Collapse
Affiliation(s)
- Sekhar Talluri
- Department of Biotechnology, GITAM, Visakhapatnam, India.
| |
Collapse
|
19
|
Tenorio CA, Parker JB, Blaber M. Functionalization of a symmetric protein scaffold: Redundant folding nuclei and alternative oligomeric folding pathways. Protein Sci 2022; 31:e4301. [PMID: 35481645 PMCID: PMC8996475 DOI: 10.1002/pro.4301] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Revised: 03/12/2022] [Accepted: 03/15/2022] [Indexed: 02/02/2023]
Abstract
Successful de novo protein design ideally targets specific folding kinetics, stability thermodynamics, and biochemical functionality, and the simultaneous achievement of all these criteria in a single step design is challenging. Protein design is potentially simplified by separating the problem into two steps: (a) an initial design of a protein "scaffold" having appropriate folding kinetics and stability thermodynamics, followed by (b) appropriate functional mutation-possibly involving insertion of a peptide functional "cassette." This stepwise approach can also separate the orthogonal effects of the "stability/function" and "foldability/function" tradeoffs commonly observed in protein design. If the scaffold is a protein architecture having an exact rotational symmetry, then there is the potential for redundant folding nuclei and multiple equivalent sites of functionalization; thereby enabling broader functional adaptation. We describe such a "scaffold" and functional "cassette" design strategy applied to a β-trefoil threefold symmetric architecture and a heparin ligand functionality. The results support the availability of redundant folding nuclei within this symmetric architecture, and also identify a minimal peptide cassette conferring heparin affinity. The results also identify an energy barrier of destabilization that switches the protein folding pathway from monomeric to trimeric, thereby identifying another potential advantage of symmetric protein architecture in de novo design.
Collapse
Affiliation(s)
- Connie A. Tenorio
- Department of Biomedical Sciences Florida State University Tallahassee Florida USA
| | - Joseph B. Parker
- Department of Biomedical Sciences Florida State University Tallahassee Florida USA
| | - Michael Blaber
- Department of Biomedical Sciences Florida State University Tallahassee Florida USA
| |
Collapse
|
20
|
Ding W, Nakai K, Gong H. Protein design via deep learning. Brief Bioinform 2022; 23:6554124. [PMID: 35348602 PMCID: PMC9116377 DOI: 10.1093/bib/bbac102] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/26/2022] [Accepted: 03/01/2022] [Indexed: 12/11/2022] Open
Abstract
Proteins with desired functions and properties are important in fields like nanotechnology and biomedicine. De novo protein design enables the production of previously unseen proteins from the ground up and is believed as a key point for handling real social challenges. Recent introduction of deep learning into design methods exhibits a transformative influence and is expected to represent a promising and exciting future direction. In this review, we retrospect the major aspects of current advances in deep-learning-based design procedures and illustrate their novelty in comparison with conventional knowledge-based approaches through noticeable cases. We not only describe deep learning developments in structure-based protein design and direct sequence design, but also highlight recent applications of deep reinforcement learning in protein design. The future perspectives on design goals, challenges and opportunities are also comprehensively discussed.
Collapse
Affiliation(s)
- Wenze Ding
- School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China.,School of Future Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China.,MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| | - Kenta Nakai
- Institute of Medical Science, the University of Tokyo, Tokyo 1088639, Japan
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
21
|
Abstract
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design. Rational protein design to achieve a given protein backbone conformation is needed to engineer specific functions. Here Anand et al. describe a machine learning method using a learned neural network potential for fixed-backbone protein design.
Collapse
|
22
|
Ovchinnikov S, Huang PS. Structure-based protein design with deep learning. Curr Opin Chem Biol 2021; 65:136-144. [PMID: 34547592 PMCID: PMC8671290 DOI: 10.1016/j.cbpa.2021.08.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 08/13/2021] [Indexed: 12/11/2022]
Abstract
Since the first revelation of proteins functioning as macromolecular machines through their three dimensional structures, researchers have been intrigued by the marvelous ways the biochemical processes are carried out by proteins. The aspiration to understand protein structures has fueled extensive efforts across different scientific disciplines. In recent years, it has been demonstrated that proteins with new functionality or shapes can be designed via structure-based modeling methods, and the design strategies have combined all available information - but largely piece-by-piece - from sequence derived statistics to the detailed atomic-level modeling of chemical interactions. Despite the significant progress, incorporating data-derived approaches through the use of deep learning methods can be a game changer. In this review, we summarize current progress, compare the arc of developing the deep learning approaches with the conventional methods, and describe the motivation and concepts behind current strategies that may lead to potential future opportunities.
Collapse
Affiliation(s)
- Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, 02138, USA.
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
23
|
Defresne M, Barbe S, Schiex T. Protein Design with Deep Learning. Int J Mol Sci 2021; 22:11741. [PMID: 34769173 PMCID: PMC8584038 DOI: 10.3390/ijms222111741] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/23/2021] [Accepted: 10/26/2021] [Indexed: 12/21/2022] Open
Abstract
Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.
Collapse
Affiliation(s)
- Marianne Defresne
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| |
Collapse
|
24
|
Maguire JB, Grattarola D, Mulligan VK, Klyshko E, Melo H. XENet: Using a new graph convolution to accelerate the timeline for protein design on quantum computers. PLoS Comput Biol 2021; 17:e1009037. [PMID: 34570773 PMCID: PMC8496835 DOI: 10.1371/journal.pcbi.1009037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 10/07/2021] [Accepted: 09/14/2021] [Indexed: 11/30/2022] Open
Abstract
Graph representations are traditionally used to represent protein structures in sequence design protocols in which the protein backbone conformation is known. This infrequently extends to machine learning projects: existing graph convolution algorithms have shortcomings when representing protein environments. One reason for this is the lack of emphasis on edge attributes during massage-passing operations. Another reason is the traditionally shallow nature of graph neural network architectures. Here we introduce an improved message-passing operation that is better equipped to model local kinematics problems such as protein design. Our approach, XENet, pays special attention to both incoming and outgoing edge attributes. We compare XENet against existing graph convolutions in an attempt to decrease rotamer sample counts in Rosetta's rotamer substitution protocol, used for protein side-chain optimization and sequence design. This use case is motivating because it both reduces the size of the search space for classical side-chain optimization algorithms, and allows larger protein design problems to be solved with quantum algorithms on near-term quantum computers with limited qubit counts. XENet outperformed competing models while also displaying a greater tolerance for deeper architectures. We found that XENet was able to decrease rotamer counts by 40% without loss in quality. This decreased the memory consumption for classical pre-computation of rotamer energies in our use case by more than a factor of 3, the qubit consumption for an existing sequence design quantum algorithm by 40%, and the size of the solution space by a factor of 165. Additionally, XENet displayed an ability to handle deeper architectures than competing convolutions.
Collapse
Affiliation(s)
- Jack B. Maguire
- Menten AI, Inc., Palo Alto, California, United States of America
| | - Daniele Grattarola
- Faculty of Informatics, Università della Svizzera italiana, Lugano, Switzerland
| | - Vikram Khipple Mulligan
- Center for Computational Biology, Flatiron Institute, New York, New York, United States of America
| | - Eugene Klyshko
- Menten AI, Inc., Palo Alto, California, United States of America
- Department of Physics, University of Toronto, Toronto, Ontario, Canada
| | - Hans Melo
- Menten AI, Inc., Palo Alto, California, United States of America
| |
Collapse
|
25
|
Boniolo F, Dorigatti E, Ohnmacht AJ, Saur D, Schubert B, Menden MP. Artificial intelligence in early drug discovery enabling precision medicine. Expert Opin Drug Discov 2021; 16:991-1007. [PMID: 34075855 DOI: 10.1080/17460441.2021.1918096] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Introduction: Precision medicine is the concept of treating diseases based on environmental factors, lifestyles, and molecular profiles of patients. This approach has been found to increase success rates of clinical trials and accelerate drug approvals. However, current precision medicine applications in early drug discovery use only a handful of molecular biomarkers to make decisions, whilst clinics gear up to capture the full molecular landscape of patients in the near future. This deep multi-omics characterization demands new analysis strategies to identify appropriate treatment regimens, which we envision will be pioneered by artificial intelligence.Areas covered: In this review, the authors discuss the current state of drug discovery in precision medicine and present our vision of how artificial intelligence will impact biomarker discovery and drug design.Expert opinion: Precision medicine is expected to revolutionize modern medicine; however, its traditional form is focusing on a few biomarkers, thus not equipped to leverage the full power of molecular landscapes. For learning how the development of drugs can be tailored to the heterogeneity of patients across their molecular profiles, artificial intelligence algorithms are the next frontier in precision medicine and will enable a fully personalized approach in drug design, and thus ultimately impacting clinical practice.
Collapse
Affiliation(s)
- Fabio Boniolo
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Centre for Environmental Health, Munich, Germany.,School of Medicine, Chair of Translational Cancer Research and Institute for Experimental Cancer Therapy, Klinikum Rechts Der Isar, Technische Universität München, Munich, Germany
| | - Emilio Dorigatti
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Centre for Environmental Health, Munich, Germany.,Statistical Learning and Data Science, Department of Statistics, Ludwig Maximilian Universität München, Munich, Germany
| | - Alexander J Ohnmacht
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Centre for Environmental Health, Munich, Germany.,Department of Biology, Ludwig-Maximilians University Munich, Martinsried, Germany
| | - Dieter Saur
- School of Medicine, Chair of Translational Cancer Research and Institute for Experimental Cancer Therapy, Klinikum Rechts Der Isar, Technische Universität München, Munich, Germany
| | - Benjamin Schubert
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Centre for Environmental Health, Munich, Germany.,Department of Mathematics, Technical University of Munich, Garching, Germany
| | - Michael P Menden
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Centre for Environmental Health, Munich, Germany.,Department of Biology, Ludwig-Maximilians University Munich, Martinsried, Germany.,German Centre for Diabetes Research (DZD e.V.), Neuherberg, Germany
| |
Collapse
|
26
|
Frappier V, Keating AE. Data-driven computational protein design. Curr Opin Struct Biol 2021; 69:63-69. [PMID: 33910104 DOI: 10.1016/j.sbi.2021.03.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Revised: 03/18/2021] [Accepted: 03/19/2021] [Indexed: 01/28/2023]
Abstract
Computational protein design can generate proteins not found in nature that adopt desired structures and perform novel functions. Although proteins could, in theory, be designed with ab initio methods, practical success has come from using large amounts of data that describe the sequences, structures, and functions of existing proteins and their variants. We present recent creative uses of multiple-sequence alignments, protein structures, and high-throughput functional assays in computational protein design. Approaches range from enhancing structure-based design with experimental data to building regression models to training deep neural nets that generate novel sequences. Looking ahead, deep learning will be increasingly important for maximizing the value of data for protein design.
Collapse
Affiliation(s)
- Vincent Frappier
- Generate Biomedicines, 26 Landsdowne Street, Cambridge, MA, 02139, USA
| | - Amy E Keating
- MIT Departments of Biology and Biological Engineering, 77 Massachusetts Ave., Cambridge, MA, 02139, USA.
| |
Collapse
|
27
|
Gao W, Mahajan SP, Sulam J, Gray JJ. Deep Learning in Protein Structural Modeling and Design. PATTERNS (NEW YORK, N.Y.) 2020; 1:100142. [PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence → structure → function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.
Collapse
Affiliation(s)
- Wenhao Gao
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeremias Sulam
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J. Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
28
|
Qi Y, Zhang JZH. DenseCPD: Improving the Accuracy of Neural-Network-Based Computational Protein Sequence Design with DenseNet. J Chem Inf Model 2020; 60:1245-1252. [DOI: 10.1021/acs.jcim.0c00043] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Yifei Qi
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- NYU−ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| | - John Z. H. Zhang
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- NYU−ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
- Department of Chemistry, New York University, New York, New York 10003, United States
| |
Collapse
|