1
|
Ferruz N, Heinzinger M, Akdel M, Goncearenco A, Naef L, Dallago C. From sequence to function through structure: Deep learning for protein design. Comput Struct Biotechnol J 2022; 21:238-250. [PMID: 36544476 PMCID: PMC9755234 DOI: 10.1016/j.csbj.2022.11.014] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/20/2022] Open
Abstract
The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.
Collapse
Key Words
- ADMM, Alternating Direction Method of Multipliers
- CNN, Convolutional Neural Network
- DL, Deep learning
- Deep learning
- Drug discovery
- FNN, fully-connected neural network
- GAN, Generative Adversarial Network
- GCN, Graph Convolutional Network
- GNN, Graph Neural Network
- GO, Gene Ontology
- GVP, Geometric Vector Perceptron
- LSTM, Long-Short Term Memory
- MLP, Multilayer Perceptron
- MSA, Multiple Sequence Alignment
- NLP, Natural Language Processing
- NSR, Natural Sequence Recovery
- Protein design
- Protein language models
- Protein prediction
- VAE, Variational Autoencoder
- pLM, protein Language Model
Collapse
Affiliation(s)
- Noelia Ferruz
- Institute of Informatics and Applications, University of Girona, Girona, Spain
- Department of Biochemistry, University of Bayreuth, Bayreuth, Germany
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
| | - Mehmet Akdel
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | | | - Luca Naef
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
- NVIDIA DE GmbH, Einsteinstraße 172, 81677 München, Germany
| |
Collapse
|
2
|
Tamposis IA, Sarantopoulou D, Theodoropoulou MC, Stasi EA, Kontou PI, Tsirigos KD, Bagos PG. Hidden neural networks for transmembrane protein topology prediction. Comput Struct Biotechnol J 2021; 19:6090-6097. [PMID: 34849210 PMCID: PMC8606341 DOI: 10.1016/j.csbj.2021.11.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 11/05/2021] [Accepted: 11/06/2021] [Indexed: 11/21/2022] Open
Abstract
Hidden Markov Models (HMMs) are amongst the most successful methods for predicting protein features in biological sequence analysis. However, there are biological problems where the Markovian assumption is not sufficient since the sequence context can provide useful information for prediction purposes. Several extensions of HMMs have appeared in the literature in order to overcome their limitations. We apply here a hybrid method that combines HMMs and Neural Networks (NNs), termed Hidden Neural Networks (HNNs), for biological sequence analysis in a straightforward manner. In this framework, the traditional HMM probability parameters are replaced by NN outputs. As a case study, we focus on the topology prediction of for alpha-helical and beta-barrel membrane proteins. The HNNs show performance gains compared to standard HMMs and the respective predictors outperform the top-scoring methods in the field. The implementation of HNNs can be found in the package JUCHMME, downloadable from http://www.compgen.org/tools/juchmme, https://github.com/pbagos/juchmme. The updated PRED-TMBB2 and HMM-TM prediction servers can be accessed at www.compgen.org.
Collapse
Key Words
- CHMM, Class Hidden Markov Models
- CML, Conditional Maximum Likelihood
- EM, Expectation-Maximization
- HMM, Hidden Markov Models
- HNN, Hidden Neural Networks
- Hidden Markov Models
- Hidden Neural Networks
- JUCHMME, Java Utility for Class Hidden Markov Models and Extensions
- MCC, Matthews Correlation Coefficient
- ML, Maximum Likelihood
- MSA, Multiple Sequence Alignment
- Membrane proteins
- NN, Neural Networks
- Neural Networks
- Protein structure prediction
- SOV, segment overlap
- Sequence analysis
Collapse
Affiliation(s)
- Ioannis A. Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Present address: National Institute on Aging, National Institutes of Health, Baltimore, Maryland, USA
| | | | - Evangelia A. Stasi
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| | - Panagiota I. Kontou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| | | | - Pantelis G. Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| |
Collapse
|
3
|
Joshi AG, Harini K, Meenakshi I, Shafi KM, Pasha SN, Mahita J, Sajeevan RS, Karpe SD, Ghosh P, Nitish S, Gandhimathi A, Mathew OK, Prasanna SH, Malini M, Mutt E, Naika M, Ravooru N, Rao RM, Shingate PN, Sukhwal A, Sunitha MS, Upadhyay AK, Vinekar RS, Sowdhamini R. A knowledge-driven protocol for prediction of proteins of interest with an emphasis on biosynthetic pathways. MethodsX 2020; 7:101053. [PMID: 33024710 PMCID: PMC7528181 DOI: 10.1016/j.mex.2020.101053] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 08/29/2020] [Indexed: 11/28/2022] Open
Abstract
This protocol describes a stepwise process to identify proteins of interest from a query proteome derived from NGS data. We implemented this protocol on Moringa oleifera transcriptome to identify proteins involved in secondary metabolite and vitamin biosynthesis and ion transport. This knowledge-driven protocol identifies proteins using an integrated approach involving sensitive sequence search and evolutionary relationships. We make use of functionally important residues (FIR) specific for the query protein family identified through its homologous sequences and literature. We screen protein hits based on the clustering with true homologues through phylogenetic tree reconstruction complemented with the FIR mapping. The protocol was validated for the protein hits through qRT-PCR and transcriptome quantification. Our protocol demonstrated a higher specificity as compared to other methods, particularly in distinguishing cross-family hits. This protocol was effective in transcriptome data analysis of M. oleifera as described in Pasha et al.Knowledge-driven protocol to identify secondary metabolite synthesizing protein in a highly specific manner. Use of functionally important residues for screening of true hits. Beneficial for metabolite pathway reconstruction in any (species, metagenomics) NGS data.
Collapse
Affiliation(s)
- Adwait G Joshi
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - K Harini
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Iyer Meenakshi
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - K Mohamed Shafi
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India.,The University of Trans-Disciplinary Health Sciences and Technology (TDU), Yelahanka, Bangalore 560064, Karnataka, India
| | - Shaik Naseer Pasha
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Jarjapu Mahita
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Radha Sivarajan Sajeevan
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Snehal D Karpe
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Pritha Ghosh
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Sathyanarayanan Nitish
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India.,The University of Trans-Disciplinary Health Sciences and Technology (TDU), Yelahanka, Bangalore 560064, Karnataka, India
| | - A Gandhimathi
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Oommen K Mathew
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Subramanian Hari Prasanna
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Manoharan Malini
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Eshita Mutt
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Mahantesha Naika
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Nithin Ravooru
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Rajas M Rao
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Prashant N Shingate
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Anshul Sukhwal
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Margaret S Sunitha
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Atul K Upadhyay
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India.,Department of Biotechnology, Thapar Institute of Engineering and Technology, Patiala 147004, Punjab, India
| | - Rithvik S Vinekar
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences (NCBS-TIFR), GKVK campus, Bellary road, Bangalor 560065, Karnataka, India
| |
Collapse
|