101
|
Zheng R, Huang Z, Deng L. Large-scale predicting protein functions through heterogeneous feature fusion. Brief Bioinform 2023:bbad243. [PMID: 37401369 DOI: 10.1093/bib/bbad243] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 05/18/2023] [Accepted: 06/12/2023] [Indexed: 07/05/2023] Open
Abstract
As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.
Collapse
Affiliation(s)
- Rongtao Zheng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| |
Collapse
|
102
|
Upadhyay V, Boorla VS, Maranas CD. Rank-ordering of known enzymes as starting points for re-engineering novel substrate activity using a convolutional neural network. Metab Eng 2023; 78:171-182. [PMID: 37301359 DOI: 10.1016/j.ymben.2023.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 05/19/2023] [Accepted: 06/02/2023] [Indexed: 06/12/2023]
Abstract
Retro-biosynthetic approaches have made significant advances in predicting synthesis routes of target biofuel, bio-renewable or bio-active molecules. The use of only cataloged enzymatic activities limits the discovery of new production routes. Recent retro-biosynthetic algorithms increasingly use novel conversions that require altering the substrate or cofactor specificities of existing enzymes while connecting pathways leading to a target metabolite. However, identifying and re-engineering enzymes for desired novel conversions are currently the bottlenecks in implementing such designed pathways. Herein, we present EnzRank, a convolutional neural network (CNN) based approach, to rank-order existing enzymes in terms of their suitability to undergo successful protein engineering through directed evolution or de novo design towards a desired specific substrate activity. We train the CNN model on 11,800 known active enzyme-substrate pairs from the BRENDA database as positive samples and data generated by scrambling these pairs as negative samples using substrate dissimilarity between an enzyme's native substrate and all other molecules present in the dataset using Tanimoto similarity score. EnzRank achieves an average recovery rate of 80.72% and 73.08% for positive and negative pairs on test data after using a 10-fold holdout method for training and cross-validation. We further developed a web-based user interface (available at https://huggingface.co/spaces/vuu10/EnzRank) to predict enzyme-substrate activity using SMILES strings of substrates and enzyme sequence as input to allow convenient and easy-to-use access to EnzRank. In summary, this effort can aid de novo pathway design tools to prioritize starting enzyme re-engineering candidates for novel reactions as well as in predicting the potential secondary activity of enzymes in cell metabolism.
Collapse
Affiliation(s)
- Vikas Upadhyay
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Veda Sheersh Boorla
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Costas D Maranas
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| |
Collapse
|
103
|
Biharie K, Michielsen L, Reinders MJT, Mahfouz A. Cell type matching across species using protein embeddings and transfer learning. Bioinformatics 2023; 39:i404-i412. [PMID: 37387141 PMCID: PMC10311290 DOI: 10.1093/bioinformatics/btad248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Knowing the relation between cell types is crucial for translating experimental results from mice to humans. Establishing cell type matches, however, is hindered by the biological differences between the species. A substantial amount of evolutionary information between genes that could be used to align the species is discarded by most of the current methods since they only use one-to-one orthologous genes. Some methods try to retain the information by explicitly including the relation between genes, however, not without caveats. RESULTS In this work, we present a model to transfer and align cell types in cross-species analysis (TACTiCS). First, TACTiCS uses a natural language processing model to match genes using their protein sequences. Next, TACTiCS employs a neural network to classify cell types within a species. Afterward, TACTiCS uses transfer learning to propagate cell type labels between species. We applied TACTiCS on scRNA-seq data of the primary motor cortex of human, mouse, and marmoset. Our model can accurately match and align cell types on these datasets. Moreover, our model outperforms Seurat and the state-of-the-art method SAMap. Finally, we show that our gene matching method results in better cell type matches than BLAST in our model. AVAILABILITY AND IMPLEMENTATION The implementation is available on GitHub (https://github.com/kbiharie/TACTiCS). The preprocessed datasets and trained models can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.7582460).
Collapse
Affiliation(s)
- Kirti Biharie
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Lieke Michielsen
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Ahmed Mahfouz
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| |
Collapse
|
104
|
Mohseni Behbahani Y, Laine E, Carbone A. Deep Local Analysis deconstructs protein-protein interfaces and accurately estimates binding affinity changes upon mutation. Bioinformatics 2023; 39:i544-i552. [PMID: 37387162 DOI: 10.1093/bioinformatics/btad231] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The spectacular recent advances in protein and protein complex structure prediction hold promise for reconstructing interactomes at large-scale and residue resolution. Beyond determining the 3D arrangement of interacting partners, modeling approaches should be able to unravel the impact of sequence variations on the strength of the association. RESULTS In this work, we report on Deep Local Analysis, a novel and efficient deep learning framework that relies on a strikingly simple deconstruction of protein interfaces into small locally oriented residue-centered cubes and on 3D convolutions recognizing patterns within cubes. Merely based on the two cubes associated with the wild-type and the mutant residues, DLA accurately estimates the binding affinity change for the associated complexes. It achieves a Pearson correlation coefficient of 0.735 on about 400 mutations on unseen complexes. Its generalization capability on blind datasets of complexes is higher than the state-of-the-art methods. We show that taking into account the evolutionary constraints on residues contributes to predictions. We also discuss the influence of conformational variability on performance. Beyond the predictive power on the effects of mutations, DLA is a general framework for transferring the knowledge gained from the available non-redundant set of complex protein structures to various tasks. For instance, given a single partially masked cube, it recovers the identity and physicochemical class of the central residue. Given an ensemble of cubes representing an interface, it predicts the function of the complex. AVAILABILITY AND IMPLEMENTATION Source code and models are available at http://gitlab.lcqb.upmc.fr/DLA/DLA.git.
Collapse
Affiliation(s)
- Yasser Mohseni Behbahani
- Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France
| | - Elodie Laine
- Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France
| | - Alessandra Carbone
- Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France
| |
Collapse
|
105
|
Lobo F, González MS, Boto A, Pérez de la Lastra JM. Prediction of Antifungal Activity of Antimicrobial Peptides by Transfer Learning from Protein Pretrained Models. Int J Mol Sci 2023; 24:10270. [PMID: 37373415 DOI: 10.3390/ijms241210270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 06/12/2023] [Accepted: 06/14/2023] [Indexed: 06/29/2023] Open
Abstract
Peptides with antifungal activity have gained significant attention due to their potential therapeutic applications. In this study, we explore the use of pretrained protein models as feature extractors to develop predictive models for antifungal peptide activity. Various machine learning classifiers were trained and evaluated. Our AFP predictor achieved comparable performance to current state-of-the-art methods. Overall, our study demonstrates the effectiveness of pretrained models for peptide analysis and provides a valuable tool for predicting antifungal peptide activity and potentially other peptide properties.
Collapse
Affiliation(s)
- Fernando Lobo
- Programa Agustín de Betancourt, Universidad de La Laguna, 38206 La Laguna, Tenerife, Spain
| | - Maily Selena González
- Instituto de Productos Naturales y Agrobiología del CSIC, Avda. Astrofísico Fco. Sánchez, 3, 38206 La Laguna, Tenerife, Spain
| | - Alicia Boto
- Instituto de Productos Naturales y Agrobiología del CSIC, Avda. Astrofísico Fco. Sánchez, 3, 38206 La Laguna, Tenerife, Spain
| | - José Manuel Pérez de la Lastra
- Instituto de Productos Naturales y Agrobiología del CSIC, Avda. Astrofísico Fco. Sánchez, 3, 38206 La Laguna, Tenerife, Spain
| |
Collapse
|
106
|
Murad T, Ali S, Patterson M. Exploring the Potential of GANs in Biological Sequence Analysis. BIOLOGY 2023; 12:854. [PMID: 37372139 DOI: 10.3390/biology12060854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/03/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023]
Abstract
Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.
Collapse
Affiliation(s)
- Taslim Murad
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| |
Collapse
|
107
|
Singh R, Sledzieski S, Bryson B, Cowen L, Berger B. Contrastive learning in protein language space predicts interactions between drugs and protein targets. Proc Natl Acad Sci U S A 2023; 120:e2220778120. [PMID: 37289807 PMCID: PMC10268324 DOI: 10.1073/pnas.2220778120] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 04/10/2023] [Indexed: 06/10/2023] Open
Abstract
Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational prediction needs to be generalizable and scalable while remaining sensitive to subtle variations in the inputs. However, current computational techniques fail to simultaneously meet these goals, often sacrificing performance of one to achieve the others. We develop a deep learning model, ConPLex, successfully leveraging the advances in pretrained protein language models ("PLex") and employing a protein-anchored contrastive coembedding ("Con") to outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds. It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome. Experimental testing of 19 kinase-drug interaction predictions validated 12 interactions, including four with subnanomolar affinity, plus a strongly binding EPHB1 inhibitor (KD = 1.3 nM). Furthermore, ConPLex embeddings are interpretable, which enables us to visualize the drug-target embedding space and use embeddings to characterize the function of human cell-surface proteins. We anticipate that ConPLex will facilitate efficient drug discovery by making highly sensitive in silico drug screening feasible at the genome scale. ConPLex is available open source at https://ConPLex.csail.mit.edu.
Collapse
Affiliation(s)
- Rohit Singh
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Samuel Sledzieski
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Bryan Bryson
- Ragon Institute of MGH, MIT and Harvard, Cambridge, MA02139
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, MA02155
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA02139
| |
Collapse
|
108
|
Nicholas Chua B, Mei Guo W, Teng Wong H, Siak-Wei Ow D, Leng Ho P, Koh W, Koay A, Tian Wong F. A sweeter future: Using protein language models for exploring sweeter brazzein homologs. Food Chem 2023; 426:136580. [PMID: 37331142 DOI: 10.1016/j.foodchem.2023.136580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/23/2023] [Accepted: 06/06/2023] [Indexed: 06/20/2023]
Abstract
With growing concerns over the health impact of sugar, brazzein offers a viable alternative due to its sweetness, thermostability, and low risk profile. Here, we demonstrated the ability of protein language models to design new brazzein homologs with improved thermostability and potentially higher sweetness, resulting in new diverse optimized amino acid sequences that improve structural and functional features beyond what conventional methods could achieve. This innovative approach resulted in the identification of unexpected mutations, thereby generating new possibilities for protein engineering. To facilitate the characterization of the brazzein mutants, a simplified procedure was developed for expressing and analyzing related proteins. This process involved an efficient purification method using Lactococcus lactis (L. lactis), a generally recognized as safe (GRAS) bacterium, as well as taste receptor assays to evaluate sweetness. The study successfully demonstrated the potential of computational design in producing a more heat-resistant and potentially more palatable brazzein variant, V23.
Collapse
Affiliation(s)
- Bryan Nicholas Chua
- Molecular Engineering Laboratory, Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, #07-06, Proteos, Singapore 138673, Republic of Singapore
| | - Wei Mei Guo
- Singapore Institute of Food and Biotechnology Innovation (SIFBI), Agency for Science, Technology and Research (A*STAR), 31 Biopolis Way, #02-01, Nanos, Singapore 138669, Republic of Singapore
| | - Han Teng Wong
- Molecular Engineering Laboratory, Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, #07-06, Proteos, Singapore 138673, Republic of Singapore
| | - Dave Siak-Wei Ow
- Bioprocessing Technology Institute (BTI), Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01, Centros, Singapore 138668, Republic of Singapore
| | - Pooi Leng Ho
- Bioprocessing Technology Institute (BTI), Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01, Centros, Singapore 138668, Republic of Singapore
| | - Winston Koh
- Institute of Bioengineering and Bioimaging (IBB), Agency for Science, Technology and Research (A*STAR), 31 Biopolis Way, #07-01, Nanos, Singapore 138669, Republic of Singapore; Bioinformatics Institute (BII), Agency of Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Republic of Singapore.
| | - Ann Koay
- Singapore Institute of Food and Biotechnology Innovation (SIFBI), Agency for Science, Technology and Research (A*STAR), 31 Biopolis Way, #02-01, Nanos, Singapore 138669, Republic of Singapore.
| | - Fong Tian Wong
- Molecular Engineering Laboratory, Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, #07-06, Proteos, Singapore 138673, Republic of Singapore; Institute of Sustainability for Chemicals, Energy and Environment (ISCE(2)), Agency for Science, Technology and Research (A*STAR), 8 Biomedical Grove, Neuros, #07-01, Singapore 138665, Republic of Singapore.
| |
Collapse
|
109
|
Ouellet S, Ferguson L, Lau AZ, Lim TKY. CysPresso: a classification model utilizing deep learning protein representations to predict recombinant expression of cysteine-dense peptides. BMC Bioinformatics 2023; 24:200. [PMID: 37193950 PMCID: PMC10189939 DOI: 10.1186/s12859-023-05327-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/08/2023] [Indexed: 05/18/2023] Open
Abstract
BACKGROUND Cysteine-dense peptides (CDPs) are an attractive pharmaceutical scaffold that display extreme biochemical properties, low immunogenicity, and the ability to bind targets with high affinity and selectivity. While many CDPs have potential and confirmed therapeutic uses, synthesis of CDPs is a challenge. Recent advances have made the recombinant expression of CDPs a viable alternative to chemical synthesis. Moreover, identifying CDPs that can be expressed in mammalian cells is crucial in predicting their compatibility with gene therapy and mRNA therapy. Currently, we lack the ability to identify CDPs that will express recombinantly in mammalian cells without labour intensive experimentation. To address this, we developed CysPresso, a novel machine learning model that predicts recombinant expression of CDPs based on primary sequence. RESULTS We tested various protein representations generated by deep learning algorithms (SeqVec, proteInfer, AlphaFold2) for their suitability in predicting CDP expression and found that AlphaFold2 representations possessed the best predictive features. We then optimized the model by concatenation of AlphaFold2 representations, time series transformation with random convolutional kernels, and dataset partitioning. CONCLUSION Our novel model, CysPresso, is the first to successfully predict recombinant CDP expression in mammalian cells and is particularly well suited for predicting recombinant expression of knottin peptides. When preprocessing the deep learning protein representation for supervised machine learning, we found that random convolutional kernel transformation preserves more pertinent information relevant for predicting expressibility than embedding averaging. Our study showcases the applicability of deep learning-based protein representations, such as those provided by AlphaFold2, in tasks beyond structure prediction.
Collapse
Affiliation(s)
| | - Larissa Ferguson
- Neurobiology Division, MRC Laboratory of Molecular Biology, Cambridge, UK
| | - Angus Z Lau
- Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Physical Sciences Platform, Sunnybrook Research Institute, Toronto, ON, Canada
| | - Tony K Y Lim
- , Vancouver, Canada.
- Department of Pharmacology, University of Cambridge, Cambridge, UK.
| |
Collapse
|
110
|
Yoshimori A, Bajorath J. Motif2Mol: Prediction of New Active Compounds Based on Sequence Motifs of Ligand Binding Sites in Proteins Using a Biochemical Language Model. Biomolecules 2023; 13:biom13050833. [PMID: 37238703 DOI: 10.3390/biom13050833] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 05/05/2023] [Accepted: 05/12/2023] [Indexed: 05/28/2023] Open
Abstract
In drug design, the prediction of new active compounds from protein sequence data has only been attempted in a few studies thus far. This prediction task is principally challenging because global protein sequence similarity has strong evolutional and structural implications, but is often only vaguely related to ligand binding. Deep language models adapted from natural language processing offer new opportunities to attempt such predictions via machine translation by directly relating amino acid sequences and chemical structures to each based on textual molecular representations. Herein, we introduce a biochemical language model with transformer architecture for the prediction of new active compounds from sequence motifs of ligand binding sites. In a proof-of-concept application on inhibitors of more than 200 human kinases, the Motif2Mol model revealed promising learning characteristics and an unprecedented ability to consistently reproduce known inhibitors of different kinases.
Collapse
Affiliation(s)
- Atsushi Yoshimori
- Institute for Theoretical Medicine, Inc., 26-1 Muraoka-Higashi 2-Chome, Fujisawa 251-0012, Japan
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany
| |
Collapse
|
111
|
Kim Y, Kwon J. AttSec: protein secondary structure prediction by capturing local patterns from attention map. BMC Bioinformatics 2023; 24:183. [PMID: 37142993 PMCID: PMC10161504 DOI: 10.1186/s12859-023-05310-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Accepted: 04/27/2023] [Indexed: 05/06/2023] Open
Abstract
BACKGROUND Protein secondary structures that link simple 1D sequences to complex 3D structures can be used as good features for describing the local properties of protein, but also can serve as key features for predicting the complex 3D structures of protein. Thus, it is very important to accurately predict the secondary structure of the protein, which contains a local structural property assigned by the pattern of hydrogen bonds formed between amino acids. In this study, we accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. In particular, AttSec extracts self-attention maps corresponding to pairwise features between amino acid embeddings and passes them through 2D convolution blocks to capture local patterns. In addition, instead of using additional evolutionary information, it uses protein embedding as an input, which is generated by a language model. RESULTS For the ProteinNet DSSP8 dataset, our model showed 11.8% better performance on the entire evaluation datasets compared with other no-evolutionary-information-based models. For the NetSurfP-2.0 DSSP8 dataset, it showed 1.2% better performance on average. There was an average performance improvement of 9.0% for the ProteinNet DSSP3 dataset and an average of 0.7% for the NetSurfP-2.0 DSSP3 dataset. CONCLUSION We accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. Although there was no dramatic accuracy improvement compared with other models, the improvement on DSSP8 was greater than that on DSSP3. This result implies that using our proposed pairwise feature could have a remarkable effect for several challenging tasks that require finely subdivided classification. Github package URL is https://github.com/youjin-DDAI/AttSec .
Collapse
Affiliation(s)
- Youjin Kim
- Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea
- LG AI Research, Seoul, Republic of Korea
| | - Junseok Kwon
- Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea.
| |
Collapse
|
112
|
Flamholz ZN, Biller SJ, Kelly L. Large language models improve annotation of viral proteins. RESEARCH SQUARE 2023:rs.3.rs-2852098. [PMID: 37205395 PMCID: PMC10187409 DOI: 10.21203/rs.3.rs-2852098/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.
Collapse
Affiliation(s)
- Zachary N. Flamholz
- Department of Systems and Computational Biology, Albert Einstein College of Medicine; Bronx, NY, USA
| | - Steve J. Biller
- Department of Biological Sciences, Wellesley College; Wellesley, MA USA
| | - Libusha Kelly
- Department of Systems and Computational Biology, Albert Einstein College of Medicine; Bronx, NY, USA
- Department of Microbiology and Immunology, Albert Einstein College of Medicine; Bronx, NY, USA
| |
Collapse
|
113
|
Soylu NN, Sefer E. BERT2OME: Prediction of 2'-O-Methylation Modifications From RNA Sequence by Transformer Architecture Based on BERT. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2177-2189. [PMID: 37819796 DOI: 10.1109/tcbb.2023.3237769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Recent work on language models has resulted in state-of-the-art performance on various language tasks. Among these, Bidirectional Encoder Representations from Transformers (BERT) has focused on contextualizing word embeddings to extract context and semantics of the words. On the other hand, post-transcriptional 2'-O-methylation (Nm) RNA modification is important in various cellular tasks and related to a number of diseases. The existing high-throughput experimental techniques take longer time to detect these modifications, and costly in exploring these functional processes. Here, to deeply understand the associated biological processes faster, we come up with an efficient method Bert2Ome to infer 2'-O-methylation RNA modification sites from RNA sequences. Bert2Ome combines BERT-based model with convolutional neural networks (CNN) to infer the relationship between the modification sites and RNA sequence content. Unlike the methods proposed so far, Bert2Ome assumes each given RNA sequence as a text and focuses on improving the modification prediction performance by integrating the pretrained deep learning-based language model BERT. Additionally, our transformer-based approach could infer modification sites across multiple species. According to 5-fold cross-validation, human and mouse accuracies were 99.15% and 94.35% respectively. Similarly, ROC AUC scores were 0.99, 0.94 for the same species. Detailed results show that Bert2Ome reduces the time consumed in biological experiments and outperforms the existing approaches across different datasets and species over multiple metrics. Additionally, deep learning approaches such as 2D CNNs are more promising in learning BERT attributes than more conventional machine learning methods.
Collapse
|
114
|
Mardikoraem M, Woldring D. Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods. Pharmaceutics 2023; 15:1337. [PMID: 37242577 PMCID: PMC10224321 DOI: 10.3390/pharmaceutics15051337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 04/19/2023] [Accepted: 04/21/2023] [Indexed: 05/28/2023] Open
Abstract
Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Daniel Woldring
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
115
|
Jha K, Karmakar S, Saha S. Graph-BERT and language model-based framework for protein-protein interaction identification. Sci Rep 2023; 13:5663. [PMID: 37024543 PMCID: PMC10079975 DOI: 10.1038/s41598-023-31612-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 03/14/2023] [Indexed: 04/08/2023] Open
Abstract
Identification of protein-protein interactions (PPI) is among the critical problems in the domain of bioinformatics. Previous studies have utilized different AI-based models for PPI classification with advances in artificial intelligence (AI) techniques. The input to these models is the features extracted from different sources of protein information, mainly sequence-derived features. In this work, we present an AI-based PPI identification model utilizing a PPI network and protein sequences. The PPI network is represented as a graph where each node is a protein pair, and an edge is defined between two nodes if there exists a common protein between these nodes. Each node in a graph has a feature vector. In this work, we have used the language model to extract feature vectors directly from protein sequences. The feature vectors for protein in pairs are concatenated and used as a node feature vector of a PPI network graph. Finally, we have used the Graph-BERT model to encode the PPI network graph with sequence-based features and learn the hidden representation of the feature vector for each node. The next step involves feeding the learned representations of nodes to the fully connected layer, the output of which is fed into the softmax layer to classify the protein interactions. To assess the efficacy of the proposed PPI model, we have performed experiments on several PPI datasets. The experimental results demonstrate that the proposed approach surpasses the existing PPI works and designed baselines in classifying PPI.
Collapse
Affiliation(s)
- Kanchan Jha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India.
| | - Sourav Karmakar
- Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, West Bengal, 713209, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India
| |
Collapse
|
116
|
Bryant P. Deep learning for protein complex structure prediction. Curr Opin Struct Biol 2023; 79:102529. [PMID: 36731337 DOI: 10.1016/j.sbi.2023.102529] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 12/10/2022] [Accepted: 12/20/2022] [Indexed: 02/04/2023]
Abstract
Recent developments in the structure prediction of protein complexes have resulted in accuracies rivalling experimental methods in many cases. The high accuracy is mainly observed in dimeric complexes and other problems such as protein disorder and predicting the structure of host-pathogen interactions remain. This review highlights the foundation for current accurate structure prediction of protein complexes and possible ways to address the remaining limitations.
Collapse
Affiliation(s)
- Patrick Bryant
- Science for Life Laboratory, 172 21 Solna, Sweden; Department of Biochemistry and Biophysics, Stockholm University, 106 91 Stockholm, Sweden.
| |
Collapse
|
117
|
Bordin N, Dallago C, Heinzinger M, Kim S, Littmann M, Rauer C, Steinegger M, Rost B, Orengo C. Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci 2023; 48:345-359. [PMID: 36504138 PMCID: PMC10570143 DOI: 10.1016/j.tibs.2022.11.001] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 10/24/2022] [Accepted: 11/17/2022] [Indexed: 12/10/2022]
Abstract
Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Christian Dallago
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; VantAI, 151 W 42nd Street, New York, NY 10036, USA
| | - Michael Heinzinger
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Stephanie Kim
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Maria Littmann
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK.
| |
Collapse
|
118
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
119
|
Wang X, Ding Z, Wang R, Lin X. Deepro-Glu: combination of convolutional neural network and Bi-LSTM models using ProtBert and handcrafted features to identify lysine glutarylation sites. Brief Bioinform 2023; 24:6991122. [PMID: 36653898 DOI: 10.1093/bib/bbac631] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 12/11/2022] [Accepted: 12/28/2022] [Indexed: 01/20/2023] Open
Abstract
Lysine glutarylation (Kglu) is a newly discovered post-translational modification of proteins with important roles in mitochondrial functions, oxidative damage, etc. The established biological experimental methods to identify glutarylation sites are often time-consuming and costly. Therefore, there is an urgent need to develop computational methods for efficient and accurate identification of glutarylation sites. Most of the existing computational methods only utilize handcrafted features to construct the prediction model and do not consider the positive impact of the pre-trained protein language model on the prediction performance. Based on this, we develop an ensemble deep-learning predictor Deepro-Glu that combines convolutional neural network and bidirectional long short-term memory network using the deep learning features and traditional handcrafted features to predict lysine glutaryation sites. The deep learning features are generated from the pre-trained protein language model called ProtBert, and the handcrafted features consist of sequence-based features, physicochemical property-based features and evolution information-based features. Furthermore, the attention mechanism is used to efficiently integrate the deep learning features and the handcrafted features by learning the appropriate attention weights. 10-fold cross-validation and independent tests demonstrate that Deepro-Glu achieves competitive or superior performance than the state-of-the-art methods. The source codes and data are publicly available at https://github.com/xwanggroup/Deepro-Glu.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Zhaoyuan Ding
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Rong Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Xi Lin
- Instiute of Artificial Intelligence, Xiamen University, No.4221, Xiang'an South Road, 361000, Xiamen, China
| |
Collapse
|
120
|
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, Dos Santos Costa A, Fazel-Zarandi M, Sercu T, Candido S, Rives A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023; 379:1123-1130. [PMID: 36927031 DOI: 10.1126/science.ade2574] [Citation(s) in RCA: 723] [Impact Index Per Article: 723.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
Collapse
Affiliation(s)
- Zeming Lin
- FAIR, Meta AI, New York, NY, USA
- New York University, New York, NY, USA
| | | | | | - Brian Hie
- FAIR, Meta AI, New York, NY, USA
- Stanford University, Palo Alto, CA, USA
| | | | | | | | | | | | | | | | | | | | | | - Alexander Rives
- FAIR, Meta AI, New York, NY, USA
- New York University, New York, NY, USA
| |
Collapse
|
121
|
Benchmarking machine learning robustness in Covid-19 genome sequence classification. Sci Rep 2023; 13:4154. [PMID: 36914815 PMCID: PMC10010240 DOI: 10.1038/s41598-023-31368-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/10/2023] [Indexed: 03/16/2023] Open
Abstract
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
Collapse
|
122
|
Wang C, Yuan C, Wang Y, Chen R, Shi Y, Patti GJ, Hou Q. Genome-scale enzymatic reaction prediction by variational graph autoencoders. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.08.531729. [PMID: 36945484 PMCID: PMC10028866 DOI: 10.1101/2023.03.08.531729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
Background Enzymatic reaction networks are crucial to explore the mechanistic function of metabolites and proteins in biological systems and understanding the etiology of diseases and potential target for drug discovery. The increasing number of metabolic reactions allows the development of deep learning-based methods to discover new enzymatic reactions, which will expand the landscape of existing enzymatic reaction networks to investigate the disrupted metabolisms in diseases. Results In this study, we propose the MPI-VGAE framework to predict metabolite-protein interactions (MPI) in a genome-scale heterogeneous enzymatic reaction network across ten organisms with thousands of enzymatic reactions. We improved the Variational Graph Autoencoders (VGAE) model to incorporate both molecular features of metabolites and proteins as well as neighboring features to achieve the best predictive performance of MPI. The MPI-VGAE framework showed robust performance in the reconstruction of hundreds of metabolic pathways and five functional enzymatic reaction networks. The MPI-VGAE framework was also applied to a homogenous metabolic reaction network and achieved as high performance as other state-of-art methods. Furthermore, the MPI-VGAE framework could be implemented to reconstruct the disease-specific MPI network based on hundreds of disrupted metabolites and proteins in Alzheimer's disease and colorectal cancer, respectively. A substantial number of new potential enzymatic reactions were predicted and validated by molecular docking. These results highlight the potential of the MPI-VGAE framework for the discovery of novel disease-related enzymatic reactions and drug targets in real-world applications. Data availability and implementation The MPI-VGAE framework and datasets are publicly accessible on GitHub https://github.com/mmetalab/mpi-vgae . Author Biographies Cheng Wang received his Ph.D. in Chemistry from The Ohio State Univesity, USA. He is currently a Assistant Professor in School of Public Health at Shandong University, China. His research interests include bioinformatics, machine learning-based approach with applications to biomedical networks. Chuang Yuan is a research assistant at Shandong University. He obtained the MS degree in Biology at the University of Science and Technology of China. His research interests include biochemistry & molecular biology, cell biology, biomedicine, bioinformatics, and computational biology. Yahui Wang is a PhD student in Department of Chemistry at Washington University in St. Louis. Her research interests include biochemistry, mass spectrometry-based metabolomics, and cancer metabolism. Ranran Chen is a master graduate student in School of Public Health at University of Shandong, China. Yuying Shi is a master graduate student in School of Public Health at University of Shandong, China. Gary J. Patti is the Michael and Tana Powell Professor at Washington University in St. Louis, where he holds appointments in the Department of Chemisrty and the Department of Medicine. He is also the Senior Director of the Center for Metabolomics and Isotope Tracing at Washington University. His research interests include metabolomics, bioinformatics, high-throughput mass spectrometry, environmental health, cancer, and aging. Leyi Wei received his Ph.D. in Computer Science from Xiamen University, China. He is currently a Professor in School of Software at Shandong University, China. His research interests include machine learning and its applications to bioinformatics. Qingzhen Hou received his Ph.D. in the Centre for Integrative Bioinformatics VU (IBIVU) from Vrije Universiteit Amsterdam, the Netherlands. Since 2020, He has serveved as the head of Bioinformatics Center in National Institute of Health Data Science of China and Assistant Professor in School of Public Health, Shandong University, China. His areas of research are bioinformatics and computational biophysics. Key points Genome-scale heterogeneous networks of metabolite-protein interaction (MPI) based on thousands of enzymatic reactions across ten organisms were constructed semi-automatically.An enzymatic reaction prediction method called Metabolite-Protein Interaction Variational Graph Autoencoders (MPI-VGAE) was developed and optimized to achieve higher performance compared with existing machine learning methods by using both molecular features of metabolites and proteins.MPI-VGAE is broadly useful for applications involving the reconstruction of metabolic pathways, functional enzymatic reaction networks, and homogenous networks (e.g., metabolic reaction networks).By implementing MPI-VGAE to Alzheimer's disease and colorectal cancer, we obtained several novel disease-related protein-metabolite reactions with biological meanings. Moreover, we further investigated the reasonable binding details of protein-metabolite interactions using molecular docking approaches which provided useful information for disease mechanism and drug design.
Collapse
|
123
|
Huang B, Fan T, Wang K, Zhang H, Yu C, Nie S, Qi Y, Zheng WM, Han J, Fan Z, Sun S, Ye S, Yang H, Bu D. Accurate and efficient protein sequence design through learning concise local environment of residues. Bioinformatics 2023; 39:btad122. [PMID: 36916746 PMCID: PMC10027430 DOI: 10.1093/bioinformatics/btad122] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 01/30/2023] [Accepted: 02/19/2023] [Indexed: 03/15/2023] Open
Abstract
MOTIVATION Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. RESULTS Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein. AVAILABILITY AND IMPLEMENTATION The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.
Collapse
Affiliation(s)
- Bin Huang
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
| | - Tingwen Fan
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Kaiyue Wang
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, School of Engineering Medicine, Beihang University, Beijing 100083, China
- Key Laboratory of Big Data-based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of the People’s Republic of China, Beijing 100083, China
| | - Haicang Zhang
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Chungong Yu
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Shuyu Nie
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
- School of Life Sciences, Hebei University, Baoding, Hebei 071002, China
| | - Yangshuo Qi
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
- School of Life Sciences, Hebei University, Baoding, Hebei 071002, China
| | - Wei-Mou Zheng
- University of Chinese Academy of Sciences, Beijing 100110, China
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China
| | - Jian Han
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Zheng Fan
- Institutional Center for Shared Technologies and Facilities, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Sheng Ye
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, School of Engineering Medicine, Beihang University, Beijing 100083, China
- Key Laboratory of Big Data-based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of the People’s Republic of China, Beijing 100083, China
| | - Huaiyi Yang
- University of Chinese Academy of Sciences, Beijing 100110, China
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| |
Collapse
|
124
|
Tran C, Khadkikar S, Porollo A. Survey of Protein Sequence Embedding Models. Int J Mol Sci 2023; 24:3775. [PMID: 36835188 PMCID: PMC9963412 DOI: 10.3390/ijms24043775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 01/23/2023] [Accepted: 02/09/2023] [Indexed: 02/16/2023] Open
Abstract
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
Collapse
Affiliation(s)
- Chau Tran
- Department of Computer Science, University of Cincinnati, Cincinnati, OH 45219, USA
| | - Siddharth Khadkikar
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Aleksey Porollo
- Center for Autoimmune Genomics and Etiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA
- Department of Pediatrics, University of Cincinnati, Cincinnati, OH 45267, USA
| |
Collapse
|
125
|
Atas Guvenilir H, Doğan T. How to approach machine learning-based prediction of drug/compound-target interactions. J Cheminform 2023; 15:16. [PMID: 36747300 PMCID: PMC9901167 DOI: 10.1186/s13321-023-00689-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 01/30/2023] [Indexed: 02/08/2023] Open
Abstract
The identification of drug/compound-target interactions (DTIs) constitutes the basis of drug discovery, for which computational predictive approaches have been developed. As a relatively new data-driven paradigm, proteochemometric (PCM) modeling utilizes both protein and compound properties as a pair at the input level and processes them via statistical/machine learning. The representation of input samples (i.e., proteins and their ligands) in the form of quantitative feature vectors is crucial for the extraction of interaction-related properties during the artificial learning and subsequent prediction of DTIs. Lately, the representation learning approach, in which input samples are automatically featurized via training and applying a machine/deep learning model, has been utilized in biomedical sciences. In this study, we performed a comprehensive investigation of different computational approaches/techniques for protein featurization (including both conventional approaches and the novel learned embeddings), data preparation and exploration, machine learning-based modeling, and performance evaluation with the aim of achieving better data representations and more successful learning in DTI prediction. For this, we first constructed realistic and challenging benchmark datasets on small, medium, and large scales to be used as reliable gold standards for specific DTI modeling tasks. We developed and applied a network analysis-based splitting strategy to divide datasets into structurally different training and test folds. Using these datasets together with various featurization methods, we trained and tested DTI prediction models and evaluated their performance from different angles. Our main findings can be summarized under 3 items: (i) random splitting of datasets into train and test folds leads to near-complete data memorization and produce highly over-optimistic results, as a result, should be avoided, (ii) learned protein sequence embeddings work well in DTI prediction and offer high potential, despite interaction-related properties (e.g., structures) of proteins are unused during their self-supervised model training, and (iii) during the learning process, PCM models tend to rely heavily on compound features while partially ignoring protein features, primarily due to the inherent bias in DTI data, indicating the requirement for new and unbiased datasets. We hope this study will aid researchers in designing robust and high-performing data-driven DTI prediction systems that have real-world translational value in drug discovery.
Collapse
Affiliation(s)
- Heval Atas Guvenilir
- Biological Data Science Laboratory, Department of Computer Engineering, Hacettepe University, Ankara, Turkey
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Tunca Doğan
- Biological Data Science Laboratory, Department of Computer Engineering, Hacettepe University, Ankara, Turkey.
- Institute of Informatics, Hacettepe University, Ankara, Turkey.
- Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey.
| |
Collapse
|
126
|
Albu AI, Bocicor MI, Czibula G. MM-StackEns: A new deep multimodal stacked generalization approach for protein-protein interaction prediction. Comput Biol Med 2023; 153:106526. [PMID: 36623437 DOI: 10.1016/j.compbiomed.2022.106526] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/05/2023]
Abstract
Accurate in-silico identification of protein-protein interactions (PPIs) is a long-standing problem in biology, with important implications in protein function prediction and drug design. Current computational approaches predominantly use a single data modality for describing protein pairs, which may not fully capture the characteristics relevant for identifying PPIs. Another limitation of existing methods is their poor generalization to proteins outside the training graph. In this paper, we aim to address these shortcomings by proposing a new ensemble approach for PPI prediction, which learns information from two modalities, corresponding to pairs of sequences and to the graph formed by the training proteins and their interactions. Our approach uses a siamese neural network to process sequence information, while graph attention networks are employed for the network view. For capturing the relationships between the proteins in a pair, we design a new feature fusion module, based on computing the distance between the distributions corresponding to the two proteins. The prediction is made using a stacked generalization procedure, in which the final classifier is represented by a Logistic Regression model trained on the scores predicted by the sequence and graph models. Additionally, we show that protein sequence embeddings obtained using pretrained language models can significantly improve the generalization of PPI methods. The experimental results demonstrate the good performance of our approach, which surpasses all the related work on two Yeast data sets, while outperforming the majority of literature approaches on two Human data sets and on independent multi-species data sets.
Collapse
Affiliation(s)
- Alexandra-Ioana Albu
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| | - Maria-Iuliana Bocicor
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| | - Gabriela Czibula
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| |
Collapse
|
127
|
Lin P, Yan Y, Huang SY. DeepHomo2.0: improved protein-protein contact prediction of homodimers by transformer-enhanced deep learning. Brief Bioinform 2023; 24:6849483. [PMID: 36440949 DOI: 10.1093/bib/bbac499] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 10/08/2022] [Accepted: 10/21/2022] [Indexed: 11/30/2022] Open
Abstract
Protein-protein interactions play an important role in many biological processes. However, although structure prediction for monomer proteins has achieved great progress with the advent of advanced deep learning algorithms like AlphaFold, the structure prediction for protein-protein complexes remains an open question. Taking advantage of the Transformer model of ESM-MSA, we have developed a deep learning-based model, named DeepHomo2.0, to predict protein-protein interactions of homodimeric complexes by leveraging the direct-coupling analysis (DCA) and Transformer features of sequences and the structure features of monomers. DeepHomo2.0 was extensively evaluated on diverse test sets and compared with eight state-of-the-art methods including protein language model-based, DCA-based and machine learning-based methods. It was shown that DeepHomo2.0 achieved a high precision of >70% with experimental monomer structures and >60% with predicted monomer structures for the top 10 predicted contacts on the test sets and outperformed the other eight methods. Moreover, even the version without using structure information, named DeepHomoSeq, still achieved a good precision of >55% for the top 10 predicted contacts. Integrating the predicted contacts into protein docking significantly improved the structure prediction of realistic Critical Assessment of Protein Structure Prediction homodimeric complexes. DeepHomo2.0 and DeepHomoSeq are available at http://huanglab.phys.hust.edu.cn/DeepHomo2/.
Collapse
Affiliation(s)
- Peicong Lin
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| | - Yumeng Yan
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| | - Sheng-You Huang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| |
Collapse
|
128
|
Hou Z, Yang Y, Ma Z, Wong KC, Li X. Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Commun Biol 2023; 6:73. [PMID: 36653447 PMCID: PMC9849350 DOI: 10.1038/s42003-023-04462-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 01/11/2023] [Indexed: 01/20/2023] Open
Abstract
Protein-protein interactions (PPIs) govern cellular pathways and processes, by significantly influencing the functional expression of proteins. Therefore, accurate identification of protein-protein interaction binding sites has become a key step in the functional analysis of proteins. However, since most computational methods are designed based on biological features, there are no available protein language models to directly encode amino acid sequences into distributed vector representations to model their characteristics for protein-protein binding events. Moreover, the number of experimentally detected protein interaction sites is much smaller than that of protein-protein interactions or protein sites in protein complexes, resulting in unbalanced data sets that leave room for improvement in their performance. To address these problems, we develop an ensemble deep learning model (EDLM)-based protein-protein interaction (PPI) site identification method (EDLMPPI). Evaluation results show that EDLMPPI outperforms state-of-the-art techniques including several PPI site prediction models on three widely-used benchmark datasets including Dset_448, Dset_72, and Dset_164, which demonstrated that EDLMPPI is superior to those PPI site prediction models by nearly 10% in terms of average precision. In addition, the biological and interpretable analyses provide new insights into protein binding site identification and characterization mechanisms from different perspectives. The EDLMPPI webserver is available at http://www.edlmppi.top:5002/ .
Collapse
Affiliation(s)
- Zilong Hou
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yuning Yang
- Information Science and Technology, Northeast Normal University, Jilin, China
| | - Zhiqiang Ma
- Information Science and Technology, Northeast Normal University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China.
| |
Collapse
|
129
|
Lim PK, Julca I, Mutwil M. Redesigning plant specialized metabolism with supervised machine learning using publicly available reactome data. Comput Struct Biotechnol J 2023; 21:1639-1650. [PMID: 36874159 PMCID: PMC9976193 DOI: 10.1016/j.csbj.2023.01.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/12/2023] [Accepted: 01/12/2023] [Indexed: 01/19/2023] Open
Abstract
The immense structural diversity of products and intermediates of plant specialized metabolism (specialized metabolites) makes them rich sources of therapeutic medicine, nutrients, and other useful materials. With the rapid accumulation of reactome data that can be accessible on biological and chemical databases, along with recent advances in machine learning, this review sets out to outline how supervised machine learning can be used to design new compounds and pathways by exploiting the wealth of said data. We will first examine the various sources from which reactome data can be obtained, followed by explaining the different machine learning encoding methods for reactome data. We then discuss current supervised machine learning developments that can be employed in various aspects to help redesign plant specialized metabolism.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Irene Julca
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
130
|
Anteghini M, Martins Dos Santos VAP. Computational Approaches for Peroxisomal Protein Localization. Methods Mol Biol 2023; 2643:405-411. [PMID: 36952202 DOI: 10.1007/978-1-0716-3048-8_29] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2023]
Abstract
Computational approaches are practical when investigating putative peroxisomal proteins and for sub-peroxisomal protein localization in unknown protein sequences. Nowadays, advancements in computational methods and Machine Learning (ML) can be used to hasten the discovery of novel peroxisomal proteins and can be combined with more established computational methodologies. Here, we explain and list some of the most used tools and methodologies for novel peroxisomal protein detection and localization.
Collapse
Affiliation(s)
- Marco Anteghini
- Lifeglimmer GmbH, Berlin, Germany.
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, WE, The Netherlands.
- Zuse Institut Berlin, Visual and Data-Centric Computing, Berlin, Germany.
| | - Vitor A P Martins Dos Santos
- Lifeglimmer GmbH, Berlin, Germany
- BioProcess Engineering, Wageningen University & Research, Wageningen, WE, The Netherlands
| |
Collapse
|
131
|
Nambiar A, Liu S, Heflin M, Forsyth JM, Maslov S, Hopkins M, Ritz A. Transformer Neural Networks for Protein Family and Interaction Prediction Tasks. J Comput Biol 2023; 30:95-111. [PMID: 35950958 DOI: 10.1089/cmb.2022.0132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer neural network that pre-trains task-agnostic sequence representations. This model is fine-tuned to solve two different protein prediction tasks: protein family classification and protein interaction prediction. Our method is comparable to existing state-of-the-art approaches for protein family classification while being much more general than other architectures. Further, our method outperforms other approaches for protein interaction prediction for two out of three different scenarios that we generated. These results offer a promising framework for fine-tuning the pre-trained sequence representations for other protein prediction tasks.
Collapse
Affiliation(s)
- Ananthan Nambiar
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Simon Liu
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Department of Computer Science, and University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Maeve Heflin
- Department of Computer Science, and University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - John Malcolm Forsyth
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Department of Computer Science, and University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Sergei Maslov
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Department of Computer Science, and University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Mark Hopkins
- Department of Computer Science and Reed College, Portland, Oregon, USA
| | - Anna Ritz
- Department of Biology, Reed College, Portland, Oregon, USA
| |
Collapse
|
132
|
Olenyi T, Marquet C, Heinzinger M, Kröger B, Nikolova T, Bernhofer M, Sändig P, Schütze K, Littmann M, Mirdita M, Steinegger M, Dallago C, Rost B. LambdaPP: Fast and accessible protein-specific phenotype predictions. Protein Sci 2023; 32:e4524. [PMID: 36454227 PMCID: PMC9793974 DOI: 10.1002/pro.4524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 11/09/2022] [Accepted: 11/21/2022] [Indexed: 12/04/2022]
Abstract
The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.
Collapse
Affiliation(s)
- Tobias Olenyi
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Céline Marquet
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Benjamin Kröger
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Tiha Nikolova
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Michael Bernhofer
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Philip Sändig
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Konstantin Schütze
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Maria Littmann
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Milot Mirdita
- School of Biological SciencesSeoul National UniversitySeoulSouth Korea
| | - Martin Steinegger
- School of Biological SciencesSeoul National UniversitySeoulSouth Korea
- Korea Artificial Intelligence InstituteSeoul National UniversitySeoulSouth Korea
- Korea Institute of Molecular Biology and GeneticsSeoul National UniversitySeoulSouth Korea
| | - Christian Dallago
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- VantAINew YorkUSA
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- Institute for Advanced Study (TUM‐IAS)Lichtenbergstr. 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW)FreisingGermany
| |
Collapse
|
133
|
Kimothi D, Biyani P, Hogan JM, Davis MJ. Sequence Representations and Their Utility for Predicting Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:646-657. [PMID: 34941517 DOI: 10.1109/tcbb.2021.3137325] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Protein-Protein Interactions (PPIs) are a crucial mechanism underpinning the function of the cell. So far, a wide range of machine-learning based methods have been proposed for predicting these relationships. Their success is heavily dependent on the construction of the underlying feature vectors, with most using a set of physico-chemical properties derived from the sequence. Few work directly with the sequence itself. In this paper, we explore the utility of sequence embeddings for predicting protein-protein interactions. We construct a protein pair feature vector by concatenating the embeddings of their constituent sequence. These feature vectors are then used as input to a binary classifier to make predictions. To learn sequence embeddings, we use two established Word2Vec based methods - Seq2Vec and BioVec - and we also introduce a novel feature construction method called SuperVecNW. The embeddings generated through SuperVecNW capture some network information in addition to the contextual information present in the sequences. We test the efficacy of our proposed approach on human and yeast PPI datasets and on three well-known networks: CD9, the Ras-Raf-Mek-Erk-Elk-Srf pathway, and a Wnt-related network. We demonstrate that low dimensional sequence embeddings provide better results than most alternative representations based on physico-chemical properties while offering a far simple approach to feature vector construction.
Collapse
|
134
|
ISPRED-SEQ: Deep neural networks and embeddings for predicting interaction sites in protein sequences. J Mol Biol 2023. [DOI: 10.1016/j.jmb.2023.167963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
135
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
136
|
Durairaj J, de Ridder D, van Dijk AD. Beyond sequence: Structure-based machine learning. Comput Struct Biotechnol J 2022; 21:630-643. [PMID: 36659927 PMCID: PMC9826903 DOI: 10.1016/j.csbj.2022.12.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 12/21/2022] [Accepted: 12/21/2022] [Indexed: 12/31/2022] Open
Abstract
Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
Collapse
Affiliation(s)
- Janani Durairaj
- Biozentrum, University of Basel, Basel, Switzerland
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Aalt D.J. van Dijk
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| |
Collapse
|
137
|
Anteghini M, Haja A, Martins dos Santos VA, Schomaker L, Saccenti E. OrganelX web server for sub-peroxisomal and sub-mitochondrial protein localization and peroxisomal target signal detection. Comput Struct Biotechnol J 2022; 21:128-133. [PMID: 36544474 PMCID: PMC9747352 DOI: 10.1016/j.csbj.2022.11.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 11/28/2022] [Accepted: 11/28/2022] [Indexed: 12/12/2022] Open
Abstract
We present the OrganelX e-Science Web Server that provides a user-friendly implementation of the In-Pero and In-Mito classifiers for sub-peroxisomal and sub-mitochondrial localization of peroxisomal and mitochondrial proteins and the Is-PTS1 algorithm for detecting and validating potential peroxisomal proteins carrying a PTS1 signal sequence. The OrganelX e-Science Web Server is available at https://organelx.hpc.rug.nl/fasta/.
Collapse
Affiliation(s)
- Marco Anteghini
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands
- LifeGlimmer GmbH, Berlin, Germany
| | - Asmaa Haja
- Bernoulli Institute, University of Groningen, Groningen, The Netherlands
| | - Vitor A.P. Martins dos Santos
- LifeGlimmer GmbH, Berlin, Germany
- Bioprocess Engineering, Wageningen University & Research, Wageningen, The Netherlands
| | - Lambert Schomaker
- Bernoulli Institute, University of Groningen, Groningen, The Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands
| |
Collapse
|
138
|
Hou Q, Waury K, Gogishvili D, Feenstra KA. Ten quick tips for sequence-based prediction of protein properties using machine learning. PLoS Comput Biol 2022; 18:e1010669. [PMID: 36454728 PMCID: PMC9714715 DOI: 10.1371/journal.pcbi.1010669] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
Collapse
Affiliation(s)
- Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong, P. R. China
- National Institute of Health Data Science of China, Shandong University, Shandong, P. R. China
| | - Katharina Waury
- Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Dea Gogishvili
- Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - K. Anton Feenstra
- Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| |
Collapse
|
139
|
Li G, Buric F, Zrimec J, Viknander S, Nielsen J, Zelezniak A, Engqvist MKM. Learning deep representations of enzyme thermal adaptation. Protein Sci 2022; 31:e4480. [PMID: 36261883 PMCID: PMC9679980 DOI: 10.1002/pro.4480] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 09/02/2022] [Accepted: 10/15/2022] [Indexed: 12/14/2022]
Abstract
Temperature is a fundamental environmental factor that shapes the evolution of organisms. Learning thermal determinants of protein sequences in evolution thus has profound significance for basic biology, drug discovery, and protein engineering. Here, we use a data set of over 3 million BRENDA enzymes labeled with optimal growth temperatures (OGTs) of their source organisms to train a deep neural network model (DeepET). The protein-temperature representations learned by DeepET provide a temperature-related statistical summary of protein sequences and capture structural properties that affect thermal stability. For prediction of enzyme optimal catalytic temperatures and protein melting temperatures via a transfer learning approach, our DeepET model outperforms classical regression models trained on rationally designed features and other deep-learning-based representations. DeepET thus holds promise for understanding enzyme thermal adaptation and guiding the engineering of thermostable enzymes.
Collapse
Affiliation(s)
- Gang Li
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Filip Buric
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Jan Zrimec
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Department of Biotechnology and Systems BiologyNational Institute of BiologyLjubljanaSlovenia
| | - Sandra Viknander
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Jens Nielsen
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- BioInnovation InstituteCopenhagen NDenmark
| | - Aleksej Zelezniak
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Life Sciences CentreInstitute of Biotechnology, Vilnius UniversityVilniusLithuania
- Randall Centre for Cell & Molecular BiophysicsKing's College London, New Hunt's House, Guy's Campus, SE1 1ULLondonUK
| | - Martin K. M. Engqvist
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Enginzyme ABStockholmSweden
| |
Collapse
|
140
|
Context-aware sentiment analysis with attention-enhanced features from bidirectional transformers. SOCIAL NETWORK ANALYSIS AND MINING 2022. [DOI: 10.1007/s13278-022-00910-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
141
|
Manfredi M, Savojardo C, Martelli PL, Casadio R. E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants. Bioinformatics 2022; 38:5168-5174. [PMID: 36227117 PMCID: PMC9710551 DOI: 10.1093/bioinformatics/btac678] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 09/14/2022] [Accepted: 10/10/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants. RESULTS E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets. AVAILABILITY AND IMPLEMENTATION The method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matteo Manfredi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| | - Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| |
Collapse
|
142
|
Wu L, Yin C, Zhu J, Wu Z, He L, Xia Y, Xie S, Qin T, Liu TY. SPRoBERTa: protein embedding learning with local fragment modeling. Brief Bioinform 2022; 23:6711410. [PMID: 36136367 DOI: 10.1093/bib/bbac401] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 07/18/2022] [Accepted: 08/18/2022] [Indexed: 12/14/2022] Open
Abstract
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Collapse
Affiliation(s)
- Lijun Wu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Chengcan Yin
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Jinhua Zhu
- CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China, No.96, JinZhai Road Baohe District, 230026, Hefei, Anhui Province, China
| | - Zhen Wu
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Liang He
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Yingce Xia
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Shufang Xie
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tao Qin
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tie-Yan Liu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| |
Collapse
|
143
|
Ferruz N, Heinzinger M, Akdel M, Goncearenco A, Naef L, Dallago C. From sequence to function through structure: Deep learning for protein design. Comput Struct Biotechnol J 2022; 21:238-250. [PMID: 36544476 PMCID: PMC9755234 DOI: 10.1016/j.csbj.2022.11.014] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/20/2022] Open
Abstract
The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.
Collapse
Key Words
- ADMM, Alternating Direction Method of Multipliers
- CNN, Convolutional Neural Network
- DL, Deep learning
- Deep learning
- Drug discovery
- FNN, fully-connected neural network
- GAN, Generative Adversarial Network
- GCN, Graph Convolutional Network
- GNN, Graph Neural Network
- GO, Gene Ontology
- GVP, Geometric Vector Perceptron
- LSTM, Long-Short Term Memory
- MLP, Multilayer Perceptron
- MSA, Multiple Sequence Alignment
- NLP, Natural Language Processing
- NSR, Natural Sequence Recovery
- Protein design
- Protein language models
- Protein prediction
- VAE, Variational Autoencoder
- pLM, protein Language Model
Collapse
Affiliation(s)
- Noelia Ferruz
- Institute of Informatics and Applications, University of Girona, Girona, Spain
- Department of Biochemistry, University of Bayreuth, Bayreuth, Germany
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
| | - Mehmet Akdel
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | | | - Luca Naef
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
- NVIDIA DE GmbH, Einsteinstraße 172, 81677 München, Germany
| |
Collapse
|
144
|
Kabir A, Shehu A. GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction. Biomolecules 2022; 12:1709. [PMID: 36421723 PMCID: PMC9687818 DOI: 10.3390/biom12111709] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/14/2022] [Accepted: 11/15/2022] [Indexed: 09/19/2023] Open
Abstract
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
Collapse
Affiliation(s)
- Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
- Center for Advancing Human-Machine Partnerships, George Mason University, Fairfax, VA 22030, USA
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
| |
Collapse
|
145
|
Schütze K, Heinzinger M, Steinegger M, Rost B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. FRONTIERS IN BIOINFORMATICS 2022; 2:1033775. [PMID: 36466147 PMCID: PMC9714024 DOI: 10.3389/fbinf.2022.1033775] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/31/2022] [Indexed: 11/29/2023] Open
Abstract
Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.
Collapse
Affiliation(s)
- Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Germany & TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany
| |
Collapse
|
146
|
Ismi DP, Pulungan R, Afiahayati. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold. Comput Struct Biotechnol J 2022; 20:6271-6286. [PMID: 36420164 PMCID: PMC9678802 DOI: 10.1016/j.csbj.2022.11.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/13/2022] Open
Abstract
This paper aims to provide a comprehensive review of the trends and challenges of deep neural networks for protein secondary structure prediction (PSSP). In recent years, deep neural networks have become the primary method for protein secondary structure prediction. Previous studies showed that deep neural networks had uplifted the accuracy of three-state secondary structure prediction to more than 80%. Favored deep learning methods, such as convolutional neural networks, recurrent neural networks, inception networks, and graph neural networks, have been implemented in protein secondary structure prediction. Methods adapted from natural language processing (NLP) and computer vision are also employed, including attention mechanism, ResNet, and U-shape networks. In the post-AlphaFold era, PSSP studies focus on different objectives, such as enhancing the quality of evolutionary information and exploiting protein language models as the PSSP input. The recent trend to utilize pre-trained language models as input features for secondary structure prediction provides a new direction for PSSP studies. Moreover, the state-of-the-art accuracy achieved by previous PSSP models is still below its theoretical limit. There are still rooms for improvement to be made in the field.
Collapse
Affiliation(s)
- Dewi Pramudi Ismi
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Infomatics, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
| | - Reza Pulungan
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| | - Afiahayati
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
147
|
Nourani E, Asgari E, McHardy AC, Mofrad MRK. TripletProt: Deep Representation Learning of Proteins Based On Siamese Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3744-3753. [PMID: 34460382 DOI: 10.1109/tcbb.2021.3108718] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Pretrained representations have recently gained attention in various machine learning applications. Nonetheless, the high computational costs associated with training these models have motivated alternative approaches for representation learning. Herein we introduce TripletProt, a new approach for protein representation learning based on the Siamese neural networks. Representation learning of biological entities which capture essential features can alleviate many of the challenges associated with supervised learning in bioinformatics. The most important distinction of our proposed method is relying on the protein-protein interaction (PPI) network. The computational cost of the generated representations for any potential application is significantly lower than comparable methods since the length of the representations is significantly smaller than that in other approaches. TripletProt offers great potentials for the protein informatics tasks and can be widely applied to similar tasks. We evaluate TripletProt comprehensively in protein functional annotation tasks including sub-cellular localization (14 categories) and gene ontology prediction (more than 2000 classes), which are both challenging multi-class, multi-label classification machine learning problems. We compare the performance of TripletProt with the state-of-the-art approaches including a recurrent language model-based approach (i.e., UniRep), as well as a protein-protein interaction (PPI) network and sequence-based method (i.e., DeepGO). Our TripletProt showed an overall improvement of F1 score in the above mentioned comprehensive functional annotation tasks, solely relying on the PPI network. Availability: The source code and datasets are available at https://github.com/EsmaeilNourani/TripletProt.
Collapse
|
148
|
Yan J, Cai J, Zhang B, Wang Y, Wong DF, Siu SWI. Recent Progress in the Discovery and Design of Antimicrobial Peptides Using Traditional Machine Learning and Deep Learning. Antibiotics (Basel) 2022; 11:1451. [PMID: 36290108 PMCID: PMC9598685 DOI: 10.3390/antibiotics11101451] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 10/11/2022] [Accepted: 10/13/2022] [Indexed: 11/16/2022] Open
Abstract
Antimicrobial resistance has become a critical global health problem due to the abuse of conventional antibiotics and the rise of multi-drug-resistant microbes. Antimicrobial peptides (AMPs) are a group of natural peptides that show promise as next-generation antibiotics due to their low toxicity to the host, broad spectrum of biological activity, including antibacterial, antifungal, antiviral, and anti-parasitic activities, and great therapeutic potential, such as anticancer, anti-inflammatory, etc. Most importantly, AMPs kill bacteria by damaging cell membranes using multiple mechanisms of action rather than targeting a single molecule or pathway, making it difficult for bacterial drug resistance to develop. However, experimental approaches used to discover and design new AMPs are very expensive and time-consuming. In recent years, there has been considerable interest in using in silico methods, including traditional machine learning (ML) and deep learning (DL) approaches, to drug discovery. While there are a few papers summarizing computational AMP prediction methods, none of them focused on DL methods. In this review, we aim to survey the latest AMP prediction methods achieved by DL approaches. First, the biology background of AMP is introduced, then various feature encoding methods used to represent the features of peptide sequences are presented. We explain the most popular DL techniques and highlight the recent works based on them to classify AMPs and design novel peptide sequences. Finally, we discuss the limitations and challenges of AMP prediction.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Jianxiu Cai
- Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
- Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macau, China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Yapeng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
| | - Derek F. Wong
- NLP2CT Lab, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Shirley W. I. Siu
- Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macau, China
- School of Pharmaceutical Sciences, Universiti Sains Malaysia, Pulau Pinang 11800, Malaysia
| |
Collapse
|
149
|
Tsimenidis S, Vrochidou E, Papakostas GA. Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. Int J Mol Sci 2022; 23:12272. [PMID: 36293133 PMCID: PMC9603455 DOI: 10.3390/ijms232012272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 10/03/2022] [Accepted: 10/12/2022] [Indexed: 11/25/2022] Open
Abstract
Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.
Collapse
Affiliation(s)
| | | | - George A. Papakostas
- MLV Research Group, Department of Computer Science, International Hellenic University, 65404 Kavala, Greece
| |
Collapse
|
150
|
Ilzhöfer D, Heinzinger M, Rost B. SETH predicts nuances of residue disorder from protein embeddings. FRONTIERS IN BIOINFORMATICS 2022; 2:1019597. [PMID: 36304335 PMCID: PMC9580958 DOI: 10.3389/fbinf.2022.1019597] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 09/20/2022] [Indexed: 11/07/2022] Open
Abstract
Predictions for millions of protein three-dimensional structures are only a few clicks away since the release of AlphaFold2 results for UniProt. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer's Disease. We showed that three recent disorder measures of AlphaFold2 predictions (pLDDT, "experimentally resolved" prediction and "relative solvent accessibility") correlated to some extent with IDRs. However, expert methods predict IDRs more reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments (MSAs). MSAs are not always available, especially for IDRs, and are computationally expensive to generate, limiting the scalability of the associated tools. Here, we present the novel method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby, our method, relying on a relatively shallow convolutional neural network, outperformed much more complex solutions while being much faster, allowing to create predictions for the human proteome in about 1 hour on a consumer-grade PC with one NVIDIA GeForce RTX 3060. Trained on a continuous disorder scale (CheZOD scores), our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of most methods. High performance paired with speed revealed that SETH's nuanced disorder predictions for entire proteomes capture aspects of the evolution of organisms. Additionally, SETH could also be used to filter out regions or proteins with probable low-quality AlphaFold2 3D structures to prioritize running the compute-intensive predictions for large data sets. SETH is freely publicly available at: https://github.com/Rostlab/SETH.
Collapse
Affiliation(s)
- Dagmar Ilzhöfer
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
| | - Michael Heinzinger
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
- Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), TUM Graduate School, Garching, Germany
| | - Burkhard Rost
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
- Institute for Advanced Study (TUM-IAS), TUM (Technical University of Munich), Garching, Germany
- TUM School of Life Sciences Weihenstephan (WZW), TUM (Technical University of Munich), Freising, Germany
| |
Collapse
|