1
|
Feng Z, Huang W, Li H, Zhu H, Kang Y, Li Z. DGCPPISP: a PPI site prediction model based on dynamic graph convolutional network and two-stage transfer learning. BMC Bioinformatics 2024; 25:252. [PMID: 39085781 PMCID: PMC11293074 DOI: 10.1186/s12859-024-05864-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Accepted: 07/10/2024] [Indexed: 08/02/2024] Open
Abstract
BACKGROUND Proteins play a pivotal role in the diverse array of biological processes, making the precise prediction of protein-protein interaction (PPI) sites critical to numerous disciplines including biology, medicine and pharmacy. While deep learning methods have progressively been implemented for the prediction of PPI sites within proteins, the task of enhancing their predictive performance remains an arduous challenge. RESULTS In this paper, we propose a novel PPI site prediction model (DGCPPISP) based on a dynamic graph convolutional neural network and a two-stage transfer learning strategy. Initially, we implement the transfer learning from dual perspectives, namely feature input and model training that serve to supply efficacious prior knowledge for our model. Subsequently, we construct a network designed for the second stage of training, which is built on the foundation of dynamic graph convolution. CONCLUSIONS To evaluate its effectiveness, the performance of the DGCPPISP model is scrutinized using two benchmark datasets. The ensuing results demonstrate that DGCPPISP outshines competing methods in terms of performance. Specifically, DGCPPISP surpasses the second-best method, EGRET, by margins of 5.9%, 10.1%, and 13.3% for F1-measure, AUPRC, and MCC metrics respectively on Dset_186_72_PDB164. Similarly, on Dset_331, it eclipses the performance of the runner-up method, HN-PPISP, by 14.5%, 19.8%, and 29.9% respectively.
Collapse
Affiliation(s)
- Zijian Feng
- Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources, School of Information Engineering, Huzhou University, Huzhou, 313000, Zhejiang, China
- College of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, Zhejiang, China
| | - Weihong Huang
- Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources, School of Information Engineering, Huzhou University, Huzhou, 313000, Zhejiang, China
- College of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, Zhejiang, China
| | - Haohao Li
- College of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, Zhejiang, China
| | - Hancan Zhu
- School of Mathematics, Physics and Information, Shaoxing University, Shaoxing, 312000, Zhejiang, China
| | - Yanlei Kang
- Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources, School of Information Engineering, Huzhou University, Huzhou, 313000, Zhejiang, China
| | - Zhong Li
- Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources, School of Information Engineering, Huzhou University, Huzhou, 313000, Zhejiang, China.
- College of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, Zhejiang, China.
| |
Collapse
|
2
|
Volzhenin K, Bittner L, Carbone A. SENSE-PPI reconstructs interactomes within, across, and between species at the genome scale. iScience 2024; 27:110371. [PMID: 39055916 PMCID: PMC11269938 DOI: 10.1016/j.isci.2024.110371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 05/04/2024] [Accepted: 06/21/2024] [Indexed: 07/28/2024] Open
Abstract
Ab initio computational reconstructions of protein-protein interaction (PPI) networks will provide invaluable insights into cellular systems, enabling the discovery of novel molecular interactions and elucidating biological mechanisms within and between organisms. Leveraging the latest generation protein language models and recurrent neural networks, we present SENSE-PPI, a sequence-based deep learning model that efficiently reconstructs ab initio PPIs, distinguishing partners among tens of thousands of proteins and identifying specific interactions within functionally similar proteins. SENSE-PPI demonstrates high accuracy, limited training requirements, and versatility in cross-species predictions, even with non-model organisms and human-virus interactions. Its performance decreases for phylogenetically more distant model and non-model organisms, but signal alteration is very slow. In this regard, it demonstrates the important role of parameters in protein language models. SENSE-PPI is very fast and can test 10,000 proteins against themselves in a matter of hours, enabling the reconstruction of genome-wide proteomes.
Collapse
Affiliation(s)
- Konstantin Volzhenin
- Sorbonne Université, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France
| | - Lucie Bittner
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, Paris, France
- Institut Universitaire de France, Paris, France
| | - Alessandra Carbone
- Sorbonne Université, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France
- Institut Universitaire de France, Paris, France
| |
Collapse
|
3
|
Luong KD, Singh A. Application of Transformers in Cheminformatics. J Chem Inf Model 2024; 64:4392-4409. [PMID: 38815246 PMCID: PMC11167597 DOI: 10.1021/acs.jcim.3c02070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 04/05/2024] [Accepted: 05/06/2024] [Indexed: 06/01/2024]
Abstract
By accelerating time-consuming processes with high efficiency, computing has become an essential part of many modern chemical pipelines. Machine learning is a class of computing methods that can discover patterns within chemical data and utilize this knowledge for a wide variety of downstream tasks, such as property prediction or substance generation. The complex and diverse chemical space requires complex machine learning architectures with great learning power. Recently, learning models based on transformer architectures have revolutionized multiple domains of machine learning, including natural language processing and computer vision. Naturally, there have been ongoing endeavors in adopting these techniques to the chemical domain, resulting in a surge of publications within a short period. The diversity of chemical structures, use cases, and learning models necessitate a comprehensive summarization of existing works. In this paper, we review recent innovations in adapting transformers to solve learning problems in chemistry. Because chemical data is diverse and complex, we structure our discussion based on chemical representations. Specifically, we highlight the strengths and weaknesses of each representation, the current progress of adapting transformer architectures, and future directions.
Collapse
Affiliation(s)
- Kha-Dinh Luong
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| | - Ambuj Singh
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| |
Collapse
|
4
|
Schmid EW, Walter JC. Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.09.588596. [PMID: 38645019 PMCID: PMC11030396 DOI: 10.1101/2024.04.09.588596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Protein-protein interactions (PPIs) are ubiquitous in biology, yet a comprehensive structural characterization of the PPIs underlying biochemical processes is lacking. Although AlphaFold-Multimer (AF-M) has the potential to fill this knowledge gap, standard AF-M confidence metrics do not reliably separate relevant PPIs from an abundance of false positive predictions. To address this limitation, we used machine learning on well curated datasets to train a Structure Prediction and Omics informed Classifier called SPOC that shows excellent performance in separating true and false PPIs, including in proteome-wide screens. We applied SPOC to an all-by-all matrix of nearly 300 human genome maintenance proteins, generating ~40,000 predictions that can be viewed at predictomes.org, where users can also score their own predictions with SPOC. High confidence PPIs discovered using our approach suggest novel hypotheses in genome maintenance. Our results provide a framework for interpreting large scale AF-M screens and help lay the foundation for a proteome-wide structural interactome.
Collapse
Affiliation(s)
- Ernst W. Schmid
- Department of Biological Chemistry & Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Johannes C. Walter
- Department of Biological Chemistry & Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
- Howard Hughes Medical Institute, Boston, MA 02115, USA
| |
Collapse
|
5
|
Wang J, Chen S, Yuan Q, Chen J, Li D, Wang L, Yang Y. Predicting the effects of mutations on protein solubility using graph convolution network and protein language model representation. J Comput Chem 2024; 45:436-445. [PMID: 37933773 DOI: 10.1002/jcc.27249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 10/11/2023] [Accepted: 10/21/2023] [Indexed: 11/08/2023]
Abstract
Solubility is one of the most important properties of protein. Protein solubility can be greatly changed by single amino acid mutations and the reduced protein solubility could lead to diseases. Since experimental methods to determine solubility are time-consuming and expensive, in-silico methods have been developed to predict the protein solubility changes caused by mutations mostly through protein evolution information. However, these methods are slow since it takes long time to obtain evolution information through multiple sequence alignment. In addition, these methods are of low performance because they do not fully utilize protein 3D structures due to a lack of experimental structures for most proteins. Here, we proposed a sequence-based method DeepMutSol to predict solubility change from residual mutations based on the Graph Convolutional Neural Network (GCN), where the protein graph was initiated according to predicted protein structure from Alphafold2, and the nodes (residues) were represented by protein language embeddings. To circumvent the small data of solubility changes, we further pretrained the model over absolute protein solubility. DeepMutSol was shown to outperform state-of-the-art methods in benchmark tests. In addition, we applied the method to clinically relevant genes from the ClinVar database and the predicted solubility changes were shown able to separate pathogenic mutations. All of the data sets and the source code are available at https://github.com/biomed-AI/DeepMutSol.
Collapse
Affiliation(s)
- Jing Wang
- Guangzhou institute of technology, Xidian University, Guangzhou, China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Sheng Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Jianwen Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Danping Li
- School of Telecommunications Engineering, Xidian University, Xi'an, China
| | - Lei Wang
- School of Electronic Engineering, Xidian University, Xi'an, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
6
|
Dong Q, Wang S, Miao Y, Luo H, Weng Z, Yu L. Novel antimicrobial peptides against Cutibacterium acnes designed by deep learning. Sci Rep 2024; 14:4529. [PMID: 38402320 PMCID: PMC10894229 DOI: 10.1038/s41598-024-55205-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 02/21/2024] [Indexed: 02/26/2024] Open
Abstract
The increasing prevalence of antibiotic resistance in Cutibacterium acnes (C. acnes) requires the search for alternative therapeutic strategies. Antimicrobial peptides (AMPs) offer a promising avenue for the development of new treatments targeting C. acnes. In this study, to design peptides with the specific inhibitory activity against C. acnes, we employed a deep learning pipeline with generators and classifiers, using transfer learning and pretrained protein embeddings, trained on publicly available data. To enhance the training data specific to C. acnes inhibition, we constructed a phylogenetic tree. A panel of 42 novel generated linear peptides was then synthesized and experimentally evaluated for their antimicrobial selectivity and activity. Five of them demonstrated their high potency and selectivity against C. acnes with MIC of 2-4 µg/mL. Our findings highlight the potential of these designed peptides as promising candidates for anti-acne therapeutics and demonstrate the power of computational approaches for the rational design of targeted antimicrobial peptides.
Collapse
Affiliation(s)
- Qichang Dong
- Shanghai MetaNovas Biotech Co., Ltd, Shanghai, 200120, China
| | - Shaohua Wang
- Shanghai MetaNovas Biotech Co., Ltd, Shanghai, 200120, China
| | - Ying Miao
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350108, China
| | - Heng Luo
- Shanghai MetaNovas Biotech Co., Ltd, Shanghai, 200120, China
| | - Zuquan Weng
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350108, China
| | - Lun Yu
- Metanovas Biotech Inc., Foster City, 94404, USA.
| |
Collapse
|
7
|
Lee J, Jun DW, Song I, Kim Y. DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning. J Cheminform 2024; 16:14. [PMID: 38297330 PMCID: PMC10832108 DOI: 10.1186/s13321-024-00808-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 01/22/2024] [Indexed: 02/02/2024] Open
Abstract
The drug discovery process is demanding and time-consuming, and machine learning-based research is increasingly proposed to enhance efficiency. A significant challenge in this field is predicting whether a drug molecule's structure will interact with a target protein. A recent study attempted to address this challenge by utilizing an encoder that leverages prior knowledge of molecular and protein structures, resulting in notable improvements in the prediction performance of the drug-target interactions task. Nonetheless, the target encoders employed in previous studies exhibit computational complexity that increases quadratically with the input length, thereby limiting their practical utility. To overcome this challenge, we adopt a hint-based learning strategy to develop a compact and efficient target encoder. With the adaptation parameter, our model can blend general knowledge and target-oriented knowledge to build features of the protein sequences. This approach yielded considerable performance enhancements and improved learning efficiency on three benchmark datasets: BIOSNAP, DAVIS, and Binding DB. Furthermore, our methodology boasts the merit of necessitating only a minimal Video RAM (VRAM) allocation, specifically 7.7GB, during the training phase (16.24% of the previous state-of-the-art model). This ensures the feasibility of training and inference even with constrained computational resources.
Collapse
Affiliation(s)
- Jonghyun Lee
- Department of Medical and Digital Engineering, Hanyang University College of Engineering, 222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea
| | - Dae Won Jun
- Department of Medical and Digital Engineering, Hanyang University College of Engineering, 222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea
- Department of Internal Medicine, Hanyang University College of Medicine, 222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea
| | - Ildae Song
- Department of Pharmaceutical Science and Technology, Kyungsung University, 309, Suyeong-ro, Nam-gu, Busan, 48434, Korea
| | - Yun Kim
- College of Pharmacy, Deagu Catholic University, 13-13, Hayang-ro, Hayang-eup, Gyeongsan-si, 38430, Gyeongsangbuk-do, Korea.
| |
Collapse
|
8
|
Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, Rollins N, Shaw A, Weitzman R, Frazer J, Dias M, Franceschi D, Orenbuch R, Gal Y, Marks DS. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570727. [PMID: 38106144 PMCID: PMC10723403 DOI: 10.1101/2023.12.07.570727] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Ada Shaw
- Applied Mathematics, Harvard University
| | | | | | - Mafalda Dias
- Centre for Genomic Regulation, Universitat Pompeu Fabra
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| | | |
Collapse
|
9
|
Notin P, Marks DS, Weitzman R, Gal Y. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.06.570473. [PMID: 38106034 PMCID: PMC10723423 DOI: 10.1101/2023.12.06.570473] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Collapse
Affiliation(s)
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| |
Collapse
|
10
|
Chen J, Gu Z, Lai L, Pei J. In silico protein function prediction: the rise of machine learning-based approaches. MEDICAL REVIEW (2021) 2023; 3:487-510. [PMID: 38282798 PMCID: PMC10808870 DOI: 10.1515/mr-2023-0038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/11/2023] [Indexed: 01/30/2024]
Abstract
Proteins function as integral actors in essential life processes, rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation. Within the context of protein research, an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings. Due to the exorbitant costs and limited throughput inherent in experimental investigations, computational models offer a promising alternative to accelerate protein function annotation. In recent years, protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks. This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction. In this review, we elucidate the historical evolution and research paradigms of computational methods for predicting protein function. Subsequently, we summarize the progress in protein and molecule representation as well as feature extraction techniques. Furthermore, we assess the performance of machine learning-based algorithms across various objectives in protein function prediction, thereby offering a comprehensive perspective on the progress within this field.
Collapse
Affiliation(s)
- Jiaxiao Chen
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Zhonghui Gu
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| |
Collapse
|
11
|
Esmaili F, Pourmirzaei M, Ramazi S, Shojaeilangari S, Yavari E. A Review of Machine Learning and Algorithmic Methods for Protein Phosphorylation Site Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1266-1285. [PMID: 37863385 PMCID: PMC11082408 DOI: 10.1016/j.gpb.2023.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 01/16/2023] [Accepted: 03/23/2023] [Indexed: 10/22/2023]
Abstract
Post-translational modifications (PTMs) have key roles in extending the functional diversity of proteins and, as a result, regulating diverse cellular processes in prokaryotic and eukaryotic organisms. Phosphorylation modification is a vital PTM that occurs in most proteins and plays a significant role in many biological processes. Disorders in the phosphorylation process lead to multiple diseases, including neurological disorders and cancers. The purpose of this review is to organize this body of knowledge associated with phosphorylation site (p-site) prediction to facilitate future research in this field. At first, we comprehensively review all related databases and introduce all steps regarding dataset creation, data preprocessing, and method evaluation in p-site prediction. Next, we investigate p-site prediction methods, which are divided into two computational groups: algorithmic and machine learning (ML). Additionally, it is shown that there are basically two main approaches for p-site prediction by ML: conventional and end-to-end deep learning methods, both of which are given an overview. Moreover, this review introduces the most important feature extraction techniques, which have mostly been used in p-site prediction. Finally, we create three test sets from new proteins related to the released version of the database of protein post-translational modifications (dbPTM) in 2022 based on general and human species. Evaluating online p-site prediction tools on newly added proteins introduced in the dbPTM 2022 release, distinct from those in the dbPTM 2019 release, reveals their limitations. In other words, the actual performance of these online p-site prediction tools on unseen proteins is notably lower than the results reported in their respective research papers.
Collapse
Affiliation(s)
- Farzaneh Esmaili
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Mahdi Pourmirzaei
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran 14115-111, Iran.
| | - Seyedehsamaneh Shojaeilangari
- Biomedical Engineering Group, Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology (IROST), Tehran 33535-111, Iran
| | - Elham Yavari
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| |
Collapse
|
12
|
Urhan A, Cosma BM, Earl AM, Manson AL, Abeel T. SAP: Synteny-aware gene function prediction for bacteria using protein embeddings. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.02.539034. [PMID: 37205418 PMCID: PMC10187222 DOI: 10.1101/2023.05.02.539034] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Motivation Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for prokaryotes. Recently, transformer-based language models - adopted from the natural language processing field - have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. Results To predict gene functions in bacteria, we developed SAP, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAP also leverages the unique operon structure of bacteria through conserved synteny. SAP outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAP to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.
Collapse
Affiliation(s)
- Aysun Urhan
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| | - Bianca-Maria Cosma
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The Netherlands
| | - Ashlee M. Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| | - Abigail L. Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| |
Collapse
|
13
|
Avraham O, Tsaban T, Ben-Aharon Z, Tsaban L, Schueler-Furman O. Protein language models can capture protein quaternary state. BMC Bioinformatics 2023; 24:433. [PMID: 37964216 PMCID: PMC10647083 DOI: 10.1186/s12859-023-05549-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 10/27/2023] [Indexed: 11/16/2023] Open
Abstract
BACKGROUND Determining a protein's quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. RESULTS We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. CONCLUSIONS QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab. RESEARCH google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb .
Collapse
Affiliation(s)
- Orly Avraham
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Tomer Tsaban
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Ziv Ben-Aharon
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Linoy Tsaban
- Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
- The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Ora Schueler-Furman
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel.
| |
Collapse
|
14
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
15
|
Riedl M, Mukherjee S, Gauthier M. Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma. Mol Pharm 2023; 20:4984-4993. [PMID: 37656906 DOI: 10.1021/acs.molpharmaceut.3c00129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/03/2023]
Abstract
Chemical-specific parameters are either measured in vitro or estimated using quantitative structure-activity relationship (QSAR) models. The existing body of QSAR work relies on extracting a set of descriptors or fingerprints, subset selection, and training a machine learning model. In this work, we used a state-of-the-art natural language processing model, Bidirectional Encoder Representations from Transformers, which allowed us to circumvent the need for calculation of these chemical descriptors. In this approach, simplified molecular-input line-entry system (SMILES) strings were embedded in a high-dimensional space using a two-stage training approach. The model was first pre-trained on a masked SMILES token task and then fine-tuned on a QSAR prediction task. The pre-training task learned meaningful high-dimensional embeddings based upon the relationships between the chemical tokens in the SMILES strings derived from the "in-stock" portion of the ZINC 15 dataset─a large dataset of commercially available chemicals. The fine-tuning task then perturbed the pre-trained embeddings to facilitate prediction of a specific QSAR endpoint of interest. The power of this model stems from the ability to reuse the pre-trained model for multiple different fine-tuning tasks, reducing the computational burden of developing multiple models for different endpoints. We used our framework to develop a predictive model for fraction unbound in human plasma (fu,p). This approach is flexible, requires minimum domain expertise, and can be generalized for other parameters of interest for rapid and accurate estimation of absorption, distribution, metabolism, excretion, and toxicity.
Collapse
|
16
|
Huang Y, Huang HY, Chen Y, Lin YCD, Yao L, Lin T, Leng J, Chang Y, Zhang Y, Zhu Z, Ma K, Cheng YN, Lee TY, Huang HD. A Robust Drug-Target Interaction Prediction Framework with Capsule Network and Transfer Learning. Int J Mol Sci 2023; 24:14061. [PMID: 37762364 PMCID: PMC10531393 DOI: 10.3390/ijms241814061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 08/27/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023] Open
Abstract
Drug-target interactions (DTIs) are considered a crucial component of drug design and drug discovery. To date, many computational methods were developed for drug-target interactions, but they are insufficiently informative for accurately predicting DTIs due to the lack of experimentally verified negative datasets, inaccurate molecular feature representation, and ineffective DTI classifiers. Therefore, we address the limitations of randomly selecting negative DTI data from unknown drug-target pairs by establishing two experimentally validated datasets and propose a capsule network-based framework called CapBM-DTI to capture hierarchical relationships of drugs and targets, which adopts pre-trained bidirectional encoder representations from transformers (BERT) for contextual sequence feature extraction from target proteins through transfer learning and the message-passing neural network (MPNN) for the 2-D graph feature extraction of compounds to accurately and robustly identify drug-target interactions. We compared the performance of CapBM-DTI with state-of-the-art methods using four experimentally validated DTI datasets of different sizes, including human (Homo sapiens) and worm (Caenorhabditis elegans) species datasets, as well as three subsets (new compounds, new proteins, and new pairs). Our results demonstrate that the proposed model achieved robust performance and powerful generalization ability in all experiments. The case study on treating COVID-19 demonstrates the applicability of the model in virtual screening.
Collapse
Affiliation(s)
- Yixian Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Hsi-Yuan Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yigang Chen
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yang-Chi-Dung Lin
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Lantian Yao
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Tianxiu Lin
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Junlin Leng
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yuan Chang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yuntian Zhang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Zihao Zhu
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Kun Ma
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yeong-Nan Cheng
- Institute of Bioinformatics and Systems Biology, Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (Y.-N.C.)
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (Y.-N.C.)
| | - Hsien-Da Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| |
Collapse
|
17
|
Lee CH, Huh J, Buckley PR, Jang M, Pinho MP, Fernandes RA, Antanaviciute A, Simmons A, Koohy H. A robust deep learning workflow to predict CD8 + T-cell epitopes. Genome Med 2023; 15:70. [PMID: 37705109 PMCID: PMC10498576 DOI: 10.1186/s13073-023-01225-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 08/30/2023] [Indexed: 09/15/2023] Open
Abstract
BACKGROUND T-cells play a crucial role in the adaptive immune system by triggering responses against cancer cells and pathogens, while maintaining tolerance against self-antigens, which has sparked interest in the development of various T-cell-focused immunotherapies. However, the identification of antigens recognised by T-cells is low-throughput and laborious. To overcome some of these limitations, computational methods for predicting CD8 + T-cell epitopes have emerged. Despite recent developments, most immunogenicity algorithms struggle to learn features of peptide immunogenicity from small datasets, suffer from HLA bias and are unable to reliably predict pathology-specific CD8 + T-cell epitopes. METHODS We developed TRAP (T-cell recognition potential of HLA-I presented peptides), a robust deep learning workflow for predicting CD8 + T-cell epitopes from MHC-I presented pathogenic and self-peptides. TRAP uses transfer learning, deep learning architecture and MHC binding information to make context-specific predictions of CD8 + T-cell epitopes. TRAP also detects low-confidence predictions for peptides that differ significantly from those in the training datasets to abstain from making incorrect predictions. To estimate the immunogenicity of pathogenic peptides with low-confidence predictions, we further developed a novel metric, RSAT (relative similarity to autoantigens and tumour-associated antigens), as a complementary to 'dissimilarity to self' from cancer studies. RESULTS TRAP was used to identify epitopes from glioblastoma patients as well as SARS-CoV-2 peptides, and it outperformed other algorithms in both cancer and pathogenic settings. TRAP was especially effective at extracting immunogenicity-associated properties from restricted data of emerging pathogens and translating them onto related species, as well as minimising the loss of likely epitopes in imbalanced datasets. We also demonstrated that the novel metric termed RSAT was able to estimate immunogenic of pathogenic peptides of various lengths and species. TRAP implementation is available at: https://github.com/ChloeHJ/TRAP . CONCLUSIONS This study presents a novel computational workflow for accurately predicting CD8 + T-cell epitopes to foster a better understanding of antigen-specific T-cell response and the development of effective clinical therapeutics.
Collapse
Affiliation(s)
- Chloe H Lee
- MRC Human Immunology Unit, Medical Research Council (MRC) Weatherall Institute of Molecular Medicine (WIMM), John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
| | - Jaesung Huh
- Visual Geometry Group, Department of Engineering Science, University of Oxford, Oxford, OX2 6NN, UK
| | - Paul R Buckley
- MRC Human Immunology Unit, Medical Research Council (MRC) Weatherall Institute of Molecular Medicine (WIMM), John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
| | - Myeongjun Jang
- Intelligent Systems Lab, Department of Computer Science, University of Oxford, Oxford, OX1 3QG, UK
| | - Mariana Pereira Pinho
- MRC Human Immunology Unit, Medical Research Council (MRC) Weatherall Institute of Molecular Medicine (WIMM), John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
| | - Ricardo A Fernandes
- Chinese Academy of Medical Sciences (CAMS) Oxford Institute (COI), University of Oxford, Oxford, OX3 7BN, UK
| | - Agne Antanaviciute
- MRC Human Immunology Unit, Medical Research Council (MRC) Weatherall Institute of Molecular Medicine (WIMM), John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
| | - Alison Simmons
- MRC Human Immunology Unit, Medical Research Council (MRC) Weatherall Institute of Molecular Medicine (WIMM), John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK
- Translational Gastroenterology Unit, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Hashem Koohy
- MRC Human Immunology Unit, Medical Research Council (MRC) Weatherall Institute of Molecular Medicine (WIMM), John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK.
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DS, UK.
- Alan Turning Fellow in Health and Medicine, The Alan Turing Institute, London, UK.
| |
Collapse
|
18
|
Brandes N, Goldman G, Wang CH, Ye CJ, Ntranos V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet 2023; 55:1512-1522. [PMID: 37563329 PMCID: PMC10484790 DOI: 10.1038/s41588-023-01465-0] [Citation(s) in RCA: 62] [Impact Index Per Article: 62.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 07/05/2023] [Indexed: 08/12/2023]
Abstract
Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.
Collapse
Affiliation(s)
- Nadav Brandes
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Grant Goldman
- Biological and Medical Informatics Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | - Charlotte H Wang
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Parker Institute for Cancer Immunotherapy, University of California, San Francisco, San Francisco, CA, USA.
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA, USA.
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
- Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
| | - Vasilis Ntranos
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
- Diabetes Center, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
19
|
Zhao Y, He B, Xu F, Li C, Xu Z, Su X, He H, Huang Y, Rossjohn J, Song J, Yao J. DeepAIR: A deep learning framework for effective integration of sequence and 3D structure to enable adaptive immune receptor analysis. SCIENCE ADVANCES 2023; 9:eabo5128. [PMID: 37556545 PMCID: PMC10411891 DOI: 10.1126/sciadv.abo5128] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 07/06/2023] [Indexed: 08/11/2023]
Abstract
Structural docking between the adaptive immune receptors (AIRs), including T cell receptors (TCRs) and B cell receptors (BCRs), and their cognate antigens are one of the most fundamental processes in adaptive immunity. However, current methods for predicting AIR-antigen binding largely rely on sequence-derived features of AIRs, omitting the structure features that are essential for binding affinity. In this study, we present a deep learning framework, termed DeepAIR, for the accurate prediction of AIR-antigen binding by integrating both sequence and structure features of AIRs. DeepAIR achieves a Pearson's correlation of 0.813 in predicting the binding affinity of TCR, and a median area under the receiver-operating characteristic curve (AUC) of 0.904 and 0.942 in predicting the binding reactivity of TCR and BCR, respectively. Meanwhile, using TCR and BCR repertoire, DeepAIR correctly identifies every patient with nasopharyngeal carcinoma and inflammatory bowel disease in test data. Thus, DeepAIR improves the AIR-antigen binding prediction that facilitates the study of adaptive immunity.
Collapse
Affiliation(s)
- Yu Zhao
- AI Lab, Tencent, Shenzhen, China
| | - Bing He
- AI Lab, Tencent, Shenzhen, China
| | - Fan Xu
- AI Lab, Tencent, Shenzhen, China
| | - Chen Li
- Biomedicine Discovery Institute and Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | | | | | | | | | - Jamie Rossjohn
- Infection and Immunity Program and Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
- Institute of Infection and Immunity, Cardiff University School of Medicine, Heath Park, Cardiff, UK
| | - Jiangning Song
- AI Lab, Tencent, Shenzhen, China
- Biomedicine Discovery Institute and Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | | |
Collapse
|
20
|
Ghazikhani H, Butler G. Enhanced identification of membrane transport proteins: a hybrid approach combining ProtBERT-BFD and convolutional neural networks. J Integr Bioinform 2023; 0:jib-2022-0055. [PMID: 37497772 PMCID: PMC10389051 DOI: 10.1515/jib-2022-0055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 06/21/2023] [Indexed: 07/28/2023] Open
Abstract
Transmembrane transport proteins (transporters) play a crucial role in the fundamental cellular processes of all organisms by facilitating the transport of hydrophilic substrates across hydrophobic membranes. Despite the availability of numerous membrane protein sequences, their structures and functions remain largely elusive. Recently, natural language processing (NLP) techniques have shown promise in the analysis of protein sequences. Bidirectional Encoder Representations from Transformers (BERT) is an NLP technique adapted for proteins to learn contextual embeddings of individual amino acids within a protein sequence. Our previous strategy, TooT-BERT-T, differentiated transporters from non-transporters by employing a logistic regression classifier with fine-tuned representations from ProtBERT-BFD. In this study, we expand upon this approach by utilizing representations from ProtBERT, ProtBERT-BFD, and MembraneBERT in combination with classical classifiers. Additionally, we introduce TooT-BERT-CNN-T, a novel method that fine-tunes ProtBERT-BFD and discriminates transporters using a Convolutional Neural Network (CNN). Our experimental results reveal that CNN surpasses traditional classifiers in discriminating transporters from non-transporters, achieving an MCC of 0.89 and an accuracy of 95.1 % on the independent test set. This represents an improvement of 0.03 and 1.11 percentage points compared to TooT-BERT-T, respectively.
Collapse
Affiliation(s)
- Hamed Ghazikhani
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| |
Collapse
|
21
|
Li Y, Yao Y, Xia Y, Tang M. Searching for protein variants with desired properties using deep generative models. BMC Bioinformatics 2023; 24:297. [PMID: 37480001 PMCID: PMC10362698 DOI: 10.1186/s12859-023-05415-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 07/17/2023] [Indexed: 07/23/2023] Open
Abstract
BACKGROUND Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model's generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model.
Collapse
Affiliation(s)
- Yan Li
- School of Information, Yunnan Normal University, Kunming, China
| | - Yinying Yao
- National Key Laboratory of Crop Genetic Improvement and National Centre of Plant Gene Research, Huazhong Agricultural University, Wuhan, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yu Xia
- School of Information, Yunnan Normal University, Kunming, China
| | - Mingjing Tang
- Engineering Research Center of Sustainable Development and Utilization of Biomass Energy, Ministry of Education, Yunnan Normal University, Kunming, China.
- School of Life Science, Yunnan Normal University, Kunming, China.
| |
Collapse
|
22
|
Oliveira GB, Pedrini H, Dias Z. TEMPROT: protein function annotation using transformers embeddings and homology search. BMC Bioinformatics 2023; 24:242. [PMID: 37291492 DOI: 10.1186/s12859-023-05375-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 06/02/2023] [Indexed: 06/10/2023] Open
Abstract
BACKGROUND Although the development of sequencing technologies has provided a large number of protein sequences, the analysis of functions that each one plays is still difficult due to the efforts of laboratorial methods, making necessary the usage of computational methods to decrease this gap. As the main source of information available about proteins is their sequences, approaches that can use this information, such as classification based on the patterns of the amino acids and the inference based on sequence similarity using alignment tools, are able to predict a large collection of proteins. The methods available in the literature that use this type of feature can achieve good results, however, they present restrictions of protein length as input to their models. In this work, we present a new method, called TEMPROT, based on the fine-tuning and extraction of embeddings from an available architecture pre-trained on protein sequences. We also describe TEMPROT+, an ensemble between TEMPROT and BLASTp, a local alignment tool that analyzes sequence similarity, which improves the results of our former approach. RESULTS The evaluation of our proposed classifiers with the literature approaches has been conducted on our dataset, which was derived from CAFA3 challenge database. Both TEMPROT and TEMPROT+ achieved competitive results on [Formula: see text], [Formula: see text], AuPRC and IAuPRC metrics on Biological Process (BP), Cellular Component (CC) and Molecular Function (MF) ontologies compared to state-of-the-art models, with the main results equal to 0.581, 0.692 and 0.662 of [Formula: see text] on BP, CC and MF, respectively. CONCLUSIONS The comparison with the literature showed that our model presented competitive results compared the state-of-the-art approaches considering the amino acid sequence pattern recognition and homology analysis. Our model also presented improvements related to the input size that the model can use to train compared to the literature methods.
Collapse
Affiliation(s)
| | - Helio Pedrini
- Institute of Computing, University of Campinas, Campinas, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
23
|
Rios-Martinez C, Bhattacharya N, Amini AP, Crawford L, Yang KK. Deep self-supervised learning for biosynthetic gene cluster detection and product classification. PLoS Comput Biol 2023; 19:e1011162. [PMID: 37220151 DOI: 10.1371/journal.pcbi.1011162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 05/07/2023] [Indexed: 05/25/2023] Open
Abstract
Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.
Collapse
Affiliation(s)
- Carolina Rios-Martinez
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Nicholas Bhattacharya
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Mathematics, University of California, Berkeley, Berkeley, California, United States of America
| | - Ava P Amini
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
| | - Lorin Crawford
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
| | - Kevin K Yang
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
| |
Collapse
|
24
|
Kim Y, Kwon J. AttSec: protein secondary structure prediction by capturing local patterns from attention map. BMC Bioinformatics 2023; 24:183. [PMID: 37142993 PMCID: PMC10161504 DOI: 10.1186/s12859-023-05310-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Accepted: 04/27/2023] [Indexed: 05/06/2023] Open
Abstract
BACKGROUND Protein secondary structures that link simple 1D sequences to complex 3D structures can be used as good features for describing the local properties of protein, but also can serve as key features for predicting the complex 3D structures of protein. Thus, it is very important to accurately predict the secondary structure of the protein, which contains a local structural property assigned by the pattern of hydrogen bonds formed between amino acids. In this study, we accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. In particular, AttSec extracts self-attention maps corresponding to pairwise features between amino acid embeddings and passes them through 2D convolution blocks to capture local patterns. In addition, instead of using additional evolutionary information, it uses protein embedding as an input, which is generated by a language model. RESULTS For the ProteinNet DSSP8 dataset, our model showed 11.8% better performance on the entire evaluation datasets compared with other no-evolutionary-information-based models. For the NetSurfP-2.0 DSSP8 dataset, it showed 1.2% better performance on average. There was an average performance improvement of 9.0% for the ProteinNet DSSP3 dataset and an average of 0.7% for the NetSurfP-2.0 DSSP3 dataset. CONCLUSION We accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. Although there was no dramatic accuracy improvement compared with other models, the improvement on DSSP8 was greater than that on DSSP3. This result implies that using our proposed pairwise feature could have a remarkable effect for several challenging tasks that require finely subdivided classification. Github package URL is https://github.com/youjin-DDAI/AttSec .
Collapse
Affiliation(s)
- Youjin Kim
- Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea
- LG AI Research, Seoul, Republic of Korea
| | - Junseok Kwon
- Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea.
| |
Collapse
|
25
|
Jha K, Karmakar S, Saha S. Graph-BERT and language model-based framework for protein-protein interaction identification. Sci Rep 2023; 13:5663. [PMID: 37024543 PMCID: PMC10079975 DOI: 10.1038/s41598-023-31612-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 03/14/2023] [Indexed: 04/08/2023] Open
Abstract
Identification of protein-protein interactions (PPI) is among the critical problems in the domain of bioinformatics. Previous studies have utilized different AI-based models for PPI classification with advances in artificial intelligence (AI) techniques. The input to these models is the features extracted from different sources of protein information, mainly sequence-derived features. In this work, we present an AI-based PPI identification model utilizing a PPI network and protein sequences. The PPI network is represented as a graph where each node is a protein pair, and an edge is defined between two nodes if there exists a common protein between these nodes. Each node in a graph has a feature vector. In this work, we have used the language model to extract feature vectors directly from protein sequences. The feature vectors for protein in pairs are concatenated and used as a node feature vector of a PPI network graph. Finally, we have used the Graph-BERT model to encode the PPI network graph with sequence-based features and learn the hidden representation of the feature vector for each node. The next step involves feeding the learned representations of nodes to the fully connected layer, the output of which is fed into the softmax layer to classify the protein interactions. To assess the efficacy of the proposed PPI model, we have performed experiments on several PPI datasets. The experimental results demonstrate that the proposed approach surpasses the existing PPI works and designed baselines in classifying PPI.
Collapse
Affiliation(s)
- Kanchan Jha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India.
| | - Sourav Karmakar
- Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, West Bengal, 713209, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India
| |
Collapse
|
26
|
Wang S, You R, Liu Y, Xiong Y, Zhu S. NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:349-358. [PMID: 37075830 PMCID: PMC10626176 DOI: 10.1016/j.gpb.2023.04.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 02/24/2023] [Accepted: 04/07/2023] [Indexed: 04/21/2023]
Abstract
As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] from protein sequences based on self-supervision. Here, we represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.
Collapse
Affiliation(s)
- Shaojun Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Yunjia Liu
- School of Life Sciences, Fudan University, Shanghai 200433, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China; Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China; Shanghai Qi Zhi Institute, Shanghai 200030, China; MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; Shanghai Key Laboratory of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai 200433, China; Zhangjiang Fudan International Innovation Center, Shanghai 200433, China.
| |
Collapse
|
27
|
Deep learning drives efficient discovery of novel antihypertensive peptides from soybean protein isolate. Food Chem 2023; 404:134690. [DOI: 10.1016/j.foodchem.2022.134690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 09/29/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022]
|
28
|
Boadu F, Cao H, Cheng J. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.17.524477. [PMID: 36711471 PMCID: PMC9882282 DOI: 10.1101/2023.01.17.524477] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Motivation Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. Results We developed TransFun - a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. Availability The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun. Contact chengji@missouri.edu.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Hongyuan Cao
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA,Contact: To whom correspondence should be addressed.
| |
Collapse
|
29
|
Chandra A, Tünnermann L, Löfstedt T, Gratz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023; 12:82819. [PMID: 36651724 PMCID: PMC9848389 DOI: 10.7554/elife.82819] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 01/06/2023] [Indexed: 01/19/2023] Open
Abstract
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.
Collapse
Affiliation(s)
- Abel Chandra
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Laura Tünnermann
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
| | - Tommy Löfstedt
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Regina Gratz
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
- Department of Forest Ecology and Management, Swedish University of Agricultural SciencesUmeåSweden
| |
Collapse
|
30
|
Konno N, Iwasaki W. Machine learning enables prediction of metabolic system evolution in bacteria. SCIENCE ADVANCES 2023; 9:eadc9130. [PMID: 36630500 PMCID: PMC9833677 DOI: 10.1126/sciadv.adc9130] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Evolution prediction is a long-standing goal in evolutionary biology, with potential impacts on strategic pathogen control, genome engineering, and synthetic biology. While laboratory evolution studies have shown the predictability of short-term and sequence-level evolution, that of long-term and system-level evolution has not been systematically examined. Here, we show that the gene content evolution of metabolic systems is generally predictable by applying ancestral gene content reconstruction and machine learning techniques to ~3000 bacterial genomes. Our framework, Evodictor, successfully predicted gene gain and loss evolution at the branches of the reference phylogenetic tree, suggesting that evolutionary pressures and constraints on metabolic systems are universally shared. Investigation of pathway architectures and meta-analysis of metagenomic datasets confirmed that these evolutionary patterns have physiological and ecological bases as functional dependencies among metabolic reactions and bacterial habitat changes. Last, pan-genomic analysis of intraspecies gene content variations proved that even "ongoing" evolution in extant bacterial species is predictable in our framework.
Collapse
Affiliation(s)
- Naoki Konno
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Bunkyo-ku, Tokyo 113-0032, Japan
- Corresponding author. (N.K.); (W.I.)
| | - Wataru Iwasaki
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Bunkyo-ku, Tokyo 113-0032, Japan
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-0882, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-0882, Japan
- Atmosphere and Ocean Research Institute, The University of Tokyo, Kashiwa, Chiba 277-0882, Japan
- Institute for Quantitative Biosciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-0032, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-0032, Japan
- Corresponding author. (N.K.); (W.I.)
| |
Collapse
|
31
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
32
|
Lim H, No KT. Prediction of polyreactive and nonspecific single-chain fragment variables through structural biochemical features and protein language-based descriptors. BMC Bioinformatics 2022; 23:520. [PMID: 36471239 PMCID: PMC9720949 DOI: 10.1186/s12859-022-05010-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 10/26/2022] [Indexed: 12/09/2022] Open
Abstract
BACKGROUND Monoclonal antibodies (mAbs) have been used as therapeutic agents, which must overcome many developability issues after the discovery from in vitro display libraries. Especially, polyreactive mAbs can strongly bind to a specific target and weakly bind to off-target proteins, which leads to poor antibody pharmacokinetics in clinical development. Although early assessment of polyreactive mAbs is important in the early discovery stage, experimental assessments are usually time-consuming and expensive. Therefore, computational approaches for predicting the polyreactivity of single-chain fragment variables (scFvs) in the early discovery stage would be promising for reducing experimental efforts. RESULTS Here, we made prediction models for the polyreactivity of scFvs with the known polyreactive antibody features and natural language model descriptors. We predicted 19,426 protein structures of scFvs with trRosetta to calculate the polyreactive antibody features and investigated the classifying performance of each factor for polyreactivity. In the known polyreactive features, the net charge of the CDR2 loop, the tryptophan and glycine residues in CDR-H3, and the lengths of the CDR1 and CDR2 loops, importantly contributed to the performance of the models. Additionally, the hydrodynamic features, such as partial specific volume, gyration radius, and isoelectric points of CDR loops and scFvs, were newly added to improve model performance. Finally, we made the prediction model with a robust performance ([Formula: see text]) with an ensemble learning of the top 3 best models. CONCLUSION The prediction models for polyreactivity would help assess polyreactive scFvs in the early discovery stage and our approaches would be promising to develop machine learning models with quantitative data from high throughput assays for antibody screening.
Collapse
Affiliation(s)
- Hocheol Lim
- grid.15444.300000 0004 0470 5454The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon, 21983 Republic of Korea ,Bioinformatics and Molecular Design Research Center (BMDRC), Incheon, 21983 Republic of Korea
| | - Kyoung Tai No
- grid.15444.300000 0004 0470 5454The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon, 21983 Republic of Korea ,Bioinformatics and Molecular Design Research Center (BMDRC), Incheon, 21983 Republic of Korea ,Baobab AiBIO Co., Ltd., Incheon, 21983 Republic of Korea
| |
Collapse
|
33
|
Anteghini M, Haja A, Martins dos Santos VA, Schomaker L, Saccenti E. OrganelX web server for sub-peroxisomal and sub-mitochondrial protein localization and peroxisomal target signal detection. Comput Struct Biotechnol J 2022; 21:128-133. [PMID: 36544474 PMCID: PMC9747352 DOI: 10.1016/j.csbj.2022.11.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 11/28/2022] [Accepted: 11/28/2022] [Indexed: 12/12/2022] Open
Abstract
We present the OrganelX e-Science Web Server that provides a user-friendly implementation of the In-Pero and In-Mito classifiers for sub-peroxisomal and sub-mitochondrial localization of peroxisomal and mitochondrial proteins and the Is-PTS1 algorithm for detecting and validating potential peroxisomal proteins carrying a PTS1 signal sequence. The OrganelX e-Science Web Server is available at https://organelx.hpc.rug.nl/fasta/.
Collapse
Affiliation(s)
- Marco Anteghini
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands,LifeGlimmer GmbH, Berlin, Germany,Corresponding author at: Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands (M. Anteghini).
| | - Asmaa Haja
- Bernoulli Institute, University of Groningen, Groningen, The Netherlands
| | - Vitor A.P. Martins dos Santos
- LifeGlimmer GmbH, Berlin, Germany,Bioprocess Engineering, Wageningen University & Research, Wageningen, The Netherlands
| | - Lambert Schomaker
- Bernoulli Institute, University of Groningen, Groningen, The Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands,Corresponding author at: Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands (M. Anteghini).
| |
Collapse
|
34
|
Ismi DP, Pulungan R, Afiahayati. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold. Comput Struct Biotechnol J 2022; 20:6271-6286. [PMID: 36420164 PMCID: PMC9678802 DOI: 10.1016/j.csbj.2022.11.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/13/2022] Open
Abstract
This paper aims to provide a comprehensive review of the trends and challenges of deep neural networks for protein secondary structure prediction (PSSP). In recent years, deep neural networks have become the primary method for protein secondary structure prediction. Previous studies showed that deep neural networks had uplifted the accuracy of three-state secondary structure prediction to more than 80%. Favored deep learning methods, such as convolutional neural networks, recurrent neural networks, inception networks, and graph neural networks, have been implemented in protein secondary structure prediction. Methods adapted from natural language processing (NLP) and computer vision are also employed, including attention mechanism, ResNet, and U-shape networks. In the post-AlphaFold era, PSSP studies focus on different objectives, such as enhancing the quality of evolutionary information and exploiting protein language models as the PSSP input. The recent trend to utilize pre-trained language models as input features for secondary structure prediction provides a new direction for PSSP studies. Moreover, the state-of-the-art accuracy achieved by previous PSSP models is still below its theoretical limit. There are still rooms for improvement to be made in the field.
Collapse
Affiliation(s)
- Dewi Pramudi Ismi
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Infomatics, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
| | - Reza Pulungan
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| | - Afiahayati
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
35
|
An J, Weng X. Collectively encoding protein properties enriches protein language models. BMC Bioinformatics 2022; 23:467. [DOI: 10.1186/s12859-022-05031-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 10/31/2022] [Indexed: 11/10/2022] Open
Abstract
AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
Collapse
|
36
|
Lupo U, Sgarbossa D, Bitbol AF. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun 2022; 13:6298. [PMID: 36273003 PMCID: PMC9588007 DOI: 10.1038/s41467-022-34032-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 10/07/2022] [Indexed: 12/25/2022] Open
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
Collapse
Affiliation(s)
- Umberto Lupo
- grid.5333.60000000121839049Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland ,grid.419765.80000 0001 2223 3006SIB Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
| | - Damiano Sgarbossa
- grid.5333.60000000121839049Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland ,grid.419765.80000 0001 2223 3006SIB Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
| | - Anne-Florence Bitbol
- grid.5333.60000000121839049Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland ,grid.419765.80000 0001 2223 3006SIB Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
| |
Collapse
|
37
|
Salem M, Keshavarzi Arshadi A, Yuan JS. AMPDeep: hemolytic activity prediction of antimicrobial peptides using transfer learning. BMC Bioinformatics 2022; 23:389. [PMID: 36163001 PMCID: PMC9511757 DOI: 10.1186/s12859-022-04952-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 09/20/2022] [Indexed: 12/02/2022] Open
Abstract
Background Deep learning’s automatic feature extraction has proven to give superior performance in many sequence classification tasks. However, deep learning models generally require a massive amount of data to train, which in the case of Hemolytic Activity Prediction of Antimicrobial Peptides creates a challenge due to the small amount of available data. Results Three different datasets for hemolysis activity prediction of therapeutic and antimicrobial peptides are gathered and the AMPDeep pipeline is implemented for each. The result demonstrate that AMPDeep outperforms the previous works on all three datasets, including works that use physicochemical features to represent the peptides or those who solely rely on the sequence and use deep learning to learn representation for the peptides. Moreover, a combined dataset is introduced for hemolytic activity prediction to address the problem of sequence similarity in this domain. AMPDeep fine-tunes a large transformer based model on a small amount of peptides and successfully leverages the patterns learned from other protein and peptide databases to assist hemolysis activity prediction modeling. Conclusions In this work transfer learning is leveraged to overcome the challenge of small data and a deep learning based model is successfully adopted for hemolysis activity classification of antimicrobial peptides. This model is first initialized as a protein language model which is pre-trained on masked amino acid prediction on many unlabeled protein sequences in a self-supervised manner. Having done so, the model is fine-tuned on an aggregated dataset of labeled peptides in a supervised manner to predict secretion. Through transfer learning, hyper-parameter optimization and selective fine-tuning, AMPDeep is able to achieve state-of-the-art performance on three hemolysis datasets using only the sequence of the peptides. This work assists the adoption of large sequence-based models for peptide classification and modeling tasks in a practical manner.
Collapse
Affiliation(s)
- Milad Salem
- Electrical and Computer Engineering Department, University of Central Florida, Orlando, FL, USA.
| | | | - Jiann Shiun Yuan
- Electrical and Computer Engineering Department, University of Central Florida, Orlando, FL, USA
| |
Collapse
|
38
|
Bernhofer M, Rost B. TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinformatics 2022; 23:326. [PMID: 35941534 PMCID: PMC9358067 DOI: 10.1186/s12859-022-04873-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 08/03/2022] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. RESULTS Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). CONCLUSIONS Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.
Collapse
Affiliation(s)
- Michael Bernhofer
- Department of Informatics, Bioinformatics and Computational Biology ‑ i12, Technical University of Munich (TUM), Boltzmannstr. 3, 85748, Garching, Germany. .,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology ‑ i12, Technical University of Munich (TUM), Boltzmannstr. 3, 85748, Garching, Germany.,Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching, Germany.,TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
39
|
Tubiana J, Schneidman-Duhovny D, Wolfson HJ. ScanNet: A web server for structure-based prediction of protein binding sites with geometric deep learning. J Mol Biol 2022; 434:167758. [DOI: 10.1016/j.jmb.2022.167758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 07/18/2022] [Accepted: 07/19/2022] [Indexed: 11/28/2022]
|
40
|
Prediction of protein-protein interaction using graph neural networks. Sci Rep 2022; 12:8360. [PMID: 35589837 PMCID: PMC9120162 DOI: 10.1038/s41598-022-12201-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 04/18/2022] [Indexed: 01/09/2023] Open
Abstract
Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein–protein interaction) present in their surroundings to complete biological activities. The knowledge of protein–protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein’s structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git.
Collapse
|
41
|
Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment. Sci Rep 2022; 12:7607. [PMID: 35534620 PMCID: PMC9085874 DOI: 10.1038/s41598-022-11684-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 04/25/2022] [Indexed: 11/09/2022] Open
Abstract
Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.
Collapse
|
42
|
LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction. Sci Rep 2022; 12:6832. [PMID: 35477726 PMCID: PMC9046255 DOI: 10.1038/s41598-022-10775-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 04/11/2022] [Indexed: 11/27/2022] Open
Abstract
Proteins perform many essential functions in biological systems and can be successfully developed as bio-therapeutics. It is invaluable to be able to predict their properties based on a proposed sequence and structure. In this study, we developed a novel generalizable deep learning framework, LM-GVP, composed of a protein Language Model (LM) and Graph Neural Network (GNN) to leverage information from both 1D amino acid sequences and 3D structures of proteins. Our approach outperformed the state-of-the-art protein LMs on a variety of property prediction tasks including fluorescence, protease stability, and protein functions from Gene Ontology (GO). We also illustrated insights into how a GNN prediction head can inform the fine-tuning of protein LMs to better leverage structural information. We envision that our deep learning framework will be generalizable to many protein property prediction problems to greatly accelerate protein engineering and drug development.
Collapse
|
43
|
|
44
|
Nambiar A, Rubel T, McCaull J, deVries J, Bedau M. Dropping diversity of products of large US firms: Models and measures. PLoS One 2022; 17:e0264330. [PMID: 35294464 PMCID: PMC8926282 DOI: 10.1371/journal.pone.0264330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Accepted: 02/08/2022] [Indexed: 11/21/2022] Open
Abstract
It is widely assumed that in our lifetimes the products available in the global economy have become more diverse. This assumption is difficult to investigate directly, however, because it is difficult to collect the necessary data about every product in an economy each year. We solve this problem by mining publicly available textual descriptions of the products of every large US firms each year from 1997 to 2017. Although many aspects of economic productivity have been steadily rising during this period, our text-based measurements show that the diversity of the products of at least large US firms has steadily declined. This downward trend is visible using a variety of product diversity metrics, including some that depend on a measurement of the similarity of the products of every single pair of firms. The current state of the art in comprehensive and detailed firm-similarity measurements is a Boolean word vector model due to Hoberg and Phillips. We measure diversity using firm-similarities from this Boolean model and two more sophisticated variants, and we consistently observe a significant dropping trend in product diversity. These results make it possible to frame and start to test specific hypotheses for explaining the dropping product diversity trend.
Collapse
Affiliation(s)
- Ananthan Nambiar
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Tobias Rubel
- Department of Philosophy, Reed College, Portland, Oregon, United States of America
- Department of Biology, Reed College, Portland, Oregon, United States of America
| | - James McCaull
- Department of Computer Science, Reed College, Portland, Oregon, United States of America
| | - Jon deVries
- Department of Biology, Reed College, Portland, Oregon, United States of America
| | - Mark Bedau
- Department of Philosophy, Reed College, Portland, Oregon, United States of America
- * E-mail:
| |
Collapse
|
45
|
Lee BD, Gitter A, Greene CS, Raschka S, Maguire F, Titus AJ, Kessler MD, Lee AJ, Chevrette MG, Stewart PA, Britto-Borges T, Cofer EM, Yu KH, Carmona JJ, Fertig EJ, Kalinin AA, Signal B, Lengerich BJ, Triche TJ, Boca SM. Ten quick tips for deep learning in biology. PLoS Comput Biol 2022; 18:e1009803. [PMID: 35324884 PMCID: PMC8946751 DOI: 10.1371/journal.pcbi.1009803] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Affiliation(s)
- Benjamin D. Lee
- In-Q-Tel Labs, Arlington, Virginia, United States of America
- School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Morgridge Institute for Research, Madison, Wisconsin, United States of America
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Sebastian Raschka
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Finlay Maguire
- Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Alexander J. Titus
- University of New Hampshire, Manchester, New Hampshire, United States of America
- Bioeconomy.XYZ, Manchester, New Hampshire, United States of America
| | - Michael D. Kessler
- Department of Oncology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Alexandra J. Lee
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Marc G. Chevrette
- Wisconsin Institute for Discovery and Department of Plant Pathology, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Paul Allen Stewart
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, Florida, United States of America
| | - Thiago Britto-Borges
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Heidelberg, Germany
- Department of Internal Medicine III (Cardiology, Angiology, and Pneumology), University Hospital Heidelberg, Heidelberg, Germany
| | - Evan M. Cofer
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Kun-Hsing Yu
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
- Department of Pathology, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
| | - Juan Jose Carmona
- Philips Healthcare, Cambridge, Massachusetts, United States of America
| | - Elana J. Fertig
- Department of Oncology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Biomedical Engineering, Department of Applied Mathematics and Statistics, Convergence Institute, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Alexandr A. Kalinin
- Medical Big Data Group, Shenzhen Research Institute of Big Data, Shenzhen, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Brandon Signal
- School of Medicine, College of Health and Medicine, University of Tasmania, Hobart, Australia
| | - Benjamin J. Lengerich
- Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Timothy J. Triche
- Center for Epigenetics, Van Andel Research Institute, Grand Rapids, Michigan, United States of America
- Department of Pediatrics, College of Human Medicine, Michigan State University, East Lansing, Michigan, United States of America
- Department of Translational Genomics, Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America
| | - Simina M. Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, District of Columbia, United States of America
- Department of Oncology, Georgetown University Medical Center, Washington, DC, United States of America
- Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, United States of America
- Cancer Prevention and Control Program, Lombardi Comprehensive Cancer Center, Washington, DC, United States of America
| |
Collapse
|
46
|
Lee I, Nam H. Sequence-based prediction of protein binding regions and drug-target interactions. J Cheminform 2022; 14:5. [PMID: 35135622 PMCID: PMC8822694 DOI: 10.1186/s13321-022-00584-w] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 01/20/2022] [Indexed: 12/19/2022] Open
Abstract
Identifying drug–target interactions (DTIs) is important for drug discovery. However, searching all drug–target spaces poses a major bottleneck. Therefore, recently many deep learning models have been proposed to address this problem. However, the developers of these deep learning models have neglected interpretability in model construction, which is closely related to a model’s performance. We hypothesized that training a model to predict important regions on a protein sequence would increase DTI prediction performance and provide a more interpretable model. Consequently, we constructed a deep learning model, named Highlights on Target Sequences (HoTS), which predicts binding regions (BRs) between a protein sequence and a drug ligand, as well as DTIs between them. To train the model, we collected complexes of protein–ligand interactions and protein sequences of binding sites and pretrained the model to predict BRs for a given protein sequence–ligand pair via object detection employing transformers. After pretraining the BR prediction, we trained the model to predict DTIs from a compound token designed to assign attention to BRs. We confirmed that training the BRs prediction model indeed improved the DTI prediction performance. The proposed HoTS model showed good performance in BR prediction on independent test datasets even though it does not use 3D structure information in its prediction. Furthermore, the HoTS model achieved the best performance in DTI prediction on test datasets. Additional analysis confirmed the appropriate attention for BRs and the importance of transformers in BR and DTI prediction. The source code is available on GitHub (https://github.com/GIST-CSBL/HoTS).
Collapse
Affiliation(s)
- Ingoo Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-ku, Gwangju, 61005, Republic of Korea
| | - Hojung Nam
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-ku, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
47
|
Thumuluri V, Martiny HM, Almagro Armenteros JJ, Salomon J, Nielsen H, Johansen AR. NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics 2022; 38:941-946. [PMID: 35088833 DOI: 10.1093/bioinformatics/btab801] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 10/13/2021] [Accepted: 11/23/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. RESULTS In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. AVAILABILITY AND IMPLEMENTATION The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Hannah-Marie Martiny
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Lyngby 2800, Denmark
| | - Jose J Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark
| | | | - Henrik Nielsen
- Department of Health Technology, Technical University of Denmark, Lyngby 2800, Denmark
| | - Alexander Rosenberg Johansen
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
48
|
Kalakoti Y, Yadav S, Sundar D. TransDTI: Transformer-Based Language Models for Estimating DTIs and Building a Drug Recommendation Workflow. ACS OMEGA 2022; 7:2706-2717. [PMID: 35097268 PMCID: PMC8792915 DOI: 10.1021/acsomega.1c05203] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 12/28/2021] [Indexed: 06/09/2023]
Abstract
The identification of novel drug-target interactions is a labor-intensive and low-throughput process. In silico alternatives have proved to be of immense importance in assisting the drug discovery process. Here, we present TransDTI, a multiclass classification and regression workflow employing transformer-based language models to segregate interactions between drug-target pairs as active, inactive, and intermediate. The models were trained with large-scale drug-target interaction (DTI) data sets, which reported an improvement in performance in terms of the area under receiver operating characteristic (auROC), the area under precision recall (auPR), Matthew's correlation coefficient (MCC), and R2 over baseline methods. The results showed that models based on transformer-based language models effectively predict novel drug-target interactions from sequence data. The proposed models significantly outperformed existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions predicted by TransDTI was found to be backed by molecular docking and simulation analysis, where the model prediction had similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed approaches can have a significant impact on the development of personalized therapy and clinical decision making.
Collapse
Affiliation(s)
- Yogesh Kalakoti
- DAILAB,
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India
| | - Shashank Yadav
- DAILAB,
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India
| | - Durai Sundar
- DAILAB,
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India
- School
of Artificial Intelligence, Indian Institute
of Technology (IIT) Delhi, New
Delhi 110016, India
| |
Collapse
|
49
|
Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins. Proc Natl Acad Sci U S A 2022; 119:2113348119. [PMID: 35074909 PMCID: PMC8795500 DOI: 10.1073/pnas.2113348119] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/07/2021] [Indexed: 12/12/2022] Open
Abstract
Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper.
Collapse
|
50
|
Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022; 40:1114-1122. [PMID: 35039677 DOI: 10.1038/s41587-021-01146-5] [Citation(s) in RCA: 58] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 11/02/2021] [Indexed: 01/27/2023]
Abstract
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
Collapse
Affiliation(s)
- Chloe Hsu
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, USA
| | - Clara Fannjiang
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. .,Center for Computational Biology, University of California, Berkeley, USA.
| |
Collapse
|