101
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A CNN-CBAM-BIGRU model for protein function prediction. Stat Appl Genet Mol Biol 2024; 23:sagmb-2024-0004. [PMID: 38943434 DOI: 10.1515/sagmb-2024-0004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 06/07/2024] [Indexed: 07/01/2024]
Abstract
Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, C.V. Raman Global University, Bhubaneswar, Odisha, India
| | | |
Collapse
|
102
|
Nielsen H, Teufel F, Brunak S, von Heijne G. SignalP: The Evolution of a Web Server. Methods Mol Biol 2024; 2836:331-367. [PMID: 38995548 DOI: 10.1007/978-1-0716-4007-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
SignalP ( https://services.healthtech.dtu.dk/services/SignalP-6.0/ ) is a very popular prediction method for signal peptides, the intrinsic signals that make proteins secretory. The SignalP web server has existed since 1995 and is now in its sixth major version. In this historical account, we (three authors who have taken part in the entire journey plus the first author of the latest version) describe the differences between the versions and discuss the various decisions taken along the way.
Collapse
Affiliation(s)
- Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Felix Teufel
- Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, Malov, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Science for Life Laboratory, Stockholm University, Solna, Sweden
| |
Collapse
|
103
|
Qiu W, Liang Q, Yu L, Xiao X, Qiu W, Lin W. LSTM-SAGDTA: Predicting Drug-target Binding Affinity with an Attention Graph Neural Network and LSTM Approach. Curr Pharm Des 2024; 30:468-476. [PMID: 38323613 PMCID: PMC11071654 DOI: 10.2174/0113816128282837240130102817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 01/14/2024] [Accepted: 01/19/2024] [Indexed: 02/08/2024]
Abstract
INTRODUCTION Drug development is a challenging and costly process, yet it plays a crucial role in improving healthcare outcomes. Drug development requires extensive research and testing to meet the demands for economic efficiency, cures, and pain relief. METHODS Drug development is a vital research area that necessitates innovation and collaboration to achieve significant breakthroughs. Computer-aided drug design provides a promising avenue for drug discovery and development by reducing costs and improving the efficiency of drug design and testing. RESULTS In this study, a novel model, namely LSTM-SAGDTA, capable of accurately predicting drug-target binding affinity, was developed. We employed SeqVec for characterizing the protein and utilized the graph neural networks to capture information on drug molecules. By introducing self-attentive graph pooling, the model achieved greater accuracy and efficiency in predicting drug-target binding affinity. CONCLUSION Moreover, LSTM-SAGDTA obtained superior accuracy over current state-of-the-art methods only by using less training time. The results of experiments suggest that this method represents a highprecision solution for the DTA predictor.
Collapse
Affiliation(s)
- Wenjing Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Qianle Liang
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Liyi Yu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Xuan Xiao
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Weizhong Lin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| |
Collapse
|
104
|
Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes (Basel) 2023; 15:25. [PMID: 38254915 PMCID: PMC10815811 DOI: 10.3390/genes15010025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/24/2024] Open
Abstract
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Collapse
Affiliation(s)
| | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (P.C.)
| |
Collapse
|
105
|
Weissenow K, Rost B. Rendering protein mutation movies with MutAmore. BMC Bioinformatics 2023; 24:469. [PMID: 38087198 PMCID: PMC10714560 DOI: 10.1186/s12859-023-05610-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 12/08/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND The success of AlphaFold2 in reliable protein three-dimensional (3D) structure prediction, assists the move of structural biology toward studies of protein dynamics and mutational impact on structure and function. This transition needs tools that qualitatively assess alternative 3D conformations. RESULTS We introduce MutAmore, a bioinformatics tool that renders individual images of protein 3D structures for, e.g., sequence mutations into a visually intuitive movie format. MutAmore streamlines a pipeline casting single amino-acid variations (SAVs) into a dynamic 3D mutation movie providing a qualitative perspective on the mutational landscape of a protein. By default, the tool first generates all possible variants of the sequence reachable through SAVs (L*19 for proteins with L residues). Next, it predicts the structural conformation for all L*19 variants using state-of-the-art models. Finally, it visualizes the mutation matrix and produces a color-coded 3D animation. Alternatively, users can input other types of variants, e.g., from experimental structures. CONCLUSION MutAmore samples alternative protein configurations to study the dynamical space accessible from SAVs in the post-AlphaFold2 era of structural biology. As the field shifts towards the exploration of alternative conformations of proteins, MutAmore aids in the understanding of the structural impact of mutations by providing a flexible pipeline for the generation of protein mutation movies using current and future structure prediction models.
Collapse
Affiliation(s)
- Konstantin Weissenow
- Department of Informatics, Bioinformatics and Computational Biology i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching, Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching, Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
106
|
Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, Rollins N, Shaw A, Weitzman R, Frazer J, Dias M, Franceschi D, Orenbuch R, Gal Y, Marks DS. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570727. [PMID: 38106144 PMCID: PMC10723403 DOI: 10.1101/2023.12.07.570727] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Ada Shaw
- Applied Mathematics, Harvard University
| | | | | | - Mafalda Dias
- Centre for Genomic Regulation, Universitat Pompeu Fabra
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| | | |
Collapse
|
107
|
Notin P, Marks DS, Weitzman R, Gal Y. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.06.570473. [PMID: 38106034 PMCID: PMC10723423 DOI: 10.1101/2023.12.06.570473] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Collapse
Affiliation(s)
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| |
Collapse
|
108
|
Xu H, Hu R, Zhao Z. DegronMD: Leveraging Evolutionary and Structural Features for Deciphering Protein-Targeted Degradation, Mutations, and Drug Response to Degrons. Mol Biol Evol 2023; 40:msad253. [PMID: 37992195 PMCID: PMC10701100 DOI: 10.1093/molbev/msad253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 10/18/2023] [Accepted: 11/09/2023] [Indexed: 11/24/2023] Open
Abstract
Protein-targeted degradation is an emerging and promising therapeutic approach. The specificity of degradation and the maintenance of cellular homeostasis are determined by the interactions between E3 ubiquitin ligase and degradation signals, known as degrons. The human genome encodes over 600 E3 ligases; however, only a small number of targeted degron instances have been identified so far. In this study, we introduced DegronMD, an open knowledgebase designed for the investigation of degrons, their associated dysfunctional events, and drug responses. We revealed that degrons are evolutionarily conserved and tend to occur near the sites of protein translational modifications, particularly in the regions of disordered structure and higher solvent accessibility. Through pattern recognition and machine learning techniques, we constructed the degrome landscape across the human proteome, yielding over 18,000 new degrons for targeted protein degradation. Furthermore, dysfunction of degrons disrupts the degradation process and leads to the abnormal accumulation of proteins; this process is associated with various types of human cancers. Based on the estimated phenotypic changes induced by somatic mutations, we systematically quantified and assessed the impact of mutations on degron function in pan-cancers; these results helped to build a global mutational map on human degrome, including 89,318 actionable mutations that may induce the dysfunction of degrons and disrupt protein degradation pathways. Multiomics integrative analysis unveiled over 400 drug resistance events associated with the mutations in functional degrons. DegronMD, accessible at https://bioinfo.uth.edu/degronmd, is a useful resource to explore the biological mechanisms, infer protein degradation, and assist with drug discovery and design on degrons.
Collapse
Affiliation(s)
- Haodong Xu
- Department of Orthopaedics, The Second Xiangya Hospital, Central South University, Changsha, Hunan 410011, China
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Ruifeng Hu
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- Center for Advanced Parkinson Research, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
- Genomics and Bioinformatics Hub, Department of Neurology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
109
|
Urhan A, Cosma BM, Earl AM, Manson AL, Abeel T. SAP: Synteny-aware gene function prediction for bacteria using protein embeddings. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.02.539034. [PMID: 37205418 PMCID: PMC10187222 DOI: 10.1101/2023.05.02.539034] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Motivation Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for prokaryotes. Recently, transformer-based language models - adopted from the natural language processing field - have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. Results To predict gene functions in bacteria, we developed SAP, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAP also leverages the unique operon structure of bacteria through conserved synteny. SAP outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAP to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.
Collapse
Affiliation(s)
- Aysun Urhan
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| | - Bianca-Maria Cosma
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The Netherlands
| | - Ashlee M. Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| | - Abigail L. Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, US
| |
Collapse
|
110
|
Khakzad H, Igashov I, Schneuing A, Goverde C, Bronstein M, Correia B. A new age in protein design empowered by deep learning. Cell Syst 2023; 14:925-939. [PMID: 37972559 DOI: 10.1016/j.cels.2023.10.006] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 06/22/2023] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein structure prediction, leading to the availability of high-quality models for millions of proteins. Along with novel architectures for generative modeling and sequence analysis, they have revolutionized the protein design field in the past few years remarkably by improving the accuracy and ability to identify novel protein sequences and structures. Deep neural networks can now learn and extract the fundamental features of protein structures, predict how they interact with other biomolecules, and have the potential to create new effective drugs for treating disease. As their applicability in protein design is rapidly growing, we review the recent developments and technology in deep learning methods and provide examples of their performance to generate novel functional proteins.
Collapse
Affiliation(s)
- Hamed Khakzad
- Université de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France; École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Ilia Igashov
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Arne Schneuing
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Bruno Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
111
|
Raza A, Uddin J, Almuhaimeed A, Akbar S, Zou Q, Ahmad A. AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks. J Chem Inf Model 2023; 63:6537-6554. [PMID: 37905969 DOI: 10.1021/acs.jcim.3c01563] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Inflammation is a biologically resistant response to harmful stimuli, such as infection, damaged cells, toxic chemicals, or tissue injuries. Its purpose is to eradicate pathogenic micro-organisms or irritants and facilitate tissue repair. Prolonged inflammation can result in chronic inflammatory diseases. However, wet-laboratory-based treatments are costly and time-consuming and may have adverse side effects on normal cells. In the past decade, peptide therapeutics have gained significant attention due to their high specificity in targeting affected cells without affecting healthy cells. Motivated by the significance of peptide-based therapies, we developed a highly discriminative prediction model called AIPs-SnTCN to predict anti-inflammatory peptides accurately. The peptide samples are encoded using word embedding techniques such as skip-gram and attention-based bidirectional encoder representation using a transformer (BERT). The conjoint triad feature (CTF) also collects structure-based cluster profile features. The fused vector of word embedding and sequential features is formed to compensate for the limitations of single encoding methods. Support vector machine-based recursive feature elimination (SVM-RFE) is applied to choose the ranking-based optimal space. The optimized feature space is trained by using an improved self-normalized temporal convolutional network (SnTCN). The AIPs-SnTCN model achieved a predictive accuracy of 95.86% and an AUC of 0.97 by using training samples. In the case of the alternate training data set, our model obtained an accuracy of 92.04% and an AUC of 0.96. The proposed AIPs-SnTCN model outperformed existing models with an ∼19% higher accuracy and an ∼14% higher AUC value. The reliability and efficacy of our AIPs-SnTCN model make it a valuable tool for scientists and may play a beneficial role in pharmaceutical design and research academia.
Collapse
Affiliation(s)
- Ali Raza
- Department of Physical and Numerical Sciences, Qurtuba University of Science and Information Technology, Peshawar, Khyber Pakhtunkhwa 25124, Pakistan
- Department of Computer Science, MY University, Islamabad 45750, Pakistan
| | - Jamal Uddin
- Department of Physical and Numerical Sciences, Qurtuba University of Science and Information Technology, Peshawar, Khyber Pakhtunkhwa 25124, Pakistan
| | - Abdullah Almuhaimeed
- Digital Health Institute, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia
| | - Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Khyber Pakhtunkhwa 23200, Pakistan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, PR China
| | - Ashfaq Ahmad
- Department of Computer Science, MY University, Islamabad 45750, Pakistan
| |
Collapse
|
112
|
Qin X, Liu M, Liu G. ResCNNT-fold: Combining residual convolutional neural network and Transformer for protein fold recognition from language model embeddings. Comput Biol Med 2023; 166:107571. [PMID: 37864911 DOI: 10.1016/j.compbiomed.2023.107571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/30/2023] [Accepted: 10/11/2023] [Indexed: 10/23/2023]
Abstract
A comprehensive understanding of protein functions holds significant promise for disease research and drug development, and proteins with analogous tertiary structures tend to exhibit similar functions. Protein fold recognition stands as a classical approach in the realm of protein structure investigation. Despite significant advancements made by researchers in this field, the continuous updating of protein databases presents an ongoing challenge in accurately identifying protein fold types. In this study, we introduce a predictor, ResCNNT-fold, for protein fold recognition and employ the LE dataset for testing purpose. ResCNNT-fold leverages a pre-trained language model to obtain embedding representations for protein sequences, which are then processed by the ResCNNT feature extractor, a combination of residual convolutional neural network and Transformer, to derive fold-specific features. Subsequently, the query protein is paired with each protein whose structure is known in the template dataset. For each pair, the similarity score of their fold-specific features is calculated. Ultimately, the query protein is identified as the fold type of the template protein in the pair with the highest similarity score. To further validate the utility and efficacy of the proposed ResCNNT-fold predictor, we conduct a 2-fold cross-validation experiment on the fold level of the LE dataset. Remarkably, this rigorous evaluation yields an exceptional accuracy of 91.57%, which surpasses the best result among other state-of-the-art protein fold recognition methods by an approximate margin of 10%. The excellent performance unequivocally underscores the compelling advantages inherent to our proposed ResCNNT-fold predictor in the realm of protein fold recognition. The source code and data of ResCNNT-fold can be downloaded from https://github.com/Bioinformatics-Laboratory/ResCNNT-fold.
Collapse
Affiliation(s)
- Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| |
Collapse
|
113
|
Muniyappan S, Rayan AXA, Varrieth GT. EGeRepDR: An enhanced genetic-based representation learning for drug repurposing using multiple biomedical sources. J Biomed Inform 2023; 147:104528. [PMID: 37858852 DOI: 10.1016/j.jbi.2023.104528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 09/11/2023] [Accepted: 10/16/2023] [Indexed: 10/21/2023]
Abstract
MOTIVATION Drug repurposing (DR) is an imminent approach for identifying novel therapeutic indications for the available drugs and discovering novel drugs for previously untreatable diseases. Nowadays, DR has major attention in the pharmaceutical industry due to the high cost and time of launching new drugs to the market through traditional drug development. DR task majorly depends on genetic information since the drugs revert the modified Gene Expression (GE) of diseases to normal. Many of the existing studies have not considered the genetic importance of predicting the potential candidates. METHOD We proposed a novel multimodal framework that utilizes genetic aspects of drugs and diseases such as genes, pathways, gene signatures, or expression to enhance the performance of DR using various data sources. Firstly, the heterogeneous biological network (HBN) is constructed with three types of nodes namely drug, disease, and gene, and 4 types of edges similarities (drug, gene, and disease), drug-gene, gene-disease, and drug-disease. Next, a modified graph auto-encoder (GAE*) model is applied to learn the representation of drug and disease nodes using the topological structure and edge information. Secondly, the HBN is enhanced with the information extracted from biomedical literature and ontology using a novel semi-supervised pattern embedding-based bootstrapping model and novel DR perspective representation learning respectively to improve the prediction performance. Finally, our proposed system uses a neural network model to generate the probability score of drug-disease pairs. RESULTS We demonstrate the efficiency of the proposed model on various datasets and achieved outstanding performance in 5-fold cross-validation (AUC = 0.99, AUPR = 0.98). Further, we validated the top-ranked potential candidates using pathway analysis and proved that the known and predicted candidates share common genes in the pathways.
Collapse
Affiliation(s)
- Saranya Muniyappan
- Computer Science and Engineering, CEG Campus, Anna University, Chennai, Tamil Nadu, India.
| | | | | |
Collapse
|
114
|
Anteghini M, Santos VAMD, Saccenti E. PortPred: Exploiting deep learning embeddings of amino acid sequences for the identification of transporter proteins and their substrates. J Cell Biochem 2023; 124:1803-1824. [PMID: 37877557 DOI: 10.1002/jcb.30490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 10/26/2023]
Abstract
The physiology of every living cell is regulated at some level by transporter proteins which constitute a relevant portion of membrane-bound proteins and are involved in the movement of ions, small and macromolecules across bio-membranes. The importance of transporter proteins is unquestionable. The prediction and study of previously unknown transporters can lead to the discovery of new biological pathways, drugs and treatments. Here we present PortPred, a tool to accurately identify transporter proteins and their substrate starting from the protein amino acid sequence. PortPred successfully combines pre-trained deep learning-based protein embeddings and machine learning classification approaches and outperforms other state-of-the-art methods. In addition, we present a comparison of the most promising protein sequence embeddings (Unirep, SeqVec, ProteinBERT, ESM-1b) and their performances for this specific task.
Collapse
Affiliation(s)
- Marco Anteghini
- LifeGlimmer GmbH, Berlin, Germany
- Department of Systems and Synthetic Biology, Wageningen University & Research, Wageningen WE, The Netherlands
- Department of Visual and Data-Centric Computing, Zuse Institute Berlin, Berlin, Germany
| | - Vitor Ap Martins Dos Santos
- LifeGlimmer GmbH, Berlin, Germany
- Department of Bioprocess Engineering, Wageningen University & Research, Wageningen WE, The Netherlands
| | - Edoardo Saccenti
- Department of Systems and Synthetic Biology, Wageningen University & Research, Wageningen WE, The Netherlands
| |
Collapse
|
115
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun Biol 2023; 6:1103. [PMID: 37907681 PMCID: PMC10618451 DOI: 10.1038/s42003-023-05476-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 10/17/2023] [Indexed: 11/02/2023] Open
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
116
|
Jiao P, Wang B, Wang X, Liu B, Wang Y, Li J. Struct2GO: protein function prediction based on graph pooling algorithm and AlphaFold2 structure information. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:btad637. [PMID: 37847755 PMCID: PMC10612405 DOI: 10.1093/bioinformatics/btad637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 10/05/2023] [Accepted: 10/16/2023] [Indexed: 10/19/2023]
Abstract
MOTIVATION In recent years, there has been a breakthrough in protein structure prediction, and the AlphaFold2 model of the DeepMind team has improved the accuracy of protein structure prediction to the atomic level. Currently, deep learning-based protein function prediction models usually extract features from protein sequences and combine them with protein-protein interaction networks to achieve good results. However, for newly sequenced proteins that are not in the protein-protein interaction network, such models cannot make effective predictions. To address this, this article proposes the Struct2GO model, which combines protein structure and sequence data to enhance the precision of protein function prediction and the generality of the model. RESULTS We obtain amino acid residue embeddings in protein structure through graph representation learning, utilize the graph pooling algorithm based on a self-attention mechanism to obtain the whole graph structure features, and fuse them with sequence features obtained from the protein language model. The results demonstrate that compared with the traditional protein sequence-based function prediction model, the Struct2GO model achieves better results. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available at https://github.com/lyjps/Struct2GO.
Collapse
Affiliation(s)
- Peishun Jiao
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guang Dong 518055, China
| | - Beibei Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guang Dong 518055, China
| | - Xuan Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guang Dong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guang Dong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
117
|
Gan Y, Huang X, Guo W, Yan C, Zou G. Predicting synergistic anticancer drug combination based on low-rank global attention mechanism and bilinear predictor. Bioinformatics 2023; 39:btad607. [PMID: 37812255 PMCID: PMC10598574 DOI: 10.1093/bioinformatics/btad607] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 08/29/2023] [Accepted: 10/07/2023] [Indexed: 10/10/2023] Open
Abstract
MOTIVATION Drug combination therapy has exhibited remarkable therapeutic efficacy and has gradually become a promising clinical treatment strategy of complex diseases such as cancers. As the related databases keep expanding, computational methods based on deep learning model have become powerful tools to predict synergistic drug combinations. However, predicting effective synergistic drug combinations is still a challenge due to the high complexity of drug combinations, the lack of biological interpretability, and the large discrepancy in the response of drug combinations in vivo and in vitro biological systems. RESULTS Here, we propose DGSSynADR, a new deep learning method based on global structured features of drugs and targets for predicting synergistic anticancer drug combinations. DGSSynADR constructs a heterogeneous graph by integrating the drug-drug, drug-target, protein-protein interactions and multi-omics data, utilizes a low-rank global attention (LRGA) model to perform global weighted aggregation of graph nodes and learn the global structured features of drugs and targets, and then feeds the embedded features into a bilinear predictor to predict the synergy scores of drug combinations in different cancer cell lines. Specifically, LRGA network brings better model generalization ability, and effectively reduces the complexity of graph computation. The bilinear predictor facilitates the dimension transformation of the features and fuses the feature representation of the two drugs to improve the prediction performance. The loss function Smooth L1 effectively avoids gradient explosion, contributing to better model convergence. To validate the performance of DGSSynADR, we compare it with seven competitive methods. The comparison results demonstrate that DGSSynADR achieves better performance. Meanwhile, the prediction of DGSSynADR is validated by previous findings in case studies. Furthermore, detailed ablation studies indicate that the one-hot coding drug feature, LRGA model and bilinear predictor play a key role in improving the prediction performance. AVAILABILITY AND IMPLEMENTATION DGSSynADR is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/DHUDBlab/DGSSynADR.
Collapse
Affiliation(s)
- Yanglan Gan
- School of Computer Science and Technology, Donghua University, Shanghai 201600, China
| | - Xingyu Huang
- School of Computer Science and Technology, Donghua University, Shanghai 201600, China
| | - Wenjing Guo
- School of Computer Science and Technology, Donghua University, Shanghai 201600, China
| | - Cairong Yan
- School of Computer Science and Technology, Donghua University, Shanghai 201600, China
| | - Guobing Zou
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
| |
Collapse
|
118
|
Zhou Y, Huang Z, Li W, Wei J, Jiang Q, Yang W, Huang J. Deep learning in preclinical antibody drug discovery and development. Methods 2023; 218:57-71. [PMID: 37454742 DOI: 10.1016/j.ymeth.2023.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 03/20/2023] [Accepted: 07/10/2023] [Indexed: 07/18/2023] Open
Abstract
Antibody drugs have become a key part of biotherapeutics. Patients suffering from various diseases have benefited from antibody therapies. However, its development process is rather long, expensive and risky. To speed up the process, reduce cost and improve success rate, artificial intelligence, especially deep learning methods, have been widely used in all aspects of preclinical antibody drug development, from library generation to hit identification, developability screening, lead selection and optimization. In this review, we systematically summarize antibody encodings, deep learning architectures and models used in preclinical antibody drug discovery and development. We also critically discuss challenges and opportunities, problems and possible solutions, current applications and future directions of deep learning in antibody drug development.
Collapse
Affiliation(s)
- Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenzhen Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jinyi Wei
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Qianhu Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wei Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
119
|
Nikam R, Yugandhar K, Gromiha MM. DeepBSRPred: deep learning-based binding site residue prediction for proteins. Amino Acids 2023; 55:1305-1316. [PMID: 36574037 DOI: 10.1007/s00726-022-03228-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 12/15/2022] [Indexed: 12/28/2022]
Abstract
MOTIVATION Proteins-protein interactions (PPIs) are important to govern several cellular activities. Amino acid residues, which are located at the interface are known as the binding sites and the information about binding sites helps to understand the binding affinities and functions of protein-protein complexes. RESULTS We have developed a deep neural network-based method, DeepBSRPred, for predicting the binding sites using protein sequence information and predicted structures from AlphaFold2. Specific sequence and structure-based features include position-specific scoring matrix (PSSM), solvent accessible surface area, conservation score and amino acid properties, and residue depth, respectively. Our method predicted the binding sites with an average F1 score of 0.73 in a dataset of 1236 proteins. Further, we compared the performance with other existing methods in the literature using four benchmark datasets and our method outperformed those methods. AVAILABILITY AND IMPLEMENTATION The DeepBSRPred web server can be found at https://web.iitm.ac.in/bioinfo2/deepbsrpred/index.html , along with all datasets used in this study. The trained models, the DeepBSRPred standalone source code, and the feature computation pipeline are freely available at https://web.iitm.ac.in/bioinfo2/deepbsrpred/download.html .
Collapse
Affiliation(s)
- Rahul Nikam
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Kumar Yugandhar
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
- Department of Computational Biology, Cornell University, New York, NY, USA
| | - M Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India.
- Department of Computer Science, Tokyo Institute of Technology, Yokohama, Japan.
| |
Collapse
|
120
|
Zhao Y, Yang Z, Hong Y, Yang Y, Wang L, Zhang Y, Lin H, Wang J. Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology. IEEE Trans Nanobioscience 2023; 22:755-762. [PMID: 37204950 DOI: 10.1109/tnb.2023.3278033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Gene Ontology (GO) is a widely used bioinformatics resource for describing biological processes, molecular functions, and cellular components of proteins. It covers more than 5000 terms hierarchically organized into a directed acyclic graph and known functional annotations. Automatically annotating protein functions by using GO-based computational models has been an area of active research for a long time. However, due to the limited functional annotation information and complex topological structures of GO, existing models cannot effectively capture the knowledge representation of GO. To solve this issue, we present a method that fuses the functional and topological knowledge of GO to guide protein function prediction. This method employs a multi-view GCN model to extract a variety of GO representations from functional information, topological structure, and their combinations. To dynamically learn the significance weights of these representations, it adopts an attention mechanism to learn the final knowledge representation of GO. Furthermore, it uses a pre-trained language model (i.e., ESM-1b) to efficiently learn biological features for each protein sequence. Finally, it obtains all predicted scores by calculating the dot product of sequence features and GO representation. Our method outperforms other state-of-the-art methods, as demonstrated by the experimental results on datasets from three different species, namely Yeast, Human and Arabidopsis. Our proposed method's code can be accessed at: https://github.com/Candyperfect/Master.
Collapse
|
121
|
Sui J, Chen J, Chen Y, Iwamori N, Sun J. Identification of plant vacuole proteins by using graph neural network and contact maps. BMC Bioinformatics 2023; 24:357. [PMID: 37740195 PMCID: PMC10517492 DOI: 10.1186/s12859-023-05475-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 09/12/2023] [Indexed: 09/24/2023] Open
Abstract
Plant vacuoles are essential organelles in the growth and development of plants, and accurate identification of their proteins is crucial for understanding their biological properties. In this study, we developed a novel model called GraphIdn for the identification of plant vacuole proteins. The model uses SeqVec, a deep representation learning model, to initialize the amino acid sequence. We utilized the AlphaFold2 algorithm to obtain the structural information of corresponding plant vacuole proteins, and then fed the calculated contact maps into a graph convolutional neural network. GraphIdn achieved accuracy values of 88.51% and 89.93% in independent testing and fivefold cross-validation, respectively, outperforming previous state-of-the-art predictors. As far as we know, this is the first model to use predicted protein topology structure graphs to identify plant vacuole proteins. Furthermore, we assessed the effectiveness and generalization capability of our GraphIdn model by applying it to identify and locate peroxisomal proteins, which yielded promising outcomes. The source code and datasets can be accessed at https://github.com/SJNNNN/GraphIdn .
Collapse
Affiliation(s)
- Jianan Sui
- School of Information Science and Engineering, University of Jinan, Jinan, China
| | - Jiazi Chen
- Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-Shi, Fukuoka, Japan
| | - Yuehui Chen
- School of Artificial Intelligence Institute and Information Science and Engineering, University of Jinan, Jinan, China.
| | - Naoki Iwamori
- Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-Shi, Fukuoka, Japan
| | - Jin Sun
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| |
Collapse
|
122
|
Mardikoraem M, Wang Z, Pascual N, Woldring D. Generative models for protein sequence modeling: recent advances and future directions. Brief Bioinform 2023; 24:bbad358. [PMID: 37864295 PMCID: PMC10589401 DOI: 10.1093/bib/bbad358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 09/08/2023] [Accepted: 09/12/2023] [Indexed: 10/22/2023] Open
Abstract
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Michigan State University (MSU)‘s Department of Chemical Engineering and Materials Science
| | - Zirui Wang
- Regeneron Pharmaceuticals, Inc. Having received his B.S. in Chemical Engineering from MSU, he is currently pursuing a M.S. in Computer Science from Syracuse University
| | | | - Daniel Woldring
- MSU’s Department of Chemical Engineering and Materials Science and a member of MSU’s Institute for Quantitative Health Sciences and Engineering
| |
Collapse
|
123
|
Zhang X, Guo H, Zhang F, Wang X, Wu K, Qiu S, Liu B, Wang Y, Hu Y, Li J. HNetGO: protein function prediction via heterogeneous network transformer. Brief Bioinform 2023; 24:bbab556. [PMID: 37861172 PMCID: PMC10588005 DOI: 10.1093/bib/bbab556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 11/18/2021] [Accepted: 12/04/2021] [Indexed: 10/21/2023] Open
Abstract
Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.
Collapse
Affiliation(s)
- Xiaoshuai Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Huannan Guo
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin 150086, China
| | - Fan Zhang
- Center NHC Key Laboratory of Cell Transplantation, The First Affiliated Hospital of Harbin Medical University, Harbin 150086, China
| | - Xuan Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Kaitao Wu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Shizheng Qiu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yang Hu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| |
Collapse
|
124
|
Zhang Y, Yao S, Chen P. Prediction of hot spots towards drug discovery by protein sequence embedding with 1D convolutional neural network. PLoS One 2023; 18:e0290899. [PMID: 37721924 PMCID: PMC10506709 DOI: 10.1371/journal.pone.0290899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Accepted: 08/18/2023] [Indexed: 09/20/2023] Open
Abstract
Protein hotspot residues are key sites that mediate protein-protein interactions. Accurate identification of these residues is essential for understanding the mechanism from protein to function and for designing drug targets. Current research has mostly focused on using machine learning methods to predict hot spots from known interface residues, which artificially extract the corresponding features of amino acid residues from sequence, structure, evolution, energy, and other information to train and test machine learning models. The process is cumbersome, time-consuming and laborious to some extent. This paper proposes a novel idea that develops a pre-trained protein sequence embedding model combined with a one-dimensional convolutional neural network, called Embed-1dCNN, to predict protein hotspot residues. In order to obtain large data samples, this work integrates and extracts data from the datasets of ASEdb, BID, SKEMPI and dbMPIKT to generate a new dataset, and adopts the SMOTE algorithm to expand positive samples to form the training set. The experimental results show that the method achieves an F1 score of 0.82 on the test set. Compared with other hot spot prediction methods, our model achieved better prediction performance.
Collapse
Affiliation(s)
- Youzhi Zhang
- School of Computer and Information, Anqing Normal University, Anqing, China
- University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing Normal University, Anqing, China
- National Engineering Research Center for Agro-Ecological Big Data Analysis & Application, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology & School of Internet, Anhui University, Anhui, China
| | - Sijie Yao
- National Engineering Research Center for Agro-Ecological Big Data Analysis & Application, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology & School of Internet, Anhui University, Anhui, China
| | - Peng Chen
- School of Computer and Information, Anqing Normal University, Anqing, China
- National Engineering Research Center for Agro-Ecological Big Data Analysis & Application, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology & School of Internet, Anhui University, Anhui, China
| |
Collapse
|
125
|
Jha K, Saha S, Karmakar S. Prediction of Protein-Protein Interactions Using Vision Transformer and Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3215-3225. [PMID: 37027644 DOI: 10.1109/tcbb.2023.3248797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new drugs. The majority of existing PPI research has relied mainly on sequence-based approaches. With the availability of multi-omics datasets (sequence, 3D structure) and advancements in deep learning techniques, it is feasible to develop a deep multi-modal framework that fuses the features learned from different sources of information to predict PPI. In this work, we propose a multi-modal approach utilizing protein sequence and 3D structure. To extract features from the 3D structure of proteins, we use a pre-trained vision transformer model that has been fine-tuned on the structural representation of proteins. The protein sequence is encoded into a feature vector using a pre-trained language model. The feature vectors extracted from the two modalities are fused and then fed to the neural network classifier to predict the protein interactions. To showcase the effectiveness of the proposed methodology, we conduct experiments on two popular PPI datasets, namely, the human dataset and the S. cerevisiae dataset. Our approach outperforms the existing methodologies to predict PPI, including multi-modal approaches. We also evaluate the contributions of each modality by designing uni-modal baselines. We perform experiments with three modalities as well, having gene ontology as the third modality.
Collapse
|
126
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP: Protein Function Prediction Using Function-Aware Domain Embedding Representations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554486. [PMID: 37662252 PMCID: PMC10473699 DOI: 10.1101/2023.08.23.554486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, significantly outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
127
|
Koludarov I, Senoner T, Jackson TNW, Dashevsky D, Heinzinger M, Aird SD, Rost B. Domain loss enabled evolution of novel functions in the snake three-finger toxin gene superfamily. Nat Commun 2023; 14:4861. [PMID: 37567881 PMCID: PMC10421932 DOI: 10.1038/s41467-023-40550-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 07/28/2023] [Indexed: 08/13/2023] Open
Abstract
Three-finger toxins (3FTXs) are a functionally diverse family of toxins, apparently unique to venoms of caenophidian snakes. Although the ancestral function of 3FTXs is antagonism of nicotinic acetylcholine receptors, redundancy conferred by the accumulation of duplicate genes has facilitated extensive neofunctionalization, such that derived members of the family interact with a range of targets. 3FTXs are members of the LY6/UPAR family, but their non-toxin ancestor remains unknown. Combining traditional phylogenetic approaches, manual synteny analysis, and machine learning techniques (including AlphaFold2 and ProtT5), we have reconstructed a detailed evolutionary history of 3FTXs. We identify their immediate ancestor as a non-secretory LY6, unique to squamate reptiles, and propose that changes in molecular ecology resulting from loss of a membrane-anchoring domain and changes in gene expression, paved the way for the evolution of one of the most important families of snake toxins.
Collapse
Affiliation(s)
- Ivan Koludarov
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Tobias Senoner
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
| | - Timothy N W Jackson
- Australian Venom Research Unit, Department of Biochemistry and Pharmacology, University of Melbourne, Melbourne, VIC, Australia
| | - Daniel Dashevsky
- Australian National Insect Collection, Commonwealth Scientific & Industrial Research Organisation, Canberra, ACT, Australia
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
| | - Steven D Aird
- 7744-23 Hotaka Ariake, 399-8301, Azumino-shi, Nagano-ken, Japan
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
128
|
Ahdritz G, Bouatta N, Kadyan S, Jarosch L, Berenberg D, Fisk I, Watkins AM, Ra S, Bonneau R, AlQuraishi M. OpenProteinSet: Training data for structural biology at scale. ARXIV 2023:arXiv:2308.05326v1. [PMID: 37608940 PMCID: PMC10441447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Collapse
Affiliation(s)
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | | | - Daniel Berenberg
- Prescient Design, Genentech & Department of Computer Science, New York University
| | | | | | | | | | | |
Collapse
|
129
|
Sun Y, Shen Y. Structure-Informed Protein Language Models are Robust Predictors for Variant Effects. RESEARCH SQUARE 2023:rs.3.rs-3219092. [PMID: 37577664 PMCID: PMC10418537 DOI: 10.21203/rs.3.rs-3219092/v1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family's evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git.
Collapse
Affiliation(s)
- Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA
- Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA
- Institute of Biosciences and Technology and Department of Translational Medical Sciences, Texas A&M University, Houston, 77030, Texas, USA
| |
Collapse
|
130
|
Savojardo C, Martelli PL, Casadio R. Finding functional motifs in protein sequences with deep learning and natural language models. Curr Opin Struct Biol 2023; 81:102641. [PMID: 37385080 DOI: 10.1016/j.sbi.2023.102641] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Revised: 04/17/2023] [Accepted: 05/24/2023] [Indexed: 07/01/2023]
Abstract
Recently, prediction of structural/functional motifs in protein sequences takes advantage of powerful machine learning based approaches. Protein encoding adopts protein language models overpassing standard procedures. Different combinations of machine learning and encoding schemas are available for predicting different structural/functional motifs. Particularly interesting is the adoption of protein language models to encode proteins in addition to evolution information and physicochemical parameters. A thorough analysis of recent predictors developed for annotating transmembrane regions, sorting signals, lipidation and phosphorylation sites allows to investigate the state-of-the-art focusing on the relevance of protein language models for the different tasks. This highlights that more experimental data are necessary to exploit available powerful machine learning methods.
Collapse
Affiliation(s)
- Castrense Savojardo
- Biocomputing Group, Dept. of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Dept. of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Dept. of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy.
| |
Collapse
|
131
|
Kandathil SM, Lau AM, Jones DT. Machine learning methods for predicting protein structure from single sequences. Curr Opin Struct Biol 2023; 81:102627. [PMID: 37320955 DOI: 10.1016/j.sbi.2023.102627] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 05/17/2023] [Accepted: 05/17/2023] [Indexed: 06/17/2023]
Abstract
Recent breakthroughs in protein structure prediction have increasingly relied on the use of deep neural networks. These recent methods are notable in that they produce 3-D atomic coordinates as a direct output of the networks, a feature which presents many advantages. Although most techniques of this type make use of multiple sequence alignments as their primary input, a new wave of methods have attempted to use just single sequences as the input. We discuss the make-up and operating principles of these models, and highlight new developments in these areas, as well as areas for future development.
Collapse
Affiliation(s)
- Shaun M Kandathil
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, United Kingdom
| | - Andy M Lau
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, United Kingdom
| | - David T Jones
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, United Kingdom.
| |
Collapse
|
132
|
Saar KL, Qian D, Good LL, Morgunov AS, Collepardo-Guevara R, Best RB, Knowles TPJ. Theoretical and Data-Driven Approaches for Biomolecular Condensates. Chem Rev 2023; 123:8988-9009. [PMID: 37171907 PMCID: PMC10375482 DOI: 10.1021/acs.chemrev.2c00586] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Indexed: 05/14/2023]
Abstract
Biomolecular condensation processes are increasingly recognized as a fundamental mechanism that living cells use to organize biomolecules in time and space. These processes can lead to the formation of membraneless organelles that enable cells to perform distinct biochemical processes in controlled local environments, thereby supplying them with an additional degree of spatial control relative to that achieved by membrane-bound organelles. This fundamental importance of biomolecular condensation has motivated a quest to discover and understand the molecular mechanisms and determinants that drive and control this process. Within this molecular viewpoint, computational methods can provide a unique angle to studying biomolecular condensation processes by contributing the resolution and scale that are challenging to reach with experimental techniques alone. In this Review, we focus on three types of dry-lab approaches: theoretical methods, physics-driven simulations and data-driven machine learning methods. We review recent progress in using these tools for probing biomolecular condensation across all three fields and outline the key advantages and limitations of each of the approaches. We further discuss some of the key outstanding challenges that we foresee the community addressing next in order to develop a more complete picture of the molecular driving forces behind biomolecular condensation processes and their biological roles in health and disease.
Collapse
Affiliation(s)
- Kadi L. Saar
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Transition
Bio Ltd., Cambridge, United Kingdom
| | - Daoyuan Qian
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
| | - Lydia L. Good
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Laboratory
of Chemical Physics, National Institute of Diabetes and Digestive
and Kidney Diseases, National Institutes
of Health, Bethesda, Maryland 20892, United States
| | - Alexey S. Morgunov
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
| | - Rosana Collepardo-Guevara
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Department
of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Robert B. Best
- Laboratory
of Chemical Physics, National Institute of Diabetes and Digestive
and Kidney Diseases, National Institutes
of Health, Bethesda, Maryland 20892, United States
| | - Tuomas P. J. Knowles
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, Cambridge CB3 0HE, United Kingdom
| |
Collapse
|
133
|
Gonzales MEM, Ureta JC, Shrestha AMS. Protein embeddings improve phage-host interaction prediction. PLoS One 2023; 18:e0289030. [PMID: 37486915 PMCID: PMC10365317 DOI: 10.1371/journal.pone.0289030] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 07/07/2023] [Indexed: 07/26/2023] Open
Abstract
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Collapse
Affiliation(s)
- Mark Edward M Gonzales
- Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines
| | - Jennifer C Ureta
- Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines
| | - Anish M S Shrestha
- Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines
- Systems and Computational Biology Research Unit, Center for Natural Sciences and Environmental Research, De La Salle University, Manila, Philippines
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines
| |
Collapse
|
134
|
Wang C, Yuan C, Wang Y, Chen R, Shi Y, Zhang T, Xue F, Patti GJ, Wei L, Hou Q. MPI-VGAE: protein-metabolite enzymatic reaction link learning by variational graph autoencoders. Brief Bioinform 2023; 24:bbad189. [PMID: 37225420 PMCID: PMC10359079 DOI: 10.1093/bib/bbad189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 04/10/2023] [Accepted: 04/27/2023] [Indexed: 05/26/2023] Open
Abstract
Enzymatic reactions are crucial to explore the mechanistic function of metabolites and proteins in cellular processes and to understand the etiology of diseases. The increasing number of interconnected metabolic reactions allows the development of in silico deep learning-based methods to discover new enzymatic reaction links between metabolites and proteins to further expand the landscape of existing metabolite-protein interactome. Computational approaches to predict the enzymatic reaction link by metabolite-protein interaction (MPI) prediction are still very limited. In this study, we developed a Variational Graph Autoencoders (VGAE)-based framework to predict MPI in genome-scale heterogeneous enzymatic reaction networks across ten organisms. By incorporating molecular features of metabolites and proteins as well as neighboring information in the MPI networks, our MPI-VGAE predictor achieved the best predictive performance compared to other machine learning methods. Moreover, when applying the MPI-VGAE framework to reconstruct hundreds of metabolic pathways, functional enzymatic reaction networks and a metabolite-metabolite interaction network, our method showed the most robust performance among all scenarios. To the best of our knowledge, this is the first MPI predictor by VGAE for enzymatic reaction link prediction. Furthermore, we implemented the MPI-VGAE framework to reconstruct the disease-specific MPI network based on the disrupted metabolites and proteins in Alzheimer's disease and colorectal cancer, respectively. A substantial number of novel enzymatic reaction links were identified. We further validated and explored the interactions of these enzymatic reactions using molecular docking. These results highlight the potential of the MPI-VGAE framework for the discovery of novel disease-related enzymatic reactions and facilitate the study of the disrupted metabolisms in diseases.
Collapse
Affiliation(s)
- Cheng Wang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| | - Chuang Yuan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| | - Yahui Wang
- Department of Chemistry, Washington University in St. Louis, St. Louis, MO, 63130, USA
- Center for Metabolomics and Isotope Tracing, Washington University in St. Louis, St. Louis, MO, 63130, USA
| | - Ranran Chen
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| | - Yuying Shi
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| | - Tao Zhang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| | - Gary J Patti
- Department of Chemistry, Washington University in St. Louis, St. Louis, MO, 63130, USA
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63130, USA
- Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO, 63130, USA
- Center for Metabolomics and Isotope Tracing, Washington University in St. Louis, St. Louis, MO, 63130, USA
| | - Leyi Wei
- School of Software, Shandong University, Jinan, 250100, China
| | - Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250000, China
| |
Collapse
|
135
|
Zheng R, Huang Z, Deng L. Large-scale predicting protein functions through heterogeneous feature fusion. Brief Bioinform 2023:bbad243. [PMID: 37401369 DOI: 10.1093/bib/bbad243] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 05/18/2023] [Accepted: 06/12/2023] [Indexed: 07/05/2023] Open
Abstract
As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.
Collapse
Affiliation(s)
- Rongtao Zheng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| |
Collapse
|
136
|
Upadhyay V, Boorla VS, Maranas CD. Rank-ordering of known enzymes as starting points for re-engineering novel substrate activity using a convolutional neural network. Metab Eng 2023; 78:171-182. [PMID: 37301359 DOI: 10.1016/j.ymben.2023.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 05/19/2023] [Accepted: 06/02/2023] [Indexed: 06/12/2023]
Abstract
Retro-biosynthetic approaches have made significant advances in predicting synthesis routes of target biofuel, bio-renewable or bio-active molecules. The use of only cataloged enzymatic activities limits the discovery of new production routes. Recent retro-biosynthetic algorithms increasingly use novel conversions that require altering the substrate or cofactor specificities of existing enzymes while connecting pathways leading to a target metabolite. However, identifying and re-engineering enzymes for desired novel conversions are currently the bottlenecks in implementing such designed pathways. Herein, we present EnzRank, a convolutional neural network (CNN) based approach, to rank-order existing enzymes in terms of their suitability to undergo successful protein engineering through directed evolution or de novo design towards a desired specific substrate activity. We train the CNN model on 11,800 known active enzyme-substrate pairs from the BRENDA database as positive samples and data generated by scrambling these pairs as negative samples using substrate dissimilarity between an enzyme's native substrate and all other molecules present in the dataset using Tanimoto similarity score. EnzRank achieves an average recovery rate of 80.72% and 73.08% for positive and negative pairs on test data after using a 10-fold holdout method for training and cross-validation. We further developed a web-based user interface (available at https://huggingface.co/spaces/vuu10/EnzRank) to predict enzyme-substrate activity using SMILES strings of substrates and enzyme sequence as input to allow convenient and easy-to-use access to EnzRank. In summary, this effort can aid de novo pathway design tools to prioritize starting enzyme re-engineering candidates for novel reactions as well as in predicting the potential secondary activity of enzymes in cell metabolism.
Collapse
Affiliation(s)
- Vikas Upadhyay
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Veda Sheersh Boorla
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Costas D Maranas
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| |
Collapse
|
137
|
Biharie K, Michielsen L, Reinders MJT, Mahfouz A. Cell type matching across species using protein embeddings and transfer learning. Bioinformatics 2023; 39:i404-i412. [PMID: 37387141 PMCID: PMC10311290 DOI: 10.1093/bioinformatics/btad248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Knowing the relation between cell types is crucial for translating experimental results from mice to humans. Establishing cell type matches, however, is hindered by the biological differences between the species. A substantial amount of evolutionary information between genes that could be used to align the species is discarded by most of the current methods since they only use one-to-one orthologous genes. Some methods try to retain the information by explicitly including the relation between genes, however, not without caveats. RESULTS In this work, we present a model to transfer and align cell types in cross-species analysis (TACTiCS). First, TACTiCS uses a natural language processing model to match genes using their protein sequences. Next, TACTiCS employs a neural network to classify cell types within a species. Afterward, TACTiCS uses transfer learning to propagate cell type labels between species. We applied TACTiCS on scRNA-seq data of the primary motor cortex of human, mouse, and marmoset. Our model can accurately match and align cell types on these datasets. Moreover, our model outperforms Seurat and the state-of-the-art method SAMap. Finally, we show that our gene matching method results in better cell type matches than BLAST in our model. AVAILABILITY AND IMPLEMENTATION The implementation is available on GitHub (https://github.com/kbiharie/TACTiCS). The preprocessed datasets and trained models can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.7582460).
Collapse
Affiliation(s)
- Kirti Biharie
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Lieke Michielsen
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Ahmed Mahfouz
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| |
Collapse
|
138
|
Mohseni Behbahani Y, Laine E, Carbone A. Deep Local Analysis deconstructs protein-protein interfaces and accurately estimates binding affinity changes upon mutation. Bioinformatics 2023; 39:i544-i552. [PMID: 37387162 DOI: 10.1093/bioinformatics/btad231] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The spectacular recent advances in protein and protein complex structure prediction hold promise for reconstructing interactomes at large-scale and residue resolution. Beyond determining the 3D arrangement of interacting partners, modeling approaches should be able to unravel the impact of sequence variations on the strength of the association. RESULTS In this work, we report on Deep Local Analysis, a novel and efficient deep learning framework that relies on a strikingly simple deconstruction of protein interfaces into small locally oriented residue-centered cubes and on 3D convolutions recognizing patterns within cubes. Merely based on the two cubes associated with the wild-type and the mutant residues, DLA accurately estimates the binding affinity change for the associated complexes. It achieves a Pearson correlation coefficient of 0.735 on about 400 mutations on unseen complexes. Its generalization capability on blind datasets of complexes is higher than the state-of-the-art methods. We show that taking into account the evolutionary constraints on residues contributes to predictions. We also discuss the influence of conformational variability on performance. Beyond the predictive power on the effects of mutations, DLA is a general framework for transferring the knowledge gained from the available non-redundant set of complex protein structures to various tasks. For instance, given a single partially masked cube, it recovers the identity and physicochemical class of the central residue. Given an ensemble of cubes representing an interface, it predicts the function of the complex. AVAILABILITY AND IMPLEMENTATION Source code and models are available at http://gitlab.lcqb.upmc.fr/DLA/DLA.git.
Collapse
Affiliation(s)
- Yasser Mohseni Behbahani
- Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France
| | - Elodie Laine
- Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France
| | - Alessandra Carbone
- Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France
| |
Collapse
|
139
|
Lobo F, González MS, Boto A, Pérez de la Lastra JM. Prediction of Antifungal Activity of Antimicrobial Peptides by Transfer Learning from Protein Pretrained Models. Int J Mol Sci 2023; 24:10270. [PMID: 37373415 DOI: 10.3390/ijms241210270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 06/12/2023] [Accepted: 06/14/2023] [Indexed: 06/29/2023] Open
Abstract
Peptides with antifungal activity have gained significant attention due to their potential therapeutic applications. In this study, we explore the use of pretrained protein models as feature extractors to develop predictive models for antifungal peptide activity. Various machine learning classifiers were trained and evaluated. Our AFP predictor achieved comparable performance to current state-of-the-art methods. Overall, our study demonstrates the effectiveness of pretrained models for peptide analysis and provides a valuable tool for predicting antifungal peptide activity and potentially other peptide properties.
Collapse
Affiliation(s)
- Fernando Lobo
- Programa Agustín de Betancourt, Universidad de La Laguna, 38206 La Laguna, Tenerife, Spain
| | - Maily Selena González
- Instituto de Productos Naturales y Agrobiología del CSIC, Avda. Astrofísico Fco. Sánchez, 3, 38206 La Laguna, Tenerife, Spain
| | - Alicia Boto
- Instituto de Productos Naturales y Agrobiología del CSIC, Avda. Astrofísico Fco. Sánchez, 3, 38206 La Laguna, Tenerife, Spain
| | - José Manuel Pérez de la Lastra
- Instituto de Productos Naturales y Agrobiología del CSIC, Avda. Astrofísico Fco. Sánchez, 3, 38206 La Laguna, Tenerife, Spain
| |
Collapse
|
140
|
Murad T, Ali S, Patterson M. Exploring the Potential of GANs in Biological Sequence Analysis. BIOLOGY 2023; 12:854. [PMID: 37372139 DOI: 10.3390/biology12060854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/03/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023]
Abstract
Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.
Collapse
Affiliation(s)
- Taslim Murad
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| |
Collapse
|
141
|
Singh R, Sledzieski S, Bryson B, Cowen L, Berger B. Contrastive learning in protein language space predicts interactions between drugs and protein targets. Proc Natl Acad Sci U S A 2023; 120:e2220778120. [PMID: 37289807 PMCID: PMC10268324 DOI: 10.1073/pnas.2220778120] [Citation(s) in RCA: 47] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 04/10/2023] [Indexed: 06/10/2023] Open
Abstract
Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational prediction needs to be generalizable and scalable while remaining sensitive to subtle variations in the inputs. However, current computational techniques fail to simultaneously meet these goals, often sacrificing performance of one to achieve the others. We develop a deep learning model, ConPLex, successfully leveraging the advances in pretrained protein language models ("PLex") and employing a protein-anchored contrastive coembedding ("Con") to outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds. It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome. Experimental testing of 19 kinase-drug interaction predictions validated 12 interactions, including four with subnanomolar affinity, plus a strongly binding EPHB1 inhibitor (KD = 1.3 nM). Furthermore, ConPLex embeddings are interpretable, which enables us to visualize the drug-target embedding space and use embeddings to characterize the function of human cell-surface proteins. We anticipate that ConPLex will facilitate efficient drug discovery by making highly sensitive in silico drug screening feasible at the genome scale. ConPLex is available open source at https://ConPLex.csail.mit.edu.
Collapse
Affiliation(s)
- Rohit Singh
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Samuel Sledzieski
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Bryan Bryson
- Ragon Institute of MGH, MIT and Harvard, Cambridge, MA02139
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, MA02155
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA02139
| |
Collapse
|
142
|
Nicholas Chua B, Mei Guo W, Teng Wong H, Siak-Wei Ow D, Leng Ho P, Koh W, Koay A, Tian Wong F. A sweeter future: Using protein language models for exploring sweeter brazzein homologs. Food Chem 2023; 426:136580. [PMID: 37331142 DOI: 10.1016/j.foodchem.2023.136580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/23/2023] [Accepted: 06/06/2023] [Indexed: 06/20/2023]
Abstract
With growing concerns over the health impact of sugar, brazzein offers a viable alternative due to its sweetness, thermostability, and low risk profile. Here, we demonstrated the ability of protein language models to design new brazzein homologs with improved thermostability and potentially higher sweetness, resulting in new diverse optimized amino acid sequences that improve structural and functional features beyond what conventional methods could achieve. This innovative approach resulted in the identification of unexpected mutations, thereby generating new possibilities for protein engineering. To facilitate the characterization of the brazzein mutants, a simplified procedure was developed for expressing and analyzing related proteins. This process involved an efficient purification method using Lactococcus lactis (L. lactis), a generally recognized as safe (GRAS) bacterium, as well as taste receptor assays to evaluate sweetness. The study successfully demonstrated the potential of computational design in producing a more heat-resistant and potentially more palatable brazzein variant, V23.
Collapse
Affiliation(s)
- Bryan Nicholas Chua
- Molecular Engineering Laboratory, Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, #07-06, Proteos, Singapore 138673, Republic of Singapore
| | - Wei Mei Guo
- Singapore Institute of Food and Biotechnology Innovation (SIFBI), Agency for Science, Technology and Research (A*STAR), 31 Biopolis Way, #02-01, Nanos, Singapore 138669, Republic of Singapore
| | - Han Teng Wong
- Molecular Engineering Laboratory, Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, #07-06, Proteos, Singapore 138673, Republic of Singapore
| | - Dave Siak-Wei Ow
- Bioprocessing Technology Institute (BTI), Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01, Centros, Singapore 138668, Republic of Singapore
| | - Pooi Leng Ho
- Bioprocessing Technology Institute (BTI), Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01, Centros, Singapore 138668, Republic of Singapore
| | - Winston Koh
- Institute of Bioengineering and Bioimaging (IBB), Agency for Science, Technology and Research (A*STAR), 31 Biopolis Way, #07-01, Nanos, Singapore 138669, Republic of Singapore; Bioinformatics Institute (BII), Agency of Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Republic of Singapore.
| | - Ann Koay
- Singapore Institute of Food and Biotechnology Innovation (SIFBI), Agency for Science, Technology and Research (A*STAR), 31 Biopolis Way, #02-01, Nanos, Singapore 138669, Republic of Singapore.
| | - Fong Tian Wong
- Molecular Engineering Laboratory, Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, #07-06, Proteos, Singapore 138673, Republic of Singapore; Institute of Sustainability for Chemicals, Energy and Environment (ISCE(2)), Agency for Science, Technology and Research (A*STAR), 8 Biomedical Grove, Neuros, #07-01, Singapore 138665, Republic of Singapore.
| |
Collapse
|
143
|
Ouellet S, Ferguson L, Lau AZ, Lim TKY. CysPresso: a classification model utilizing deep learning protein representations to predict recombinant expression of cysteine-dense peptides. BMC Bioinformatics 2023; 24:200. [PMID: 37193950 PMCID: PMC10189939 DOI: 10.1186/s12859-023-05327-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/08/2023] [Indexed: 05/18/2023] Open
Abstract
BACKGROUND Cysteine-dense peptides (CDPs) are an attractive pharmaceutical scaffold that display extreme biochemical properties, low immunogenicity, and the ability to bind targets with high affinity and selectivity. While many CDPs have potential and confirmed therapeutic uses, synthesis of CDPs is a challenge. Recent advances have made the recombinant expression of CDPs a viable alternative to chemical synthesis. Moreover, identifying CDPs that can be expressed in mammalian cells is crucial in predicting their compatibility with gene therapy and mRNA therapy. Currently, we lack the ability to identify CDPs that will express recombinantly in mammalian cells without labour intensive experimentation. To address this, we developed CysPresso, a novel machine learning model that predicts recombinant expression of CDPs based on primary sequence. RESULTS We tested various protein representations generated by deep learning algorithms (SeqVec, proteInfer, AlphaFold2) for their suitability in predicting CDP expression and found that AlphaFold2 representations possessed the best predictive features. We then optimized the model by concatenation of AlphaFold2 representations, time series transformation with random convolutional kernels, and dataset partitioning. CONCLUSION Our novel model, CysPresso, is the first to successfully predict recombinant CDP expression in mammalian cells and is particularly well suited for predicting recombinant expression of knottin peptides. When preprocessing the deep learning protein representation for supervised machine learning, we found that random convolutional kernel transformation preserves more pertinent information relevant for predicting expressibility than embedding averaging. Our study showcases the applicability of deep learning-based protein representations, such as those provided by AlphaFold2, in tasks beyond structure prediction.
Collapse
Affiliation(s)
| | - Larissa Ferguson
- Neurobiology Division, MRC Laboratory of Molecular Biology, Cambridge, UK
| | - Angus Z Lau
- Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Physical Sciences Platform, Sunnybrook Research Institute, Toronto, ON, Canada
| | - Tony K Y Lim
- , Vancouver, Canada.
- Department of Pharmacology, University of Cambridge, Cambridge, UK.
| |
Collapse
|
144
|
Yoshimori A, Bajorath J. Motif2Mol: Prediction of New Active Compounds Based on Sequence Motifs of Ligand Binding Sites in Proteins Using a Biochemical Language Model. Biomolecules 2023; 13:biom13050833. [PMID: 37238703 DOI: 10.3390/biom13050833] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 05/05/2023] [Accepted: 05/12/2023] [Indexed: 05/28/2023] Open
Abstract
In drug design, the prediction of new active compounds from protein sequence data has only been attempted in a few studies thus far. This prediction task is principally challenging because global protein sequence similarity has strong evolutional and structural implications, but is often only vaguely related to ligand binding. Deep language models adapted from natural language processing offer new opportunities to attempt such predictions via machine translation by directly relating amino acid sequences and chemical structures to each based on textual molecular representations. Herein, we introduce a biochemical language model with transformer architecture for the prediction of new active compounds from sequence motifs of ligand binding sites. In a proof-of-concept application on inhibitors of more than 200 human kinases, the Motif2Mol model revealed promising learning characteristics and an unprecedented ability to consistently reproduce known inhibitors of different kinases.
Collapse
Affiliation(s)
- Atsushi Yoshimori
- Institute for Theoretical Medicine, Inc., 26-1 Muraoka-Higashi 2-Chome, Fujisawa 251-0012, Japan
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany
| |
Collapse
|
145
|
Kim Y, Kwon J. AttSec: protein secondary structure prediction by capturing local patterns from attention map. BMC Bioinformatics 2023; 24:183. [PMID: 37142993 PMCID: PMC10161504 DOI: 10.1186/s12859-023-05310-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Accepted: 04/27/2023] [Indexed: 05/06/2023] Open
Abstract
BACKGROUND Protein secondary structures that link simple 1D sequences to complex 3D structures can be used as good features for describing the local properties of protein, but also can serve as key features for predicting the complex 3D structures of protein. Thus, it is very important to accurately predict the secondary structure of the protein, which contains a local structural property assigned by the pattern of hydrogen bonds formed between amino acids. In this study, we accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. In particular, AttSec extracts self-attention maps corresponding to pairwise features between amino acid embeddings and passes them through 2D convolution blocks to capture local patterns. In addition, instead of using additional evolutionary information, it uses protein embedding as an input, which is generated by a language model. RESULTS For the ProteinNet DSSP8 dataset, our model showed 11.8% better performance on the entire evaluation datasets compared with other no-evolutionary-information-based models. For the NetSurfP-2.0 DSSP8 dataset, it showed 1.2% better performance on average. There was an average performance improvement of 9.0% for the ProteinNet DSSP3 dataset and an average of 0.7% for the NetSurfP-2.0 DSSP3 dataset. CONCLUSION We accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. Although there was no dramatic accuracy improvement compared with other models, the improvement on DSSP8 was greater than that on DSSP3. This result implies that using our proposed pairwise feature could have a remarkable effect for several challenging tasks that require finely subdivided classification. Github package URL is https://github.com/youjin-DDAI/AttSec .
Collapse
Affiliation(s)
- Youjin Kim
- Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea
- LG AI Research, Seoul, Republic of Korea
| | - Junseok Kwon
- Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea.
| |
Collapse
|
146
|
Flamholz ZN, Biller SJ, Kelly L. Large language models improve annotation of viral proteins. RESEARCH SQUARE 2023:rs.3.rs-2852098. [PMID: 37205395 PMCID: PMC10187409 DOI: 10.21203/rs.3.rs-2852098/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.
Collapse
Affiliation(s)
- Zachary N. Flamholz
- Department of Systems and Computational Biology, Albert Einstein College of Medicine; Bronx, NY, USA
| | - Steve J. Biller
- Department of Biological Sciences, Wellesley College; Wellesley, MA USA
| | - Libusha Kelly
- Department of Systems and Computational Biology, Albert Einstein College of Medicine; Bronx, NY, USA
- Department of Microbiology and Immunology, Albert Einstein College of Medicine; Bronx, NY, USA
| |
Collapse
|
147
|
Soylu NN, Sefer E. BERT2OME: Prediction of 2'-O-Methylation Modifications From RNA Sequence by Transformer Architecture Based on BERT. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2177-2189. [PMID: 37819796 DOI: 10.1109/tcbb.2023.3237769] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Recent work on language models has resulted in state-of-the-art performance on various language tasks. Among these, Bidirectional Encoder Representations from Transformers (BERT) has focused on contextualizing word embeddings to extract context and semantics of the words. On the other hand, post-transcriptional 2'-O-methylation (Nm) RNA modification is important in various cellular tasks and related to a number of diseases. The existing high-throughput experimental techniques take longer time to detect these modifications, and costly in exploring these functional processes. Here, to deeply understand the associated biological processes faster, we come up with an efficient method Bert2Ome to infer 2'-O-methylation RNA modification sites from RNA sequences. Bert2Ome combines BERT-based model with convolutional neural networks (CNN) to infer the relationship between the modification sites and RNA sequence content. Unlike the methods proposed so far, Bert2Ome assumes each given RNA sequence as a text and focuses on improving the modification prediction performance by integrating the pretrained deep learning-based language model BERT. Additionally, our transformer-based approach could infer modification sites across multiple species. According to 5-fold cross-validation, human and mouse accuracies were 99.15% and 94.35% respectively. Similarly, ROC AUC scores were 0.99, 0.94 for the same species. Detailed results show that Bert2Ome reduces the time consumed in biological experiments and outperforms the existing approaches across different datasets and species over multiple metrics. Additionally, deep learning approaches such as 2D CNNs are more promising in learning BERT attributes than more conventional machine learning methods.
Collapse
|
148
|
Mardikoraem M, Woldring D. Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods. Pharmaceutics 2023; 15:1337. [PMID: 37242577 PMCID: PMC10224321 DOI: 10.3390/pharmaceutics15051337] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 04/19/2023] [Accepted: 04/21/2023] [Indexed: 05/28/2023] Open
Abstract
Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Daniel Woldring
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
149
|
Jha K, Karmakar S, Saha S. Graph-BERT and language model-based framework for protein-protein interaction identification. Sci Rep 2023; 13:5663. [PMID: 37024543 PMCID: PMC10079975 DOI: 10.1038/s41598-023-31612-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 03/14/2023] [Indexed: 04/08/2023] Open
Abstract
Identification of protein-protein interactions (PPI) is among the critical problems in the domain of bioinformatics. Previous studies have utilized different AI-based models for PPI classification with advances in artificial intelligence (AI) techniques. The input to these models is the features extracted from different sources of protein information, mainly sequence-derived features. In this work, we present an AI-based PPI identification model utilizing a PPI network and protein sequences. The PPI network is represented as a graph where each node is a protein pair, and an edge is defined between two nodes if there exists a common protein between these nodes. Each node in a graph has a feature vector. In this work, we have used the language model to extract feature vectors directly from protein sequences. The feature vectors for protein in pairs are concatenated and used as a node feature vector of a PPI network graph. Finally, we have used the Graph-BERT model to encode the PPI network graph with sequence-based features and learn the hidden representation of the feature vector for each node. The next step involves feeding the learned representations of nodes to the fully connected layer, the output of which is fed into the softmax layer to classify the protein interactions. To assess the efficacy of the proposed PPI model, we have performed experiments on several PPI datasets. The experimental results demonstrate that the proposed approach surpasses the existing PPI works and designed baselines in classifying PPI.
Collapse
Affiliation(s)
- Kanchan Jha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India.
| | - Sourav Karmakar
- Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, West Bengal, 713209, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India
| |
Collapse
|
150
|
Bryant P. Deep learning for protein complex structure prediction. Curr Opin Struct Biol 2023; 79:102529. [PMID: 36731337 DOI: 10.1016/j.sbi.2023.102529] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 12/10/2022] [Accepted: 12/20/2022] [Indexed: 02/04/2023]
Abstract
Recent developments in the structure prediction of protein complexes have resulted in accuracies rivalling experimental methods in many cases. The high accuracy is mainly observed in dimeric complexes and other problems such as protein disorder and predicting the structure of host-pathogen interactions remain. This review highlights the foundation for current accurate structure prediction of protein complexes and possible ways to address the remaining limitations.
Collapse
Affiliation(s)
- Patrick Bryant
- Science for Life Laboratory, 172 21 Solna, Sweden; Department of Biochemistry and Biophysics, Stockholm University, 106 91 Stockholm, Sweden.
| |
Collapse
|