151
|
Kondo HX, Iizuka H, Masumoto G, Kabaya Y, Kanematsu Y, Takano Y. Prediction of Protein Function from Tertiary Structure of the Active Site in Heme Proteins by Convolutional Neural Network. Biomolecules 2023; 13:biom13010137. [PMID: 36671521 PMCID: PMC9855806 DOI: 10.3390/biom13010137] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Accepted: 01/07/2023] [Indexed: 01/11/2023] Open
Abstract
Structure-function relationships in proteins have been one of the crucial scientific topics in recent research. Heme proteins have diverse and pivotal biological functions. Therefore, clarifying their structure-function correlation is significant to understand their functional mechanism and is informative for various fields of science. In this study, we constructed convolutional neural network models for predicting protein functions from the tertiary structures of heme-binding sites (active sites) of heme proteins to examine the structure-function correlation. As a result, we succeeded in the classification of oxygen-binding protein (OB), oxidoreductase (OR), proteins with both functions (OB-OR), and electron transport protein (ET) with high accuracy. Although the misclassification rate for OR and ET was high, the rates between OB and ET and between OB and OR were almost zero, indicating that the prediction model works well between protein groups with quite different functions. However, predicting the function of proteins modified with amino acid mutation(s) remains a challenge. Our findings indicate a structure-function correlation in the active site of heme proteins. This study is expected to be applied to the prediction of more detailed protein functions such as catalytic reactions.
Collapse
Affiliation(s)
- Hiroko X. Kondo
- Faculty of Engineering, Kitami Institute of Technology, 165 Koen-cho, Kitami 090-8507, Japan
- Graduate School of Information Sciences, Hiroshima City University, 3-4-1 Ozukahigashi Asaminamiku, Hiroshima 731-3194, Japan
- Laboratory for Computational Molecular Design, RIKEN Center for Biosystems Dynamics Research, 6-2-3 Furuedai, Suita 565-0874, Japan
- Correspondence: (H.X.K.); (Y.T.); Tel.: +81-157-26-9401 (H.X.K.); +81-82-830-1825 (Y.T.)
| | - Hiroyuki Iizuka
- Graduate School of Information Science and Technology, Hokkaido University, Kita 14, Nishi 9, Kitaku, Sapporo 060-0814, Japan
| | - Gen Masumoto
- Information Systems Division, RIKEN Information R&D and Strategy Headquarters, 2-1 Hirosawa, Wako 351-0198, Japan
| | - Yuichi Kabaya
- Faculty of Engineering, Kitami Institute of Technology, 165 Koen-cho, Kitami 090-8507, Japan
| | - Yusuke Kanematsu
- Graduate School of Information Sciences, Hiroshima City University, 3-4-1 Ozukahigashi Asaminamiku, Hiroshima 731-3194, Japan
- Graduate School of Advanced Science and Engineering, Hiroshima University, 1-4-1 Kagamiyama, Higashi-Hiroshima 739-8527, Japan
| | - Yu Takano
- Graduate School of Information Sciences, Hiroshima City University, 3-4-1 Ozukahigashi Asaminamiku, Hiroshima 731-3194, Japan
- Correspondence: (H.X.K.); (Y.T.); Tel.: +81-157-26-9401 (H.X.K.); +81-82-830-1825 (Y.T.)
| |
Collapse
|
152
|
George A, Kim DN, Moser T, Gildea IT, Evans JE, Cheung MS. Graph identification of proteins in tomograms (GRIP-Tomo). Protein Sci 2023; 32:e4538. [PMID: 36482866 PMCID: PMC9798246 DOI: 10.1002/pro.4538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 11/23/2022] [Accepted: 12/03/2022] [Indexed: 12/14/2022]
Abstract
In this study, we present a method of pattern mining based on network theory that enables the identification of protein structures or complexes from synthetic volume densities, without the knowledge of predefined templates or human biases for refinement. We hypothesized that the topological connectivity of protein structures is invariant, and they are distinctive for the purpose of protein identification from distorted data presented in volume densities. Three-dimensional densities of a protein or a complex from simulated tomographic volumes were transformed into mathematical graphs as observables. We systematically introduced data distortion or defects such as missing fullness of data, the tumbling effect, and the missing wedge effect into the simulated volumes, and varied the distance cutoffs in pixels to capture the varying connectivity between the density cluster centroids in the presence of defects. A similarity score between the graphs from the simulated volumes and the graphs transformed from the physical protein structures in point data was calculated by comparing their network theory order parameters including node degrees, betweenness centrality, and graph densities. By capturing the essential topological features defining the heterogeneous morphologies of a network, we were able to accurately identify proteins and homo-multimeric complexes from 10 topologically distinctive samples without realistic noise added. Our approach empowers future developments of tomogram processing by providing pattern mining with interpretability, to enable the classification of single-domain protein native topologies as well as distinct single-domain proteins from multimeric complexes within noisy volumes.
Collapse
Affiliation(s)
- August George
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, USA.,Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, USA
| | - Doo Nam Kim
- Biological Science Division, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Trevor Moser
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Ian T Gildea
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - James E Evans
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, USA.,School of Biological Sciences, Washington State University, Pullman, Washington, USA
| | - Margaret S Cheung
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, USA.,Department of Physics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
153
|
Lim PK, Julca I, Mutwil M. Redesigning plant specialized metabolism with supervised machine learning using publicly available reactome data. Comput Struct Biotechnol J 2023; 21:1639-1650. [PMID: 36874159 PMCID: PMC9976193 DOI: 10.1016/j.csbj.2023.01.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/12/2023] [Accepted: 01/12/2023] [Indexed: 01/19/2023] Open
Abstract
The immense structural diversity of products and intermediates of plant specialized metabolism (specialized metabolites) makes them rich sources of therapeutic medicine, nutrients, and other useful materials. With the rapid accumulation of reactome data that can be accessible on biological and chemical databases, along with recent advances in machine learning, this review sets out to outline how supervised machine learning can be used to design new compounds and pathways by exploiting the wealth of said data. We will first examine the various sources from which reactome data can be obtained, followed by explaining the different machine learning encoding methods for reactome data. We then discuss current supervised machine learning developments that can be employed in various aspects to help redesign plant specialized metabolism.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Irene Julca
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
154
|
Dehnavi A, Nazem F, Ghasemi F, Fassihi A, Rasti R. A GU-Net-based architecture predicting ligand–Protein-binding atoms. JOURNAL OF MEDICAL SIGNALS & SENSORS 2023. [DOI: 10.4103/jmss.jmss_142_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
|
155
|
Zhang J, Lin X, Chen Y, Li T, Lee AC, Chow EY, Cho WC, Chan T. LAFITE Reveals the Complexity of Transcript Isoforms in Subcellular Fractions. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2023; 10:e2203480. [PMID: 36461702 PMCID: PMC9875686 DOI: 10.1002/advs.202203480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 10/28/2022] [Indexed: 06/17/2023]
Abstract
Characterization of the subcellular distribution of RNA is essential for understanding the molecular basis of biological processes. Here, the subcellular nanopore direct RNA-sequencing (DRS) of four lung cancer cell lines (A549, H1975, H358, and HCC4006) is performed, coupled with a computational pipeline, Low-abundance Aware Full-length Isoform clusTEr (LAFITE), to comprehensively analyze the full-length cytoplasmic and nuclear transcriptome. Using additional DRS and orthogonal data sets, it is shown that LAFITE outperforms current methods for detecting full-length transcripts, particularly for low-abundance isoforms that are usually overlooked due to poor read coverage. Experimental validation of six novel isoforms exclusively identified by LAFITE further confirms the reliability of this pipeline. By applying LAFITE to subcellular DRS data, the complexity of the nuclear transcriptome is revealed in terms of isoform diversity, 3'-UTR usage, m6A modification patterns, and intron retention. Overall, LAFITE provides enhanced full-length isoform identification and enables a high-resolution view of the RNA landscape at the isoform level.
Collapse
Affiliation(s)
- Jizhou Zhang
- School of Life SciencesThe Chinese University of Hong KongShatinHong Kong SARChina
- State Key Laboratory of AgrobiotechnologyThe Chinese University of Hong KongShatinHong Kong SARChina
| | - Xiao Lin
- School of Life SciencesThe Chinese University of Hong KongShatinHong Kong SARChina
- State Key Laboratory of AgrobiotechnologyThe Chinese University of Hong KongShatinHong Kong SARChina
| | - Yuelong Chen
- School of Life SciencesThe Chinese University of Hong KongShatinHong Kong SARChina
| | - Tsz‐Ho Li
- School of Life SciencesThe Chinese University of Hong KongShatinHong Kong SARChina
- State Key Laboratory of AgrobiotechnologyThe Chinese University of Hong KongShatinHong Kong SARChina
| | - Alan Chun‐Kit Lee
- School of Life SciencesThe Chinese University of Hong KongShatinHong Kong SARChina
| | | | | | - Ting‐Fung Chan
- School of Life SciencesThe Chinese University of Hong KongShatinHong Kong SARChina
- State Key Laboratory of AgrobiotechnologyThe Chinese University of Hong KongShatinHong Kong SARChina
| |
Collapse
|
156
|
Durairaj J, de Ridder D, van Dijk AD. Beyond sequence: Structure-based machine learning. Comput Struct Biotechnol J 2022; 21:630-643. [PMID: 36659927 PMCID: PMC9826903 DOI: 10.1016/j.csbj.2022.12.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 12/21/2022] [Accepted: 12/21/2022] [Indexed: 12/31/2022] Open
Abstract
Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
Collapse
Affiliation(s)
- Janani Durairaj
- Biozentrum, University of Basel, Basel, Switzerland
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Aalt D.J. van Dijk
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| |
Collapse
|
157
|
Fischer S, Gillis J. Defining the extent of gene function using ROC curvature. Bioinformatics 2022; 38:5390-5397. [PMID: 36271855 PMCID: PMC9750128 DOI: 10.1093/bioinformatics/btac692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 09/19/2022] [Accepted: 10/20/2022] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Interactions between proteins help us understand how genes are functionally related and how they contribute to phenotypes. Experiments provide imperfect 'ground truth' information about a small subset of potential interactions in a specific biological context, which can then be extended to the whole genome across different contexts, such as conditions, tissues or species, through machine learning methods. However, evaluating the performance of these methods remains a critical challenge. Here, we propose to evaluate the generalizability of gene characterizations through the shape of performance curves. RESULTS We identify Functional Equivalence Classes (FECs), subsets of annotated and unannotated genes that jointly drive performance, by assessing the presence of straight lines in ROC curves built from gene-centric prediction tasks, such as function or interaction predictions. FECs are widespread across data types and methods, they can be used to evaluate the extent and context-specificity of functional annotations in a data-driven manner. For example, FECs suggest that B cell markers can be decomposed into shared primary markers (10-50 genes), and tissue-specific secondary markers (100-500 genes). In addition, FECs suggest the existence of functional modules that span a wide range of the genome, with marker sets spanning at most 5% of the genome and data-driven extensions of Gene Ontology sets spanning up to 40% of the genome. Simple to assess visually and statistically, the identification of FECs in performance curves paves the way for novel functional characterization and increased robustness in the definition of functional gene sets. AVAILABILITY AND IMPLEMENTATION Code for analyses and figures is available at https://github.com/yexilein/pyroc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stephan Fischer
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, Cold Spring Harbor, NY 11724, USA
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris F-75015, France
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, Cold Spring Harbor, NY 11724, USA
- Department of Physiology, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
158
|
Jokinen E, Dumitrescu A, Huuhtanen J, Gligorijević V, Mustjoki S, Bonneau R, Heinonen M, Lähdesmäki H. TCRconv: predicting recognition between T cell receptors and epitopes using contextualized motifs. Bioinformatics 2022; 39:6881078. [PMID: 36477794 PMCID: PMC9825763 DOI: 10.1093/bioinformatics/btac788] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 11/01/2022] [Accepted: 12/06/2022] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION T cells use T cell receptors (TCRs) to recognize small parts of antigens, called epitopes, presented by major histocompatibility complexes. Once an epitope is recognized, an immune response is initiated and T cell activation and proliferation by clonal expansion begin. Clonal populations of T cells with identical TCRs can remain in the body for years, thus forming immunological memory and potentially mappable immunological signatures, which could have implications in clinical applications including infectious diseases, autoimmunity and tumor immunology. RESULTS We introduce TCRconv, a deep learning model for predicting recognition between TCRs and epitopes. TCRconv uses a deep protein language model and convolutions to extract contextualized motifs and provides state-of-the-art TCR-epitope prediction accuracy. Using TCR repertoires from COVID-19 patients, we demonstrate that TCRconv can provide insight into T cell dynamics and phenotypes during the disease. AVAILABILITY AND IMPLEMENTATION TCRconv is available at https://github.com/emmijokinen/tcrconv. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Alexandru Dumitrescu
- Department of Computer Science, Aalto University, Espoo 02150, Finland,Helsinki Institute of Life Science, University of Helsinki, Helsinki 00014, Finland
| | - Jani Huuhtanen
- Department of Clinical Chemistry and Hematology, Translational Immunology Research Program, University of Helsinki, Helsinki 00290, Finland,Hematology Research Unit Helsinki, Helsinki University Hospital Comprehensive Cancer Center, Helsinki 00290, Finland
| | - Vladimir Gligorijević
- Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, NY 10010, USA,Prescient Design, Genentech, New York, NY, USA
| | - Satu Mustjoki
- Department of Clinical Chemistry and Hematology, Translational Immunology Research Program, University of Helsinki, Helsinki 00290, Finland,Hematology Research Unit Helsinki, Helsinki University Hospital Comprehensive Cancer Center, Helsinki 00290, Finland,iCAN Digital Precision Cancer Medicine Flagship, Helsinki, Finland
| | - Richard Bonneau
- Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, NY 10010, USA,Prescient Design, Genentech, New York, NY, USA,Center for Data Science, New York University, New York, NY 10011, USA,Department of Computer Science, New York University, Courant Institute of Mathematical Sciences, New York, NY 10012, USA
| | - Markus Heinonen
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | | |
Collapse
|
159
|
Zacharias HU, Kaleta C, Cossais F, Schaeffer E, Berndt H, Best L, Dost T, Glüsing S, Groussin M, Poyet M, Heinzel S, Bang C, Siebert L, Demetrowitsch T, Leypoldt F, Adelung R, Bartsch T, Bosy-Westphal A, Schwarz K, Berg D. Microbiome and Metabolome Insights into the Role of the Gastrointestinal-Brain Axis in Parkinson's and Alzheimer's Disease: Unveiling Potential Therapeutic Targets. Metabolites 2022; 12:metabo12121222. [PMID: 36557259 PMCID: PMC9786685 DOI: 10.3390/metabo12121222] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 11/25/2022] [Accepted: 11/28/2022] [Indexed: 12/12/2022] Open
Abstract
Neurodegenerative diseases such as Parkinson's (PD) and Alzheimer's disease (AD), the prevalence of which is rapidly rising due to an aging world population and westernization of lifestyles, are expected to put a strong socioeconomic burden on health systems worldwide. Clinical trials of therapies against PD and AD have only shown limited success so far. Therefore, research has extended its scope to a systems medicine point of view, with a particular focus on the gastrointestinal-brain axis as a potential main actor in disease development and progression. Microbiome and metabolome studies have already revealed important insights into disease mechanisms. Both the microbiome and metabolome can be easily manipulated by dietary and lifestyle interventions, and might thus offer novel, readily available therapeutic options to prevent the onset as well as the progression of PD and AD. This review summarizes our current knowledge on the interplay between microbiota, metabolites, and neurodegeneration along the gastrointestinal-brain axis. We further illustrate state-of-the art methods of microbiome and metabolome research as well as metabolic modeling that facilitate the identification of disease pathomechanisms. We conclude with therapeutic options to modulate microbiome composition to prevent or delay neurodegeneration and illustrate potential future research directions to fight PD and AD.
Collapse
Affiliation(s)
- Helena U. Zacharias
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, 30625 Hannover, Germany
- Department of Internal Medicine I, University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
- Correspondence: (H.U.Z.); (C.K.)
| | - Christoph Kaleta
- Research Group Medical Systems Biology, Institute for Experimental Medicine, Kiel University, 24105 Kiel, Germany
- Kiel Nano, Surface and Interface Science—KiNSIS, Kiel University, 24118 Kiel, Germany
- Correspondence: (H.U.Z.); (C.K.)
| | | | - Eva Schaeffer
- Department of Neurology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| | - Henry Berndt
- Research Group Comparative Immunobiology, Zoological Institute, Kiel University, 24118 Kiel, Germany
| | - Lena Best
- Research Group Medical Systems Biology, Institute for Experimental Medicine, Kiel University, 24105 Kiel, Germany
| | - Thomas Dost
- Research Group Medical Systems Biology, Institute for Experimental Medicine, Kiel University, 24105 Kiel, Germany
| | - Svea Glüsing
- Institute of Human Nutrition and Food Science, Food Technology, Kiel University, 24118 Kiel, Germany
| | - Mathieu Groussin
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| | - Mathilde Poyet
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Sebastian Heinzel
- Department of Neurology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
- Institute of Medical Informatics and Statistics, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| | - Corinna Bang
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| | - Leonard Siebert
- Kiel Nano, Surface and Interface Science—KiNSIS, Kiel University, 24118 Kiel, Germany
- Functional Nanomaterials, Department of Materials Science, Kiel University, 24143 Kiel, Germany
| | - Tobias Demetrowitsch
- Institute of Human Nutrition and Food Science, Food Technology, Kiel University, 24118 Kiel, Germany
- Kiel Network of Analytical Spectroscopy and Mass Spectrometry, Kiel University, 24118 Kiel, Germany
| | - Frank Leypoldt
- Department of Neurology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
- Neuroimmunology, Institute of Clinical Chemistry, University Medical Center Schleswig-Holstein, 24105 Kiel, Germany
| | - Rainer Adelung
- Kiel Nano, Surface and Interface Science—KiNSIS, Kiel University, 24118 Kiel, Germany
- Functional Nanomaterials, Department of Materials Science, Kiel University, 24143 Kiel, Germany
| | - Thorsten Bartsch
- Kiel Nano, Surface and Interface Science—KiNSIS, Kiel University, 24118 Kiel, Germany
- Department of Neurology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| | - Anja Bosy-Westphal
- Institute of Human Nutrition and Food Science, Kiel University, 24107 Kiel, Germany
| | - Karin Schwarz
- Kiel Nano, Surface and Interface Science—KiNSIS, Kiel University, 24118 Kiel, Germany
- Institute of Human Nutrition and Food Science, Food Technology, Kiel University, 24118 Kiel, Germany
- Kiel Network of Analytical Spectroscopy and Mass Spectrometry, Kiel University, 24118 Kiel, Germany
| | - Daniela Berg
- Kiel Nano, Surface and Interface Science—KiNSIS, Kiel University, 24118 Kiel, Germany
- Department of Neurology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| |
Collapse
|
160
|
Petrovsky DV, Rudnev VR, Nikolsky KS, Kulikova LI, Malsagova KM, Kopylov AT, Kaysheva AL. PSSNet-An Accurate Super-Secondary Structure for Protein Segmentation. Int J Mol Sci 2022; 23:ijms232314813. [PMID: 36499138 PMCID: PMC9740782 DOI: 10.3390/ijms232314813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 11/18/2022] [Accepted: 11/24/2022] [Indexed: 12/03/2022] Open
Abstract
A super-secondary structure (SSS) is a spatially unique ensemble of secondary structural elements that determine the three-dimensional shape of a protein and its function, rendering SSSs attractive as folding cores. Understanding known types of SSSs is important for developing a deeper understanding of the mechanisms of protein folding. Here, we propose a universal PSSNet machine-learning method for SSS recognition and segmentation. For various types of SSS segmentation, this method uses key characteristics of SSS geometry, including the lengths of secondary structural elements and the distances between them, torsion angles, spatial positions of Cα atoms, and primary sequences. Using four types of SSSs (βαβ-unit, α-hairpin, β-hairpin, αα-corner), we showed that extensive SSS sets could be reliably selected from the Protein Data Bank and AlphaFold 2.0 database of protein structures.
Collapse
|
161
|
Singh D, Roy J. A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs. Nucleic Acids Res 2022; 50:12094-12111. [PMID: 36420898 PMCID: PMC9757047 DOI: 10.1093/nar/gkac1092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 10/22/2022] [Accepted: 10/28/2022] [Indexed: 11/27/2022] Open
Abstract
Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world.
Collapse
Affiliation(s)
- Dalwinder Singh
- To whom correspondence should be addressed. Tel: +91 172 5221206;
| | - Joy Roy
- Correspondence may also be addressed to Joy Roy.
| |
Collapse
|
162
|
Marchetti L, Nifosì R, Martelli PL, Da Pozzo E, Cappello V, Banterle F, Trincavelli ML, Martini C, D’Elia M. Quantum computing algorithms: getting closer to critical problems in computational biology. Brief Bioinform 2022; 23:6758194. [PMID: 36220772 PMCID: PMC9677474 DOI: 10.1093/bib/bbac437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 08/15/2022] [Accepted: 09/08/2022] [Indexed: 12/14/2022] Open
Abstract
The recent biotechnological progress has allowed life scientists and physicians to access an unprecedented, massive amount of data at all levels (molecular, supramolecular, cellular and so on) of biological complexity. So far, mostly classical computational efforts have been dedicated to the simulation, prediction or de novo design of biomolecules, in order to improve the understanding of their function or to develop novel therapeutics. At a higher level of complexity, the progress of omics disciplines (genomics, transcriptomics, proteomics and metabolomics) has prompted researchers to develop informatics means to describe and annotate new biomolecules identified with a resolution down to the single cell, but also with a high-throughput speed. Machine learning approaches have been implemented to both the modelling studies and the handling of biomedical data. Quantum computing (QC) approaches hold the promise to resolve, speed up or refine the analysis of a wide range of these computational problems. Here, we review and comment on recently developed QC algorithms for biocomputing, with a particular focus on multi-scale modelling and genomic analyses. Indeed, differently from other computational approaches such as protein structure prediction, these problems have been shown to be adequately mapped onto quantum architectures, the main limit for their immediate use being the number of qubits and decoherence effects in the available quantum machines. Possible advantages over the classical counterparts are highlighted, along with a description of some hybrid classical/quantum approaches, which could be the closest to be realistically applied in biocomputation.
Collapse
Affiliation(s)
| | | | - Pier Luigi Martelli
- Corresponding authors: Pier Luigi Martelli. Tel.: +39 0512094005; Fax: +39 0512094005; E-mail: ; Claudia Martini. Tel.: +39 0502219522; Fax: +39 050 2210680; E-mail:
| | - Eleonora Da Pozzo
- University of Pisa, Department of Pharmacy, via Bonanno 6, 56126 Pisa Italy
| | - Valentina Cappello
- Italian Institute of Technology, Center for Materials Interfaces, Viale Rinaldo Piaggio 34, 56025 Pontedera (PI), Italy
| | | | | | - Claudia Martini
- Corresponding authors: Pier Luigi Martelli. Tel.: +39 0512094005; Fax: +39 0512094005; E-mail: ; Claudia Martini. Tel.: +39 0502219522; Fax: +39 050 2210680; E-mail:
| | - Massimo D’Elia
- University of Pisa, Department of Physics, Largo Bruno Pontecorvo 3, 56127, Pisa Italy
- INFN, Sezione di Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy
| |
Collapse
|
163
|
Wu L, Yin C, Zhu J, Wu Z, He L, Xia Y, Xie S, Qin T, Liu TY. SPRoBERTa: protein embedding learning with local fragment modeling. Brief Bioinform 2022; 23:6711410. [PMID: 36136367 DOI: 10.1093/bib/bbac401] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 07/18/2022] [Accepted: 08/18/2022] [Indexed: 12/14/2022] Open
Abstract
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Collapse
Affiliation(s)
- Lijun Wu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Chengcan Yin
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Jinhua Zhu
- CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China, No.96, JinZhai Road Baohe District, 230026, Hefei, Anhui Province, China
| | - Zhen Wu
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Liang He
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Yingce Xia
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Shufang Xie
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tao Qin
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tie-Yan Liu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| |
Collapse
|
164
|
Yan K, Lv H, Guo Y, Peng W, Liu B. sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure. Bioinformatics 2022; 39:6808615. [PMID: 36342186 PMCID: PMC9805557 DOI: 10.1093/bioinformatics/btac715] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 10/24/2022] [Accepted: 11/04/2022] [Indexed: 11/09/2022] Open
Abstract
MOTIVATION Antimicrobial peptides (AMPs) are essential components of therapeutic peptides for innate immunity. Researchers have developed several computational methods to predict the potential AMPs from many candidate peptides. With the development of artificial intelligent techniques, the protein structures can be accurately predicted, which are useful for protein sequence and function analysis. Unfortunately, the predicted peptide structure information has not been applied to the field of AMP prediction so as to improve the predictive performance. RESULTS In this study, we proposed a computational predictor called sAMPpred-GAT for AMP identification. To the best of our knowledge, sAMPpred-GAT is the first approach based on the predicted peptide structures for AMP prediction. The sAMPpred-GAT predictor constructs the graphs based on the predicted peptide structures, sequence information and evolutionary information. The Graph Attention Network (GAT) is then performed on the graphs to learn the discriminative features. Finally, the full connection networks are utilized as the output module to predict whether the peptides are AMP or not. Experimental results show that sAMPpred-GAT outperforms the other state-of-the-art methods in terms of AUC, and achieves better or highly comparable performance in terms of the other metrics on the eight independent test datasets, demonstrating that the predicted peptide structure information is important for AMP prediction. AVAILABILITY AND IMPLEMENTATION A user-friendly webserver of sAMPpred-GAT can be accessed at http://bliulab.net/sAMPpred-GAT and the source code is available at https://github.com/HongWuL/sAMPpred-GAT/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Hongwu Lv
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Yichen Guo
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Wei Peng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- To whom correspondence should be addressed.
| |
Collapse
|
165
|
Li L, Peng S, Wang Z, Zhang T, Li H, Xiao Y, Li J, Liu Y, Yin H. Genome mining reveals abiotic stress resistance genes in plant genomes acquired from microbes via HGT. FRONTIERS IN PLANT SCIENCE 2022; 13:1025122. [PMID: 36407614 PMCID: PMC9667741 DOI: 10.3389/fpls.2022.1025122] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 09/07/2022] [Indexed: 06/16/2023]
Abstract
Colonization by beneficial microbes can enhance plant tolerance to abiotic stresses. However, there are still many unknown fields regarding the beneficial plant-microbe interactions. In this study, we have assessed the amount or impact of horizontal gene transfer (HGT)-derived genes in plants that have potentials to confer abiotic stress resistance. We have identified a total of 235 gene entries in fourteen high-quality plant genomes belonging to phyla Chlorophyta and Streptophyta that confer resistance against a wide range of abiotic pressures acquired from microbes through independent HGTs. These genes encode proteins contributed to toxic metal resistance (e.g., ChrA, CopA, CorA), osmotic and drought stress resistance (e.g., Na+/proline symporter, potassium/proton antiporter), acid resistance (e.g., PcxA, ArcA, YhdG), heat and cold stress resistance (e.g., DnaJ, Hsp20, CspA), oxidative stress resistance (e.g., GST, PoxA, glutaredoxin), DNA damage resistance (e.g., Rad25, Rad51, UvrD), and organic pollutant resistance (e.g., CytP450, laccase, CbbY). Phylogenetic analyses have supported the HGT inferences as the plant lineages are all clustering closely with distant microbial lineages. Deep-learning-based protein structure prediction and analyses, in combination with expression assessment based on codon adaption index (CAI) further corroborated the functionality and expressivity of the HGT genes in plant genomes. A case-study applying fold comparison and molecular dynamics (MD) of the HGT-driven CytP450 gave a more detailed illustration on the resemblance and evolutionary linkage between the plant recipient and microbial donor sequences. Together, the microbe-originated HGT genes identified in plant genomes and their participation in abiotic pressures resistance indicate a more profound impact of HGT on the adaptive evolution of plants.
Collapse
Affiliation(s)
- Liangzhi Li
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
| | | | - Zhenhua Wang
- Zhangjiajie Tobacco Company of Hunan Province, Zhangjiajie, China
| | - Teng Zhang
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
- Hunan Urban and Rural Environmental Construction Co., Ltd, Changsha, China
| | - Hongguang Li
- Hunan Tobacco Science Institute, Changsha, China
| | - Yansong Xiao
- Chenzhou Tobacco Company of Hunan Province, Chenzhou, China
| | - Jingjun Li
- Chenzhou Tobacco Company of Hunan Province, Chenzhou, China
| | - Yongjun Liu
- Hunan Tobacco Science Institute, Changsha, China
| | - Huaqun Yin
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
| |
Collapse
|
166
|
Wu F, Jin S, Jiang Y, Jin X, Tang B, Niu Z, Liu X, Zhang Q, Zeng X, Li SZ. Pre-Training of Equivariant Graph Matching Networks with Conformation Flexibility for Drug Binding. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2203796. [PMID: 36202759 PMCID: PMC9685463 DOI: 10.1002/advs.202203796] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 09/07/2022] [Indexed: 05/16/2023]
Abstract
The latest biological findings observe that the motionless "lock-and-key" theory is not generally applicable and that changes in atomic sites and binding pose can provide important information for understanding drug binding. However, the computational expenditure limits the growth of protein trajectory-related studies, thus hindering the possibility of supervised learning. A spatial-temporal pre-training method based on the modified equivariant graph matching networks, dubbed ProtMD which has two specially designed self-supervised learning tasks: atom-level prompt-based denoising generative task and conformation-level snapshot ordering task to seize the flexibility information inside molecular dynamics (MD) trajectories with very fine temporal resolutions is presented. The ProtMD can grant the encoder network the capacity to capture the time-dependent geometric mobility of conformations along MD trajectories. Two downstream tasks are chosen to verify the effectiveness of ProtMD through linear detection and task-specific fine-tuning. A huge improvement from current state-of-the-art methods, with a decrease of 4.3% in root mean square error for the binding affinity problem and an average increase of 13.8% in the area under receiver operating characteristic curve and the area under the precision-recall curve for the ligand efficacy problem is observed. The results demonstrate a strong correlation between the magnitude of conformation's motion in the 3D space and the strength with which the ligand binds with its receptor.
Collapse
Affiliation(s)
- Fang Wu
- School of EngineeringWestlake UniversityHangzhou310024China
- MindRank AI Ltd.Hangzhou310000China
| | - Shuting Jin
- MindRank AI Ltd.Hangzhou310000China
- School of InformaticsXiamen UniversityXiamen361005China
| | | | | | | | | | - Xiangrong Liu
- School of InformaticsXiamen UniversityXiamen361005China
| | - Qiang Zhang
- ZJU‐Hangzhou Global Scientific and Technological Innovation CenterHangzhou311200China
- College of Computer Science and TechnologyZhejiang UniversityHangzhou310013China
| | - Xiangxiang Zeng
- School of Information Science and EngineeringHunan UniversityHunan410082China
| | - Stan Z. Li
- School of EngineeringWestlake UniversityHangzhou310024China
| |
Collapse
|
167
|
Zhao S, Martin-Vicente A, Colabardini AC, Pereira Silva L, Rinker DC, Fortwendel JR, Goldman GH, Gibbons JG. Genomic and Molecular Identification of Genes Contributing to the Caspofungin Paradoxical Effect in Aspergillus fumigatus. Microbiol Spectr 2022; 10:e0051922. [PMID: 36094204 PMCID: PMC9603777 DOI: 10.1128/spectrum.00519-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 08/17/2022] [Indexed: 11/25/2022] Open
Abstract
Aspergillus fumigatus is a deadly opportunistic fungal pathogen responsible for ~100,000 annual deaths. Azoles are the first line antifungal agent used against A. fumigatus, but azole resistance has rapidly evolved making treatment challenging. Caspofungin is an important second-line therapy against invasive pulmonary aspergillosis, a severe A. fumigatus infection. Caspofungin functions by inhibiting β-1,3-glucan synthesis, a primary and essential component of the fungal cell wall. A phenomenon termed the caspofungin paradoxical effect (CPE) has been observed in several fungal species where at higher concentrations of caspofungin, chitin replaces β-1,3-glucan, morphology returns to normal, and growth rate increases. CPE appears to occur in vivo, and it is therefore clinically important to better understand the genetic contributors to CPE. We applied genomewide association (GWA) analysis and molecular genetics to identify and validate candidate genes involved in CPE. We quantified CPE across 67 clinical isolates and conducted three independent GWA analyses to identify genetic variants associated with CPE. We identified 48 single nucleotide polymorphisms (SNPs) associated with CPE. We used a CRISPR/Cas9 approach to generate gene deletion mutants for seven genes harboring candidate SNPs. Two null mutants, ΔAfu3g13230 and ΔAfu4g07080 (dscP), resulted in reduced basal growth rate and a loss of CPE. We further characterized the dscP phosphatase-null mutant and observed a significant reduction in conidia production and extremely high sensitivity to caspofungin at both low and high concentrations. Collectively, our work reveals the contribution of Afu3g13230 and dscP in CPE and sheds new light on the complex genetic interactions governing this phenotype. IMPORTANCE This is one of the first studies to apply genomewide association (GWA) analysis to identify genes involved in an Aspergillus fumigatus phenotype. A. fumigatus is an opportunistic fungal pathogen that causes hundreds of thousands of infections and ~100,000 deaths each year, and antifungal resistance has rapidly evolved in this species. A phenomenon called the caspofungin paradoxical effect (CPE) occurs in some isolates, where high concentrations of the drug lead to increased growth rate. There is clinical relevance in understanding the genetic basis of this phenotype, since caspofungin concentrations could lead to unintended adverse clinical outcomes in certain cases. Using GWA analysis, we identified several interesting candidate polymorphisms and genes and then generated gene deletion mutants to determine whether these genes were important for CPE. Two of these mutant strains (ΔAfu3g13230 and ΔAfu4g07080/ΔdscP) displayed a loss of the CPE. This study sheds light on the genes involved in clinically important phenotype CPE.
Collapse
Affiliation(s)
- Shu Zhao
- Molecular and Cellular Biology Graduate Program, University of Massachusetts, Amherst, Massachusetts, USA
- Department of Food Science, University of Massachusetts, Amherst, Massachusetts, USA
| | - Adela Martin-Vicente
- Department of Clinical Pharmacy and Translational Science, University of Tennessee Health Science Center, Memphis, Tennessee, USA
| | - Ana Cristina Colabardini
- Faculdade de Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo, Brazil
| | - Lilian Pereira Silva
- Faculdade de Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo, Brazil
| | - David C. Rinker
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee, USA
| | - Jarrod R. Fortwendel
- Department of Clinical Pharmacy and Translational Science, University of Tennessee Health Science Center, Memphis, Tennessee, USA
| | - Gustavo Henrique Goldman
- Faculdade de Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo, Brazil
| | - John G. Gibbons
- Molecular and Cellular Biology Graduate Program, University of Massachusetts, Amherst, Massachusetts, USA
- Department of Food Science, University of Massachusetts, Amherst, Massachusetts, USA
- Organismic and Evolutionary Biology Graduate Program, University of Massachusetts, Amherst, Massachusetts, USA
| |
Collapse
|
168
|
Abstract
Many enzymes possess high catalytic efficiency and selectivity that far surpass classical organic or organometallic catalysts. However, the initial starting enzyme for a given transformation does not always possess the right properties needed for broad utilization. Searching in genome/protein sequence libraries for homologs, aided with powerful bioinformatic tools developed in recent years, provides an avenue to identify superior biocatalysts. Herein, we highlight several case studies to illustrate the power of this concept. A brief discussion on its complementarity with contemporary approaches in protein engineering (such as directed evolution) and possible future developments is also provided.
Collapse
Affiliation(s)
- Yanlong Jiang
- Department of Chemistry, BioScience Research Collaborative, Rice University, Houston, TX,77005, USA
| | - Hans Renata
- Department of Chemistry, BioScience Research Collaborative, Rice University, Houston, TX,77005, USA
- Lead Contact
| |
Collapse
|
169
|
Greenrod STE, Stoycheva M, Elphinstone J, Friman VP. Global diversity and distribution of prophages are lineage-specific within the Ralstonia solanacearum species complex. BMC Genomics 2022; 23:689. [PMID: 36199029 PMCID: PMC9535894 DOI: 10.1186/s12864-022-08909-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 09/23/2022] [Indexed: 11/17/2022] Open
Abstract
Background Ralstonia solanacearum species complex (RSSC) strains are destructive plant pathogenic bacteria and the causative agents of bacterial wilt disease, infecting over 200 plant species worldwide. In addition to chromosomal genes, their virulence is mediated by mobile genetic elements including integrated DNA of bacteriophages, i.e., prophages, which may carry fitness-associated auxiliary genes or modulate host gene expression. Although experimental studies have characterised several prophages that shape RSSC virulence, the global diversity, distribution, and wider functional gene content of RSSC prophages are unknown. In this study, prophages were identified in a diverse collection of 192 RSSC draft genome assemblies originating from six continents. Results Prophages were identified bioinformatically and their diversity investigated using genetic distance measures, gene content, GC, and total length. Prophage distributions were characterised using metadata on RSSC strain geographic origin and lineage classification (phylotypes), and their functional gene content was assessed by identifying putative prophage-encoded auxiliary genes. In total, 313 intact prophages were identified, forming ten genetically distinct clusters. These included six prophage clusters with similarity to the Inoviridae, Myoviridae, and Siphoviridae phage families, and four uncharacterised clusters, possibly representing novel, previously undescribed phages. The prophages had broad geographical distributions, being present across multiple continents. However, they were generally host phylogenetic lineage-specific, and overall, prophage diversity was proportional to the genetic diversity of their hosts. The prophages contained many auxiliary genes involved in metabolism and virulence of both phage and bacteria. Conclusions Our results show that while RSSC prophages are highly diverse globally, they make lineage-specific contributions to the RSSC accessory genome, which could have resulted from shared coevolutionary history. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08909-7.
Collapse
Affiliation(s)
| | | | - John Elphinstone
- Fera Science Ltd, National Agri-Food Innovation Campus, Sand Hutton, York, UK
| | | |
Collapse
|
170
|
Structural host immune-microbiota interactions. Curr Opin Struct Biol 2022; 76:102445. [PMID: 36063760 DOI: 10.1016/j.sbi.2022.102445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
171
|
Zhu YH, Zhang C, Liu Y, Omenn GS, Freddolino PL, Yu DJ, Zhang Y. TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1013-1027. [PMID: 35568117 PMCID: PMC10025770 DOI: 10.1016/j.gpb.2022.03.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 03/02/2022] [Accepted: 04/16/2022] [Indexed: 01/13/2023]
Abstract
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Departments of Internal Medicine and Human Genetics, and School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
172
|
Sengupta K, Saha S, Halder AK, Chatterjee P, Nasipuri M, Basu S, Plewczynski D. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms. Front Genet 2022; 13:969915. [PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.
Collapse
Affiliation(s)
- Kaustav Sengupta
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Anup Kumar Halder
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| |
Collapse
|
173
|
Two Conserved Amino Acids Characterized in the Island Domain Are Essential for the Biological Functions of Brassinolide Receptors. Int J Mol Sci 2022; 23:ijms231911454. [PMID: 36232750 PMCID: PMC9570414 DOI: 10.3390/ijms231911454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 09/19/2022] [Accepted: 09/26/2022] [Indexed: 11/16/2022] Open
Abstract
Brassinosteroids (BRs) play important roles in plant growth and development, and BR perception is the pivotal process required to trigger BR signaling. In angiosperms, BR insensitive 1 (BRI1) is the essential BR receptor, because its mutants exhibit an extremely dwarf phenotype in Arabidopsis. Two other BR receptors, BRI1-like 1 (BRL1) and BRI1-like 3 (BRL3), are shown to be not indispensable. All BR receptors require an island domain (ID) responsible for BR perception. However, the biological functional significance of residues in the ID remains unknown. Based on the crystal structure and sequence alignments analysis of BR receptors, we identified two residues 597 and 599 of AtBRI1 that were highly conserved within a BR receptor but diversified among different BR receptors. Both of these residues are tyrosine in BRI1, while BRL1/BRL3 fixes two phenylalanines. The experimental findings revealed that, except BRI1Y597F and BRI1Y599F, substitutions of residues 597 and 599 with the remaining 18 amino acids differently impaired BR signaling and, surprisingly, BRI1Y599F showed a weaker phenotype than BRI1Y599 did, implying that these residues were the key sites to differentiate BR receptors from a non-BR receptor, and the essential BR receptor BRI1 from BRL1/3, which possibly results from positive selection via gain of function during evolution.
Collapse
|
174
|
Hu JX, Yang Y, Xu YY, Shen HB. GraphLoc: a graph neural network model for predicting protein subcellular localization from immunohistochemistry images. Bioinformatics 2022; 38:4941-4948. [DOI: 10.1093/bioinformatics/btac634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 09/07/2022] [Accepted: 09/15/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Recognition of protein subcellular distribution patterns and identification of location biomarker proteins in cancer tissues are important for understanding protein functions and related diseases. Immunohistochemical (IHC) images enable visualizing the distribution of proteins at the tissue level, providing an important resource for the protein localization studies. In the past decades, several image-based protein subcellular location prediction methods have been developed, but the prediction accuracies still have much space to improve due to the complexity of protein patterns resulting from multi-label proteins and variation of location patterns across cell types or states.
Results
Here, we propose a multi-label multi-instance model based on deep graph convolutional neural networks, GraphLoc, to recognize protein subcellular location patterns. GraphLoc builds a graph of multiple IHC images for one protein, learns protein-level representations by graph convolutions, and predicts multi-label information by a dynamic threshold method. Our results show that GraphLoc is a promising model for image-based protein subcellular location prediction with model interpretability. Furthermore, we apply GraphLoc to the identification of candidate location biomarkers and potential members for protein networks. A large portion of the predicted results have supporting evidence from the existing literatures and the new candidates also provide guidance for further experimental screening.
Availability
The dataset and code are available at: www.csbio.sjtu.edu.cn/bioinf/GraphLoc.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jin-Xian Hu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing , Ministry of Education of China, Shanghai 200240, China
| | - Yang Yang
- Shanghai Jiao Tong University Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, , Shanghai 200240, China
| | - Ying-Ying Xu
- Southern Medical University School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, , Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University , Guangzhou 510515, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing , Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
175
|
Szydlowski L, Ehlich J, Szczerbiak P, Shibata N, Goryanin I. Novel species identification and deep functional annotation of electrogenic biofilms, selectively enriched in a microbial fuel cell array. Front Microbiol 2022; 13:951044. [PMID: 36188001 PMCID: PMC9517587 DOI: 10.3389/fmicb.2022.951044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 08/17/2022] [Indexed: 11/13/2022] Open
Abstract
In this study, electrogenic microbial communities originating from a single source were multiplied using our custom-made, 96-well-plate-based microbial fuel cell (MFC) array. Developed communities operated under different pH conditions and produced currents up to 19.4 A/m3 (0.6 A/m2) within 2 days of inoculation. Microscopic observations [combined scanning electron microscopy (SEM) and energy dispersive spectroscopy (EDS)] revealed that some species present in the anodic biofilm adsorbed copper on their surface because of the bioleaching of the printed circuit board (PCB), yielding Cu2 + ions up to 600 mg/L. Beta- diversity indicates taxonomic divergence among all communities, but functional clustering is based on reactor pH. Annotated metagenomes showed the high presence of multicopper oxidases and Cu-resistance genes, as well as genes encoding aliphatic and aromatic hydrocarbon-degrading enzymes, corresponding to PCB bioleaching. Metagenome analysis revealed a high abundance of Dietzia spp., previously characterized in MFCs, which did not grow at pH 4. Binning metagenomes allowed us to identify novel species, one belonging to Actinotalea, not yet associated with electrogenicity and enriched only in the pH 7 anode. Furthermore, we identified 854 unique protein-coding genes in Actinotalea that lacked sequence homology with other metagenomes. The function of some genes was predicted with high accuracy through deep functional residue identification (DeepFRI), with several of these genes potentially related to electrogenic capacity. Our results demonstrate the feasibility of using MFC arrays for the enrichment of functional electrogenic microbial consortia and data mining for the comparative analysis of either consortia or their members.
Collapse
Affiliation(s)
- Lukasz Szydlowski
- Biological Systems Unit, Okinawa Institute of Science and Technology, Onna, Japan
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- *Correspondence: Lukasz Szydlowski,
| | - Jiri Ehlich
- Faculty of Chemistry, Brno University of Technology, Brno, Czechia
| | - Pawel Szczerbiak
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
| | - Noriko Shibata
- Biological Systems Unit, Okinawa Institute of Science and Technology, Onna, Japan
| | - Igor Goryanin
- Biological Systems Unit, Okinawa Institute of Science and Technology, Onna, Japan
- School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
- Tianjin Institute of Industrial Biotechnology, Tianjin, China
| |
Collapse
|
176
|
A pocket-based 3D molecule generative model fueled by experimental electron density. Sci Rep 2022; 12:15100. [PMID: 36068257 PMCID: PMC9448726 DOI: 10.1038/s41598-022-19363-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Accepted: 08/29/2022] [Indexed: 11/08/2022] Open
Abstract
We report for the first time the use of experimental electron density (ED) as training data for the generation of drug-like three-dimensional molecules based on the structure of a target protein pocket. Similar to a structural biologist building molecules based on their ED, our model functions with two main components: a generative adversarial network (GAN) to generate the ligand ED in the input pocket and an ED interpretation module for molecule generation. The model was tested on three targets: a kinase (hematopoietic progenitor kinase 1), protease (SARS-CoV-2 main protease), and nuclear receptor (vitamin D receptor), and evaluated with a reference dataset composed of over 8000 compounds that have their activities reported in the literature. The evaluation considered the chemical validity, chemical space distribution-based diversity, and similarity with reference active compounds concerning the molecular structure and pocket-binding mode. Our model can generate molecules with similar structures to classical active compounds and novel compounds sharing similar binding modes with active compounds, making it a promising tool for library generation supporting high-throughput virtual screening. The ligand ED generated can also be used to support fragment-based drug design. Our model is available as an online service to academic users via https://edmg.stonewise.cn/#/create .
Collapse
|
177
|
Han Y, Wennersten SA, Wright JM, Ludwig RW, Lau E, Lam MPY. Proteogenomics reveals sex-biased aging genes and coordinated splicing in cardiac aging. Am J Physiol Heart Circ Physiol 2022; 323:H538-H558. [PMID: 35930447 PMCID: PMC9448281 DOI: 10.1152/ajpheart.00244.2022] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 07/20/2022] [Accepted: 07/31/2022] [Indexed: 01/24/2023]
Abstract
The risks of heart diseases are significantly modulated by age and sex, but how these factors influence baseline cardiac gene expression remains incompletely understood. Here, we used RNA sequencing and mass spectrometry to compare gene expression in female and male young adult (4 mo) and early aging (20 mo) mouse hearts, identifying thousands of age- and sex-dependent gene expression signatures. Sexually dimorphic cardiac genes are broadly distributed, functioning in mitochondrial metabolism, translation, and other processes. In parallel, we found over 800 genes with differential aging response between male and female, including genes in cAMP and PKA signaling. Analysis of the sex-adjusted aging cardiac transcriptome revealed a widespread remodeling of exon usage patterns that is largely independent from differential gene expression, concomitant with upstream changes in RNA-binding protein and splice factor transcripts. To evaluate the impact of the splicing events on cardiac proteoform composition, we applied an RNA-guided proteomics computational pipeline to analyze the mass spectrometry data and detected hundreds of putative splice variant proteins that have the potential to rewire the cardiac proteome. Taken together, the results here suggest that cardiac aging is associated with 1) widespread sex-biased aging genes and 2) a rewiring of RNA splicing programs, including sex- and age-dependent changes in exon usages and splice patterns that have the potential to influence cardiac protein structure and function. These changes contribute to the emerging evidence for considerable sexual dimorphism in the cardiac aging process that should be considered in the search for disease mechanisms.NEW & NOTEWORTHY Han et al. used proteogenomics to compare male and female mouse hearts at 4 and 20 mo. Sex-biased cardiac genes function in mitochondrial metabolism, translation, autophagy, and other processes. Hundreds of cardiac genes show sex-by-age interactions, that is, sex-biased aging genes. Cardiac aging is accompanied with a remodeling of exon usage in functionally coordinated genes, concomitant with differential expression of RNA-binding proteins and splice factors. These features represent an underinvestigated aspect of cardiac aging that may be relevant to the search for disease mechanisms.
Collapse
Grants
- R21-HL150456 HHS | NIH | National Heart, Lung, and Blood Institute (NHLBI)
- R00-HL144829 HHS | NIH | National Heart, Lung, and Blood Institute (NHLBI)
- R00 HL127302 NHLBI NIH HHS
- R03-OD032666 HHS | NIH | NIH Office of the Director (OD)
- R01 HL141278 NHLBI NIH HHS
- F32 HL149191 NHLBI NIH HHS
- F32-HL149191 HHS | NIH | National Heart, Lung, and Blood Institute (NHLBI)
- R00-HL127302 HHS | NIH | National Heart, Lung, and Blood Institute (NHLBI)
- R21 HL150456 NHLBI NIH HHS
- R03 OD032666 NIH HHS
- R00 HL144829 NHLBI NIH HHS
- R01-HL141278 HHS | NIH | National Heart, Lung, and Blood Institute (NHLBI)
- University of Colorado
- University of Colorado School of Medicine, Anschutz Medical Campus
Collapse
Affiliation(s)
- Yu Han
- Department of Medicine, Anschutz Medical Campus, University of Colorado School of Medicine, Aurora, Colorado
| | - Sara A Wennersten
- Department of Medicine, Anschutz Medical Campus, University of Colorado School of Medicine, Aurora, Colorado
| | - Julianna M Wright
- Department of Medicine, Anschutz Medical Campus, University of Colorado School of Medicine, Aurora, Colorado
| | - R W Ludwig
- Department of Medicine, Anschutz Medical Campus, University of Colorado School of Medicine, Aurora, Colorado
| | | | - Maggie P Y Lam
- Department of Medicine, Anschutz Medical Campus, University of Colorado School of Medicine, Aurora, Colorado
- Department of Biochemistry and Molecular Genetics, Anschutz Medical Campus, University of Colorado School of Medicine, Aurora, Colorado
| |
Collapse
|
178
|
Newaz K, Piland J, Clark PL, Emrich SJ, Li J, Milenković T. Multi-layer sequential network analysis improves protein 3D structural classification. Proteins 2022; 90:1721-1731. [PMID: 35441395 PMCID: PMC9356989 DOI: 10.1002/prot.26349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 03/04/2022] [Accepted: 03/30/2022] [Indexed: 11/08/2022]
Abstract
Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-of-the-art sequence or other 3D structure-based PSC approaches. However, existing PSN-based PSC approaches model the whole 3D structure of a protein as a static (i.e., single-layer) PSN. Because folding of a protein is a dynamic process, where some parts (i.e., sub-structures) of a protein fold before others, modeling the 3D structure of a protein as a PSN that captures the sub-structures might further help improve the existing PSC performance. Here, we propose to model 3D structures of proteins as multi-layer sequential PSNs that approximate 3D sub-structures of proteins, with the hypothesis that this will improve upon the current state-of-the-art PSC approaches that are based on single-layer PSNs (and thus upon the existing state-of-the-art sequence and other 3D structural approaches). Indeed, we confirm this on 72 datasets spanning ~44 000 CATH and SCOPe protein domains.
Collapse
Affiliation(s)
- Khalique Newaz
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA,Center for Data and Computing in Natural Sciences (CDCS), Institute for Computational Systems Biology, Universität Hamburg, Hamburg, 20146, Germany
| | - Jacob Piland
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Patricia L. Clark
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Scott J. Emrich
- Department of Electrical Engineering and Computer Science; University of Tennessee, Knoxville, TN 37996, USA
| | - Jun Li
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| |
Collapse
|
179
|
Yuvaraj I, Chaudhary SK, Jeyakanthan J, Sekar K. Structure of the hypothetical protein TTHA1873 from Thermus thermophilus. Acta Crystallogr F Struct Biol Commun 2022; 78:338-346. [PMID: 36048084 PMCID: PMC9435673 DOI: 10.1107/s2053230x22008457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 08/23/2022] [Indexed: 11/10/2022] Open
Abstract
The crystal structure of an uncharacterized hypothetical protein, TTHA1873 from Thermus thermophilus, has been determined by X-ray crystallography to a resolution of 1.78 Å using the single-wavelength anomalous dispersion method. The protein crystallized as a dimer in two space groups: P43212 and P6122. Structural analysis of the hypothetical protein revealed that the overall fold of TTHA1873 has a β-sandwich jelly-roll topology with nine β-strands. TTHA1873 is a dimeric metal-binding protein that binds to two Ca2+ ions per chain, with one on the surface and the other stabilizing the dimeric interface of the two chains. A structural homology search indicates that the protein has moderate structural similarity to one domain of cell-surface proteins or agglutinin receptor proteins. Red blood cells showed visible agglutination at high concentrations of the hypothetical protein.
Collapse
Affiliation(s)
- I. Yuvaraj
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560 012, India
| | - Santosh Kumar Chaudhary
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560 012, India
| | - J. Jeyakanthan
- Structural Biology and Bio Computing Laboratory, Department of Bioinformatics, Alagappa University, Karaikudi 630 004, India
| | - K. Sekar
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560 012, India
| |
Collapse
|
180
|
Ma W, Zhang S, Li Z, Jiang M, Wang S, Lu W, Bi X, Jiang H, Zhang H, Wei Z. Enhancing Protein Function Prediction Performance by Utilizing AlphaFold-Predicted Protein Structures. J Chem Inf Model 2022; 62:4008-4017. [PMID: 36006049 DOI: 10.1021/acs.jcim.2c00885] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The structure of a protein is of great importance in determining its functionality, and this characteristic can be leveraged to train data-driven prediction models. However, the limited number of available protein structures severely limits the performance of these models. AlphaFold2 and its open-source data set of predicted protein structures have provided a promising solution to this problem, and these predicted structures are expected to benefit the model performance by increasing the number of training samples. In this work, we constructed a new data set that acted as a benchmark and implemented a state-of-the-art structure-based approach for determining whether the performance of the function prediction model can be improved by putting additional AlphaFold-predicted structures into the training set and further compared the performance differences between two models separately trained with real structures only and AlphaFold-predicted structures only. Experimental results indicated that structure-based protein function prediction models could benefit from virtual training data consisting of AlphaFold-predicted structures. First, model performances were improved in all three categories of Gene Ontology terms (GO terms) after adding predicted structures as training samples. Second, the model trained only on AlphaFold-predicted virtual samples achieved comparable performances to the model based on experimentally solved real structures, suggesting that predicted structures were almost equally effective in predicting protein functionality.
Collapse
Affiliation(s)
- Wenjian Ma
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Shugang Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.,High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China
| | - Zhen Li
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Mingjian Jiang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266033, China
| | - Shuang Wang
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Weigang Lu
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Xiangpeng Bi
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Huasen Jiang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Henggui Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.,High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China.,Biological Physics Group, School of Physics and Astronomy, University of Manchester, Manchester M13 9PL, U.K
| | - Zhiqiang Wei
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.,High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China
| |
Collapse
|
181
|
ML helps predict enzyme turnover rates. Nat Catal 2022. [DOI: 10.1038/s41929-022-00827-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
182
|
Li W, Zhang H, Li M, Han M, Yin Y. MGEGFP: a multi-view graph embedding method for gene function prediction based on adaptive estimation with GCN. Brief Bioinform 2022; 23:6659744. [PMID: 35947989 DOI: 10.1093/bib/bbac333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 07/02/2022] [Accepted: 07/21/2022] [Indexed: 11/14/2022] Open
Abstract
In recent years, a number of computational approaches have been proposed to effectively integrate multiple heterogeneous biological networks, and have shown impressive performance for inferring gene function. However, the previous methods do not fully represent the critical neighborhood relationship between genes during the feature learning process. Furthermore, it is difficult to accurately estimate the contributions of different views for multi-view integration. In this paper, we propose MGEGFP, a multi-view graph embedding method based on adaptive estimation with Graph Convolutional Network (GCN), to learn high-quality gene representations among multiple interaction networks for function prediction. First, we design a dual-channel GCN encoder to disentangle the view-specific information and the consensus pattern across diverse networks. By the aid of disentangled representations, we develop a multi-gate module to adaptively estimate the contributions of different views during each reconstruction process and make full use of the multiplexity advantages, where a diversity preservation constraint is designed to prevent the over-fitting problem. To validate the effectiveness of our model, we conduct experiments on networks from the STRING database for both yeast and human datasets, and compare the performance with seven state-of-the-art methods in five evaluation metrics. Moreover, the ablation study manifests the important contribution of the designed dual-channel encoder, multi-gate module and the diversity preservation constraint in MGEGFP. The experimental results confirm the superiority of our proposed method and suggest that MGEGFP can be a useful tool for gene function prediction.
Collapse
Affiliation(s)
- Wei Li
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Minghe Li
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Mingjing Han
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Yanbin Yin
- Department of Food Science and Technology, University of Nebraska - Lincoln, 1400 R Street, 68588, Nebraska, USA
| |
Collapse
|
183
|
I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction. Nat Protoc 2022; 17:2326-2353. [PMID: 35931779 DOI: 10.1038/s41596-022-00728-0] [Citation(s) in RCA: 104] [Impact Index Per Article: 52.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 05/24/2022] [Indexed: 01/17/2023]
Abstract
Most proteins in cells are composed of multiple folding units (or domains) to perform complex functions in a cooperative manner. Relative to the rapid progress in single-domain structure prediction, there are few effective tools available for multi-domain protein structure assembly, mainly due to the complexity of modeling multi-domain proteins, which involves higher degrees of freedom in domain-orientation space and various levels of continuous and discontinuous domain assembly and linker refinement. To meet the challenge and the high demand of the community, we developed I-TASSER-MTD to model the structures and functions of multi-domain proteins through a progressive protocol that combines sequence-based domain parsing, single-domain structure folding, inter-domain structure assembly and structure-based function annotation in a fully automated pipeline. Advanced deep-learning models have been incorporated into each of the steps to enhance both the domain modeling and inter-domain assembly accuracy. The protocol allows for the incorporation of experimental cross-linking data and cryo-electron microscopy density maps to guide the multi-domain structure assembly simulations. I-TASSER-MTD is built on I-TASSER but substantially extends its ability and accuracy in modeling large multi-domain protein structures and provides meaningful functional insights for the targets at both the domain- and full-chain levels from the amino acid sequence alone.
Collapse
|
184
|
Qiu XY, Wu H, Shao J. TALE-cmap: Protein function prediction based on a TALE-based architecture and the structure information from contact map. Comput Biol Med 2022; 149:105938. [DOI: 10.1016/j.compbiomed.2022.105938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 07/26/2022] [Accepted: 08/06/2022] [Indexed: 11/03/2022]
|
185
|
Du H, Jiang D, Gao J, Zhang X, Jiang L, Zeng Y, Wu Z, Shen C, Xu L, Cao D, Hou T, Pan P. Proteome-Wide Profiling of the Covalent-Druggable Cysteines with a Structure-Based Deep Graph Learning Network. Research (Wash D C) 2022; 2022:9873564. [PMID: 35958111 PMCID: PMC9343084 DOI: 10.34133/2022/9873564] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 06/27/2022] [Indexed: 11/06/2022] Open
Abstract
Covalent ligands have attracted increasing attention due to their unique advantages, such as long residence time, high selectivity, and strong binding affinity. They also show promise for targets where previous efforts to identify noncovalent small molecule inhibitors have failed. However, our limited knowledge of covalent binding sites has hindered the discovery of novel ligands. Therefore, developing in silico methods to identify covalent binding sites is highly desirable. Here, we propose DeepCoSI, the first structure-based deep graph learning model to identify ligandable covalent sites in the protein. By integrating the characterization of the binding pocket and the interactions between each cysteine and the surrounding environment, DeepCoSI achieves state-of-the-art predictive performances. The validation on two external test sets which mimic the real application scenarios shows that DeepCoSI has strong ability to distinguish ligandable sites from the others. Finally, we profiled the entire set of protein structures in the RCSB Protein Data Bank (PDB) with DeepCoSI to evaluate the ligandability of each cysteine for covalent ligand design, and made the predicted data publicly available on website.
Collapse
Affiliation(s)
- Hongyan Du
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
- State Key Lab of CAD&CG, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
- State Key Lab of CAD&CG, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Junbo Gao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Lingxiao Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Yundian Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410004 Hunan, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
- State Key Lab of CAD&CG, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, China
| |
Collapse
|
186
|
Xiao X, Jin Z, Wang S, Xu J, Peng Z, Wang R, Shao W, Hui Y. A dual-path dynamic directed graph convolutional network for air quality prediction. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 827:154298. [PMID: 35271925 DOI: 10.1016/j.scitotenv.2022.154298] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Revised: 02/28/2022] [Accepted: 02/28/2022] [Indexed: 06/14/2023]
Abstract
Accurate air quality prediction can help cope with air pollution and improve the life quality. With the development of the deployments of low-cost air quality sensors, increasing data related to air quality has provided chances to find out more accurate prediction methods. Air quality is affected by many external factors such as the position, wind, meteorological information, and so on. Meanwhile, these factors are spatio-temporal dynamic and there are many dynamic contextual relationships between them. Many methods for air quality prediction do not consider these complex spatio-temporal correlations and dynamic contextual relationships. In this paper, we propose a dual-path dynamic directed graph convolutional network (DP-DDGCN) for air quality prediction. We first create a dual-path transposed dynamic directed graph according to static distance relationships of stations and the dynamic relationships generated by wind speed and directions. Then based on the dual-path dynamic directed graph, we can capture the dynamic spatial dependencies more comprehensively. After that we apply gated recurrent units (GRUs) and add the future meteorological features, to extract the complex temporal dependencies of historical air quality data. Using dual-path dynamic directed graph blocks and the GRUs, we finally construct a dynamic spatio-temporal gated recurrent block to capture the dynamic spatio-temporal contextual correlations. Based on real-world datasets, which record a large amount of PM2.5 concentration data, we compare the proposed model with the benchmark models. The experimental results show that our proposed model has the best performance in predicting the PM2.5 concentrations.
Collapse
Affiliation(s)
- Xiao Xiao
- School of Telecommunications Engineering, Xidian University, Xi'an 710071, Shaanxi, China.
| | - Zhiling Jin
- School of Telecommunications Engineering, Xidian University, Xi'an 710071, Shaanxi, China.
| | - Shuo Wang
- School of Systems Science, Beijing Normal University, Beijing, 100875, China.
| | - Jing Xu
- School of Systems Science, Beijing Normal University, Beijing, 100875, China
| | - Ziyan Peng
- School of Telecommunications Engineering, Xidian University, Xi'an 710071, Shaanxi, China.
| | - Rui Wang
- School of Electronic Information, Sichuan University, Chengdu 610065, Sichuan, China
| | - Wei Shao
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3000, Australia.
| | - Yilong Hui
- School of Telecommunications Engineering, Xidian University, Xi'an 710071, Shaanxi, China; The State Key Laboratory of Integrated Services Networks, Xidian University, Xi'an 710071, Shaanxi, China.
| |
Collapse
|
187
|
Sharma VS, Fossati A, Ciuffa R, Buljan M, Williams EG, Chen Z, Shao W, Pedrioli PGA, Purcell AW, Martínez MR, Song J, Manica M, Aebersold R, Li C. PCfun: a hybrid computational framework for systematic characterization of protein complex function. Brief Bioinform 2022; 23:6611913. [PMID: 35724564 PMCID: PMC9310514 DOI: 10.1093/bib/bbac239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 05/05/2022] [Accepted: 05/21/2022] [Indexed: 11/14/2022] Open
Abstract
In molecular biology, it is a general assumption that the ensemble of expressed molecules, their activities and interactions determine biological function, cellular states and phenotypes. Stable protein complexes—or macromolecular machines—are, in turn, the key functional entities mediating and modulating most biological processes. Although identifying protein complexes and their subunit composition can now be done inexpensively and at scale, determining their function remains challenging and labor intensive. This study describes Protein Complex Function predictor (PCfun), the first computational framework for the systematic annotation of protein complex functions using Gene Ontology (GO) terms. PCfun is built upon a word embedding using natural language processing techniques based on 1 million open access PubMed Central articles. Specifically, PCfun leverages two approaches for accurately identifying protein complex function, including: (i) an unsupervised approach that obtains the nearest neighbor (NN) GO term word vectors for a protein complex query vector and (ii) a supervised approach using Random Forest (RF) models trained specifically for recovering the GO terms of protein complex queries described in the CORUM protein complex database. PCfun consolidates both approaches by performing a hypergeometric statistical test to enrich the top NN GO terms within the child terms of the GO terms predicted by the RF models. The documentation and implementation of the PCfun package are available at https://github.com/sharmavaruns/PCfun. We anticipate that PCfun will serve as a useful tool and novel paradigm for the large-scale characterization of protein complex function.
Collapse
Affiliation(s)
- Varun S Sharma
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
| | - Andrea Fossati
- Quantitative Biosciences Institute (QBI) and Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA 94158, USA.,J. David Gladstone Institutes, San Francisco, CA 94158, USA
| | - Rodolfo Ciuffa
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
| | - Marija Buljan
- Empa - Swiss Federal Laboratories for Materials Science and Technology, St. Gallen, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Evan G Williams
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette Luxembourg
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Wenguang Shao
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
| | - Patrick G A Pedrioli
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
| | - Anthony W Purcell
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | | | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | | | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,Faculty of Science, University of Zurich, Switzerland
| | - Chen Li
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
188
|
Odrzywolek K, Karwowska Z, Majta J, Byrski A, Milanowska-Zabel K, Kosciolek T. Deep embeddings to comprehend and visualize microbiome protein space. Sci Rep 2022; 12:10332. [PMID: 35725732 PMCID: PMC9209496 DOI: 10.1038/s41598-022-14055-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 05/31/2022] [Indexed: 12/13/2022] Open
Abstract
Understanding the function of microbial proteins is essential to reveal the clinical potential of the microbiome. The application of high-throughput sequencing technologies allows for fast and increasingly cheaper acquisition of data from microbial communities. However, many of the inferred protein sequences are novel and not catalogued, hence the possibility of predicting their function through conventional homology-based approaches is limited, which indicates the need for further research on alignment-free methods. Here, we leverage a deep-learning-based representation of proteins to assess its utility in alignment-free analysis of microbial proteins. We trained a language model on the Unified Human Gastrointestinal Protein catalogue and validated the resulting protein representation on the bacterial part of the SwissProt database. Finally, we present a use case on proteins involved in SCFA metabolism. Results indicate that the deep learning model manages to accurately represent features related to protein structure and function, allowing for alignment-free protein analyses. Technologies that contextualize metagenomic data are a promising direction to deeply understand the microbiome.
Collapse
Affiliation(s)
- Krzysztof Odrzywolek
- Ardigen, Podole 76, 30-394, Krakow, Poland
- Institute of Computer Science, Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology, Mickiewicza 30, 30-059, Krakow, Poland
| | - Zuzanna Karwowska
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, 30-387, Krakow, Poland
| | - Jan Majta
- Ardigen, Podole 76, 30-394, Krakow, Poland
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387, Krakow, Poland
| | - Aleksander Byrski
- Institute of Computer Science, Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology, Mickiewicza 30, 30-059, Krakow, Poland
| | | | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, 30-387, Krakow, Poland.
| |
Collapse
|
189
|
Kagaya Y, Flannery ST, Jain A, Kihara D. ContactPFP: Protein Function Prediction Using Predicted Contact Information. FRONTIERS IN BIOINFORMATICS 2022; 2. [PMID: 35875419 PMCID: PMC9302406 DOI: 10.3389/fbinf.2022.896295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Computational function prediction is one of the most important problems in bioinformatics as elucidating the function of genes is a central task in molecular biology and genomics. Most of the existing function prediction methods use protein sequences as the primary source of input information because the sequence is the most available information for query proteins. There are attempts to consider other attributes of query proteins. Among these attributes, the three-dimensional (3D) structure of proteins is known to be very useful in identifying the evolutionary relationship of proteins, from which functional similarity can be inferred. Here, we report a novel protein function prediction method, ContactPFP, which uses predicted residue-residue contact maps as input structural features of query proteins. Although 3D structure information is known to be useful, it has not been routinely used in function prediction because the 3D structure is not experimentally determined for many proteins. In ContactPFP, we overcome this limitation by using residue-residue contact prediction, which has become increasingly accurate due to rapid development in the protein structure prediction field. ContactPFP takes a query protein sequence as input and uses predicted residue-residue contact as a proxy for the 3D protein structure. To characterize how predicted contacts contribute to function prediction accuracy, we compared the performance of ContactPFP with several well-established sequence-based function prediction methods. The comparative study revealed the advantages and weaknesses of ContactPFP compared to contemporary sequence-based methods. There were many cases where it showed higher prediction accuracy. We examined factors that affected the accuracy of ContactPFP using several illustrative cases that highlight the strength of our method.
Collapse
Affiliation(s)
- Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Sean T. Flannery
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- *Correspondence: Daisuke Kihara,
| |
Collapse
|
190
|
Ihalage A, Hao Y. Formula Graph Self-Attention Network for Representation-Domain Independent Materials Discovery. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2200164. [PMID: 35475548 PMCID: PMC9218748 DOI: 10.1002/advs.202200164] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 03/05/2022] [Indexed: 06/14/2023]
Abstract
The success of machine learning (ML) in materials property prediction depends heavily on how the materials are represented for learning. Two dominant families of material descriptors exist, one that encodes crystal structure in the representation and the other that only uses stoichiometric information with the hope of discovering new materials. Graph neural networks (GNNs) in particular have excelled in predicting material properties within chemical accuracy. However, current GNNs are limited to only one of the above two avenues owing to the little overlap between respective material representations. Here, a new concept of formula graph which unifies stoichiometry-only and structure-based material descriptors is introduced. A self-attention integrated GNN that assimilates a formula graph is further developed and it is found that the proposed architecture produces material embeddings transferable between the two domains. The proposed model can outperform some previously reported structure-agnostic models and their structure-based counterparts while exhibiting better sample efficiency and faster convergence. Finally, the model is applied in a challenging exemplar to predict the complex dielectric function of materials and nominate new substances that potentially exhibit epsilon-near-zero phenomena.
Collapse
Affiliation(s)
- Achintha Ihalage
- School of Electronic Engineering and Computer ScienceQueen Mary University of LondonMile End RdLondonE1 4NSUnited Kingdom
| | - Yang Hao
- School of Electronic Engineering and Computer ScienceQueen Mary University of LondonMile End RdLondonE1 4NSUnited Kingdom
| |
Collapse
|
191
|
Hu S, Zhang Z, Xiong H, Jiang M, Luo Y, Yan W, Zhao B. A tensor-based bi-random walks model for protein function prediction. BMC Bioinformatics 2022; 23:199. [PMID: 35637427 PMCID: PMC9150346 DOI: 10.1186/s12859-022-04747-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 05/24/2022] [Indexed: 11/26/2022] Open
Abstract
Background The accurate characterization of protein functions is critical to understanding life at the molecular level and has a huge impact on biomedicine and pharmaceuticals. Computationally predicting protein function has been studied in the past decades. Plagued by noise and errors in protein–protein interaction (PPI) networks, researchers have undertaken to focus on the fusion of multi-omics data in recent years. A data model that appropriately integrates network topologies with biological data and preserves their intrinsic characteristics is still a bottleneck and an aspirational goal for protein function prediction. Results In this paper, we propose the RWRT (Random Walks with Restart on Tensor) method to accomplish protein function prediction by applying bi-random walks on the tensor. RWRT firstly constructs a functional similarity tensor by combining protein interaction networks with multi-omics data derived from domain annotation and protein complex information. After this, RWRT extends the bi-random walks algorithm from a two-dimensional matrix to the tensor for scoring functional similarity between proteins. Finally, RWRT filters out possible pretenders based on the concept of cohesiveness coefficient and annotates target proteins with functions of the remaining functional partners. Experimental results indicate that RWRT performs significantly better than the state-of-the-art methods and improves the area under the receiver-operating curve (AUROC) by no less than 18%. Conclusions The functional similarity tensor offers us an alternative, in that it is a collection of networks sharing the same nodes; however, the edges belong to different categories or represent interactions of different nature. We demonstrate that the tensor-based random walk model can not only discover more partners with similar functions but also free from the constraints of errors in protein interaction networks effectively. We believe that the performance of function prediction depends greatly on whether we can extract and exploit proper functional similarity information on protein correlations. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04747-2.
Collapse
Affiliation(s)
- Sai Hu
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| | - Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China.,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, Hunan, China
| | - Huijun Xiong
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| | - Meiping Jiang
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, 410008, Hunan, China.,NHC Key Laboratory of Birth Defect for Research and Prevention, Hunan Provincial Maternal and Child Health Care Hospital), Changsha, 410100, Hunan, China
| | - Yingchun Luo
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, 410008, Hunan, China.,NHC Key Laboratory of Birth Defect for Research and Prevention, Hunan Provincial Maternal and Child Health Care Hospital), Changsha, 410100, Hunan, China
| | - Wei Yan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| | - Bihai Zhao
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China. .,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, Hunan, China.
| |
Collapse
|
192
|
Wang S, Wu R, Lu J, Jiang Y, Huang T, Cai YD. Protein-protein interaction networks as miners of biological discovery. Proteomics 2022; 22:e2100190. [PMID: 35567424 DOI: 10.1002/pmic.202100190] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2021] [Revised: 03/28/2022] [Accepted: 04/29/2022] [Indexed: 11/12/2022]
Abstract
Protein-protein interactions (PPIs) form the basis of a myriad of biological pathways and mechanism, such as the formation of protein-complexes or the components of signaling cascades. Here, we reviewed experimental methods for identifying PPI pairs, including yeast two-hybrid, mass spectrometry, co-localization, and co-immunoprecipitation. Furthermore, a range of computational methods leveraging biochemical properties, evolution history, protein structures and more have enabled identification of additional PPIs. Given the wealth of known PPIs, we reviewed important network methods to construct and analyze networks of PPIs. These methods aid biological discovery through identifying hub genes and dynamic changes in the network, and have been thoroughly applied in various fields of biological research. Lastly, we discussed the challenges and future direction of research utilizing the power of PPI networks. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Steven Wang
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Runxin Wu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jiaqi Lu
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, IN, USA
| | - Yijia Jiang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tao Huang
- Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
193
|
Chen Z, Liu X, Zhao P, Li C, Wang Y, Li F, Akutsu T, Bain C, Gasser RB, Li J, Yang Z, Gao X, Kurgan L, Song J. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res 2022; 50:W434-W447. [PMID: 35524557 PMCID: PMC9252729 DOI: 10.1093/nar/gkac351] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 04/22/2022] [Accepted: 04/25/2022] [Indexed: 01/07/2023] Open
Abstract
The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Center for Crop Genome Engineering, Henan Agricultural University, Zhengzhou 450046, China
| | - Xuhan Liu
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden 2333 CC, The Netherlands
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Yanan Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Chris Bain
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Zuoren Yang
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
194
|
Prediction of GPCR activity using Machine Learning. Comput Struct Biotechnol J 2022; 20:2564-2573. [PMID: 35685352 PMCID: PMC9163700 DOI: 10.1016/j.csbj.2022.05.016] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 05/08/2022] [Accepted: 05/09/2022] [Indexed: 11/20/2022] Open
Abstract
GPCRs are the target for one-third of the FDA-approved drugs, however; the development of new drug molecules targeting GPCRs is limited by the lack of mechanistic understanding of the GPCR structure–activity-function relationship. To modulate the GPCR activity with highly specific drugs and minimal side-effects, it is necessary to quantitatively describe the important structural features in the GPCR and correlate them to the activation state of GPCR. In this study, we developed 3 ML approaches to predict the conformation state of GPCR proteins. Additionally, we predict the activity level of GPCRs based on their structure. We leverage the unique advantages of each of the 3 ML approaches, interpretability of XGBoost, minimal feature engineering for 3D convolutional neural network, and graph representation of protein structure for graph neural network. By using these ML approaches, we are able to predict the activation state of GPCRs with high accuracy (91%–95%) and also predict the activation state of GPCRs with low error (MAE of 7.15–10.58). Furthermore, the interpretation of the ML approaches allows us to determine the importance of each of the features in distinguishing between the GPCRs conformations.
Collapse
|
195
|
Newton MAH, Rahman J, Zaman R, Sattar A. Enhancing Protein Contact Map Prediction Accuracy via Ensembles of Inter-Residue Distance Predictors. Comput Biol Chem 2022; 99:107700. [DOI: 10.1016/j.compbiolchem.2022.107700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 05/19/2022] [Accepted: 05/19/2022] [Indexed: 11/03/2022]
|
196
|
LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction. Sci Rep 2022; 12:6832. [PMID: 35477726 PMCID: PMC9046255 DOI: 10.1038/s41598-022-10775-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 04/11/2022] [Indexed: 11/27/2022] Open
Abstract
Proteins perform many essential functions in biological systems and can be successfully developed as bio-therapeutics. It is invaluable to be able to predict their properties based on a proposed sequence and structure. In this study, we developed a novel generalizable deep learning framework, LM-GVP, composed of a protein Language Model (LM) and Graph Neural Network (GNN) to leverage information from both 1D amino acid sequences and 3D structures of proteins. Our approach outperformed the state-of-the-art protein LMs on a variety of property prediction tasks including fluorescence, protease stability, and protein functions from Gene Ontology (GO). We also illustrated insights into how a GNN prediction head can inform the fine-tuning of protein LMs to better leverage structural information. We envision that our deep learning framework will be generalizable to many protein property prediction problems to greatly accelerate protein engineering and drug development.
Collapse
|
197
|
Gu J, Zhang T, Wu C, Liang Y, Shi X. Refined Contact Map Prediction of Peptides Based on GCN and ResNet. Front Genet 2022; 13:859626. [PMID: 35571037 PMCID: PMC9092020 DOI: 10.3389/fgene.2022.859626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
Predicting peptide inter-residue contact maps plays an important role in computational biology, which determines the topology of the peptide structure. However, due to the limited number of known homologous structures, there is still much room for inter-residue contact map prediction. Current models are not sufficient for capturing the high accuracy relationship between the residues, especially for those with a long-range distance. In this article, we developed a novel deep neural network framework to refine the rough contact map produced by the existing methods. The rough contact map is used to construct the residue graph that is processed by the graph convolutional neural network (GCN). GCN can better capture the global information and is therefore used to grasp the long-range contact relationship. The residual convolutional neural network is also applied in the framework for learning local information. We conducted the experiments on four different test datasets, and the inter-residue long-range contact map prediction accuracy demonstrates the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Jiawei Gu
- College of Computer Science and Technology, University of Jilin, Changchun, China
| | - Tianhao Zhang
- College of Computer Science and Technology, University of Jilin, Changchun, China
| | - Chunguo Wu
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
| | - Yanchun Liang
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
| | - Xiaohu Shi
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
- *Correspondence: Xiaohu Shi,
| |
Collapse
|
198
|
Ghorbani M, Prasad S, Klauda J, Brooks B. GraphVAMPNet, using graph neural networks and variational approach to Markov processes for dynamical modeling of biomolecules. J Chem Phys 2022; 156:184103. [PMID: 35568532 PMCID: PMC9094994 DOI: 10.1063/5.0085607] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Finding low dimensional representation of data from long-timescale trajectories of biomolecular processes such as protein-folding or ligand-receptor binding is of fundamental importance and kinetic models such as Markov modeling have proven useful in describing the kinetics of these systems. Recently, an unsupervised machine learning technique called VAMPNet was introduced to learn the low dimensional representation and linear dynamical model in an end-to-end manner. VAMPNet is based on variational approach to Markov processes (VAMP) and relies on neural networks to learn the coarse-grained dynamics. In this contribution, we combine VAMPNet and graph neural networks to generate an end-to-end framework to efficiently learn high-level dynamics and metastable states from the long-timescale molecular dynamics trajectories. This method bears the advantages of graph representation learning and uses graph message passing operations to generate an embedding for each datapoint which is used in the VAMPNet to generate a coarse-grained representation. This type of molecular representation results in a higher resolution and more interpretable Markov model than the standard VAMPNet enabling a more detailed kinetic study of the biomolecular processes. Our GraphVAMPNet approach is also enhanced with an attention mechanism to find the important residues for classification into different metastable states.
Collapse
Affiliation(s)
- Mahdi Ghorbani
- University of Maryland at College Park, United States of America
| | - Samarjeet Prasad
- National Heart Lung and Blood Institute, United States of America
| | - Jeffery Klauda
- Chemical and Biomolecular Engineering, University of Maryland at College Park, United States of America
| | - Bernard Brooks
- Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, United States of America
| |
Collapse
|
199
|
Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nat Commun 2022; 13:1914. [PMID: 35395843 PMCID: PMC8993921 DOI: 10.1038/s41467-022-29443-w] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 03/15/2022] [Indexed: 01/27/2023] Open
Abstract
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured. "Representation learning plays an increasing role in protein sequence analysis. This paper seeks to clarify how to ensure that such representations are meaningful, proposing best practices both for the choice of methods and the subsequence analysis
Collapse
Affiliation(s)
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
200
|
Pi J, Jiao P, Zhang Y, Li J. MDGNN: Microbial Drug Prediction Based on Heterogeneous Multi-Attention Graph Neural Network. Front Microbiol 2022; 13:819046. [PMID: 35464940 PMCID: PMC9021438 DOI: 10.3389/fmicb.2022.819046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Accepted: 03/07/2022] [Indexed: 11/14/2022] Open
Abstract
Human beings are now facing one of the largest public health crises in history with the outbreak of COVID-19. Traditional drug discovery could not keep peace with newly discovered infectious diseases. The prediction of drug-virus associations not only provides insights into the mechanism of drug–virus interactions, but also guides the screening of potential antiviral drugs. We develop a deep learning algorithm based on the graph convolutional networks (MDGNN) to predict potential antiviral drugs. MDGNN is consisted of new node-level attention and feature-level attention mechanism and shows its effectiveness compared with other comparative algorithms. MDGNN integrates the global information of the graph in the process of information aggregation by introducing the attention at node and feature level to graph convolution. Comparative experiments show that MDGNN achieves state-of-the-art performance with an area under the curve (AUC) of 0.9726 and an area under the PR curve (AUPR) of 0.9112. In this case study, two drugs related to SARS-CoV-2 were successfully predicted and verified by the relevant literature. The data and code are open source and can be accessed from https://github.com/Pijiangsheng/MDGNN.
Collapse
Affiliation(s)
- Jiangsheng Pi
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Peishun Jiao
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Yang Zhang
- College of Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China
- *Correspondence: Yang Zhang,
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
- Junyi Li,
| |
Collapse
|