1
|
Ulusoy E, Doğan T. Mutual annotation-based prediction of protein domain functions with Domain2GO. Protein Sci 2024; 33:e4988. [PMID: 38757367 PMCID: PMC11099699 DOI: 10.1002/pro.4988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/25/2024] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.
Collapse
Affiliation(s)
- Erva Ulusoy
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| | - Tunca Doğan
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| |
Collapse
|
2
|
Abulfaraj AA, Alshareef SA. Concordant Gene Expression and Alternative Splicing Regulation under Abiotic Stresses in Arabidopsis. Genes (Basel) 2024; 15:675. [PMID: 38927612 PMCID: PMC11202685 DOI: 10.3390/genes15060675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 05/19/2024] [Accepted: 05/20/2024] [Indexed: 06/28/2024] Open
Abstract
The current investigation endeavors to identify differentially expressed alternatively spliced (DAS) genes that exhibit concordant expression with splicing factors (SFs) under diverse multifactorial abiotic stress combinations in Arabidopsis seedlings. SFs serve as the post-transcriptional mechanism governing the spatiotemporal dynamics of gene expression. The different stresses encompass variations in salt concentration, heat, intensive light, and their combinations. Clusters demonstrating consistent expression profiles were surveyed to pinpoint DAS/SF gene pairs exhibiting concordant expression. Through rigorous selection criteria, which incorporate alignment with documented gene functionalities and expression patterns observed in this study, four members of the serine/arginine-rich (SR) gene family were delineated as SFs concordantly expressed with six DAS genes. These regulated SF genes encompass cactin, SR1-like, SR30, and SC35-like. The identified concordantly expressed DAS genes encode diverse proteins such as the 26.5 kDa heat shock protein, chaperone protein DnaJ, potassium channel GORK, calcium-binding EF hand family protein, DEAD-box RNA helicase, and 1-aminocyclopropane-1-carboxylate synthase 6. Among the concordantly expressed DAS/SF gene pairs, SR30/DEAD-box RNA helicase, and SC35-like/1-aminocyclopropane-1-carboxylate synthase 6 emerge as promising candidates, necessitating further examinations to ascertain whether these SFs orchestrate splicing of the respective DAS genes. This study contributes to a deeper comprehension of the varied responses of the splicing machinery to abiotic stresses. Leveraging these DAS/SF associations shows promise for elucidating avenues for augmenting breeding programs aimed at fortifying cultivated plants against heat and intensive light stresses.
Collapse
Affiliation(s)
- Aala A. Abulfaraj
- Biological Sciences Department, College of Science & Arts, King Abdulaziz University, Rabigh 21911, Saudi Arabia
| | - Sahar A. Alshareef
- Department of Biology, College of Science and Arts at Khulis, University of Jeddah, Jeddah 21921, Saudi Arabia;
| |
Collapse
|
3
|
Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes (Basel) 2023; 15:25. [PMID: 38254915 PMCID: PMC10815811 DOI: 10.3390/genes15010025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/24/2024] Open
Abstract
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Collapse
Affiliation(s)
| | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (P.C.)
| |
Collapse
|
4
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun Biol 2023; 6:1103. [PMID: 37907681 PMCID: PMC10618451 DOI: 10.1038/s42003-023-05476-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 10/17/2023] [Indexed: 11/02/2023] Open
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
5
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP: Protein Function Prediction Using Function-Aware Domain Embedding Representations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554486. [PMID: 37662252 PMCID: PMC10473699 DOI: 10.1101/2023.08.23.554486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, significantly outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
6
|
Iruegas R, Pfefferle K, Göttig S, Averhoff B, Ebersberger I. Feature architecture aware phylogenetic profiling indicates a functional diversification of type IVa pili in the nosocomial pathogen Acinetobacter baumannii. PLoS Genet 2023; 19:e1010646. [PMID: 37498819 PMCID: PMC10374093 DOI: 10.1371/journal.pgen.1010646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 06/06/2023] [Indexed: 07/29/2023] Open
Abstract
The Gram-negative bacterial pathogen Acinetobacter baumannii is a major cause of hospital-acquired opportunistic infections. The increasing spread of pan-drug resistant strains makes A. baumannii top-ranking among the ESKAPE pathogens for which novel routes of treatment are urgently needed. Comparative genomics approaches have successfully identified genetic changes coinciding with the emergence of pathogenicity in Acinetobacter. Genes that are prevalent both in pathogenic and a-pathogenic Acinetobacter species were not considered ignoring that virulence factors may emerge by the modification of evolutionarily old and widespread proteins. Here, we increased the resolution of comparative genomics analyses to also include lineage-specific changes in protein feature architectures. Using type IVa pili (T4aP) as an example, we show that three pilus components, among them the pilus tip adhesin ComC, vary in their Pfam domain annotation within the genus Acinetobacter. In most pathogenic Acinetobacter isolates, ComC displays a von Willebrand Factor type A domain harboring a finger-like protrusion, and we provide experimental evidence that this finger conveys virulence-related functions in A. baumannii. All three genes are part of an evolutionary cassette, which has been replaced at least twice during A. baumannii diversification. The resulting strain-specific differences in T4aP layout suggests differences in the way how individual strains interact with their host. Our study underpins the hypothesis that A. baumannii uses T4aP for host infection as it was shown previously for other pathogens. It also indicates that many more functional complexes may exist whose precise functions have been adjusted by modifying individual components on the domain level.
Collapse
Affiliation(s)
- Ruben Iruegas
- Applied Bioinformatics Group, Inst of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Katharina Pfefferle
- Molecular Microbiology & Bioenergetics, Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Stephan Göttig
- Institute for Medical Microbiology and Infection Control, University Hospital, Goethe University, Frankfurt, Germany
| | - Beate Averhoff
- Molecular Microbiology & Bioenergetics, Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Inst of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt am Main, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIK-F), Frankfurt am Main, Germany
- LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| |
Collapse
|
7
|
Dosch J, Bergmann H, Tran V, Ebersberger I. FAS: assessing the similarity between proteins using multi-layered feature architectures. Bioinformatics 2023; 39:btad226. [PMID: 37084276 PMCID: PMC10185405 DOI: 10.1093/bioinformatics/btad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 02/23/2023] [Accepted: 04/13/2023] [Indexed: 04/23/2023] Open
Abstract
MOTIVATION Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations. RESULTS Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications. AVAILABILITY AND IMPLEMENTATION FAS is available as python package: https://pypi.org/project/greedyFAS/.
Collapse
Affiliation(s)
- Julian Dosch
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Holger Bergmann
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Vinh Tran
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, 60325, Germany
- LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt, 60325, Germany
| |
Collapse
|
8
|
Meenakshi S I, Rao M, Mayor S, Sowdhamini R. A census of actin-associated proteins in humans. Front Cell Dev Biol 2023; 11:1168050. [PMID: 37187613 PMCID: PMC10175787 DOI: 10.3389/fcell.2023.1168050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 03/31/2023] [Indexed: 05/17/2023] Open
Abstract
Actin filaments help in maintaining the cell structure and coordinating cellular movements and cargo transport within the cell. Actin participates in the interaction with several proteins and also with itself to form the helical filamentous actin (F-actin). Actin-binding proteins (ABPs) and actin-associated proteins (AAPs) coordinate the actin filament assembly and processing, regulate the flux between globular G-actin and F-actin in the cell, and help maintain the cellular structure and integrity. We have used protein-protein interaction data available through multiple sources (STRING, BioGRID, mentha, and a few others), functional annotation, and classical actin-binding domains to identify actin-binding and actin-associated proteins in the human proteome. Here, we report 2482 AAPs and present an analysis of their structural and sequential domains, functions, evolutionary conservation, cellular localization, abundance, and tissue-specific expression patterns. This analysis provides a base for the characterization of proteins involved in actin dynamics and turnover in the cell.
Collapse
Affiliation(s)
| | - Madan Rao
- National Centre for Biological Sciences, TIFR, Bangalore, India
| | - Satyajit Mayor
- National Centre for Biological Sciences, TIFR, Bangalore, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, TIFR, Bangalore, India
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
- Institute of Bioinformatics and Applied Biotechnology, Bangalore, India
- *Correspondence: Ramanathan Sowdhamini,
| |
Collapse
|
9
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
10
|
Sengupta K, Saha S, Halder AK, Chatterjee P, Nasipuri M, Basu S, Plewczynski D. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms. Front Genet 2022; 13:969915. [PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.
Collapse
Affiliation(s)
- Kaustav Sengupta
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Anup Kumar Halder
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| |
Collapse
|
11
|
Hu S, Zhang Z, Xiong H, Jiang M, Luo Y, Yan W, Zhao B. A tensor-based bi-random walks model for protein function prediction. BMC Bioinformatics 2022; 23:199. [PMID: 35637427 PMCID: PMC9150346 DOI: 10.1186/s12859-022-04747-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 05/24/2022] [Indexed: 11/26/2022] Open
Abstract
Background The accurate characterization of protein functions is critical to understanding life at the molecular level and has a huge impact on biomedicine and pharmaceuticals. Computationally predicting protein function has been studied in the past decades. Plagued by noise and errors in protein–protein interaction (PPI) networks, researchers have undertaken to focus on the fusion of multi-omics data in recent years. A data model that appropriately integrates network topologies with biological data and preserves their intrinsic characteristics is still a bottleneck and an aspirational goal for protein function prediction. Results In this paper, we propose the RWRT (Random Walks with Restart on Tensor) method to accomplish protein function prediction by applying bi-random walks on the tensor. RWRT firstly constructs a functional similarity tensor by combining protein interaction networks with multi-omics data derived from domain annotation and protein complex information. After this, RWRT extends the bi-random walks algorithm from a two-dimensional matrix to the tensor for scoring functional similarity between proteins. Finally, RWRT filters out possible pretenders based on the concept of cohesiveness coefficient and annotates target proteins with functions of the remaining functional partners. Experimental results indicate that RWRT performs significantly better than the state-of-the-art methods and improves the area under the receiver-operating curve (AUROC) by no less than 18%. Conclusions The functional similarity tensor offers us an alternative, in that it is a collection of networks sharing the same nodes; however, the edges belong to different categories or represent interactions of different nature. We demonstrate that the tensor-based random walk model can not only discover more partners with similar functions but also free from the constraints of errors in protein interaction networks effectively. We believe that the performance of function prediction depends greatly on whether we can extract and exploit proper functional similarity information on protein correlations. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04747-2.
Collapse
Affiliation(s)
- Sai Hu
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| | - Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China.,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, Hunan, China
| | - Huijun Xiong
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| | - Meiping Jiang
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, 410008, Hunan, China.,NHC Key Laboratory of Birth Defect for Research and Prevention, Hunan Provincial Maternal and Child Health Care Hospital), Changsha, 410100, Hunan, China
| | - Yingchun Luo
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, 410008, Hunan, China.,NHC Key Laboratory of Birth Defect for Research and Prevention, Hunan Provincial Maternal and Child Health Care Hospital), Changsha, 410100, Hunan, China
| | - Wei Yan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| | - Bihai Zhao
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China. .,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, Hunan, China.
| |
Collapse
|
12
|
Zhao G, Meng J, Wang C, Wang L, Wang H, Tian M, Ma L, Guo X, Xu B. Roles of the protein disulphide isomerases AccPDIA1 and AccPDIA3 in response to oxidant stress in Apis cerana cerana. INSECT MOLECULAR BIOLOGY 2022; 31:10-23. [PMID: 34453759 DOI: 10.1111/imb.12733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 08/19/2021] [Accepted: 08/25/2021] [Indexed: 06/13/2023]
Abstract
Protein disulphide isomerase (PDI) plays an important role in a variety of physiological processes through its oxidoreductase activity and molecular chaperone activity. In this study, we cloned two PDI family members, AccPDIA1 and AccPDIA3, from Apis cerana cerana. AccPDIA1 and AccPDIA3 had typical sequence features of PDI family members and were constitutively expressed in A. cerana cerana. The expression levels of AccPDIA1 and AccPDIA3 were generally upregulated after treatment with a variety of environmental stress factors. Inhibition assays showed that E. coli expressing recombinant AccPDIA1 and AccPDIA3 proteins was more resistant to oxidative stress than control E. coli. In addition, silencing AccPDIA1 or AccPDIA3 in A. cerana cerana resulted in significant changes in the expression levels of several antioxidant-related genes as well as the enzymatic activities of peroxidase (POD), superoxide dismutase (SOD) and catalase (CAT) and reduced the survival rate of A. cerana cerana under oxidative stress caused by high temperature. In conclusion, our results suggest that AccPDIA1 and AccPDIA3 may play important roles in the antioxidant activities of A. cerana cerana.
Collapse
Affiliation(s)
- G Zhao
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - J Meng
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - C Wang
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - L Wang
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - H Wang
- College of Animal Science and Technology, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - M Tian
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - L Ma
- College of Animal Science and Technology, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - X Guo
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, Shandong, P. R. China
| | - B Xu
- College of Animal Science and Technology, Shandong Agricultural University, Taian, Shandong, P. R. China
| |
Collapse
|
13
|
La SR, Ndhlovu A, Durand PM. The Ancient Origins of Death Domains Support the 'Original Sin' Hypothesis for the Evolution of Programmed Cell Death. J Mol Evol 2022; 90:95-113. [PMID: 35084524 DOI: 10.1007/s00239-021-10044-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 12/17/2021] [Indexed: 10/19/2022]
Abstract
The discovery of caspase homologs in bacteria highlighted the relationship between programmed cell death (PCD) evolution and eukaryogenesis. However, the origin of PCD genes in prokaryotes themselves (bacteria and archaea) is poorly understood and a source of controversy. Whether archaea also contain C14 peptidase enzymes and other death domains is largely unknown because of a historical dearth of genomic data. Archaeal genomic databases have grown significantly in the last decade, which allowed us to perform a detailed comparative study of the evolutionary histories of PCD-related death domains in major archaeal phyla, including the deepest branching phyla of Candidatus Aenigmarchaeota, Candidatus Woesearchaeota, and Euryarchaeota. We identified death domains associated with executioners of PCD, like the caspase homologs of the C14 peptidase family, in 321 archaea sequences. Of these, 15.58% were metacaspase type I orthologues and 84.42% were orthocaspases. Maximum likelihood phylogenetic analyses revealed a scattered distribution of orthocaspases and metacaspases in deep-branching bacteria and archaea. The tree topology was incongruent with the prokaryote 16S phylogeny suggesting a common ancestry of PCD genes in prokaryotes and subsequent massive horizontal gene transfer coinciding with the divergence of archaea and bacteria. Previous arguments for the origin of PCD were philosophical in nature with two popular propositions being the "addiction" and 'original sin' hypotheses. Our data support the 'original sin' hypothesis, which argues for a pleiotropic origin of the PCD toolkit with pro-life and pro-death functions tracing back to the emergence of cellular life-the Last Universal Common Ancestor State.
Collapse
Affiliation(s)
- So Ri La
- Evolutionary Studies Institute, University of Witwatersrand, Braamfontein, Johannesburg, South Africa.
| | - Andrew Ndhlovu
- Evolutionary Genomics Group, Department of Botany and Zoology, University of Stellenbosch, Stellenbosch, South Africa
| | - Pierre M Durand
- Evolutionary Studies Institute, University of Witwatersrand, Braamfontein, Johannesburg, South Africa
| |
Collapse
|
14
|
Rojano E, Jabato FM, Perkins JR, Córdoba-Caballero J, García-Criado F, Sillitoe I, Orengo C, Ranea JAG, Seoane-Zonjic P. Assigning protein function from domain-function associations using DomFun. BMC Bioinformatics 2022; 23:43. [PMID: 35033002 PMCID: PMC8761305 DOI: 10.1186/s12859-022-04565-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 01/05/2022] [Indexed: 12/03/2022] Open
Abstract
Background Protein function prediction remains a key challenge. Domain composition affects protein function. Here we present DomFun, a Ruby gem that uses associations between protein domains and functions, calculated using multiple indices based on tripartite network analysis. These domain-function associations are combined at the protein level, to generate protein-function predictions. Results We analysed 16 tripartite networks connecting homologous superfamily and FunFam domains from CATH-Gene3D with functional annotations from the three Gene Ontology (GO) sub-ontologies, KEGG, and Reactome. We validated the results using the CAFA 3 benchmark platform for GO annotation, finding that out of the multiple association metrics and domain datasets tested, Simpson index for FunFam domain-function associations combined with Stouffer’s method leads to the best performance in almost all scenarios. We also found that using FunFams led to better performance than superfamilies, and better results were found for GO molecular function compared to GO biological process terms. DomFun performed as well as the highest-performing method in certain CAFA 3 evaluation procedures in terms of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$F_{max}$$\end{document}Fmax and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$S_{min}$$\end{document}Smin We also implemented our own benchmark procedure, Pathway Prediction Performance (PPP), which can be used to validate function prediction for additional annotations sources, such as KEGG and Reactome. Using PPP, we found similar results to those found with CAFA 3 for GO, moreover we found good performance for the other annotation sources. As with CAFA 3, Simpson index with Stouffer’s method led to the top performance in almost all scenarios. Conclusions DomFun shows competitive performance with other methods evaluated in CAFA 3 when predicting proteins function with GO, although results vary depending on the evaluation procedure. Through our own benchmark procedure, PPP, we have shown it can also make accurate predictions for KEGG and Reactome. It performs best when using FunFams, combining Simpson index derived domain-function associations using Stouffer’s method. The tool has been implemented so that it can be easily adapted to incorporate other protein features, such as domain data from other sources, amino acid k-mers and motifs. The DomFun Ruby gem is available from https://rubygems.org/gems/DomFun. Code maintained at https://github.com/ElenaRojano/DomFun. Validation procedure scripts can be found at https://github.com/ElenaRojano/DomFun_project. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04565-6.
Collapse
Affiliation(s)
- Elena Rojano
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain.,Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010, Malaga, Spain
| | - Fernando M Jabato
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain.,Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010, Malaga, Spain
| | - James R Perkins
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain. .,CIBER of Rare Diseases, Av. Monforte de Lemos, 3-5. Pabellon 11. Planta 0, 28029, Madrid, Spain. .,Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010, Malaga, Spain.
| | - José Córdoba-Caballero
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain
| | - Federico García-Criado
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain
| | - Ian Sillitoe
- Department of Structural and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK
| | - Juan A G Ranea
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain.,CIBER of Rare Diseases, Av. Monforte de Lemos, 3-5. Pabellon 11. Planta 0, 28029, Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010, Malaga, Spain
| | - Pedro Seoane-Zonjic
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010, Malaga, Spain.,CIBER of Rare Diseases, Av. Monforte de Lemos, 3-5. Pabellon 11. Planta 0, 28029, Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010, Malaga, Spain
| |
Collapse
|
15
|
Zhao G, Guo D, Wang L, Li H, Wang C, Guo X. Functions of RPM1-interacting protein 4 in plant immunity. PLANTA 2021; 253:11. [PMID: 33389186 DOI: 10.1007/s00425-020-03527-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 12/02/2020] [Indexed: 05/20/2023]
Abstract
We reviewed recent advances related to RIN4, including its involvement in the immune process through posttranslational modifications, PM H+-ATPase activity regulation, interaction with EXO70 and identification of RIN4-associated NLR proteins. RPM1-interacting protein 4 (RIN4) is a conserved plant immunity regulator that has been extensively studied and can be modified by pathogenic effector proteins. RIN4 plays an important role in both PTI and ETI. In this article, we review the functions of the two conserved NOI domains of RIN4, the C-terminal cysteine residues required for membrane localization and the sites targeted and modified by effector proteins during plant immunity. In addition, we discuss the effect of RIN4 on the stomatal virulence of pathogens via the regulation of PM H+-ATPase activity, which is involved in the immune process through interactions with the exocyst subunit EXO70, and progress in the identification of RIN4-related R proteins in multiple species. This review provides new insights enhancing the current understanding of the immune function of RIN4.
Collapse
Affiliation(s)
- Guangdong Zhao
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, 271018, Shandong, People's Republic of China
| | - Dezheng Guo
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, 271018, Shandong, People's Republic of China
| | - Lijun Wang
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, 271018, Shandong, People's Republic of China
| | - Han Li
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, 271018, Shandong, People's Republic of China
| | - Chen Wang
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, 271018, Shandong, People's Republic of China.
| | - Xingqi Guo
- State Key Laboratory of Crop Biology, College of Life Sciences, Shandong Agricultural University, Taian, 271018, Shandong, People's Republic of China.
| |
Collapse
|
16
|
Fan K, Zhang Y. Pseudo2GO: A Graph-Based Deep Learning Method for Pseudogene Function Prediction by Borrowing Information From Coding Genes. Front Genet 2020; 11:807. [PMID: 33014009 PMCID: PMC7461887 DOI: 10.3389/fgene.2020.00807] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 07/06/2020] [Indexed: 12/16/2022] Open
Abstract
Pseudogenes are indicating more and more functional potentials recently, though historically were regarded as relics of evolution. Computational methods for predicting pseudogene functions on Gene Ontology is important for directing experimental discovery. However, no pseudogene-specific computational methods have been proposed to directly predict their Gene Ontology (GO) terms. The biggest challenge for pseudogene function prediction is the lack of enough features and functional annotations, making training a predictive model difficult. Considering the close functional similarity between pseudogenes and their parent coding genes that share great amount of DNA sequence, as well as that coding genes have rich annotations, we aim to predict pseudogene functions by borrowing information from coding genes in a graph-based way. Here we propose Pseudo2GO, a graph-based deep learning semi-supervised model for pseudogene function prediction. A sequence similarity graph is first constructed to connect pseudogenes and coding genes. Multiple features are incorporated into the model as the node attributes to enable the graph an attributed graph, including expression profiles, interactions with microRNAs, protein-protein interactions (PPIs), and genetic interactions. Graph convolutional networks are used to propagate node attributes across the graph to make classifications on pseudogenes. Comparing Pseudo2GO with other frameworks adapted from popular protein function prediction methods, we demonstrated that our method has achieved state-of-the-art performance, significantly outperforming other methods in terms of the M-AUPR metric.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
- The Ohio State University Comprehensive Cancer Center, Columbus, OH, United States
| |
Collapse
|
17
|
Fan K, Guan Y, Zhang Y. Graph2GO: a multi-modal attributed network embedding method for inferring protein functions. Gigascience 2020; 9:giaa081. [PMID: 32770210 PMCID: PMC7414417 DOI: 10.1093/gigascience/giaa081] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 04/30/2020] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Identifying protein functions is important for many biological applications. Since experimental functional characterization of proteins is time-consuming and costly, accurate and efficient computational methods for predicting protein functions are in great demand for generating the testable hypotheses guiding large-scale experiments." RESULTS Here, we propose Graph2GO, a multi-modal graph-based representation learning model that can integrate heterogeneous information, including multiple types of interaction networks (sequence similarity network and protein-protein interaction network) and protein features (amino acid sequence, subcellular location, and protein domains) to predict protein functions on gene ontology. Comparing Graph2GO to BLAST, as a baseline model, and to two popular protein function prediction methods (Mashup and deepNF), we demonstrated that our model can achieve state-of-the-art performance. We show the robustness of our model by testing on multiple species. We also provide a web server supporting function query and downstream analysis on-the-fly. CONCLUSIONS Graph2GO is the first model that has utilized attributed network representation learning methods to model both interaction networks and protein features for predicting protein functions, and achieved promising performance. Our model can be easily extended to include more protein features to further improve the performance. Besides, Graph2GO is also applicable to other application scenarios involving biological networks, and the learned latent representations can be used as feature inputs for machine learning tasks in various downstream analyses.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- The Ohio State University Comprehensive Cancer Center (OSUCCC - James), Columbus, OH 43210, USA
| |
Collapse
|
18
|
Koo DCE, Bonneau R. Towards region-specific propagation of protein functions. Bioinformatics 2020; 35:1737-1744. [PMID: 30304483 PMCID: PMC6513163 DOI: 10.1093/bioinformatics/bty834] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 08/23/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features. RESULTS We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms. AVAILABILITY AND IMPLEMENTATION The code and features are freely available at: https://github.com/ek1203/rsfp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Da Chen Emily Koo
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Richard Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.,Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.,Center for Data Science, New York University, New York, NY, USA
| |
Collapse
|
19
|
Cai Y, Wang J, Deng L. SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction. Front Bioeng Biotechnol 2020; 8:391. [PMID: 32411695 PMCID: PMC7201018 DOI: 10.3389/fbioe.2020.00391] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Accepted: 04/07/2020] [Indexed: 02/01/2023] Open
Abstract
The assignment of function to proteins at a large scale is essential for understanding the molecular mechanism of life. However, only a very small percentage of the more than 179 million proteins in UniProtKB have Gene Ontology (GO) annotations supported by experimental evidence. In this paper, we proposed an integrated deep-learning-based classification model, named SDN2GO, to predict protein functions. SDN2GO applies convolutional neural networks to learn and extract features from sequences, protein domains, and known PPI networks, and then utilizes a weight classifier to integrate these features and achieve accurate predictions of GO terms. We constructed the training set and the independent test set according to the time-delayed principle of the Critical Assessment of Function Annotation (CAFA) and compared it with two highly competitive methods and the classic BLAST method on the independent test set. The results show that our method outperforms others on each sub-ontology of GO. We also investigated the performance of using protein domain information. We learned from the Natural Language Processing (NLP) to process domain information and pre-trained a deep learning sub-model to extract the comprehensive features of domains. The experimental results demonstrate that the domain features we obtained are much improved the performance of our model. Our deep learning models together with the data pre-processing scripts are publicly available as an open source software at https://github.com/Charrick/SDN2GO.
Collapse
Affiliation(s)
- Yideng Cai
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jiacheng Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, China
- School of Software, Xinjiang University, Urumqi, China
| |
Collapse
|
20
|
Kobren SN, Singh M. Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions. Nucleic Acids Res 2019; 47:582-593. [PMID: 30535108 PMCID: PMC6344845 DOI: 10.1093/nar/gky1224] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 11/26/2018] [Indexed: 02/06/2023] Open
Abstract
Domains are fundamental subunits of proteins, and while they play major roles in facilitating protein–DNA, protein–RNA and other protein–ligand interactions, a systematic assessment of their various interaction modes is still lacking. A comprehensive resource identifying positions within domains that tend to interact with nucleic acids, small molecules and other ligands would expand our knowledge of domain functionality as well as aid in detecting ligand-binding sites within structurally uncharacterized proteins. Here, we introduce an approach to identify per-domain-position interaction ‘frequencies’ by aggregating protein co-complex structures by domain and ascertaining how often residues mapping to each domain position interact with ligands. We perform this domain-based analysis on ∼91000 co-complex structures, and infer positions involved in binding DNA, RNA, peptides, ions or small molecules across 4128 domains, which we refer to collectively as the InteracDome. Cross-validation testing reveals that ligand-binding positions for 2152 domains are highly consistent and can be used to identify residues facilitating interactions in ∼63–69% of human genes. Our resource of domain-inferred ligand-binding sites should be a great aid in understanding disease etiology: whereas these sites are enriched in Mendelian-associated and cancer somatic mutations, they are depleted in polymorphisms observed across healthy populations. The InteracDome is available at http://interacdome.princeton.edu.
Collapse
Affiliation(s)
- Shilpa Nadimpalli Kobren
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, MA 02115, USA
| | - Mona Singh
- Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08544, USA.,Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, NJ 08544, USA
| |
Collapse
|
21
|
Zeng C, Zhan W, Deng L. SDADB: a functional annotation database of protein structural domains. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5046758. [PMID: 29961821 PMCID: PMC6025185 DOI: 10.1093/database/bay064] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 06/04/2018] [Indexed: 12/27/2022]
Abstract
Annotating functional terms with individual domains is essential for understanding the functions of full-length proteins. We describe SDADB, a functional annotation database for structural domains. SDADB provides associations between gene ontology (GO) terms and SCOP domains calculated with an integrated framework. GO annotations are assigned probabilities of being correct, which are estimated with a Bayesian network by taking advantage of structural neighborhood mappings, SCOP-InterPro domain mapping information, position-specific scoring matrices (PSSMs) and sequence homolog features, with the most substantial contribution coming from high-coverage structure-based domain-protein mappings. The domain-protein mappings are computed using large-scale structure alignment. SDADB contains ontological terms with probabilistic scores for more than 214 000 distinct SCOP domains. It also provides additional features include 3D structure alignment visualization, GO hierarchical tree view, search, browse and download options. Database URL: http://sda.denglab.org
Collapse
Affiliation(s)
- Cheng Zeng
- School of Software, Central South University, Changsha 410075, China
| | - Weihua Zhan
- School of Electronics and Computer Science, Zhejiang Wanli University, Ningbo 315100, China
| | - Lei Deng
- School of Software, Central South University, Changsha 410075, China.,Shanghai Key Lab of Intelligent Information Processing, Shanghai 200433, China
| |
Collapse
|
22
|
Mehrotra P, Ami VKG, Srinivasan N. Clustering of multi-domain protein sequences. Proteins 2018; 86:759-776. [PMID: 29675880 DOI: 10.1002/prot.25510] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Revised: 04/09/2018] [Accepted: 04/16/2018] [Indexed: 11/06/2022]
Abstract
The overall function of a multi-domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment-based methods commonly utilize domain-level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain-linker regions and classify multi-domain proteins. An alignment-free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi-domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi-domain protein sequences. In this article, CLAP-based classification has been explored on 5 datasets of multi-domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain-level CLAP-based classification scheme resulted in a clustering similar to that obtained from an alignment-based method. CLAP-based clusters obtained for full-length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi-domain proteins could be classified effectively by considering full-length sequences without a requirement of identification of domains in the sequence.
Collapse
Affiliation(s)
- Prachi Mehrotra
- Indian Institute of Science Mathematics Initiative, Bangalore, 560012, India.,Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India
| | - Vimla Kany G Ami
- Institute of Bioinformatics and Applied Biotechnology, Bangalore, 560100, India
| | | |
Collapse
|
23
|
Wang Z, Zhao C, Wang Y, Sun Z, Wang N. PANDA: Protein function prediction using domain architecture and affinity propagation. Sci Rep 2018; 8:3484. [PMID: 29472600 PMCID: PMC5823857 DOI: 10.1038/s41598-018-21849-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 02/09/2018] [Indexed: 12/23/2022] Open
Abstract
We developed PANDA (Propagation of Affinity and Domain Architecture) to predict protein functions in the format of Gene Ontology (GO) terms. PANDA at first executes profile-profile alignment algorithm to search against PfamA, KOG, COG, and SwissProt databases, and then launches PSI-BLAST against UniProt for homologue search. PANDA integrates a domain architecture inference algorithm based on the Bayesian statistics that calculates the probability of having a GO term. All the candidate GO terms are pooled and filtered based on Z-score. After that, the remaining GO terms are clustered using an affinity propagation algorithm based on the GO directed acyclic graph, followed by a second round of filtering on the clusters of GO terms. We benchmarked the performance of all the baseline predictors PANDA integrates and also for every pooling and filtering step of PANDA. It can be found that PANDA achieves better performances in terms of area under the curve for precision and recall compared to the baseline predictors. PANDA can be accessed from http://dna.cs.miami.edu/PANDA/ .
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, University of Miami, 1364 Memorial Drive, P.O. Box 248154, Coral Gables, FL, 33124, USA.
| | - Chenguang Zhao
- School of Computing, University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS, 39406, USA
| | - Yiheng Wang
- School of Computing, University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS, 39406, USA
| | - Zheng Sun
- Department of Mathematics and Computer Science, The Citadel, 171 Moulrie Street, Charleston, SC, 29409, USA
| | - Nan Wang
- Department of Computer Science, New Jersey City University, 2039 Kennedy Blvd, Jersey City, NJ, 07305, USA
| |
Collapse
|
24
|
Garzón JI, Deng L, Murray D, Shapira S, Petrey D, Honig B. A computational interactome and functional annotation for the human proteome. eLife 2016; 5. [PMID: 27770567 PMCID: PMC5115866 DOI: 10.7554/elife.18715] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2016] [Accepted: 10/19/2016] [Indexed: 12/14/2022] Open
Abstract
We present a database, PrePPI (Predicting Protein-Protein Interactions), of more than 1.35 million predicted protein-protein interactions (PPIs). Of these at least 127,000 are expected to constitute direct physical interactions although the actual number may be much larger (~500,000). The current PrePPI, which contains predicted interactions for about 85% of the human proteome, is related to an earlier version but is based on additional sources of interaction evidence and is far larger in scope. The use of structural relationships allows PrePPI to infer numerous previously unreported interactions. PrePPI has been subjected to a series of validation tests including reproducing known interactions, recapitulating multi-protein complexes, analysis of disease associated SNPs, and identifying functional relationships between interacting proteins. We show, using Gene Set Enrichment Analysis (GSEA), that predicted interaction partners can be used to annotate a protein's function. We provide annotations for most human proteins, including many annotated as having unknown function.
Collapse
Affiliation(s)
- José Ignacio Garzón
- Center for Computational Biology and Bioinformatics, Department of Systems Biology, Columbia University, New York, United States
| | - Lei Deng
- Center for Computational Biology and Bioinformatics, Department of Systems Biology, Columbia University, New York, United States.,School of Software, Central South University, Changsha, China
| | - Diana Murray
- Center for Computational Biology and Bioinformatics, Department of Systems Biology, Columbia University, New York, United States
| | - Sagi Shapira
- Center for Computational Biology and Bioinformatics, Department of Systems Biology, Columbia University, New York, United States.,Department of Microbiology and Immunology, Columbia University, New York, United States
| | - Donald Petrey
- Center for Computational Biology and Bioinformatics, Department of Systems Biology, Columbia University, New York, United States.,Howard Hughes Medical Institute, Columbia University, New York, United States
| | - Barry Honig
- Center for Computational Biology and Bioinformatics, Department of Systems Biology, Columbia University, New York, United States.,Howard Hughes Medical Institute, Columbia University, New York, United States.,Department of Biochemistry and Molecular Biophysics, Columbia University, New York, United States.,Department of Medicine, Columbia University, New York, United States.,Zuckerman Mind Brain Behavior Institute, Columbia University, New York, United States
| |
Collapse
|
25
|
Abstract
Proteins are the workhorses of the cell and, over billions of years, they have evolved an amazing plethora of extremely diverse and versatile structures with equally diverse functions. Evolutionary emergence of new proteins and transitions between existing ones are believed to be rare or even impossible. However, recent advances in comparative genomics have repeatedly called some 10%-30% of all genes without any detectable similarity to existing proteins. Even after careful scrutiny, some of those orphan genes contain protein coding reading frames with detectable transcription and translation. Thus some proteins seem to have emerged from previously non-coding 'dark genomic matter'. These 'de novo' proteins tend to be disordered, fast evolving, weakly expressed but also rapidly assuming novel and physiologically important functions. Here we review mechanisms by which 'de novo' proteins might be created, under which circumstances they may become fixed and why they are elusive. We propose a 'grow slow and moult' model in which first a reading frame is extended, coding for an initially disordered and non-globular appendage which, over time, becomes more structured and may also become associated with other proteins.
Collapse
|
26
|
Dohmen E, Kremer LPM, Bornberg-Bauer E, Kemena C. DOGMA: domain-based transcriptome and proteome quality assessment. Bioinformatics 2016; 32:2577-81. [PMID: 27153665 DOI: 10.1093/bioinformatics/btw231] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 04/21/2016] [Indexed: 12/29/2022] Open
Abstract
MOTIVATION Genome studies have become cheaper and easier than ever before, due to the decreased costs of high-throughput sequencing and the free availability of analysis software. However, the quality of genome or transcriptome assemblies can vary a lot. Therefore, quality assessment of assemblies and annotations are crucial aspects of genome analysis pipelines. RESULTS We developed DOGMA, a program for fast and easy quality assessment of transcriptome and proteome data based on conserved protein domains. DOGMA measures the completeness of a given transcriptome or proteome and provides information about domain content for further analysis. DOGMA provides a very fast way to do quality assessment within seconds. AVAILABILITY AND IMPLEMENTATION DOGMA is implemented in Python and published under GNU GPL v.3 license. The source code is available on https://ebbgit.uni-muenster.de/domainWorld/DOGMA/ CONTACTS: e.dohmen@wwu.de or c.kemena@wwu.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Elias Dohmen
- Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, Recklinghausen 45665, Germany
| | - Lukas P M Kremer
- Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany
| | - Carsten Kemena
- Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany
| |
Collapse
|
27
|
Li HD, Omenn GS, Guan Y. A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling. Brief Bioinform 2016; 17:1024-1031. [PMID: 26740460 DOI: 10.1093/bib/bbv109] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 11/03/2015] [Indexed: 01/23/2023] Open
Abstract
The products of multi-exon genes are a mixture of alternatively spliced isoforms, from which the translated proteins can have similar, different or even opposing functions. It is therefore essential to differentiate and annotate functions for individual isoforms. Computational approaches provide an efficient complement to expensive and time-consuming experimental studies. The input data of these methods range from DNA sequence, to RNA selection pressure, to expressed sequence tags, to full-length complementary DNA, to exon array, to RNA-seq expression, to proteomic data. Notably, RNA-seq technology generates quantitative profiling of transcript expression at the genome scale, with an unprecedented amount of expression data available for developing isoform function prediction methods. Integrative analysis of these data at different molecular levels enables a proteogenomic approach to systematically interrogate isoform functions. Here, we briefly review the state-of-the-art methods according to their input data sources, discuss their advantages and limitations and point out potential ways to improve prediction accuracies.
Collapse
|
28
|
Das S, Orengo CA. Protein function annotation using protein domain family resources. Methods 2016; 93:24-34. [DOI: 10.1016/j.ymeth.2015.09.029] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Revised: 09/28/2015] [Accepted: 09/29/2015] [Indexed: 01/25/2023] Open
|
29
|
Ochoa A, Storey JD, Llinás M, Singh M. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol 2015; 11:e1004509. [PMID: 26575353 PMCID: PMC4648515 DOI: 10.1371/journal.pcbi.1004509] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 08/03/2015] [Indexed: 01/25/2023] Open
Abstract
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning. Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.
Collapse
Affiliation(s)
- Alejandro Ochoa
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - John D. Storey
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - Manuel Llinás
- Department of Biochemistry and Molecular Biology, and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Mona Singh
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
30
|
Triant DA, Pearson WR. Most partial domains in proteins are alignment and annotation artifacts. Genome Biol 2015; 16:99. [PMID: 25976240 PMCID: PMC4443539 DOI: 10.1186/s13059-015-0656-7] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 04/15/2015] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2). RESULTS We characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts. CONCLUSIONS Partial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein's gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.
Collapse
Affiliation(s)
- Deborah A Triant
- Department of Biochemistry and Molecular Genetics, University of Virginia, Box 800733, Charlottesville, VA, 22908, USA.
| | - William R Pearson
- Department of Biochemistry and Molecular Genetics, University of Virginia, Box 800733, Charlottesville, VA, 22908, USA.
| |
Collapse
|
31
|
Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T. DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe. BMC Bioinformatics 2015; 16:96. [PMID: 25888481 PMCID: PMC4389672 DOI: 10.1186/s12859-015-0499-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 02/18/2015] [Indexed: 12/27/2022] Open
Abstract
Background Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature–based enzyme functional prediction tool to assign Enzyme Commission (EC) digits. Results DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes. Conclusions Our results offer preliminarily confirmation of the existence of the hypothesized huge number of “hidden enzymes” in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0499-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tianmin Wang
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Hiroshi Mori
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-E3-10 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| | - Chong Zhang
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Ken Kurokawa
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-E3-10 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| | - Xin-Hui Xing
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Takuji Yamada
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| |
Collapse
|
32
|
Mbandi SK, Hesse U, van Heusden P, Christoffels A. Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms. BMC Bioinformatics 2015; 16:58. [PMID: 25880035 PMCID: PMC4344733 DOI: 10.1186/s12859-015-0492-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Accepted: 02/06/2015] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies. RESULTS Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5' and 3') regions and non-coding gene loci. CONCLUSIONS IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism.
Collapse
Affiliation(s)
- Stanley Kimbung Mbandi
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| | - Uljana Hesse
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| | - Peter van Heusden
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| | - Alan Christoffels
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| |
Collapse
|
33
|
Gnanavel M, Mehrotra P, Rakshambikai R, Martin J, Srinivasan N, Bhaskara RM. CLAP: a web-server for automatic classification of proteins with special reference to multi-domain proteins. BMC Bioinformatics 2014; 15:343. [PMID: 25282152 PMCID: PMC4287353 DOI: 10.1186/1471-2105-15-343] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2014] [Accepted: 09/30/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better. RESULTS Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions.Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family. CONCLUSIONS CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/.
Collapse
|
34
|
Hybrid and rogue kinases encoded in the genomes of model eukaryotes. PLoS One 2014; 9:e107956. [PMID: 25255313 PMCID: PMC4177888 DOI: 10.1371/journal.pone.0107956] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/18/2014] [Indexed: 11/19/2022] Open
Abstract
The highly modular nature of protein kinases generates diverse functional roles mediated by evolutionary events such as domain recombination, insertion and deletion of domains. Usually domain architecture of a kinase is related to the subfamily to which the kinase catalytic domain belongs. However outlier kinases with unusual domain architectures serve in the expansion of the functional space of the protein kinase family. For example, Src kinases are made-up of SH2 and SH3 domains in addition to the kinase catalytic domain. A kinase which lacks these two domains but retains sequence characteristics within the kinase catalytic domain is an outlier that is likely to have modes of regulation different from classical src kinases. This study defines two types of outlier kinases: hybrids and rogues depending on the nature of domain recombination. Hybrid kinases are those where the catalytic kinase domain belongs to a kinase subfamily but the domain architecture is typical of another kinase subfamily. Rogue kinases are those with kinase catalytic domain characteristic of a kinase subfamily but the domain architecture is typical of neither that subfamily nor any other kinase subfamily. This report provides a consolidated set of such hybrid and rogue kinases gleaned from six eukaryotic genomes-S.cerevisiae, D. melanogaster, C.elegans, M.musculus, T.rubripes and H.sapiens-and discusses their functions. The presence of such kinases necessitates a revisiting of the classification scheme of the protein kinase family using full length sequences apart from classical classification using solely the sequences of kinase catalytic domains. The study of these kinases provides a good insight in engineering signalling pathways for a desired output. Lastly, identification of hybrids and rogues in pathogenic protozoa such as P.falciparum sheds light on possible strategies in host-pathogen interactions.
Collapse
|
35
|
Abstract
Protein location and function can change dynamically depending on many factors, including environmental stress, disease state, age, developmental stage, and cell type. Here, we describe an integrative computational framework, called the conditional function predictor (CoFP; http://nbm.ajou.ac.kr/cofp/), for predicting changes in subcellular location and function on a proteome-wide scale. The essence of the CoFP approach is to cross-reference general knowledge about a protein and its known network of physical interactions, which typically pool measurements from diverse environments, against gene expression profiles that have been measured under specific conditions of interest. Using CoFP, we predict condition-specific subcellular locations, biological processes, and molecular functions of the yeast proteome under 18 specified conditions. In addition to highly accurate retrieval of previously known gold standard protein locations and functions, CoFP predicts previously unidentified condition-dependent locations and functions for nearly all yeast proteins. Many of these predictions can be confirmed using high-resolution cellular imaging. We show that, under DNA-damaging conditions, Tsr1, Caf120, Dip5, Skg6, Lte1, and Nnf2 change subcellular location and RNA polymerase I subunit A43, Ino2, and Ids2 show changes in DNA binding. Beyond specific predictions, this work reveals a global landscape of changing protein location and function, highlighting a surprising number of proteins that translocate from the mitochondria to the nucleus or from endoplasmic reticulum to Golgi apparatus under stress.
Collapse
|
36
|
Khafif M, Cottret L, Balagué C, Raffaele S. Identification and phylogenetic analyses of VASt, an uncharacterized protein domain associated with lipid-binding domains in Eukaryotes. BMC Bioinformatics 2014; 15:222. [PMID: 24965341 PMCID: PMC4082322 DOI: 10.1186/1471-2105-15-222] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2014] [Accepted: 06/19/2014] [Indexed: 01/25/2023] Open
Abstract
Background Several regulators of programmed cell death (PCD) in plants encode proteins with putative lipid-binding domains. Among them, VAD1 is a regulator of PCD propagation harboring a GRAM putative lipid-binding domain. However the function of VAD1 at the subcellular level is unknown and the domain architecture of VAD1 has not been analyzed in details. Results We analyzed sequence conservation across the plant kingdom in the VAD1 protein and identified an uncharacterized VASt (VAD1 Analog of StAR-related lipid transfer) domain. Using profile hidden Markov models (profile HMMs) and phylogenetic analysis we found that this domain is conserved among eukaryotes and generally associates with various lipid-binding domains. Proteins containing both a GRAM and a VASt domain include notably the yeast Ysp2 cell death regulator and numerous uncharacterized proteins. Using structure-based phylogeny, we found that the VASt domain is structurally related to Bet v1-like domains. Conclusion We identified a novel protein domain ubiquitous in Eukaryotic genomes and belonging to the Bet v1-like superfamily. Our findings open perspectives for the functional analysis of VASt-containing proteins and the characterization of novel mechanisms regulating PCD.
Collapse
Affiliation(s)
| | | | | | - Sylvain Raffaele
- INRA, Laboratoire des Interactions Plantes-Microorganismes (LIPM), UMR441, 24 Chemin de Borde Rouge - Auzeville, CS52627, F31326 Castanet Tolosan Cedex, France.
| |
Collapse
|
37
|
Li HD, Menon R, Omenn GS, Guan Y. The emerging era of genomic data integration for analyzing splice isoform function. Trends Genet 2014; 30:340-7. [PMID: 24951248 DOI: 10.1016/j.tig.2014.05.005] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2014] [Revised: 05/21/2014] [Accepted: 05/23/2014] [Indexed: 01/17/2023]
Abstract
The vast majority of multi-exon genes in humans undergo alternative splicing, which greatly increases the functional diversity of protein species. Predicting functions at the isoform level is essential to further our understanding of developmental abnormalities and cancers, which frequently exhibit aberrant splicing and dysregulation of isoform expression. However, determination of isoform function is very difficult, and efforts to predict isoform function have been limited in the functional genomics field. Deep sequencing of RNA now provides an unprecedented amount of expression data at the transcript level. We describe here emerging computational approaches that integrate such large-scale whole-transcriptome sequencing (RNA-seq) data for predicting the functions of alternatively spliced isoforms, and we discuss their applications in developmental and cancer biology. We outline future directions for isoform function prediction, emphasizing the need for heterogeneous genomic data integration and tissue-specific, dynamic isoform-level network modeling, which will allow the field to realize its full potential.
Collapse
Affiliation(s)
- Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, MI, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, MI, USA; Department of Electrical Engineering and Computer Science, Ann Arbor, MI, USA.
| |
Collapse
|
38
|
Ghouila A, Florent I, Guerfali FZ, Terrapon N, Laouini D, Yahia SB, Gascuel O, Bréhélin L. Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection. PLoS One 2014; 9:e95275. [PMID: 24901648 PMCID: PMC4046975 DOI: 10.1371/journal.pone.0095275] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2013] [Accepted: 03/26/2014] [Indexed: 01/03/2023] Open
Abstract
Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence — the general domain tendency to preferentially appear along with some favorite domains in the proteins — to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced.
Collapse
Affiliation(s)
- Amel Ghouila
- Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France
- Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia
| | - Isabelle Florent
- Centre National de la Recherche Scientifique/Muséum National d'Histoire Naturelle, UMR7245 CNRS-MNHN, Molécules de Communication et Adaptation des Micro-organismes, Adaptation des Protozoaires à leur Environnent, Paris, France
| | - Fatma Zahra Guerfali
- Institut Pasteur de Tunis, LR11IPT02, Laboratory of Transmission, Control and Immunobiology of Infections (LTCII), Tunis-Belvédère, Tunisia
- Université Tunis El Manar, Tunis, Tunisia
| | - Nicolas Terrapon
- Centre National de la Recherche Scientifique, Aix-Marseille Université, CNRS UMR 7257, AFMB, Marseille, France
| | - Dhafer Laouini
- Institut Pasteur de Tunis, LR11IPT02, Laboratory of Transmission, Control and Immunobiology of Infections (LTCII), Tunis-Belvédère, Tunisia
- Université Tunis El Manar, Tunis, Tunisia
| | - Sadok Ben Yahia
- Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia
| | - Olivier Gascuel
- Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France
| | - Laurent Bréhélin
- Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France
- * E-mail:
| |
Collapse
|
39
|
Computational prediction of protein function based on weighted mapping of domains and GO terms. BIOMED RESEARCH INTERNATIONAL 2014; 2014:641469. [PMID: 24868539 PMCID: PMC4017789 DOI: 10.1155/2014/641469] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2013] [Accepted: 03/12/2014] [Indexed: 11/17/2022]
Abstract
In this paper, we propose a novel method, SeekFun, to predict protein function based on weighted mapping of domains and GO terms. Firstly, a weighted mapping of domains and GO terms is constructed according to GO annotations and domain composition of the proteins. The association strength between domain and GO term is weighted by symmetrical conditional probability. Secondly, the mapping is extended along the true paths of the terms based on GO hierarchy. Finally, the terms associated with resident domains are transferred to host protein and real annotations of the host protein are determined by association strengths. Our careful comparisons demonstrate that SeekFun outperforms the concerned methods on most occasions. SeekFun provides a flexible and effective way for protein function prediction. It benefits from the well-constructed mapping of domains and GO terms, as well as the reasonable strategy for inferring annotations of protein from those of its domains.
Collapse
|
40
|
Peng W, Wang J, Cai J, Chen L, Li M, Wu FX. Improving protein function prediction using domain and protein complexes in PPI networks. BMC SYSTEMS BIOLOGY 2014; 8:35. [PMID: 24655481 PMCID: PMC3994332 DOI: 10.1186/1752-0509-8-35] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 03/14/2014] [Indexed: 01/25/2023]
Abstract
Background Characterization of unknown proteins through computational approaches is one of the most challenging problems in silico biology, which has attracted world-wide interests and great efforts. There have been some computational methods proposed to address this problem, which are either based on homology mapping or in the context of protein interaction networks. Results In this paper, two algorithms are proposed by integrating the protein-protein interaction (PPI) network, proteins’ domain information and protein complexes. The one is domain combination similarity (DCS), which combines the domain compositions of both proteins and their neighbors. The other is domain combination similarity in context of protein complexes (DSCP), which extends the protein functional similarity definition of DCS by combining the domain compositions of both proteins and the complexes including them. The new algorithms are tested on networks of the model species of Saccharomyces cerevisiae to predict functions of unknown proteins using cross validations. Comparing with other several existing algorithms, the results have demonstrated the effectiveness of our proposed methods in protein function prediction. Furthermore, the algorithm DSCP using experimental determined complex data is robust when a large percentage of the proteins in the network is unknown, and it outperforms DCS and other several existing algorithms. Conclusions The accuracy of predicting protein function can be improved by integrating the protein-protein interaction (PPI) network, proteins’ domain information and protein complexes.
Collapse
Affiliation(s)
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, PR China.
| | | | | | | | | |
Collapse
|
41
|
Bhaskara RM, Mehrotra P, Rakshambikai R, Gnanavel M, Martin J, Srinivasan N. The relationship between classification of multi-domain proteins using an alignment-free approach and their functions: a case study with immunoglobulins. MOLECULAR BIOSYSTEMS 2014; 10:1082-93. [DOI: 10.1039/c3mb70443b] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
42
|
Liberal R, Pinney JW. Simple topological properties predict functional misannotations in a metabolic network. Bioinformatics 2013; 29:i154-61. [PMID: 23812979 PMCID: PMC3694667 DOI: 10.1093/bioinformatics/btt236] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Motivation: Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism’s metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation. Results: We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Contact:j.pinney@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rodrigo Liberal
- Department of Life Sciences and Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| | | |
Collapse
|
43
|
Enzyme reaction annotation using cloud techniques. BIOMED RESEARCH INTERNATIONAL 2013; 2013:140237. [PMID: 24222895 PMCID: PMC3814047 DOI: 10.1155/2013/140237] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2012] [Revised: 07/12/2013] [Accepted: 07/19/2013] [Indexed: 01/22/2023]
Abstract
An understanding of the activities of enzymes could help to elucidate the metabolic pathways of thousands of chemical reactions that are catalyzed by enzymes in living systems. Sophisticated applications such as drug design and metabolic reconstruction could be developed using accurate enzyme reaction annotation. Because accurate enzyme reaction annotation methods create potential for enhanced production capacity in these applications, they have received greater attention in the global market. We propose the enzyme reaction prediction (ERP) method as a novel tool to deduce enzyme reactions from domain architecture. We used several frequency relationships between architectures and reactions to enhance the annotation rates for single and multiple catalyzed reactions. The deluge of information which arose from high-throughput techniques in the postgenomic era has improved our understanding of biological data, although it presents obstacles in the data-processing stage. The high computational capacity provided by cloud computing has resulted in an exponential growth in the volume of incoming data. Cloud services also relieve the requirement for large-scale memory space required by this approach to analyze enzyme kinetic data. Our tool is designed as a single execution file; thus, it could be applied to any cloud platform in which multiple queries are supported.
Collapse
|
44
|
Abstract
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.
Collapse
Affiliation(s)
- Robert Rentzsch
- Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany.
| | | |
Collapse
|
45
|
Saini A, Hou J. Progressive Clustering Based Method for Protein Function Prediction. Bull Math Biol 2013; 75:331-50. [DOI: 10.1007/s11538-013-9809-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Accepted: 01/07/2013] [Indexed: 12/26/2022]
|
46
|
Abbas A, Kong XB, Liu Z, Jing BY, Gao X. Automatic peak selection by a Benjamini-Hochberg-based algorithm. PLoS One 2013; 8:e53112. [PMID: 23308147 PMCID: PMC3538655 DOI: 10.1371/journal.pone.0053112] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2012] [Accepted: 11/26/2012] [Indexed: 11/25/2022] Open
Abstract
A common issue in bioinformatics is that computational methods often generate a large number of predictions sorted according to certain confidence scores. A key problem is then determining how many predictions must be selected to include most of the true predictions while maintaining reasonably high precision. In nuclear magnetic resonance (NMR)-based protein structure determination, for instance, computational peak picking methods are becoming more and more common, although expert-knowledge remains the method of choice to determine how many peaks among thousands of candidate peaks should be taken into consideration to capture the true peaks. Here, we propose a Benjamini-Hochberg (B-H)-based approach that automatically selects the number of peaks. We formulate the peak selection problem as a multiple testing problem. Given a candidate peak list sorted by either volumes or intensities, we first convert the peaks into [Formula: see text]-values and then apply the B-H-based algorithm to automatically select the number of peaks. The proposed approach is tested on the state-of-the-art peak picking methods, including WaVPeak [1] and PICKY [2]. Compared with the traditional fixed number-based approach, our approach returns significantly more true peaks. For instance, by combining WaVPeak or PICKY with the proposed method, the missing peak rates are on average reduced by 20% and 26%, respectively, in a benchmark set of 32 spectra extracted from eight proteins. The consensus of the B-H-selected peaks from both WaVPeak and PICKY achieves 88% recall and 83% precision, which significantly outperforms each individual method and the consensus method without using the B-H algorithm. The proposed method can be used as a standard procedure for any peak picking method and straightforwardly applied to some other prediction selection problems in bioinformatics. The source code, documentation and example data of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx.
Collapse
Affiliation(s)
- Ahmed Abbas
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Xin-Bing Kong
- Department of Statistics, Fudan University, Shanghai, China
| | - Zhi Liu
- Department of Mathematics, Faculty of Science and Technology, University of Macau, Taipa, Macau
| | - Bing-Yi Jing
- Department of Mathematics, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
47
|
Abstract
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from
Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in
E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.
Collapse
Affiliation(s)
- Lars-Gustav Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, Norway
| | - David W Ussery
- Centre for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
48
|
Messih MA, Chitale M, Bajic VB, Kihara D, Gao X. Protein domain recurrence and order can enhance prediction of protein functions. Bioinformatics 2012; 28:i444-i450. [PMID: 22962465 PMCID: PMC3436825 DOI: 10.1093/bioinformatics/bts398] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
MOTIVATION Burgeoning sequencing technologies have generated massive amounts of genomic and proteomic data. Annotating the functions of proteins identified in this data has become a big and crucial problem. Various computational methods have been developed to infer the protein functions based on either the sequences or domains of proteins. The existing methods, however, ignore the recurrence and the order of the protein domains in this function inference. RESULTS We developed two new methods to infer protein functions based on protein domain recurrence and domain order. Our first method, DRDO, calculates the posterior probability of the Gene Ontology terms based on domain recurrence and domain order information, whereas our second method, DRDO-NB, relies on the naïve Bayes methodology using the same domain architecture information. Our large-scale benchmark comparisons show strong improvements in the accuracy of the protein function inference achieved by our new methods, demonstrating that domain recurrence and order can provide important information for inference of protein functions. AVAILABILITY The new models are provided as open source programs at http://sfb.kaust.edu.sa/Pages/Software.aspx. CONTACT dkihara@cs.purdue.edu, xin.gao@kaust.edu.sa SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics Online.
Collapse
Affiliation(s)
- Mario Abdel Messih
- Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | | | | | | | | |
Collapse
|
49
|
WASS MN, STANWAY R, BLAGBOROUGH AM, LAL K, PRIETO JH, RAINE D, STERNBERG MJE, TALMAN AM, TOMLEY F, YATES J, SINDEN RE. Proteomic analysis of Plasmodium in the mosquito: progress and pitfalls. Parasitology 2012; 139:1131-45. [PMID: 22336136 PMCID: PMC3417538 DOI: 10.1017/s0031182012000133] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Revised: 01/10/2012] [Accepted: 01/11/2012] [Indexed: 11/06/2022]
Abstract
Here we discuss proteomic analyses of whole cell preparations of the mosquito stages of malaria parasite development (i.e. gametocytes, microgamete, ookinete, oocyst and sporozoite) of Plasmodium berghei. We also include critiques of the proteomes of two cell fractions from the purified ookinete, namely the micronemes and cell surface. Whereas we summarise key biological interpretations of the data, we also try to identify key methodological constraints we have met, only some of which we were able to resolve. Recognising the need to translate the potential of current genome sequencing into functional understanding, we report our efforts to develop more powerful combinations of methods for the in silico prediction of protein function and location. We have applied this analysis to the proteome of the male gamete, a cell whose very simple structural organisation facilitated interpretation of data. Some of the in silico predictions made have now been supported by ongoing protein tagging and genetic knockout studies. We hope this discussion may assist future studies.
Collapse
Affiliation(s)
- M. N. WASS
- The Centre for Bioinformatics, Department of Life Sciences, Imperial College, London SW7 2AZ
| | - R. STANWAY
- The Malaria Centre, Department of Life Sciences, Imperial College, London SW7 2AZ
- University of Bern, Institute of Cell Biology, Baltzerstrasse 4, CH-3012 Bern
| | - A. M. BLAGBOROUGH
- The Malaria Centre, Department of Life Sciences, Imperial College, London SW7 2AZ
| | - K. LAL
- The Malaria Centre, Department of Life Sciences, Imperial College, London SW7 2AZ
| | - J. H. PRIETO
- The Scripps Research Institute, 10550 North Torrey Pines Rd., Department of Chemical Physiology, SR11 , La Jolla, CA 92037
| | - D. RAINE
- The Malaria Centre, Department of Life Sciences, Imperial College, London SW7 2AZ
| | - M. J. E. STERNBERG
- The Centre for Bioinformatics, Department of Life Sciences, Imperial College, London SW7 2AZ
| | - A. M. TALMAN
- The Malaria Centre, Department of Life Sciences, Imperial College, London SW7 2AZ
- Department of Microbial Pathogenesis, Yale University School of Medicine, 295 Congress Avenue, New Haven, CT 06519
| | - F. TOMLEY
- Pathology & Infectious Diseases, The Royal Veterinary College, Hawkshead Lane, North Mymms, Hatfield, Hertfordshire AL9 7TA
| | - J. YATES
- The Scripps Research Institute, 10550 North Torrey Pines Rd., Department of Chemical Physiology, SR11 , La Jolla, CA 92037
| | - R. E. SINDEN
- The Malaria Centre, Department of Life Sciences, Imperial College, London SW7 2AZ
| |
Collapse
|
50
|
Wass MN, Barton G, Sternberg MJE. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 2012; 40:W466-70. [PMID: 22641853 PMCID: PMC3394346 DOI: 10.1093/nar/gks489] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein–protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.
Collapse
Affiliation(s)
- Mark N Wass
- Centre for Bioinformatics, Imperial College London, London, SW7 2AZ, UK.
| | | | | |
Collapse
|