1
|
Tang W, Gui C, Zhang T. Expression, Purification, and Bioinformatic Prediction of Mycobacterium tuberculosis Rv0439c as a Potential NADP +-Retinol Dehydrogenase. Mol Biotechnol 2023:10.1007/s12033-023-00956-z. [PMID: 37989944 DOI: 10.1007/s12033-023-00956-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 10/23/2023] [Indexed: 11/23/2023]
Abstract
Although the genome of Mycobacterium tuberculosis (Mtb) H37Rv, the causative agent of tuberculosis, has been repeatedly annotated and updated, a range of proteins from this human pathogen have unknown functions. Mtb Rv0439c, a member of the short-chain dehydrogenase/reductases superfamily, has yet to be cloned and characterized, and its function remains unclear. In this work, we present for the first time the optimized expression and purification of this enzyme, as well as bioinformatic analysis to unveil its potential coenzyme and substrate. Optimized expression in Escherichia coli yielded soluble Rv0439c, while certain tag fusions resulted in insolubility. Sequence and docking analyses strongly suggested that Rv0439c has a clear preference for NADP+, with Arg53 being a key residue that confers coenzyme specificity. Furthermore, functional prediction using CLEAN and DEEPre servers suggested that this protein is a potential NADP+-retinol dehydrogenase (EC No. 1.1.1.300) in retinol metabolism, and this was supported by a BLASTp search and docking studies. Collectively, our findings provide a solid basis for future functional characterization and structural studies of Rv0439c, which will contribute to enhanced understanding of Mtb biology.
Collapse
Affiliation(s)
- Wanggang Tang
- Bengbu Medical College Key Laboratory of Cancer Research and Clinical Laboratory Diagnosis, School of Laboratory Medicine, Bengbu Medical College, Anhui, 233030, China.
- Department of Biochemistry and Molecular Biology, School of Laboratory Medicine, Bengbu Medical College, Bengbu, 233030, Anhui, China.
| | - Chuanyue Gui
- Bengbu Medical College Key Laboratory of Cancer Research and Clinical Laboratory Diagnosis, School of Laboratory Medicine, Bengbu Medical College, Anhui, 233030, China
- School of Public Health, Bengbu Medical College, Bengbu, 233030, Anhui, China
| | - Tingting Zhang
- Bengbu Medical College Key Laboratory of Cancer Research and Clinical Laboratory Diagnosis, School of Laboratory Medicine, Bengbu Medical College, Anhui, 233030, China
- School of Public Health, Bengbu Medical College, Bengbu, 233030, Anhui, China
| |
Collapse
|
2
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: A case study of BK channels. PLoS Comput Biol 2023; 19:e1011460. [PMID: 37713443 PMCID: PMC10529646 DOI: 10.1371/journal.pcbi.1011460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/27/2023] [Accepted: 08/24/2023] [Indexed: 09/17/2023] Open
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ∆V1/2, with a RMSE ~ 32 mV and correlation coefficient of R ~ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ∆V1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Kelli M. White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| |
Collapse
|
3
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: a case study of BK channels. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.24.546384. [PMID: 37425916 PMCID: PMC10327070 DOI: 10.1101/2023.06.24.546384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ΔV 1/2 , with a RMSE ∼ 32 mV and correlation coefficient of R ∼ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V 1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ΔV 1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction. Author Summary Deep machine learning has brought many exciting breakthroughs in chemistry, physics and biology. These models require large amount of training data and struggle when the data is scarce. The latter is true for predictive modeling of the function of complex proteins such as ion channels, where only hundreds of mutational data may be available. Using the big potassium (BK) channel as a biologically important model system, we demonstrate that a reliable predictive model of its voltage gating property could be derived from only 473 mutational data by incorporating physics-derived features, which include dynamic properties from molecular dynamics simulations and energetic quantities from Rosetta mutation calculations. We show that the final random forest model captures key trends and hotspots in mutational effects of BK voltage gating, such as the important role of pore hydrophobicity. A particularly curious prediction is that mutations of two adjacent residues on the S5 helix would always have opposite effects on the gating voltage, which was confirmed by experimental characterization of four novel mutations. The current work demonstrates the importance and effectiveness of incorporating physics in predictive modeling of protein function with scarce data.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, USA
| | - Kelli M. White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
4
|
Titus-McQuillan JE, Nanni AV, McIntyre LM, Rogers RL. Estimating transcriptome complexities across eukaryotes. BMC Genomics 2023; 24:254. [PMID: 37170194 PMCID: PMC10173493 DOI: 10.1186/s12864-023-09326-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 04/20/2023] [Indexed: 05/13/2023] Open
Abstract
BACKGROUND Genomic complexity is a growing field of evolution, with case studies for comparative evolutionary analyses in model and emerging non-model systems. Understanding complexity and the functional components of the genome is an untapped wealth of knowledge ripe for exploration. With the "remarkable lack of correspondence" between genome size and complexity, there needs to be a way to quantify complexity across organisms. In this study, we use a set of complexity metrics that allow for evaluating changes in complexity using TranD. RESULTS We ascertain if complexity is increasing or decreasing across transcriptomes and at what structural level, as complexity varies. In this study, we define three metrics - TpG, EpT, and EpG- to quantify the transcriptome's complexity that encapsulates the dynamics of alternative splicing. Here we compare complexity metrics across 1) whole genome annotations, 2) a filtered subset of orthologs, and 3) novel genes to elucidate the impacts of orthologs and novel genes in transcript model analysis. Effective Exon Number (EEN) issued to compare the distribution of exon sizes within transcripts against random expectations of uniform exon placement. EEN accounts for differences in exon size, which is important because novel gene differences in complexity for orthologs and whole-transcriptome analyses are biased towards low-complexity genes with few exons and few alternative transcripts. CONCLUSIONS With our metric analyses, we are able to quantify changes in complexity across diverse lineages with greater precision and accuracy than previous cross-species comparisons under ortholog conditioning. These analyses represent a step toward whole-transcriptome analysis in the emerging field of non-model evolutionary genomics, with key insights for evolutionary inference of complexity changes on deep timescales across the tree of life. We suggest a means to quantify biases generated in ortholog calling and correct complexity analysis for lineage-specific effects. With these metrics, we directly assay the quantitative properties of newly formed lineage-specific genes as they lower complexity.
Collapse
Affiliation(s)
- James E Titus-McQuillan
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| | - Adalena V Nanni
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32611, USA
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, 32611, USA
| | - Lauren M McIntyre
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32611, USA
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, 32611, USA
| | - Rebekah L Rogers
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| |
Collapse
|
5
|
Vu TTD, Jung J. Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ 2021; 9:e12019. [PMID: 34513334 PMCID: PMC8395570 DOI: 10.7717/peerj.12019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 07/29/2021] [Indexed: 11/25/2022] Open
Abstract
Protein function prediction is a crucial part of genome annotation. Prediction methods have recently witnessed rapid development, owing to the emergence of high-throughput sequencing technologies. Among the available databases for identifying protein function terms, Gene Ontology (GO) is an important resource that describes the functional properties of proteins. Researchers are employing various approaches to efficiently predict the GO terms. Meanwhile, deep learning, a fast-evolving discipline in data-driven approach, exhibits impressive potential with respect to assigning GO terms to amino acid sequences. Herein, we reviewed the currently available computational GO annotation methods for proteins, ranging from conventional to deep learning approach. Further, we selected some suitable predictors from among the reviewed tools and conducted a mini comparison of their performance using a worldwide challenge dataset. Finally, we discussed the remaining major challenges in the field, and emphasized the future directions for protein function prediction with GO.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| |
Collapse
|
6
|
Bordin N, Sillitoe I, Lees JG, Orengo C. Tracing Evolution Through Protein Structures: Nature Captured in a Few Thousand Folds. Front Mol Biosci 2021; 8:668184. [PMID: 34041266 PMCID: PMC8141709 DOI: 10.3389/fmolb.2021.668184] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open
Abstract
This article is dedicated to the memory of Cyrus Chothia, who was a leading light in the world of protein structure evolution. His elegant analyses of protein families and their mechanisms of structural and functional evolution provided important evolutionary and biological insights and firmly established the value of structural perspectives. He was a mentor and supervisor to many other leading scientists who continued his quest to characterise structure and function space. He was also a generous and supportive colleague to those applying different approaches. In this article we review some of his accomplishments and the history of protein structure classifications, particularly SCOP and CATH. We also highlight some of the evolutionary insights these two classifications have brought. Finally, we discuss how the expansion and integration of protein sequence data into these structural families helps reveal the dark matter of function space and can inform the emergence of novel functions in Metazoa. Since we cover 25 years of structural classification, it has not been feasible to review all structure based evolutionary studies and hence we focus mainly on those undertaken by the SCOP and CATH groups and their collaborators.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jonathan G Lees
- Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes University, Oxford, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| |
Collapse
|
7
|
Barot M, Gligorijević V, Cho K, Bonneau R. NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity. Bioinformatics 2021; 37:2414-2422. [PMID: 33576802 PMCID: PMC8388039 DOI: 10.1093/bioinformatics/btab098] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 02/04/2021] [Accepted: 02/09/2021] [Indexed: 02/02/2023] Open
Abstract
Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meet Barot
- Center for Data Science, New York University, New York, 10011, USA
| | | | - Kyunghyun Cho
- Center for Data Science, New York University, New York, 10011, USA
| | - Richard Bonneau
- Center for Data Science, New York University, New York, 10011, USA.,Center for Computational Biology, Flatiron Institute, New York, 10010, USA
| |
Collapse
|
8
|
MacDougall A, Volynkin V, Saidi R, Poggioli D, Zellner H, Hatton-Ellis E, Joshi V, O’Donovan C, Orchard S, Auchincloss AH, Baratin D, Bolleman J, Coudert E, de Castro E, Hulo C, Masson P, Pedruzzi I, Rivoire C, Arighi C, Wang Q, Chen C, Huang H, Garavelli J, Vinayaka CR, Yeh LS, Natale DA, Laiho K, Martin MJ, Renaux A, Pichler K. UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase. Bioinformatics 2020; 36:4643-4648. [PMID: 32399560 PMCID: PMC7750954 DOI: 10.1093/bioinformatics/btaa485] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 04/13/2020] [Accepted: 05/05/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The number of protein records in the UniProt Knowledgebase (UniProtKB: https://www.uniprot.org) continues to grow rapidly as a result of genome sequencing and the prediction of protein-coding genes. Providing functional annotation for these proteins presents a significant and continuing challenge. RESULTS In response to this challenge, UniProt has developed a method of annotation, known as UniRule, based on expertly curated rules, which integrates related systems (RuleBase, HAMAP, PIRSR, PIRNR) developed by the members of the UniProt consortium. UniRule uses protein family signatures from InterPro, combined with taxonomic and other constraints, to select sets of reviewed proteins which have common functional properties supported by experimental evidence. This annotation is propagated to unreviewed records in UniProtKB that meet the same selection criteria, most of which do not have (and are never likely to have) experimentally verified functional annotation. Release 2020_01 of UniProtKB contains 6496 UniRule rules which provide annotation for 53 million proteins, accounting for 30% of the 178 million records in UniProtKB. UniRule provides scalable enrichment of annotation in UniProtKB. AVAILABILITY AND IMPLEMENTATION UniRule rules are integrated into UniProtKB and can be viewed at https://www.uniprot.org/unirule/. UniRule rules and the code required to run the rules, are publicly available for researchers who wish to annotate their own sequences. The implementation used to run the rules is known as UniFIRE and is available at https://gitlab.ebi.ac.uk/uniprot-public/unifire.
Collapse
Affiliation(s)
- Alistair MacDougall
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Vladimir Volynkin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Diego Poggioli
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Kantar Consulting, Casalecchio Di Reno, 40033 Bologna, Italy
| | - Hermann Zellner
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Emma Hatton-Ellis
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Vishal Joshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Claire O’Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sandra Orchard
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andrea H Auchincloss
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Delphine Baratin
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Jerven Bolleman
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Elisabeth Coudert
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Edouard de Castro
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Chantal Hulo
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Patrick Masson
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Ivo Pedruzzi
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Catherine Rivoire
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Cecilia Arighi
- Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - Qinghua Wang
- Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - Chuming Chen
- Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - Hongzhan Huang
- Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - John Garavelli
- Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - C R Vinayaka
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Lai-Su Yeh
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Kati Laiho
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alexandre Renaux
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Klemens Pichler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
9
|
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, Falda M, Berselli M, Tosatto SCE, Carraro M, Piovesan D, Ur Rehman H, Mao Q, Zhang S, Vucetic S, Black GS, Jo D, Suh E, Dayton JB, Larsen DJ, Omdahl AR, McGuffin LJ, Brackenridge DA, Babbitt PC, Yunes JM, Fontana P, Zhang F, Zhu S, You R, Zhang Z, Dai S, Yao S, Tian W, Cao R, Chandler C, Amezola M, Johnson D, Chang JM, Liao WH, Liu YW, Pascarelli S, Frank Y, Hoehndorf R, Kulmanov M, Boudellioua I, Politano G, Di Carlo S, Benso A, Hakala K, Ginter F, Mehryary F, Kaewphan S, Björne J, Moen H, Tolvanen MEE, Salakoski T, Kihara D, Jain A, Šmuc T, Altenhoff A, Ben-Hur A, Rost B, Brenner SE, Orengo CA, Jeffery CJ, Bosco G, Hogan DA, Martin MJ, O'Donovan C, Mooney SD, Greene CS, Radivojac P, Friedberg I. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 2019; 20:244. [PMID: 31744546 PMCID: PMC6864930 DOI: 10.1186/s13059-019-1835-8] [Citation(s) in RCA: 187] [Impact Index Per Article: 37.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 09/24/2019] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Collapse
Affiliation(s)
- Naihui Zhou
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Yuxiang Jiang
- Indiana University Bloomington, Bloomington, Indiana, USA
| | - Timothy R Bergquist
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Alexandra J Lee
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Balint Z Kacsoh
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Molecular and Systems Biology, Hanover, NH, USA
| | - Alex W Crocker
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kimberley A Lewis
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - George Georghiou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Huy N Nguyen
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Computer Science, Ames, IA, USA
| | - Md Nafiz Hamid
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Larry Davis
- Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Tunca Dogan
- Department of Computer Engineering, Hacettepe University, Ankara, Turkey.,European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Ahmet S Rifaioglu
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey.,Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey
| | - Alperen Dalkıran
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Rengul Cetin Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Rebecca L Hurto
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | | | - Fran Supek
- Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - José M Fernández
- INB Coordination Unit, Life Sciences Department, Barcelona Supercomputing Center, Barcelona, Catalonia, Spain.,(former) INB GN2, Structural and Computational Biology Programme, Spanish National Cancer Research Centre, Barcelona, Catalonia, Spain
| | - Branislava Gemovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Vladimir R Perovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Radoslav S Davidović
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Neven Sumonja
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering, University of California Berkeley, Berkeley, CA, USA.,Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Berkeley, CA, USA
| | | | - Giuseppe Profiti
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,National Research Council, IBIOM, Bologna, Italy
| | - Castrense Savojardo
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Florian Boecker
- University of Bonn: INRES Crop Bioinformatics, Bonn, North Rhine-Westphalia, Germany
| | - Heiko Schoof
- INRES Crop Bioinformatics, University of Bonn, Bonn, Germany
| | - Indika Kahanda
- Gianforte School of Computing, Montana State University, Bozeman, Montana, USA
| | - Natalie Thurlby
- University of Bristol, Computer Science, Bristol, Bristol, United Kingdom
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.,RESIST, DFG Cluster of Excellence 2155, Brunswick, Germany
| | - Alexandre Renaux
- Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles - Vrije Universiteit Brussel, Brussels, Belgium.,Machine Learning Group, Université libre de Bruxelles, Brussels, Belgium.,Artificial Intelligence lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Rabie Saidi
- European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
| | - Alex A Freitas
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Magdalena Antczak
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Fabio Fabris
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Jie Hou
- University of Missouri, Computer Science, Columbia, Missouri, USA.,Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Zheng Wang
- University of Miami, Coral Gables, Florida, USA
| | - Alfonso E Romero
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alberto Paccanaro
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Galway, Ireland.,Technical University of Munich, Garching, Germany
| | - Tatyana Goldberg
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Chenguang Zhao
- Faculty for Informatics, Garching, Germany.,Department for Bioinformatics and Computational Biology, Garching, Germany.,School of Computing Sciences and Computer Engineering, Hattiesburg, Mississippi, USA
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Alan J Medlar
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Elaine Zosa
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | | | - Ilya Novikov
- Baylor College of Medicine, Department of Biochemistry and Molecular Biology, Houston, TX, USA
| | - Angela Wilkins
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Olivier Lichtarge
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Po-Han Chi
- National TsingHua University, Hsinchu, Taiwan
| | - Wei-Cheng Tseng
- Department of Electrical Engineering in National Tsing Hua University, Hsinchu City, Taiwan
| | - Michal Linial
- The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Peter W Rose
- University of California San Diego, San Diego Supercomputer Center, La Jolla, California, USA
| | - Christophe Dessimoz
- Department of Computational Biology and Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Genetics, Evolution & Environment, and Department of Computer Science, University College London, London, UK.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Vedrana Vidulin
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
| | - Saso Dzeroski
- Jozef Stefan Institute, Ljubljana, Slovenia.,Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Ian Sillitoe
- Research Department of Structural and Molecular Biology, University College London, London, England
| | - Sayoni Das
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jonathan Gill Lees
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom.,Department of Health and Life Sciences, Oxford Brookes University, London, UK
| | - David T Jones
- The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom
| | - Cen Wan
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Domenico Cozzetto
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Rui Fa
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Mateo Torres
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alex Warwick Vesztrocy
- Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom.,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid, Spain
| | - Michael L Tress
- Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Marco Frasca
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Notaro
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giuliano Grossi
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Alessandro Petrini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Matteo Re
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giorgio Valentini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Mesiti
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy.,Institut de Biologie Computationnelle, LIRMM, CNRS-UMR 5506, Universite de Montpellier, Montpellier, France
| | - Daniel B Roche
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Jonas Reeb
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - David W Ritchie
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | - Sabeur Aridhi
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | | | - Marie-Dominique Devignes
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France.,University of Lorraine, Nancy, Lorraine, France.,Inria, Nancy, France
| | | | - Richard Bonneau
- NYU Center for Data Science, New York, 10010, NY, USA.,Flatiron Institute, CCB, New York, 10010, NY, USA
| | - Vladimir Gligorijević
- Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, New York, USA
| | - Meet Barot
- Center for Data Science, New York University, New York, 10011, NY, USA
| | - Hai Fang
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Marco Falda
- Department of Biology, University of Padova, Padova, Italy
| | - Michele Berselli
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Silvio C E Tosatto
- CNR Institute of Neuroscience, Padova, Italy.,Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Marco Carraro
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Hafeez Ur Rehman
- Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar, Khyber Pakhtoonkhwa, Pakistan
| | - Qizhong Mao
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.,University of California, Riverside, Philadelphia, PA, USA
| | - Shanshan Zhang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Gage S Black
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dane Jo
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Erica Suh
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Jonathan B Dayton
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dallas J Larsen
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Ashton R Omdahl
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Liam J McGuffin
- School of Biological Sciences, University of Reading, Reading, England, United Kingdom
| | | | - Patricia C Babbitt
- Department of Pharmaceutical Chemistry, San Francisco, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Jeffrey M Yunes
- UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, San Francisco, 94158, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Paolo Fontana
- Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, Shanghai, China.,Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Zihan Zhang
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Suyang Dai
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China.,Department of Pediatrics, Brain Tumor Center, Division of Experimental Hematology and Cancer Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Caleb Chandler
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Miguel Amezola
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Devon Johnson
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Jia-Ming Chang
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Wen-Hung Liao
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Yi-Wei Liu
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | | | | | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Maxat Kulmanov
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Imane Boudellioua
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Computer, Electrical and Mathematical Sciences Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Gianfranco Politano
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Stefano Di Carlo
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Kai Hakala
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku, Turku, Finland
| | - Farrokh Mehryary
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Suwisa Kaewphan
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland.,Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Jari Björne
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | | | | | - Tapio Salakoski
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | - Daisuke Kihara
- Department of Biological Sciences, Department of Computer Science, Purdue University, 47907, IN, USA.,Department of Pediatrics, University of Cincinnati, Cincinnati, 45229, OH, USA
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Tomislav Šmuc
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
| | - Adrian Altenhoff
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany.,Institute for Food and Plant Sciences WZW, Technische Universität München, Freising, Germany
| | | | - Christine A Orengo
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Constance J Jeffery
- Biological Sciences, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Giovanni Bosco
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Deborah A Hogan
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, Pennsylvania, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
| | - Iddo Friedberg
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.
| |
Collapse
|
10
|
Teso S, Masera L, Diligenti M, Passerini A. Combining learning and constraints for genome-wide protein annotation. BMC Bioinformatics 2019; 20:338. [PMID: 31208327 PMCID: PMC6580517 DOI: 10.1186/s12859-019-2875-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 05/03/2019] [Indexed: 11/28/2022] Open
Abstract
Background The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale. Results We present Ocelot, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as Ocelot), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning. Electronic supplementary material The online version of this article (10.1186/s12859-019-2875-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Stefano Teso
- Computer Science Department, KULeuven, Celestijnenlaan 200 A bus 2402, Leuven, 3001, Belgium
| | - Luca Masera
- Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123, Italy
| | - Michelangelo Diligenti
- Department of Information Engineering and Mathematics, University of Siena, San Niccolò, via Roma, 56, Siena, 53100, Italy
| | - Andrea Passerini
- Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123, Italy.
| |
Collapse
|
11
|
Fodeh SJ, Tiwari A. Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification. J Biomed Inform 2018; 86:160-166. [PMID: 30130573 DOI: 10.1016/j.jbi.2018.08.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Revised: 08/13/2018] [Accepted: 08/17/2018] [Indexed: 11/25/2022]
Abstract
Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.
Collapse
Affiliation(s)
- Samah Jamal Fodeh
- Yale Center for Medical Informatics, Yale University, 300 George st, Suite 501, New Haven, CT 06511, United States.
| | | |
Collapse
|
12
|
Vignolini T, Mengoni A, Fondi M. Template-Assisted Metabolic Reconstruction and Assembly of Hybrid Bacterial Models. Methods Mol Biol 2018; 1716:177-196. [PMID: 29222754 DOI: 10.1007/978-1-4939-7528-0_8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Intraspecific genomic exchanges happen frequently between bacteria living in the same natural environment and can also be performed artificially in the laboratory for basic research or genetic/metabolic engineering purposes. In silico metabolic reconstruction and simulation of the metabolism of the hybrid strains that result from these processes can be used to predict the phenotypic outcome of such genomic rearrangements; this can be especially helpful as a designing tool in the purview of synthetic biology. However, reconstructing the metabolism of a bacterium with a hybrid genome through in silico approaches is not a trivial task, as it requires taking into account the complex relationships existing between metabolic genes and how they change (or remain unchanged) when new genes are placed in a different genomic context. Furthermore, in order to "mix" the metabolic models of different bacterial strains one needs at least two different metabolic models to begin with, and reconstructing a genome-scale model from the ground up is a challenging task itself, requiring an intensive manual effort and a great deal of information. In this chapter, we propose two general protocols to address the aforementioned issues of: (1) quickly generating strain-specific metabolic models, given the relevant genomic sequence and an already existing, high-quality metabolic model of a different strain belonging to the same species, and (2) reconstructing the metabolic model of a hybrid strain containing genomic elements from two different parental strains.
Collapse
Affiliation(s)
- Tiziano Vignolini
- LENS, European Laboratory for Non-linear Spectroscopy, University of Florence, Via Nello Carrara 1, 50019 Sesto Fiorentino, Florence, Italy
| | - Alessio Mengoni
- Department of Biology, University of Florence, Via Madonna del Piano 6, 50019 Sesto Fiorentino, Florence, Italy
| | - Marco Fondi
- Department of Biology, University of Florence, Via Madonna del Piano 6, 50019 Sesto Fiorentino, Florence, Italy.
| |
Collapse
|
13
|
HashGO: hashing gene ontology for protein function prediction. Comput Biol Chem 2017; 71:264-273. [DOI: 10.1016/j.compbiolchem.2017.09.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 09/25/2017] [Indexed: 10/18/2022]
|
14
|
ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017; 22:molecules22101732. [PMID: 29039790 PMCID: PMC6151571 DOI: 10.3390/molecules22101732] [Citation(s) in RCA: 114] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 11/25/2022] Open
Abstract
With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language “ProLan” to the protein function language “GOLan”, and build a neural machine translation model based on recurrent neural networks to translate “ProLan” language to “GOLan” language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.
Collapse
|
15
|
Abstract
A biological experiment is the most reliable way of assigning function to a protein. However, in the era of high-throughput sequencing, scientists are unable to carry out experiments to determine the function of every single gene product. Therefore, to gain insights into the activity of these molecules and guide experiments, we must rely on computational means to functionally annotate the majority of sequence data. To understand how well these algorithms perform, we have established a challenge involving a broad scientific community in which we evaluate different annotation methods according to their ability to predict the associations between previously unannotated protein sequences and Gene Ontology terms. Here we discuss the rationale, benefits, and issues associated with evaluating computational methods in an ongoing community-wide challenge.
Collapse
|
16
|
Abstract
The Gene Ontology (GO) is a formidable resource, but there are several considerations about it that are essential to understand the data and interpret it correctly. The GO is sufficiently simple that it can be used without deep understanding of its structure or how it is developed, which is both a strength and a weakness. In this chapter, we discuss some common misinterpretations of the ontology and the annotations. A better understanding of the pitfalls and the biases in the GO should help users make the most of this very rich resource. We also review some of the misconceptions and misleading assumptions commonly made about GO, including the effect of data incompleteness, the importance of annotation qualifiers, and the transitivity or lack thereof associated with different ontology relations. We also discuss several biases that can confound aggregate analyses such as gene enrichment analyses. For each of these pitfalls and biases, we suggest remedies and best practices.
Collapse
Affiliation(s)
- Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel-Servet, 1211, Geneva 4, Switzerland. .,Department of Human Protein Sciences, Faculty of Medicine, University of Geneva, 1211, Geneva, Switzerland.
| | - Christophe Dessimoz
- Department of Genetics, Evolution & Environment, University College London, Gower St, London, WC1E 6BT, UK.,Swiss Institute of Bioinformatics, Biophore Building, 1015, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Street Biophore, 1015, Lausanne, Switzerland.,Center of Integrative Genomics, University of Lausanne, Biophore, 1015, Lausanne, Switzerland.,Department of Computer Science, University College London, Gower St, WC1E 6BT, London, UK
| |
Collapse
|
17
|
Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 2016; 590:2327-41. [PMID: 27423136 PMCID: PMC5937700 DOI: 10.1002/1873-3468.12307] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 07/12/2016] [Accepted: 07/12/2016] [Indexed: 12/21/2022]
Abstract
Precision medicine and personalized health efforts propose leveraging complex molecular, medical and family history, along with other types of personal data toward better life. We argue that this ambitious objective will require advanced and specialized machine learning solutions. Simply skimming some low-hanging results off the data wealth might have limited potential. Instead, we need to better understand all parts of the system to define medically relevant causes and effects: how do particular sequence variants affect particular proteins and pathways? How do these effects, in turn, cause the health or disease-related phenotype? Toward this end, deeper understanding will not simply diffuse from deeper machine learning, but from more explicit focus on understanding protein function, context-specific protein interaction networks, and impact of variation on both.
Collapse
Affiliation(s)
- Burkhard Rost
- Department of Informatics and Bioinformatics, Institute for Advanced Studies, Technical University of Munich, Garching, Germany
| | - Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA
| |
Collapse
|
18
|
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016; 93:84-91. [PMID: 26370280 PMCID: PMC4894840 DOI: 10.1016/j.ymeth.2015.09.011] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 09/03/2015] [Accepted: 09/10/2015] [Indexed: 11/30/2022] Open
Abstract
MOTIVATIONS Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. RESULTS In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein-protein interaction and spatial gene-gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile-sequence comparison, profile-profile comparison, and domain co-occurrence networks according to the maximum F-measure.
Collapse
Affiliation(s)
- Renzhi Cao
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | - Jianlin Cheng
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
| |
Collapse
|
19
|
Das S, Orengo CA. Protein function annotation using protein domain family resources. Methods 2016; 93:24-34. [DOI: 10.1016/j.ymeth.2015.09.029] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Revised: 09/28/2015] [Accepted: 09/29/2015] [Indexed: 01/25/2023] Open
|
20
|
Sillitoe I, Furnham N. FunTree: advances in a resource for exploring and contextualising protein function evolution. Nucleic Acids Res 2015; 44:D317-23. [PMID: 26590404 PMCID: PMC4702901 DOI: 10.1093/nar/gkv1274] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 11/03/2015] [Indexed: 11/13/2022] Open
Abstract
FunTree is a resource that brings together protein sequence, structure and functional information, including overall chemical reaction and mechanistic data, for structurally defined domain superfamilies. Developed in tandem with the CATH database, the original FunTree contained just 276 superfamilies focused on enzymes. Here, we present an update of FunTree that has expanded to include 2340 superfamilies including both enzymes and proteins with non-enzymatic functions annotated by Gene Ontology (GO) terms. This allows the investigation of how novel functions have evolved within a structurally defined superfamily and provides a means to analyse trends across many superfamilies. This is done not only within the context of a protein's sequence and structure but also the relationships of their functions. New measures of functional similarity have been integrated, including for enzymes comparisons of overall reactions based on overall bond changes, reaction centres (the local environment atoms involved in the reaction) and the sub-structure similarities of the metabolites involved in the reaction and for non-enzymes semantic similarities based on the GO. To identify and highlight changes in function through evolution, ancestral character estimations are made and presented. All this is accessible through a new re-designed web interface that can be found at http://www.funtree.info.
Collapse
Affiliation(s)
- Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK
| | - Nicholas Furnham
- Department of Pathogen Molecular Biology, London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, UK
| |
Collapse
|
21
|
Leuthaeuser JB, Knutson ST, Kumar K, Babbitt PC, Fetrow JS. Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity. Protein Sci 2015; 24:1423-39. [PMID: 26073648 PMCID: PMC4570537 DOI: 10.1002/pro.2724] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 06/10/2015] [Indexed: 01/27/2023]
Abstract
The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods.
Collapse
Affiliation(s)
- Janelle B Leuthaeuser
- Department of Molecular Genetics and Genomics, Wake Forest University, Winston-Salem, North Carolina, 27106
| | - Stacy T Knutson
- Departments of Computer Science and Physics, Wake Forest University, Winston-Salem, North Carolina, 27106
| | - Kiran Kumar
- Departments of Computer Science and Physics, Wake Forest University, Winston-Salem, North Carolina, 27106
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, Institute for Quantitative Biosciences University of California San Francisco, San Francisco, California, 94158.,Department of Pharmaceutical Chemistry, Institute for Quantitative Biosciences University of California San Francisco, San Francisco, California, 94158
| | - Jacquelyn S Fetrow
- Department of Molecular Genetics and Genomics, Wake Forest University, Winston-Salem, North Carolina, 27106.,Departments of Computer Science and Physics, Wake Forest University, Winston-Salem, North Carolina, 27106.,Office of the Provost, Maryland Hall 202, University of Richmond, VA, 23173
| |
Collapse
|
22
|
Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics 2015; 31:3429-36. [DOI: 10.1093/bioinformatics/btv345] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 05/29/2015] [Indexed: 11/13/2022] Open
|
23
|
Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T. DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe. BMC Bioinformatics 2015; 16:96. [PMID: 25888481 PMCID: PMC4389672 DOI: 10.1186/s12859-015-0499-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 02/18/2015] [Indexed: 12/27/2022] Open
Abstract
Background Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature–based enzyme functional prediction tool to assign Enzyme Commission (EC) digits. Results DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes. Conclusions Our results offer preliminarily confirmation of the existence of the hypothesized huge number of “hidden enzymes” in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0499-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tianmin Wang
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Hiroshi Mori
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-E3-10 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| | - Chong Zhang
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Ken Kurokawa
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-E3-10 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| | - Xin-Hui Xing
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Takuji Yamada
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| |
Collapse
|
24
|
Jiang Y, Clark WT, Friedberg I, Radivojac P. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. ACTA ACUST UNITED AC 2015; 30:i609-16. [PMID: 25161254 PMCID: PMC4147924 DOI: 10.1093/bioinformatics/btu472] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Motivation: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. Results: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. Contact:predrag@indiana.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuxiang Jiang
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| | - Wyatt T Clark
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| | - Iddo Friedberg
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| |
Collapse
|
25
|
Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem 2014; 53PB:251-276. [DOI: 10.1016/j.compbiolchem.2014.10.001] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 10/03/2014] [Accepted: 10/07/2014] [Indexed: 01/01/2023]
|
26
|
Reijnders MJ, van Heck RG, Lam CM, Scaife MA, Santos VAMD, Smith AG, Schaap PJ. Green genes: bioinformatics and systems-biology innovations drive algal biotechnology. Trends Biotechnol 2014; 32:617-26. [DOI: 10.1016/j.tibtech.2014.10.003] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 09/30/2014] [Accepted: 10/01/2014] [Indexed: 01/18/2023]
|
27
|
Text as data: using text-based features for proteins representation and for computational prediction of their characteristics. Methods 2014; 74:54-64. [PMID: 25448299 DOI: 10.1016/j.ymeth.2014.10.027] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Revised: 09/21/2014] [Accepted: 10/21/2014] [Indexed: 11/21/2022] Open
Abstract
The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.
Collapse
|
28
|
Wittwer LD, Piližota I, Altenhoff AM, Dessimoz C. Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology. PeerJ 2014; 2:e607. [PMID: 25320677 PMCID: PMC4193403 DOI: 10.7717/peerj.607] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2014] [Accepted: 09/12/2014] [Indexed: 11/20/2022] Open
Abstract
Orthology inference and other sequence analyses across multiple genomes typically start by performing exhaustive pairwise sequence comparisons, a process referred to as "all-against-all". As this process scales quadratically in terms of the number of sequences analysed, this step can become a bottleneck, thus limiting the number of genomes that can be simultaneously analysed. Here, we explored ways of speeding-up the all-against-all step while maintaining its sensitivity. By exploiting the transitivity of homology and, crucially, ensuring that homology is defined in terms of consistent protein subsequences, our proof-of-concept resulted in a 4× speedup while recovering >99.6% of all homologs identified by the full all-against-all procedure on empirical sequences sets. In comparison, state-of-the-art k-mer approaches are orders of magnitude faster but only recover 3-14% of all homologous pairs. We also outline ideas to further improve the speed and recall of the new approach. An open source implementation is provided as part of the OMA standalone software at http://omabrowser.org/standalone.
Collapse
Affiliation(s)
- Lucas D Wittwer
- University College London, London, United Kingdom.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,ETH Zurich, Department of Computer Science, Zurich, Switzerland
| | | | - Adrian M Altenhoff
- University College London, London, United Kingdom.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,ETH Zurich, Department of Computer Science, Zurich, Switzerland
| | - Christophe Dessimoz
- University College London, London, United Kingdom.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| |
Collapse
|
29
|
Abstract
Proteomics techniques generate an avalanche of data and promise to satisfy biologists' long-held desire to measure absolute protein abundances on a genome-wide scale. However, can this knowledge be translated into a clearer picture of how cells invest their protein resources? This article aims to give a broad perspective on the composition of proteomes as gleaned from recent quantitative proteomics studies. We describe proteomaps, an approach for visualizing the composition of proteomes with a focus on protein abundances and functions. In proteomaps, each protein is shown as a polygon-shaped tile, with an area representing protein abundance. Functionally related proteins appear in adjacent regions. General trends in proteomes, such as the dominance of metabolism and protein production, become easily visible. We make interactive visualizations of published proteome datasets accessible at www.proteomaps.net. We suggest that evaluating the way protein resources are allocated by various organisms and cell types in different conditions will sharpen our understanding of how and why cells regulate the composition of their proteomes.
Collapse
|
30
|
Becher D, Bernhardt J, Fuchs S, Riedel K. Metaproteomics to unravel major microbial players in leaf litter and soil environments: challenges and perspectives. Proteomics 2014; 13:2895-909. [PMID: 23894095 DOI: 10.1002/pmic.201300095] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2013] [Revised: 05/03/2013] [Accepted: 05/13/2013] [Indexed: 11/06/2022]
Abstract
Soil- and litter-borne microorganisms vitally contribute to biogeochemical cycles. However, changes in environmental parameters but also human interferences may alter species composition and elicit alterations in microbial activities. Soil and litter metaproteomics, implying the assignment of soil and litter proteins to specific phylogenetic and functional groups, has a great potential to provide essential new insights into the impact of microbial diversity on soil ecosystem functioning. This article will illuminate challenges and perspectives of current soil and litter metaproteomics research, starting with an introduction to an appropriate experimental design and state-of-the-art proteomics methodologies. This will be followed by a summary of important studies aimed at (i) the discovery of the major biotic drivers of leaf litter decomposition, (ii) metaproteomics analyses of rhizosphere-inhabiting microbes, and (iii) global approaches to study bioremediation processes. The review will be closed by a brief outlook on future developments and some concluding remarks, which should assist the reader to develop successful concepts for soil and litter metaproteomics studies.
Collapse
Affiliation(s)
- Dörte Becher
- Ernst-Moritz-Arndt-University of Greifswald, Institute of Microbiology, Greifswald, Germany
| | | | | | | |
Collapse
|
31
|
Lee J, Gross SP, Lee J. Improved network community structure improves function prediction. Sci Rep 2014; 3:2197. [PMID: 23852097 PMCID: PMC3711050 DOI: 10.1038/srep02197] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2012] [Accepted: 06/24/2013] [Indexed: 12/15/2022] Open
Abstract
We are overwhelmed by experimental data, and need better ways to understand large interaction datasets. While clustering related nodes in such networks—known as community detection—appears a promising approach, detecting such communities is computationally difficult. Further, how to best use such community information has not been determined. Here, within the context of protein function prediction, we address both issues. First, we apply a novel method that generates improved modularity solutions than the current state of the art. Second, we develop a better method to use this community information to predict proteins' functions. We discuss when and why this community information is important. Our results should be useful for two distinct scientific communities: first, those using various cost functions to detect community structure, where our new optimization approach will improve solutions, and second, those working to extract novel functional information about individual nodes from large interaction datasets.
Collapse
Affiliation(s)
- Juyong Lee
- School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea.
| | | | | |
Collapse
|
32
|
Walking on a tissue-specific disease-protein-complex heterogeneous network for the discovery of disease-related protein complexes. BIOMED RESEARCH INTERNATIONAL 2013; 2013:732650. [PMID: 24455720 PMCID: PMC3888695 DOI: 10.1155/2013/732650] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 10/07/2013] [Indexed: 11/29/2022]
Abstract
Besides the pinpointing of individual disease-related genes, associating protein complexes to human inherited diseases is also of great importance, because a biological function usually arises from the cooperative behaviour of multiple proteins in a protein complex. Moreover, knowledge about disease-related protein complexes could also enhance the inference of disease genes and pathogenic genetic variants. Here, we have designed a computational systems biology approach to systematically analyse potential relationships between diseases and protein complexes. First, we construct a heterogeneous network which is composed of a disease-disease similarity layer, a tissue-specific protein-protein interaction layer, and a protein complex membership layer. Then, we propose a random walk model on this disease-protein-complex network for identifying protein complexes that are related to a query disease. With a series of leave-one-out cross-validation experiments, we show that our method not only possesses high performance but also demonstrates robustness regarding the parameters and the network structure. We further predict a landscape of associations between human diseases and protein complexes. This landscape can be used to facilitate the inference of disease genes, thereby benefiting studies on pathology of diseases.
Collapse
|
33
|
Abstract
Motivation: The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. Although various algorithms have been proposed for these tasks, evaluating their performance is difficult owing to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. Results: We propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein’s function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that it addresses several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools. Contact:predrag@indiana.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wyatt T Clark
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN 47405, USA.
| | | |
Collapse
|
34
|
Ramsak Ž, Baebler Š, Rotter A, Korbar M, Mozetic I, Usadel B, Gruden K. GoMapMan: integration, consolidation and visualization of plant gene annotations within the MapMan ontology. Nucleic Acids Res 2013; 42:D1167-75. [PMID: 24194592 PMCID: PMC3965006 DOI: 10.1093/nar/gkt1056] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
GoMapMan (http://www.gomapman.org) is an open web-accessible resource for gene functional annotations in the plant sciences. It was developed to facilitate improvement, consolidation and visualization of gene annotations across several plant species. GoMapMan is based on the MapMan ontology, organized in the form of a hierarchical tree of biological concepts, which describe gene functions. Currently, genes of the model species Arabidopsis and three crop species (potato, tomato and rice) are included. The main features of GoMapMan are (i) dynamic and interactive gene product annotation through various curation options; (ii) consolidation of gene annotations for different plant species through the integration of orthologue group information; (iii) traceability of gene ontology changes and annotations; (iv) integration of external knowledge about genes from different public resources; and (v) providing gathered information to high-throughput analysis tools via dynamically generated export files. All of the GoMapMan functionalities are openly available, with the restriction on the curation functions, which require prior registration to ensure traceability of the implemented changes.
Collapse
Affiliation(s)
- Živa Ramsak
- Department of Biotechnology and Systems Biology, National Institute of Biology, 1000 Ljubljana, Slovenia, Department of Knowledge Technologies, JoŽef Stefan Institute, 1000 Ljubljana, Slovenia, Department of Biology, Institute for Biology I, RWTH Aachen University, D-52056 Aachen, Germany and IBG-2: Plant Sciences, Institute for Bio- and Geosciences, Forschungszentrum Jülich, 52425 Jülich, Germany
| | | | | | | | | | | | | |
Collapse
|
35
|
Prediction and experimental validation of enzyme substrate specificity in protein structures. Proc Natl Acad Sci U S A 2013; 110:E4195-202. [PMID: 24145433 DOI: 10.1073/pnas.1305162110] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Structural Genomics aims to elucidate protein structures to identify their functions. Unfortunately, the variation of just a few residues can be enough to alter activity or binding specificity and limit the functional resolution of annotations based on sequence and structure; in enzymes, substrates are especially difficult to predict. Here, large-scale controls and direct experiments show that the local similarity of five or six residues selected because they are evolutionarily important and on the protein surface can suffice to identify an enzyme activity and substrate. A motif of five residues predicted that a previously uncharacterized Silicibacter sp. protein was a carboxylesterase for short fatty acyl chains, similar to hormone-sensitive-lipase-like proteins that share less than 20% sequence identity. Assays and directed mutations confirmed this activity and showed that the motif was essential for catalysis and substrate specificity. We conclude that evolutionary and structural information may be combined on a Structural Genomics scale to create motifs of mixed catalytic and noncatalytic residues that identify enzyme activity and substrate specificity.
Collapse
|
36
|
Hou J, Jiang Y. Dynamically searching for a domain for protein function prediction. J Bioinform Comput Biol 2013; 11:1350008. [PMID: 23859272 DOI: 10.1142/s021972001350008x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The availability of large amounts of protein-protein interaction (PPI) data makes it feasible to use computational approaches to predict protein functions. The base of existing computational approaches is to exploit the known function information of annotated proteins in the PPI data to predict functions of un-annotated proteins. However, these approaches consider the prediction domain (i.e. the set of proteins from which the functions are predicted) as unchangeable during the prediction procedure. This may lead to valuable information being overwhelmed by the unavoidable noise information in the PPI data when predicting protein functions, and in turn, the prediction results will be distorted. In this paper, we propose a novel method to dynamically predict protein functions from the PPI data. Our method regards the function prediction as a dynamic process of finding a suitable prediction domain, from which representative functions of the domain are selected to predict functions of un-annotated proteins. Our method exploits the topological structural information of a PPI network and the semantic relationship between protein functions to measure the relationship between proteins, dynamically select a suitable prediction domain and predict functions. The evaluation on real PPI datasets demonstrated the effectiveness of our proposed method, and generated better prediction results.
Collapse
Affiliation(s)
- Jingyu Hou
- School of Information Technology, Deakin University, 221 Burwood Highway, Burwood, Victoria 3125, Australia.
| | | |
Collapse
|
37
|
Kotaru AR, Shameer K, Sundaramurthy P, Joshi RC. An improved hypergeometric probability method for identification of functionally linked proteins using phylogenetic profiles. Bioinformation 2013; 9:368-74. [PMID: 23750082 PMCID: PMC3669790 DOI: 10.6026/97320630009368] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 03/06/2013] [Indexed: 12/04/2022] Open
Abstract
Predicting functions of proteins and alternatively spliced isoforms encoded in a genome is one of the important applications of
bioinformatics in the post-genome era. Due to the practical limitation of experimental characterization of all proteins encoded in a
genome using biochemical studies, bioinformatics methods provide powerful tools for function annotation and prediction. These
methods also help minimize the growing sequence-to-function gap. Phylogenetic profiling is a bioinformatics approach to identify
the influence of a trait across species and can be employed to infer the evolutionary history of proteins encoded in genomes. Here
we propose an improved phylogenetic profile-based method which considers the co-evolution of the reference genome to derive
the basic similarity measure, the background phylogeny of target genomes for profile generation and assigning weights to target
genomes. The ordering of genomes and the runs of consecutive matches between the proteins were used to define phylogenetic
relationships in the approach. We used Escherichia coli K12 genome as the reference genome and its 4195 proteins were used in the
current analysis. We compared our approach with two existing methods and our initial results show that the predictions have
outperformed two of the existing approaches. In addition, we have validated our method using a targeted protein-protein
interaction network derived from protein-protein interaction database STRING. Our preliminary results indicates that
improvement in function prediction can be attained by using coevolution-based similarity measures and the runs on to the same
scale instead of computing them in different scales. Our method can be applied at the whole-genome level for annotating
hypothetical proteins from prokaryotic genomes.
Collapse
Affiliation(s)
- Appala Raju Kotaru
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, 247667, Roorkee, India
| | | | | | | |
Collapse
|
38
|
Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol 2013; 9:e1003063. [PMID: 23737737 PMCID: PMC3667760 DOI: 10.1371/journal.pcbi.1003063] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 04/02/2013] [Indexed: 11/19/2022] Open
Abstract
The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the “few articles - many proteins” phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments. Experiments and observations are the vehicles used by science to understand the world around us. In the field of molecular biology, we are increasingly relying on high-throughput, genome-wide experiments to provide answers about the function of biological macromolecules. However, any experimental assay is essentially limited in the type of information it can discover. Here, we show that our increasing reliance on high-throughput experiments biases our understanding of protein function. While the primary source of information is experiments, the functions of many proteins are computationally annotated by sequence-based similarity, either directly or indirectly, to proteins whose function is experimentally determined. Therefore, any biases in experimental annotations can get amplified and entrenched in the majority of protein databases. We show here that high-throughput studies are biased towards certain aspects of protein function, and that they provide less information than low-throughput studies. While there is no clear solution to the phenomenon of bias from high-throughput experiments, recognizing its existence and its impact can help take steps to mitigate its effect.
Collapse
Affiliation(s)
- Alexandra M. Schnoes
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, United States of America
| | - David C. Ream
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Alexander W. Thorman
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Patricia C. Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
- Department of Computer Science and Software Engineering, Miami University, Oxford, Ohio, United States of America
- * E-mail:
| |
Collapse
|
39
|
López D, Pazos F. COPRED: prediction of fold, GO molecular function and functional residues at the domain level. Bioinformatics 2013; 29:1811-2. [PMID: 23720488 DOI: 10.1093/bioinformatics/btt283] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
SUMMARY Only recently the first resources devoted to the functional annotation of proteins at the domain level started to appear. The next step is to develop specific methodologies for predicting function at the domain level based on these resources, and to implement them in web servers to be used by the community. In this work, we present COPRED, a web server for the concomitant prediction of fold, molecular function and functional sites at the domain level, based on a methodology for domain molecular function prediction and a resource of domain functional annotations previously developed and benchmarked. AVAILABILITY AND IMPLEMENTATION COPRED can be freely accessed at http://csbg.cnb.csic.es/copred. The interface works in all standard web browsers. WebGL (natively supported by most browsers) is required for the in-line preview and manipulation of protein 3D structures. The website includes a detailed help section and usage examples. CONTACT pazos@cnb.csic.es.
Collapse
Affiliation(s)
- Daniel López
- Systems Biology Department, Computational Systems Biology Group CNB-CSIC, c/ Darwin, 3. Cantoblanco, 28049 Madrid, Spain
| | | |
Collapse
|
40
|
Abstract
Disease-causing aberrations in the normal function of a gene define that gene as a disease gene. Proving a causal link between a gene and a disease experimentally is expensive and time-consuming. Comprehensive prioritization of candidate genes prior to experimental testing drastically reduces the associated costs. Computational gene prioritization is based on various pieces of correlative evidence that associate each gene with the given disease and suggest possible causal links. A fair amount of this evidence comes from high-throughput experimentation. Thus, well-developed methods are necessary to reliably deal with the quantity of information at hand. Existing gene prioritization techniques already significantly improve the outcomes of targeted experimental studies. Faster and more reliable techniques that account for novel data types are necessary for the development of new diagnostics, treatments, and cure for many diseases.
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, New Jersey, USA.
| |
Collapse
|
41
|
Bertsova YV, Fadeeva MS, Kostyrko VA, Serebryakova MV, Baykov AA, Bogachev AV. Alternative pyrimidine biosynthesis protein ApbE is a flavin transferase catalyzing covalent attachment of FMN to a threonine residue in bacterial flavoproteins. J Biol Chem 2013; 288:14276-14286. [PMID: 23558683 DOI: 10.1074/jbc.m113.455402] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Na(+)-translocating NADH:quinone oxidoreductase (Na(+)-NQR) contains two flavin residues as redox-active prosthetic groups attached by a phosphoester bond to threonine residues in subunits NqrB and NqrC. We demonstrate here that flavinylation of truncated Vibrio harveyi NqrC at Thr-229 in Escherichia coli cells requires the presence of a co-expressed Vibrio apbE gene. The apbE genes cluster with genes for Na(+)-NQR and other FMN-binding flavoproteins in bacterial genomes and encode proteins with previously unknown function. Experiments with isolated NqrC and ApbE proteins confirmed that ApbE is the only protein factor required for NqrC flavinylation and also indicated that the reaction is Mg(2+)-dependent and proceeds with FAD but not FMN. Inactivation of the apbE gene in Klebsiella pneumoniae, wherein the nqr operon and apbE are well separated in the chromosome, resulted in a complete loss of the quinone reductase activity of Na(+)-NQR, consistent with its dependence on covalently bound flavin. Our data thus identify ApbE as a novel modifying enzyme, flavin transferase.
Collapse
Affiliation(s)
- Yulia V Bertsova
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119992, Russia
| | - Maria S Fadeeva
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119992, Russia
| | - Vitaly A Kostyrko
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119992, Russia
| | - Marina V Serebryakova
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119992, Russia
| | - Alexander A Baykov
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119992, Russia
| | - Alexander V Bogachev
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119992, Russia.
| |
Collapse
|
42
|
Oberlin AT, Jurkovic DA, Balish MF, Friedberg I. Biological database of images and genomes: tools for community annotations linking image and genomic information. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat016. [PMID: 23550062 PMCID: PMC3708683 DOI: 10.1093/database/bat016] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype-genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas.
Collapse
Affiliation(s)
- Andrew T Oberlin
- Department of Computer Science and Software Engineering, Miami University, Oxford, OH 45056, USA
| | | | | | | |
Collapse
|
43
|
Abstract
Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.
Collapse
Affiliation(s)
- Hai Fang
- Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK.
| | | |
Collapse
|
44
|
Piovesan D, Martelli PL, Fariselli P, Profiti G, Zauli A, Rossi I, Casadio R. How to inherit statistically validated annotation within BAR+ protein clusters. BMC Bioinformatics 2013; 14 Suppl 3:S4. [PMID: 23514411 PMCID: PMC3584929 DOI: 10.1186/1471-2105-14-s3-s4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In the genomic era a key issue is protein annotation, namely how to endow protein sequences, upon translation from the corresponding genes, with structural and functional features. Routinely this operation is electronically done by deriving and integrating information from previous knowledge. The reference database for protein sequences is UniProtKB divided into two sections, UniProtKB/TrEMBL which is automatically annotated and not reviewed and UniProtKB/Swiss-Prot which is manually annotated and reviewed. The annotation process is essentially based on sequence similarity search. The question therefore arises as to which extent annotation based on transfer by inheritance is valuable and specifically if it is possible to statistically validate inherited features when little homology exists among the target sequence and its template(s). Results In this paper we address the problem of annotating protein sequences in a statistically validated manner considering as a reference annotation resource UniProtKB. The test case is the set of 48,298 proteins recently released by the Critical Assessment of Function Annotations (CAFA) organization. We show that we can transfer after validation, Gene Ontology (GO) terms of the three main categories and Pfam domains to about 68% and 72% of the sequences, respectively. This is possible after alignment of the CAFA sequences towards BAR+, our annotation resource that allows discriminating among statistically validated and not statistically validated annotation. By comparing with a direct UniProtKB annotation, we find that besides validating annotation of some 78% of the CAFA set, we assign new and statistically validated annotation to 14.8% of the sequences and find new structural templates for about 25% of the chains, half of which share less than 30% sequence identity to the corresponding template/s. Conclusion Inheritance of annotation by transfer generally requires a careful selection of the identity value among the target and the template in order to transfer structural and/or functional features. Here we prove that even distantly remote homologs can be safely endowed with structural templates and GO and/or Pfam terms provided that annotation is done within clusters collecting cluster-related protein sequences and where a statistical validation of the shared structural and functional features is possible.
Collapse
|
45
|
Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B. Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics 2013; 14 Suppl 3:S7. [PMID: 23514582 PMCID: PMC3584931 DOI: 10.1186/1471-2105-14-s3-s7] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference. METHODS Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements. RESULTS AND CONCLUSIONS During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA.
Collapse
Affiliation(s)
- Tobias Hamp
- TUM, Department of Informatics, Bioinformatics & Computational Biology - I12 Boltzmannstr, 3, 85748 Garching/Munich, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
46
|
Wong A, Shatkay H. Protein function prediction using text-based features extracted from the biomedical literature: the CAFA challenge. BMC Bioinformatics 2013; 14 Suppl 3:S14. [PMID: 23514326 PMCID: PMC3584852 DOI: 10.1186/1471-2105-14-s3-s14] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge. RESULTS We have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28% respectively. Results obtained as part of the CAFA evaluation itself on the CAFA dataset are reported as well. CONCLUSIONS Our evaluation shows that the text-based classifier consistently outperforms the baseline classifier that is based on prior distribution, and typically has comparable performance to the baseline classifier that uses sequence similarity. Moreover, the results suggest that combining text features with other types of features can potentially lead to improved prediction performance. The preliminary results also suggest that while our text-based classifier can be used to predict both molecular function and biological process in which a protein is involved, the classifier performs significantly better for predicting molecular function than for predicting biological process. A similar trend was observed for other classifiers participating in the CAFA challenge.
Collapse
Affiliation(s)
- Andrew Wong
- Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, ON, K7L 3N6, Canada
| | | |
Collapse
|
47
|
Wang Z, Cao R, Cheng J. Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks. BMC Bioinformatics 2013; 14 Suppl 3:S3. [PMID: 23514381 PMCID: PMC3584933 DOI: 10.1186/1471-2105-14-s3-s3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA
| | | | | |
Collapse
|
48
|
Lopez D, Pazos F. Concomitant prediction of function and fold at the domain level with GO-based profiles. BMC Bioinformatics 2013; 14 Suppl 3:S12. [PMID: 23514233 PMCID: PMC3584904 DOI: 10.1186/1471-2105-14-s3-s12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Predicting the function of newly sequenced proteins is crucial due to the pace at which these raw sequences are being obtained. Almost all resources for predicting protein function assign functional terms to whole chains, and do not distinguish which particular domain is responsible for the allocated function. This is not a limitation of the methodologies themselves but it is due to the fact that in the databases of functional annotations these methods use for transferring functional terms to new proteins, these annotations are done on a whole-chain basis. Nevertheless, domains are the basic evolutionary and often functional units of proteins. In many cases, the domains of a protein chain have distinct molecular functions, independent from each other. For that reason resources with functional annotations at the domain level, as well as methodologies for predicting function for individual domains adapted to these resources are required. We present a methodology for predicting the molecular function of individual domains, based on a previously developed database of functional annotations at the domain level. The approach, which we show outperforms a standard method based on sequence searches in assigning function, concomitantly predicts the structural fold of the domains and can give hints on the functionally important residues associated to the predicted function.
Collapse
Affiliation(s)
- Daniel Lopez
- Computational Systems Biology Group, National Centre for Biotechnology (CNB-CSIC), C/ Darwin 3, 28049 Madrid, Spain
| | | |
Collapse
|
49
|
Erdin S, Venner E, Lisewski AM, Lichtarge O. Function prediction from networks of local evolutionary similarity in protein structure. BMC Bioinformatics 2013; 14 Suppl 3:S6. [PMID: 23514548 PMCID: PMC3584919 DOI: 10.1186/1471-2105-14-s3-s6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background Annotating protein function with both high accuracy and sensitivity remains a major challenge in structural genomics. One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found. To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function. In order to further increase sensitivity, we now let each protein contribute multiple templates rather than just one, and also let the template size vary. Results Retrospective benchmarks in 605 Structural Genomics enzymes showed that multiple templates increased sensitivity by up to 14% when combined with single template predictions even as they maintained the accuracy over 91%. Diffusing function globally on networks of single and multiple template matches marginally increased the area under the ROC curve over 0.97, but in a subset of proteins that could not be annotated by ETA, the network approach recovered annotations for the most confident 20-23 of 91 cases with 100% accuracy. Conclusions We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.
Collapse
Affiliation(s)
- Serkan Erdin
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA
| | | | | | | |
Collapse
|
50
|
Li W, Cong Q, Kinch LN, Grishin NV. Seq2Ref: a web server to facilitate functional interpretation. BMC Bioinformatics 2013; 14:30. [PMID: 23356573 PMCID: PMC3573977 DOI: 10.1186/1471-2105-14-30] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 01/15/2013] [Indexed: 01/01/2023] Open
Abstract
Background The size of the protein sequence database has been exponentially increasing due to advances in genome sequencing. However, experimentally characterized proteins only constitute a small portion of the database, such that the majority of sequences have been annotated by computational approaches. Current automatic annotation pipelines inevitably introduce errors, making the annotations unreliable. Instead of such error-prone automatic annotations, functional interpretation should rely on annotations of ‘reference proteins’ that have been experimentally characterized or manually curated. Results The Seq2Ref server uses BLAST to detect proteins homologous to a query sequence and identifies the reference proteins among them. Seq2Ref then reports publications with experimental characterizations of the identified reference proteins that might be relevant to the query. Furthermore, a plurality-based rating system is developed to evaluate the homologous relationships and rank the reference proteins by their relevance to the query. Conclusions The reference proteins detected by our server will lend insight into proteins of unknown function and provide extensive information to develop in-depth understanding of uncharacterized proteins. Seq2Ref is available at: http://prodata.swmed.edu/seq2ref.
Collapse
Affiliation(s)
- Wenlin Li
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| | | | | | | |
Collapse
|