Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Rentzsch R, Orengo CA. Protein function prediction--the power of multiplicity. Trends Biotechnol 2009;27:210-9. [PMID: 19251332 DOI: 10.1016/j.tibtech.2009.01.002] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 01/21/2009] [Accepted: 01/23/2009] [Indexed: 01/07/2023]

For:	Rentzsch R, Orengo CA. Protein function prediction--the power of multiplicity. Trends Biotechnol 2009;27:210-9. [PMID: 19251332 DOI: 10.1016/j.tibtech.2009.01.002] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 01/21/2009] [Accepted: 01/23/2009] [Indexed: 01/07/2023]

Number

Cited by Other Article(s)

Tang W, Gui C, Zhang T. Expression, Purification, and Bioinformatic Prediction of Mycobacterium tuberculosis Rv0439c as a Potential NADP⁺-Retinol Dehydrogenase. Mol Biotechnol 2023:10.1007/s12033-023-00956-z. [PMID: 37989944 DOI: 10.1007/s12033-023-00956-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 10/23/2023] [Indexed: 11/23/2023]

Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: A case study of BK channels. PLoS Comput Biol 2023;19:e1011460. [PMID: 37713443 PMCID: PMC10529646 DOI: 10.1371/journal.pcbi.1011460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/27/2023] [Accepted: 08/24/2023] [Indexed: 09/17/2023] Open

Abstract

Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ∆V1/2, with a RMSE ~ 32 mV and correlation coefficient of R ~ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ∆V1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction.

Collapse

Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: a case study of BK channels. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.24.546384. [PMID: 37425916 PMCID: PMC10327070 DOI: 10.1101/2023.06.24.546384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]

Abstract

Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ΔV 1/2 , with a RMSE ∼ 32 mV and correlation coefficient of R ∼ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V 1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ΔV 1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction.

Author Summary

Deep machine learning has brought many exciting breakthroughs in chemistry, physics and biology. These models require large amount of training data and struggle when the data is scarce. The latter is true for predictive modeling of the function of complex proteins such as ion channels, where only hundreds of mutational data may be available. Using the big potassium (BK) channel as a biologically important model system, we demonstrate that a reliable predictive model of its voltage gating property could be derived from only 473 mutational data by incorporating physics-derived features, which include dynamic properties from molecular dynamics simulations and energetic quantities from Rosetta mutation calculations. We show that the final random forest model captures key trends and hotspots in mutational effects of BK voltage gating, such as the important role of pore hydrophobicity. A particularly curious prediction is that mutations of two adjacent residues on the S5 helix would always have opposite effects on the gating voltage, which was confirmed by experimental characterization of four novel mutations. The current work demonstrates the importance and effectiveness of incorporating physics in predictive modeling of protein function with scarce data.

Collapse

Titus-McQuillan JE, Nanni AV, McIntyre LM, Rogers RL. Estimating transcriptome complexities across eukaryotes. BMC Genomics 2023;24:254. [PMID: 37170194 PMCID: PMC10173493 DOI: 10.1186/s12864-023-09326-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 04/20/2023] [Indexed: 05/13/2023] Open

Abstract

BACKGROUND

Genomic complexity is a growing field of evolution, with case studies for comparative evolutionary analyses in model and emerging non-model systems. Understanding complexity and the functional components of the genome is an untapped wealth of knowledge ripe for exploration. With the "remarkable lack of correspondence" between genome size and complexity, there needs to be a way to quantify complexity across organisms. In this study, we use a set of complexity metrics that allow for evaluating changes in complexity using TranD.

RESULTS

We ascertain if complexity is increasing or decreasing across transcriptomes and at what structural level, as complexity varies. In this study, we define three metrics - TpG, EpT, and EpG- to quantify the transcriptome's complexity that encapsulates the dynamics of alternative splicing. Here we compare complexity metrics across 1) whole genome annotations, 2) a filtered subset of orthologs, and 3) novel genes to elucidate the impacts of orthologs and novel genes in transcript model analysis. Effective Exon Number (EEN) issued to compare the distribution of exon sizes within transcripts against random expectations of uniform exon placement. EEN accounts for differences in exon size, which is important because novel gene differences in complexity for orthologs and whole-transcriptome analyses are biased towards low-complexity genes with few exons and few alternative transcripts.

CONCLUSIONS

With our metric analyses, we are able to quantify changes in complexity across diverse lineages with greater precision and accuracy than previous cross-species comparisons under ortholog conditioning. These analyses represent a step toward whole-transcriptome analysis in the emerging field of non-model evolutionary genomics, with key insights for evolutionary inference of complexity changes on deep timescales across the tree of life. We suggest a means to quantify biases generated in ortholog calling and correct complexity analysis for lineage-specific effects. With these metrics, we directly assay the quantitative properties of newly formed lineage-specific genes as they lower complexity.

Collapse

Vu TTD, Jung J. Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ 2021;9:e12019. [PMID: 34513334 PMCID: PMC8395570 DOI: 10.7717/peerj.12019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 07/29/2021] [Indexed: 11/25/2022] Open

Bordin N, Sillitoe I, Lees JG, Orengo C. Tracing Evolution Through Protein Structures: Nature Captured in a Few Thousand Folds. Front Mol Biosci 2021;8:668184. [PMID: 34041266 PMCID: PMC8141709 DOI: 10.3389/fmolb.2021.668184] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open

Barot M, Gligorijević V, Cho K, Bonneau R. NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity. Bioinformatics 2021;37:2414-2422. [PMID: 33576802 PMCID: PMC8388039 DOI: 10.1093/bioinformatics/btab098] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 02/04/2021] [Accepted: 02/09/2021] [Indexed: 02/02/2023] Open

Abstract

Motivation

Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks.

Results

In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance.

Availability and implementation

The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

MacDougall A, Volynkin V, Saidi R, Poggioli D, Zellner H, Hatton-Ellis E, Joshi V, O’Donovan C, Orchard S, Auchincloss AH, Baratin D, Bolleman J, Coudert E, de Castro E, Hulo C, Masson P, Pedruzzi I, Rivoire C, Arighi C, Wang Q, Chen C, Huang H, Garavelli J, Vinayaka CR, Yeh LS, Natale DA, Laiho K, Martin MJ, Renaux A, Pichler K. UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase. Bioinformatics 2020;36:4643-4648. [PMID: 32399560 PMCID: PMC7750954 DOI: 10.1093/bioinformatics/btaa485] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 04/13/2020] [Accepted: 05/05/2020] [Indexed: 11/13/2022] Open

Affiliation(s)

Alistair MacDougall European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Vladimir Volynkin European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Rabie Saidi European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Diego Poggioli European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Kantar Consulting, Casalecchio Di Reno, 40033 Bologna, Italy
Hermann Zellner European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Emma Hatton-Ellis European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Vishal Joshi European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Claire O’Donovan European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Sandra Orchard European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Andrea H Auchincloss SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Delphine Baratin SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Jerven Bolleman SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Elisabeth Coudert SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Edouard de Castro SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Chantal Hulo SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Patrick Masson SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Ivo Pedruzzi SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Catherine Rivoire SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
Cecilia Arighi Protein Information Resource, University of Delaware, Newark, DE 19711, USA
Qinghua Wang Protein Information Resource, University of Delaware, Newark, DE 19711, USA
Chuming Chen Protein Information Resource, University of Delaware, Newark, DE 19711, USA
Hongzhan Huang Protein Information Resource, University of Delaware, Newark, DE 19711, USA
John Garavelli Protein Information Resource, University of Delaware, Newark, DE 19711, USA
C R Vinayaka Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
Lai-Su Yeh Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
Darren A Natale Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
Kati Laiho Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
Maria-Jesus Martin European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Alexandre Renaux European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Klemens Pichler European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Collapse

Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, Falda M, Berselli M, Tosatto SCE, Carraro M, Piovesan D, Ur Rehman H, Mao Q, Zhang S, Vucetic S, Black GS, Jo D, Suh E, Dayton JB, Larsen DJ, Omdahl AR, McGuffin LJ, Brackenridge DA, Babbitt PC, Yunes JM, Fontana P, Zhang F, Zhu S, You R, Zhang Z, Dai S, Yao S, Tian W, Cao R, Chandler C, Amezola M, Johnson D, Chang JM, Liao WH, Liu YW, Pascarelli S, Frank Y, Hoehndorf R, Kulmanov M, Boudellioua I, Politano G, Di Carlo S, Benso A, Hakala K, Ginter F, Mehryary F, Kaewphan S, Björne J, Moen H, Tolvanen MEE, Salakoski T, Kihara D, Jain A, Šmuc T, Altenhoff A, Ben-Hur A, Rost B, Brenner SE, Orengo CA, Jeffery CJ, Bosco G, Hogan DA, Martin MJ, O'Donovan C, Mooney SD, Greene CS, Radivojac P, Friedberg I. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 2019;20:244. [PMID: 31744546 PMCID: PMC6864930 DOI: 10.1186/s13059-019-1835-8] [Citation(s) in RCA: 187] [Impact Index Per Article: 37.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 09/24/2019] [Indexed: 12/23/2022] Open

Affiliation(s)

Naihui Zhou Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
Yuxiang Jiang Indiana University Bloomington, Bloomington, Indiana, USA
Timothy R Bergquist Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
Alexandra J Lee Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
Balint Z Kacsoh Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Molecular and Systems Biology, Hanover, NH, USA
Alex W Crocker Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
Kimberley A Lewis Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
George Georghiou European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
Huy N Nguyen Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Computer Science, Ames, IA, USA
Md Nafiz Hamid Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
Larry Davis Program in Bioinformatics and Computational Biology, Ames, IA, USA
Tunca Dogan Department of Computer Engineering, Hacettepe University, Ankara, Turkey.,European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
Volkan Atalay Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
Ahmet S Rifaioglu Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey.,Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey
Alperen Dalkıran Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
Rengul Cetin Atalay CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
Chengxin Zhang Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
Rebecca L Hurto Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
Peter L Freddolino Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
Yang Zhang Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
Prajwal Bhat Achira Labs, Bangalore, India
Fran Supek Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
José M Fernández INB Coordination Unit, Life Sciences Department, Barcelona Supercomputing Center, Barcelona, Catalonia, Spain.,(former) INB GN2, Structural and Computational Biology Programme, Spanish National Cancer Research Centre, Barcelona, Catalonia, Spain
Branislava Gemovic Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
Vladimir R Perovic Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
Radoslav S Davidović Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
Neven Sumonja Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
Nevena Veljkovic Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
Ehsaneddin Asgari Molecular Cell Biomechanics Laboratory, Departments of Bioengineering, University of California Berkeley, Berkeley, CA, USA.,Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Berkeley, CA, USA
Mohammad R K Mofrad Departments of Bioengineering and Mechanical Engineering, Berkeley, CA, USA
Giuseppe Profiti Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,National Research Council, IBIOM, Bologna, Italy
Castrense Savojardo Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
Pier Luigi Martelli Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
Rita Casadio Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
Florian Boecker University of Bonn: INRES Crop Bioinformatics, Bonn, North Rhine-Westphalia, Germany
Heiko Schoof INRES Crop Bioinformatics, University of Bonn, Bonn, Germany
Indika Kahanda Gianforte School of Computing, Montana State University, Bozeman, Montana, USA
Natalie Thurlby University of Bristol, Computer Science, Bristol, Bristol, United Kingdom
Alice C McHardy Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.,RESIST, DFG Cluster of Excellence 2155, Brunswick, Germany
Alexandre Renaux Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles - Vrije Universiteit Brussel, Brussels, Belgium.,Machine Learning Group, Université libre de Bruxelles, Brussels, Belgium.,Artificial Intelligence lab, Vrije Universiteit Brussel, Brussels, Belgium
Rabie Saidi European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
Julian Gough MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
Alex A Freitas University of Kent, School of Computing, Canterbury, United Kingdom
Magdalena Antczak School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
Fabio Fabris University of Kent, School of Computing, Canterbury, United Kingdom
Mark N Wass School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
Jie Hou University of Missouri, Computer Science, Columbia, Missouri, USA.,Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
Jianlin Cheng Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
Zheng Wang University of Miami, Coral Gables, Florida, USA
Alfonso E Romero Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
Alberto Paccanaro Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
Haixuan Yang School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Galway, Ireland.,Technical University of Munich, Garching, Germany
Tatyana Goldberg Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
Chenguang Zhao Faculty for Informatics, Garching, Germany.,Department for Bioinformatics and Computational Biology, Garching, Germany.,School of Computing Sciences and Computer Engineering, Hattiesburg, Mississippi, USA
Liisa Holm Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
Petri Törönen Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
Alan J Medlar Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
Elaine Zosa Institute of Biotechnology, University of Helsinki, Helsinki, Finland
Itamar Borukhov Compugen Ltd., Holon, Israel
Ilya Novikov Baylor College of Medicine, Department of Biochemistry and Molecular Biology, Houston, TX, USA
Angela Wilkins Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
Olivier Lichtarge Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
Po-Han Chi National TsingHua University, Hsinchu, Taiwan
Wei-Cheng Tseng Department of Electrical Engineering in National Tsing Hua University, Hsinchu City, Taiwan
Michal Linial The Hebrew University of Jerusalem, Jerusalem, Israel
Peter W Rose University of California San Diego, San Diego Supercomputer Center, La Jolla, California, USA
Christophe Dessimoz Department of Computational Biology and Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Genetics, Evolution & Environment, and Department of Computer Science, University College London, London, UK.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
Vedrana Vidulin Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
Saso Dzeroski Jozef Stefan Institute, Ljubljana, Slovenia.,Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
Ian Sillitoe Research Department of Structural and Molecular Biology, University College London, London, England
Sayoni Das Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
Jonathan Gill Lees Research Department of Structural and Molecular Biology, University College London, London, United Kingdom.,Department of Health and Life Sciences, Oxford Brookes University, London, UK
David T Jones The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom
Cen Wan Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
Domenico Cozzetto Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
Rui Fa Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
Mateo Torres Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
Alex Warwick Vesztrocy Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom.,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
Jose Manuel Rodriguez Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid, Spain
Michael L Tress Spanish National Cancer Research Centre (CNIO), Madrid, Spain
Marco Frasca Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
Marco Notaro Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
Giuliano Grossi Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
Alessandro Petrini Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
Matteo Re Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
Giorgio Valentini Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
Marco Mesiti Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy.,Institut de Biologie Computationnelle, LIRMM, CNRS-UMR 5506, Universite de Montpellier, Montpellier, France
Daniel B Roche Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
Jonas Reeb Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
David W Ritchie University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
Sabeur Aridhi University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
Seyed Ziaeddin Alborzi University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France.,Inria, Nancy, France
Marie-Dominique Devignes University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France.,University of Lorraine, Nancy, Lorraine, France.,Inria, Nancy, France
Da Chen Emily Koo Department of Biology, New York University, New York, NY, USA
Richard Bonneau NYU Center for Data Science, New York, 10010, NY, USA.,Flatiron Institute, CCB, New York, 10010, NY, USA
Vladimir Gligorijević Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, New York, USA
Meet Barot Center for Data Science, New York University, New York, 10011, NY, USA
Hai Fang Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Stefano Toppo Department of Molecular Medicine, University of Padova, Padova, Italy
Enrico Lavezzo Department of Molecular Medicine, University of Padova, Padova, Italy
Marco Falda Department of Biology, University of Padova, Padova, Italy
Michele Berselli Department of Molecular Medicine, University of Padova, Padova, Italy
Silvio C E Tosatto CNR Institute of Neuroscience, Padova, Italy.,Department of Biomedical Sciences, University of Padua, Padova, Italy
Marco Carraro Department of Biomedical Sciences, University of Padua, Padova, Italy
Damiano Piovesan Department of Biomedical Sciences, University of Padua, Padova, Italy
Hafeez Ur Rehman Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar, Khyber Pakhtoonkhwa, Pakistan
Qizhong Mao Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.,University of California, Riverside, Philadelphia, PA, USA
Shanshan Zhang Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
Slobodan Vucetic Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
Gage S Black Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
Dane Jo Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
Erica Suh Department of Biology, Brigham Young University, Provo, UT, USA
Jonathan B Dayton Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
Dallas J Larsen Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
Ashton R Omdahl Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
Liam J McGuffin School of Biological Sciences, University of Reading, Reading, England, United Kingdom
Danielle A Brackenridge School of Biological Sciences, University of Reading, Reading, England, United Kingdom
Patricia C Babbitt Department of Pharmaceutical Chemistry, San Francisco, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
Jeffrey M Yunes UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, San Francisco, 94158, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
Paolo Fontana Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
Feng Zhang State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, Shanghai, China.,Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China
Shanfeng Zhu School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
Ronghui You School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
Zihan Zhang School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
Suyang Dai School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
Shuwei Yao School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China
Weidong Tian State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China.,Department of Pediatrics, Brain Tumor Center, Division of Experimental Hematology and Cancer Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
Renzhi Cao Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
Caleb Chandler Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
Miguel Amezola Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
Devon Johnson Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
Jia-Ming Chang Department of Computer Science, National Chengchi University, Taipei, Taiwan
Wen-Hung Liao Department of Computer Science, National Chengchi University, Taipei, Taiwan
Yi-Wei Liu Department of Computer Science, National Chengchi University, Taipei, Taiwan
Stefano Pascarelli Okinawa Institute of Science and Technology, Tancha, Okinawa, Japan
Yotam Frank Tel Aviv University, Tel Aviv, Israel
Robert Hoehndorf Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
Maxat Kulmanov Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
Imane Boudellioua Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Computer, Electrical and Mathematical Sciences Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Gianfranco Politano Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
Stefano Di Carlo Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
Alfredo Benso Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
Kai Hakala Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
Filip Ginter Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku, Turku, Finland
Farrokh Mehryary Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
Suwisa Kaewphan Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland.,Turku Centre for Computer Science (TUCS), Turku, Finland
Jari Björne Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
Hans Moen University of Turku, Turku, Finland
Martti E E Tolvanen Department of Future Technologies, University of Turku, Turku, Finland
Tapio Salakoski Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
Daisuke Kihara Department of Biological Sciences, Department of Computer Science, Purdue University, 47907, IN, USA.,Department of Pediatrics, University of Cincinnati, Cincinnati, 45229, OH, USA
Aashish Jain Department of Computer Science, Purdue University, West Lafayette, IN, USA
Tomislav Šmuc Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
Adrian Altenhoff Department of Computer Science, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Asa Ben-Hur Department of Computer Science, Colorado State University, Fort Collins, CO, USA
Burkhard Rost Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany.,Institute for Food and Plant Sciences WZW, Technische Universität München, Freising, Germany
Steven E Brenner University of California, Berkeley, CA, USA
Christine A Orengo Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
Constance J Jeffery Biological Sciences, University of Illinois at Chicago, Chicago, Illinois, USA
Giovanni Bosco Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
Deborah A Hogan Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
Maria J Martin European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
Claire O'Donovan European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
Sean D Mooney Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
Casey S Greene Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, Pennsylvania, USA
Predrag Radivojac Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
Iddo Friedberg Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.

Collapse

Teso S, Masera L, Diligenti M, Passerini A. Combining learning and constraints for genome-wide protein annotation. BMC Bioinformatics 2019;20:338. [PMID: 31208327 PMCID: PMC6580517 DOI: 10.1186/s12859-019-2875-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 05/03/2019] [Indexed: 11/28/2022] Open

Abstract

Background

The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale.

Results

We present Ocelot, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as Ocelot), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning.

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2875-5) contains supplementary material, which is available to authorized users.

Collapse

Fodeh SJ, Tiwari A. Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification. J Biomed Inform 2018;86:160-166. [PMID: 30130573 DOI: 10.1016/j.jbi.2018.08.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Revised: 08/13/2018] [Accepted: 08/17/2018] [Indexed: 11/25/2022]

Vignolini T, Mengoni A, Fondi M. Template-Assisted Metabolic Reconstruction and Assembly of Hybrid Bacterial Models. Methods Mol Biol 2018;1716:177-196. [PMID: 29222754 DOI: 10.1007/978-1-4939-7528-0_8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

HashGO: hashing gene ontology for protein function prediction. Comput Biol Chem 2017;71:264-273. [DOI: 10.1016/j.compbiolchem.2017.09.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 09/25/2017] [Indexed: 10/18/2022]

ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017;22:molecules22101732. [PMID: 29039790 PMCID: PMC6151571 DOI: 10.3390/molecules22101732] [Citation(s) in RCA: 114] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 11/25/2022] Open

Community-Wide Evaluation of Computational Function Prediction. Methods Mol Biol 2017;1446:133-146. [PMID: 27812940 DOI: 10.1007/978-1-4939-3743-1_10] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]

Gaudet P, Dessimoz C. Gene Ontology: Pitfalls, Biases, and Remedies. Methods Mol Biol 2017;1446:189-205. [PMID: 27812944 DOI: 10.1007/978-1-4939-3743-1_14] [Citation(s) in RCA: 77] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 2016;590:2327-41. [PMID: 27423136 PMCID: PMC5937700 DOI: 10.1002/1873-3468.12307] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 07/12/2016] [Accepted: 07/12/2016] [Indexed: 12/21/2022]

Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016;93:84-91. [PMID: 26370280 PMCID: PMC4894840 DOI: 10.1016/j.ymeth.2015.09.011] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 09/03/2015] [Accepted: 09/10/2015] [Indexed: 11/30/2022] Open

Das S, Orengo CA. Protein function annotation using protein domain family resources. Methods 2016;93:24-34. [DOI: 10.1016/j.ymeth.2015.09.029] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Revised: 09/28/2015] [Accepted: 09/29/2015] [Indexed: 01/25/2023] Open

Sillitoe I, Furnham N. FunTree: advances in a resource for exploring and contextualising protein function evolution. Nucleic Acids Res 2015;44:D317-23. [PMID: 26590404 PMCID: PMC4702901 DOI: 10.1093/nar/gkv1274] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 11/03/2015] [Indexed: 11/13/2022] Open

Leuthaeuser JB, Knutson ST, Kumar K, Babbitt PC, Fetrow JS. Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity. Protein Sci 2015;24:1423-39. [PMID: 26073648 PMCID: PMC4570537 DOI: 10.1002/pro.2724] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 06/10/2015] [Indexed: 01/27/2023]

Abstract

The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods.

Collapse

Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics 2015;31:3429-36. [DOI: 10.1093/bioinformatics/btv345] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 05/29/2015] [Indexed: 11/13/2022] Open

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T. DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe. BMC Bioinformatics 2015;16:96. [PMID: 25888481 PMCID: PMC4389672 DOI: 10.1186/s12859-015-0499-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 02/18/2015] [Indexed: 12/27/2022] Open

Abstract

Background

Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature–based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results

DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions

Our results offer preliminarily confirmation of the existence of the hypothesized huge number of “hidden enzymes” in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0499-y) contains supplementary material, which is available to authorized users.

Collapse

Jiang Y, Clark WT, Friedberg I, Radivojac P. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. ACTA ACUST UNITED AC 2015;30:i609-16. [PMID: 25161254 PMCID: PMC4147924 DOI: 10.1093/bioinformatics/btu472] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]

Abstract

Motivation: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy.

Results: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable.

Contact:predrag@indiana.edu

Supplementary information:Supplementary data are available at Bioinformatics online.

Collapse

Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem 2014;53PB:251-276. [DOI: 10.1016/j.compbiolchem.2014.10.001] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 10/03/2014] [Accepted: 10/07/2014] [Indexed: 01/01/2023]

Reijnders MJ, van Heck RG, Lam CM, Scaife MA, Santos VAMD, Smith AG, Schaap PJ. Green genes: bioinformatics and systems-biology innovations drive algal biotechnology. Trends Biotechnol 2014;32:617-26. [DOI: 10.1016/j.tibtech.2014.10.003] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 09/30/2014] [Accepted: 10/01/2014] [Indexed: 01/18/2023]

Text as data: using text-based features for proteins representation and for computational prediction of their characteristics. Methods 2014;74:54-64. [PMID: 25448299 DOI: 10.1016/j.ymeth.2014.10.027] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Revised: 09/21/2014] [Accepted: 10/21/2014] [Indexed: 11/21/2022] Open

Wittwer LD, Piližota I, Altenhoff AM, Dessimoz C. Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology. PeerJ 2014;2:e607. [PMID: 25320677 PMCID: PMC4193403 DOI: 10.7717/peerj.607] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2014] [Accepted: 09/12/2014] [Indexed: 11/20/2022] Open

Visual account of protein investment in cellular functions. Proc Natl Acad Sci U S A 2014;111:8488-93. [PMID: 24889604 DOI: 10.1073/pnas.1314810111] [Citation(s) in RCA: 216] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open

Becher D, Bernhardt J, Fuchs S, Riedel K. Metaproteomics to unravel major microbial players in leaf litter and soil environments: challenges and perspectives. Proteomics 2014;13:2895-909. [PMID: 23894095 DOI: 10.1002/pmic.201300095] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2013] [Revised: 05/03/2013] [Accepted: 05/13/2013] [Indexed: 11/06/2022]

Lee J, Gross SP, Lee J. Improved network community structure improves function prediction. Sci Rep 2014;3:2197. [PMID: 23852097 PMCID: PMC3711050 DOI: 10.1038/srep02197] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2012] [Accepted: 06/24/2013] [Indexed: 12/15/2022] Open

Walking on a tissue-specific disease-protein-complex heterogeneous network for the discovery of disease-related protein complexes. BIOMED RESEARCH INTERNATIONAL 2013;2013:732650. [PMID: 24455720 PMCID: PMC3888695 DOI: 10.1155/2013/732650] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 10/07/2013] [Indexed: 11/29/2022]

Clark WT, Radivojac P. Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 2013;29:i53-61. [PMID: 23813009 PMCID: PMC3694662 DOI: 10.1093/bioinformatics/btt228] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Ramsak Ž, Baebler Š, Rotter A, Korbar M, Mozetic I, Usadel B, Gruden K. GoMapMan: integration, consolidation and visualization of plant gene annotations within the MapMan ontology. Nucleic Acids Res 2013;42:D1167-75. [PMID: 24194592 PMCID: PMC3965006 DOI: 10.1093/nar/gkt1056] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open

Prediction and experimental validation of enzyme substrate specificity in protein structures. Proc Natl Acad Sci U S A 2013;110:E4195-202. [PMID: 24145433 DOI: 10.1073/pnas.1305162110] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Hou J, Jiang Y. Dynamically searching for a domain for protein function prediction. J Bioinform Comput Biol 2013;11:1350008. [PMID: 23859272 DOI: 10.1142/s021972001350008x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Kotaru AR, Shameer K, Sundaramurthy P, Joshi RC. An improved hypergeometric probability method for identification of functionally linked proteins using phylogenetic profiles. Bioinformation 2013;9:368-74. [PMID: 23750082 PMCID: PMC3669790 DOI: 10.6026/97320630009368] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 03/06/2013] [Indexed: 12/04/2022] Open

Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol 2013;9:e1003063. [PMID: 23737737 PMCID: PMC3667760 DOI: 10.1371/journal.pcbi.1003063] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 04/02/2013] [Indexed: 11/19/2022] Open

Abstract

The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the “few articles - many proteins” phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.

Experiments and observations are the vehicles used by science to understand the world around us. In the field of molecular biology, we are increasingly relying on high-throughput, genome-wide experiments to provide answers about the function of biological macromolecules. However, any experimental assay is essentially limited in the type of information it can discover. Here, we show that our increasing reliance on high-throughput experiments biases our understanding of protein function. While the primary source of information is experiments, the functions of many proteins are computationally annotated by sequence-based similarity, either directly or indirectly, to proteins whose function is experimentally determined. Therefore, any biases in experimental annotations can get amplified and entrenched in the majority of protein databases. We show here that high-throughput studies are biased towards certain aspects of protein function, and that they provide less information than low-throughput studies. While there is no clear solution to the phenomenon of bias from high-throughput experiments, recognizing its existence and its impact can help take steps to mitigate its effect.

Collapse

López D, Pazos F. COPRED: prediction of fold, GO molecular function and functional residues at the domain level. Bioinformatics 2013;29:1811-2. [PMID: 23720488 DOI: 10.1093/bioinformatics/btt283] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Bromberg Y. Chapter 15: disease gene prioritization. PLoS Comput Biol 2013;9:e1002902. [PMID: 23633938 PMCID: PMC3635969 DOI: 10.1371/journal.pcbi.1002902] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open

Bertsova YV, Fadeeva MS, Kostyrko VA, Serebryakova MV, Baykov AA, Bogachev AV. Alternative pyrimidine biosynthesis protein ApbE is a flavin transferase catalyzing covalent attachment of FMN to a threonine residue in bacterial flavoproteins. J Biol Chem 2013;288:14276-14286. [PMID: 23558683 DOI: 10.1074/jbc.m113.455402] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open

Oberlin AT, Jurkovic DA, Balish MF, Friedberg I. Biological database of images and genomes: tools for community annotations linking image and genomic information. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013;2013:bat016. [PMID: 23550062 PMCID: PMC3708683 DOI: 10.1093/database/bat016] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]

Fang H, Gough J. A domain-centric solution to functional genomics via dcGO Predictor. BMC Bioinformatics 2013;14 Suppl 3:S9. [PMID: 23514627 PMCID: PMC3584936 DOI: 10.1186/1471-2105-14-s3-s9] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open

Abstract

Background

Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.

Results

Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.

Conclusions

As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.

Collapse

Piovesan D, Martelli PL, Fariselli P, Profiti G, Zauli A, Rossi I, Casadio R. How to inherit statistically validated annotation within BAR+ protein clusters. BMC Bioinformatics 2013;14 Suppl 3:S4. [PMID: 23514411 PMCID: PMC3584929 DOI: 10.1186/1471-2105-14-s3-s4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

Background

In the genomic era a key issue is protein annotation, namely how to endow protein sequences, upon translation from the corresponding genes, with structural and functional features. Routinely this operation is electronically done by deriving and integrating information from previous knowledge. The reference database for protein sequences is UniProtKB divided into two sections, UniProtKB/TrEMBL which is automatically annotated and not reviewed and UniProtKB/Swiss-Prot which is manually annotated and reviewed. The annotation process is essentially based on sequence similarity search. The question therefore arises as to which extent annotation based on transfer by inheritance is valuable and specifically if it is possible to statistically validate inherited features when little homology exists among the target sequence and its template(s).

Results

In this paper we address the problem of annotating protein sequences in a statistically validated manner considering as a reference annotation resource UniProtKB. The test case is the set of 48,298 proteins recently released by the Critical Assessment of Function Annotations (CAFA) organization. We show that we can transfer after validation, Gene Ontology (GO) terms of the three main categories and Pfam domains to about 68% and 72% of the sequences, respectively. This is possible after alignment of the CAFA sequences towards BAR+, our annotation resource that allows discriminating among statistically validated and not statistically validated annotation. By comparing with a direct UniProtKB annotation, we find that besides validating annotation of some 78% of the CAFA set, we assign new and statistically validated annotation to 14.8% of the sequences and find new structural templates for about 25% of the chains, half of which share less than 30% sequence identity to the corresponding template/s.

Conclusion

Inheritance of annotation by transfer generally requires a careful selection of the identity value among the target and the template in order to transfer structural and/or functional features. Here we prove that even distantly remote homologs can be safely endowed with structural templates and GO and/or Pfam terms provided that annotation is done within clusters collecting cluster-related protein sequences and where a statistical validation of the shared structural and functional features is possible.

Collapse

Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B. Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics 2013;14 Suppl 3:S7. [PMID: 23514582 PMCID: PMC3584931 DOI: 10.1186/1471-2105-14-s3-s7] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Wong A, Shatkay H. Protein function prediction using text-based features extracted from the biomedical literature: the CAFA challenge. BMC Bioinformatics 2013;14 Suppl 3:S14. [PMID: 23514326 PMCID: PMC3584852 DOI: 10.1186/1471-2105-14-s3-s14] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge.

RESULTS

We have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28% respectively. Results obtained as part of the CAFA evaluation itself on the CAFA dataset are reported as well.

CONCLUSIONS

Our evaluation shows that the text-based classifier consistently outperforms the baseline classifier that is based on prior distribution, and typically has comparable performance to the baseline classifier that uses sequence similarity. Moreover, the results suggest that combining text features with other types of features can potentially lead to improved prediction performance. The preliminary results also suggest that while our text-based classifier can be used to predict both molecular function and biological process in which a protein is involved, the classifier performs significantly better for predicting molecular function than for predicting biological process. A similar trend was observed for other classifiers participating in the CAFA challenge.

Collapse

Wang Z, Cao R, Cheng J. Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks. BMC Bioinformatics 2013;14 Suppl 3:S3. [PMID: 23514381 PMCID: PMC3584933 DOI: 10.1186/1471-2105-14-s3-s3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

Abstract

Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Collapse

Lopez D, Pazos F. Concomitant prediction of function and fold at the domain level with GO-based profiles. BMC Bioinformatics 2013;14 Suppl 3:S12. [PMID: 23514233 PMCID: PMC3584904 DOI: 10.1186/1471-2105-14-s3-s12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open

Erdin S, Venner E, Lisewski AM, Lichtarge O. Function prediction from networks of local evolutionary similarity in protein structure. BMC Bioinformatics 2013;14 Suppl 3:S6. [PMID: 23514548 PMCID: PMC3584919 DOI: 10.1186/1471-2105-14-s3-s6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Li W, Cong Q, Kinch LN, Grishin NV. Seq2Ref: a web server to facilitate functional interpretation. BMC Bioinformatics 2013;14:30. [PMID: 23356573 PMCID: PMC3573977 DOI: 10.1186/1471-2105-14-30] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 01/15/2013] [Indexed: 01/01/2023] Open