1
|
Chitale M, Khan IK, Kihara D. In-depth performance evaluation of PFP and ESG sequence-based function prediction methods in CAFA 2011 experiment. BMC Bioinformatics 2013; 14 Suppl 3:S2. [PMID: 23514353 PMCID: PMC3584938 DOI: 10.1186/1471-2105-14-s3-s2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Many Automatic Function Prediction (AFP) methods were developed to cope with an increasing growth of the number of gene sequences that are available from high throughput sequencing experiments. To support the development of AFP methods, it is essential to have community wide experiments for evaluating performance of existing AFP methods. Critical Assessment of Function Annotation (CAFA) is one such community experiment. The meeting of CAFA was held as a Special Interest Group (SIG) meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in 2011. Here, we perform a detailed analysis of two sequence-based function prediction methods, PFP and ESG, which were developed in our lab, using the predictions submitted to CAFA. RESULTS We evaluate PFP and ESG using four different measures in comparison with BLAST, Prior, and GOtcha. In addition to the predictions submitted to CAFA, we further investigate performance of a different scoring function to rank order predictions by PFP as well as PFP/ESG predictions enriched with Priors that simply adds frequently occurring Gene Ontology terms as a part of predictions. Prediction accuracies of each method were also evaluated separately for different functional categories. Successful and unsuccessful predictions by PFP and ESG are also discussed in comparison with BLAST. CONCLUSION The in-depth analysis discussed here will complement the overall assessment by the CAFA organizers. Since PFP and ESG are based on sequence database search results, our analyses are not only useful for PFP and ESG users but will also shed light on the relationship of the sequence similarity space and functions that can be inferred from the sequences.
Collapse
Affiliation(s)
- Meghana Chitale
- Department of Computer Science, Purdue University, 305 N, University Street, West Lafayette, Indiana 47907, USA
| | | | | |
Collapse
|
2
|
Ashkenazi S, Snir R, Ofran Y. Assessing the relationship between conservation of function and conservation of sequence using photosynthetic proteins. ACTA ACUST UNITED AC 2012; 28:3203-10. [PMID: 23080118 DOI: 10.1093/bioinformatics/bts608] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Assessing the false positive rate of function prediction methods is difficult, as it is hard to establish that a protein does not have a certain function. To determine to what extent proteins with similar sequences have a common function, we focused on photosynthesis-related proteins. A protein that comes from a non-photosynthetic organism is, undoubtedly, not involved in photosynthesis. RESULTS We show that function diverges very rapidly: 70% of the close homologs of photosynthetic proteins come from non-photosynthetic organisms. Therefore, high sequence similarity, in most cases, is not tantamount to similar function. However, we found that many functionally similar proteins often share short sequence elements, which may correspond to a functional site and could reveal functional similarities more accurately than sequence similarity. CONCLUSIONS These results shed light on the way biological function is conserved in evolution and may help improve large-scale analysis of protein function.
Collapse
Affiliation(s)
- Shaul Ashkenazi
- The Goodman faculty of life sciences, Bar Ilan University, Ramat Gan 52900, Israel
| | | | | |
Collapse
|
3
|
Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics 2012; 13 Suppl 4:S2. [PMID: 22759650 PMCID: PMC3303716 DOI: 10.1186/1471-2164-13-s4-s2] [Citation(s) in RCA: 155] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed. Results Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge. Conclusions The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.
Collapse
Affiliation(s)
- Mauno Vihinen
- Institute of Biomedical Technology, University of Tampere, Finland.
| |
Collapse
|
4
|
Erdin S, Lisewski AM, Lichtarge O. Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol 2011; 21:180-8. [PMID: 21353529 DOI: 10.1016/j.sbi.2011.02.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2011] [Accepted: 02/03/2011] [Indexed: 11/16/2022]
Abstract
Genomic centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, less than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.
Collapse
Affiliation(s)
- Serkan Erdin
- Department of Molecular and Human Genetics, 1 Baylor Plaza, Baylor College of Medicine, Houston, TX 77030, USA
| | | | | |
Collapse
|
5
|
Jaroszewski L, Li Z, Krishna SS, Bakolitsa C, Wooley J, Deacon AM, Wilson IA, Godzik A. Exploration of uncharted regions of the protein universe. PLoS Biol 2009; 7:e1000205. [PMID: 19787035 PMCID: PMC2744874 DOI: 10.1371/journal.pbio.1000205] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2009] [Accepted: 08/19/2009] [Indexed: 12/02/2022] Open
Abstract
Determination of first protein structures, from hundreds of families of unknown function, have shown that divergence, rather than novelty, is the dominant force that shapes the evolution of the protein universe. The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies. More than 40% of known proteins lack any annotation within public databases and are usually referred to as hypothetical proteins despite most of them being real and many being evolutionarily conserved and thus expected to play important biological roles. Determination of the three-dimensional structures of representatives of more than 240 families of protein domains of unknown function by the Protein Structure Initiative has provided a unique sample of regions of the protein universe that, until this systematic effort, were completely uncharacterized. Analysis of these structures reveals that most of the 240 families can be considered as remote homologs of already known protein families. Such distant evolutionary links can sometimes be predicted by current state-of-the-art sequence comparison tools, but structural analysis has led to the first hypotheses about biological functions for many of these uncharacterized proteins, and serves as a starting point for experimental studies. The rapid pace of discovery of such relationships appears to suggest that the protein universe is made up of a relatively small and stable number of ‘extended neighborhoods’ that bring together distantly related protein families. Thus, the vast uncharacterized part of protein universe, called by some “the dark matter of protein space”, may consist mainly of highly divergent homologs. Continued structural characterization of these previously under-investigated regions of the protein universe should further help unravel the patterns and rules that led to such divergence in the evolution of protein structure and function.
Collapse
Affiliation(s)
- Lukasz Jaroszewski
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
| | - Zhanwen Li
- Joint Center for Molecular Modeling, Burnham Institute for Medical Research, La Jolla, California, United States of America
| | - S. Sri Krishna
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
| | - Constantina Bakolitsa
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
| | - John Wooley
- Joint Center for Structural Genomics, Bioinformatics Core, Center for Research in Biological Systems, University of California San Diego, La Jolla, California, United States of America
| | - Ashley M. Deacon
- Joint Center for Structural Genomics, Structure Determination Core, Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, California, United States of America
| | - Ian A. Wilson
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, California, United States of America
| | - Adam Godzik
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
- Joint Center for Molecular Modeling, Burnham Institute for Medical Research, La Jolla, California, United States of America
- Joint Center for Structural Genomics, Bioinformatics Core, Center for Research in Biological Systems, University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
6
|
Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, Maurer-Stroh S, Wong WC, Schleiffer A, Eisenhaber F, Schneider G. ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res 2009; 37:W435-40. [PMID: 19389726 PMCID: PMC2703921 DOI: 10.1093/nar/gkp254] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Function prediction of proteins with computational sequence analysis requires the use of dozens of prediction tools with a bewildering range of input and output formats. Each of these tools focuses on a narrow aspect and researchers are having difficulty obtaining an integrated picture. ANNIE is the result of years of close interaction between computational biologists and computer scientists and automates an essential part of this sequence analytic process. It brings together over 20 function prediction algorithms that have proven sufficiently reliable and indispensable in daily sequence analytic work and are meant to give scientists a quick overview of possible functional assignments of sequence segments in the query proteins. The results are displayed in an integrated manner using an innovative AJAX-based sequence viewer. ANNIE is available online at: http://annie.bii.a-star.edu.sg. This website is free and open to all users and there is no login requirement.
Collapse
|
7
|
Fontana P, Cestaro A, Velasco R, Formentin E, Toppo S. Rapid annotation of anonymous sequences from genome projects using semantic similarities and a weighting scheme in gene ontology. PLoS One 2009; 4:e4619. [PMID: 19247487 PMCID: PMC2645684 DOI: 10.1371/journal.pone.0004619] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2008] [Accepted: 01/09/2009] [Indexed: 11/22/2022] Open
Abstract
Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage.
Collapse
Affiliation(s)
- Paolo Fontana
- FEM-IASMA Research Center, San Michele all'Adige (TN), Italy
| | | | | | | | - Stefano Toppo
- Department of Biological Chemistry, University of Padova, Padova, Italy
- * E-mail:
| |
Collapse
|