Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Rodrigues APC, Grant BJ, Godzik A, Friedberg I. The 2006 automated function prediction meeting. BMC Bioinformatics 2007;8 Suppl 4:S1-4. [PMID: 17570143 PMCID: PMC1892079 DOI: 10.1186/1471-2105-8-s4-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

For:	Rodrigues APC, Grant BJ, Godzik A, Friedberg I. The 2006 automated function prediction meeting. BMC Bioinformatics 2007;8 Suppl 4:S1-4. [PMID: 17570143 PMCID: PMC1892079 DOI: 10.1186/1471-2105-8-s4-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Chitale M, Khan IK, Kihara D. In-depth performance evaluation of PFP and ESG sequence-based function prediction methods in CAFA 2011 experiment. BMC Bioinformatics 2013;14 Suppl 3:S2. [PMID: 23514353 PMCID: PMC3584938 DOI: 10.1186/1471-2105-14-s3-s2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open

Ashkenazi S, Snir R, Ofran Y. Assessing the relationship between conservation of function and conservation of sequence using photosynthetic proteins. ACTA ACUST UNITED AC 2012;28:3203-10. [PMID: 23080118 DOI: 10.1093/bioinformatics/bts608] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics 2012;13 Suppl 4:S2. [PMID: 22759650 PMCID: PMC3303716 DOI: 10.1186/1471-2164-13-s4-s2] [Citation(s) in RCA: 155] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open

Abstract

Background

Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed.

Results

Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge.

Conclusions

The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.

Collapse

Erdin S, Lisewski AM, Lichtarge O. Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol 2011;21:180-8. [PMID: 21353529 DOI: 10.1016/j.sbi.2011.02.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2011] [Accepted: 02/03/2011] [Indexed: 11/16/2022]

Jaroszewski L, Li Z, Krishna SS, Bakolitsa C, Wooley J, Deacon AM, Wilson IA, Godzik A. Exploration of uncharted regions of the protein universe. PLoS Biol 2009;7:e1000205. [PMID: 19787035 PMCID: PMC2744874 DOI: 10.1371/journal.pbio.1000205] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2009] [Accepted: 08/19/2009] [Indexed: 12/02/2022] Open

Abstract

Determination of first protein structures, from hundreds of families of unknown function, have shown that divergence, rather than novelty, is the dominant force that shapes the evolution of the protein universe.

The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies.

More than 40% of known proteins lack any annotation within public databases and are usually referred to as hypothetical proteins despite most of them being real and many being evolutionarily conserved and thus expected to play important biological roles. Determination of the three-dimensional structures of representatives of more than 240 families of protein domains of unknown function by the Protein Structure Initiative has provided a unique sample of regions of the protein universe that, until this systematic effort, were completely uncharacterized. Analysis of these structures reveals that most of the 240 families can be considered as remote homologs of already known protein families. Such distant evolutionary links can sometimes be predicted by current state-of-the-art sequence comparison tools, but structural analysis has led to the first hypotheses about biological functions for many of these uncharacterized proteins, and serves as a starting point for experimental studies. The rapid pace of discovery of such relationships appears to suggest that the protein universe is made up of a relatively small and stable number of ‘extended neighborhoods’ that bring together distantly related protein families. Thus, the vast uncharacterized part of protein universe, called by some “the dark matter of protein space”, may consist mainly of highly divergent homologs. Continued structural characterization of these previously under-investigated regions of the protein universe should further help unravel the patterns and rules that led to such divergence in the evolution of protein structure and function.

Collapse

Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, Maurer-Stroh S, Wong WC, Schleiffer A, Eisenhaber F, Schneider G. ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res 2009;37:W435-40. [PMID: 19389726 PMCID: PMC2703921 DOI: 10.1093/nar/gkp254] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

Fontana P, Cestaro A, Velasco R, Formentin E, Toppo S. Rapid annotation of anonymous sequences from genome projects using semantic similarities and a weighting scheme in gene ontology. PLoS One 2009;4:e4619. [PMID: 19247487 PMCID: PMC2645684 DOI: 10.1371/journal.pone.0004619] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2008] [Accepted: 01/09/2009] [Indexed: 11/22/2022] Open