351
|
Wang J, Zhou X, Zhu J, Zhou C, Guo Z. Revealing and avoiding bias in semantic similarity scores for protein pairs. BMC Bioinformatics 2010; 11:290. [PMID: 20509916 PMCID: PMC2903568 DOI: 10.1186/1471-2105-11-290] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2010] [Accepted: 05/28/2010] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as those related to diseases, tend to be studied more intensively, annotations are likely to be biased, which may affect applications based on semantic similarity measures. Thus, it is necessary to evaluate the effects of the bias on semantic similarity scores between proteins and then find a method to avoid them. RESULTS First, we evaluated 14 commonly used semantic similarity scores for protein pairs and demonstrated that they significantly correlated with the numbers of annotation terms for the proteins (also known as the protein annotation length). These results suggested that current applications of the semantic similarity scores between proteins might be unreliable. Then, to reduce this annotation bias effect, we proposed normalizing the semantic similarity scores between proteins using the power transformation of the scores. We provide evidence that this improves performance in some applications. CONCLUSIONS Current semantic similarity measures for protein pairs are highly dependent on protein annotation lengths, which are subject to biological research bias. This affects applications that are based on these semantic similarity scores, especially in clustering studies that rely on score magnitudes. The normalized scores proposed in this paper can reduce the effects of this bias to some extent.
Collapse
Affiliation(s)
- Jing Wang
- Bioinformatics Centre, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Xianxiao Zhou
- Bioinformatics Centre, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Jing Zhu
- Bioinformatics Centre, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Chenggui Zhou
- Bioinformatics Centre, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Zheng Guo
- Bioinformatics Centre, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150086, China
| |
Collapse
|
352
|
Hawkins T, Chitale M, Kihara D. Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP. BMC Bioinformatics 2010; 11:265. [PMID: 20482861 PMCID: PMC2882935 DOI: 10.1186/1471-2105-11-265] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 05/19/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance. RESULTS Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted. CONCLUSION The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact on a researchers' ability to interpret the immense biological data that are being generated today. The newly introduced functional similarity networks of the three organisms show different network properties as compared with the protein-protein interaction networks.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | |
Collapse
|
353
|
Zhang Y, Xuan J, de los Reyes BG, Clarke R, Ressom HW. Reconstruction of gene regulatory modules in cancer cell cycle by multi-source data integration. PLoS One 2010; 5:e10268. [PMID: 20422009 PMCID: PMC2858157 DOI: 10.1371/journal.pone.0010268] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2009] [Accepted: 03/25/2010] [Indexed: 12/31/2022] Open
Abstract
Background Precise regulation of the cell cycle is crucial to the growth and development of all organisms. Understanding the regulatory mechanism of the cell cycle is crucial to unraveling many complicated diseases, most notably cancer. Multiple sources of biological data are available to study the dynamic interactions among many genes that are related to the cancer cell cycle. Integrating these informative and complementary data sources can help to infer a mutually consistent gene transcriptional regulatory network with strong similarity to the underlying gene regulatory relationships in cancer cells. Results and Principal Findings We propose an integrative framework that infers gene regulatory modules from the cell cycle of cancer cells by incorporating multiple sources of biological data, including gene expression profiles, gene ontology, and molecular interaction. Among 846 human genes with putative roles in cell cycle regulation, we identified 46 transcription factors and 39 gene ontology groups. We reconstructed regulatory modules to infer the underlying regulatory relationships. Four regulatory network motifs were identified from the interaction network. The relationship between each transcription factor and predicted target gene groups was examined by training a recurrent neural network whose topology mimics the network motif(s) to which the transcription factor was assigned. Inferred network motifs related to eight well-known cell cycle genes were confirmed by gene set enrichment analysis, binding site enrichment analysis, and comparison with previously published experimental results. Conclusions We established a robust method that can accurately infer underlying relationships between a given transcription factor and its downstream target genes by integrating different layers of biological data. Our method could also be beneficial to biologists for predicting the components of regulatory modules in which any candidate gene is involved. Such predictions can then be used to design a more streamlined experimental approach for biological validation. Understanding the dynamics of these modules will shed light on the processes that occur in cancer cells resulting from errors in cell cycle regulation.
Collapse
Affiliation(s)
- Yuji Zhang
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington, D. C., United States of America
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, United States of America
| | - Jianhua Xuan
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, United States of America
| | | | - Robert Clarke
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington, D. C., United States of America
| | - Habtom W. Ressom
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington, D. C., United States of America
- * E-mail:
| |
Collapse
|
354
|
Lutter D, Marr C, Krumsiek J, Lang EW, Theis FJ. Intronic microRNAs support their host genes by mediating synergistic and antagonistic regulatory effects. BMC Genomics 2010; 11:224. [PMID: 20370903 PMCID: PMC2865499 DOI: 10.1186/1471-2164-11-224] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2009] [Accepted: 04/06/2010] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND MicroRNA-mediated control of gene expression via translational inhibition has substantial impact on cellular regulatory mechanisms. About 37% of mammalian microRNAs appear to be located within introns of protein coding genes, linking their expression to the promoter-driven regulation of the host gene. In our study we investigate this linkage towards a relationship beyond transcriptional co-regulation. RESULTS Using measures based on both annotation and experimental data, we show that intronic microRNAs tend to support their host genes by regulation of target gene expression with significantly correlated expression patterns. We used expression data of three differentiating cell types and compared gene expression profiles of host and target genes. Many microRNA target genes show expression patterns significantly correlated with the expressions of the microRNA host genes. By calculating functional similarities between host and predicted microRNA target genes based on GO annotations, we confirm that many microRNAs link host and target gene activity in an either synergistic or antagonistic manner. CONCLUSIONS These two regulatory effects may result from fine tuning of target gene expression functionally related to the host or knock-down of remaining opponent target gene expression. This finding allows to extend the common practice of mapping large scale gene expression data to protein associated genes with functionality of co-expressed intronic microRNAs.
Collapse
Affiliation(s)
- Dominik Lutter
- Institute of Bioinformatics and Systems Biology, CMB, Helmholtz Zentrum München, Germany.
| | | | | | | | | |
Collapse
|
355
|
Zeng J, Zhu S, Liew AWC, Yan H. Multiconstrained gene clustering based on generalized projections. BMC Bioinformatics 2010; 11:164. [PMID: 20356386 PMCID: PMC3098054 DOI: 10.1186/1471-2105-11-164] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2009] [Accepted: 03/31/2010] [Indexed: 11/10/2022] Open
Abstract
Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.
Collapse
Affiliation(s)
- Jia Zeng
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | | | | | | |
Collapse
|
356
|
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 2010; 26:976-8. [DOI: 10.1093/bioinformatics/btq064] [Citation(s) in RCA: 712] [Impact Index Per Article: 50.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
357
|
Chin CH, Chen SH, Ho CW, Ko MT, Lin CY. A hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles. BMC Bioinformatics 2010; 11 Suppl 1:S25. [PMID: 20122197 PMCID: PMC3009496 DOI: 10.1186/1471-2105-11-s1-s25] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Many research results show that the biological systems are composed of functional modules. Members in the same module usually have common functions. This is useful information to understand how biological systems work. Therefore, detecting functional modules is an important research topic in the post-genome era. One of functional module detecting methods is to find dense regions in Protein-Protein Interaction (PPI) networks. Most of current methods neglect confidence-scores of interactions, and pay little attention on using gene expression data to improve their results. Results In this paper, we propose a novel hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles, and we name it HUNTER. Our method not only can extract functional modules from a weighted PPI network, but also use gene expression data as optional input to increase the quality of outcomes. Using HUNTER on yeast data, we found it can discover more novel components related with RNA polymerase complex than those existed methods from yeast interactome. And these new components show the close relationship with polymerase after functional analysis on Gene Ontology. Conclusion A C++ implementation of our prediction method, dataset and supplementary material are available at http://hub.iis.sinica.edu.tw/Hunter/. Our proposed HUNTER method has been applied on yeast data, and the empirical results show that our method can accurately identify functional modules. Such useful application derived from our algorithm can reconstruct the biological machinery, identify undiscovered components and decipher common sub-modules inside these complexes like RNA polymerases I, II, III.
Collapse
Affiliation(s)
- Chia-Hao Chin
- Institute of Information Science, Academia Sinica, No, 128 Yan-Chiu-Yuan Rd, Sec, 2, Taipei 115, Taiwan.
| | | | | | | | | |
Collapse
|
358
|
Schönhuth A, Salari R, Hormozdiari F, Cherkasov A, Cenk Sahinalp S. Towards Improved Assessment of Functional Similarity in Large-Scale Screens: A Study on Indel Length. J Comput Biol 2010; 17:1-20. [DOI: 10.1089/cmb.2009.0031] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Alexander Schönhuth
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Raheleh Salari
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Fereydoun Hormozdiari
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Artem Cherkasov
- Division of Infectious Diseases, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - S. Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
359
|
Ali W, Deane CM. Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 2009; 25:3166-73. [PMID: 19797409 PMCID: PMC2778333 DOI: 10.1093/bioinformatics/btp569] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2009] [Revised: 09/25/2009] [Accepted: 09/29/2009] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Functional module detection within protein interaction networks is a challenging problem due to the sparsity of data and presence of errors. Computational techniques for this task range from purely graph theoretical approaches involving single networks to alignment of multiple networks from several species. Current network alignment methods all rely on protein sequence similarity to map proteins across species. RESULTS Here we carry out network alignment using a protein functional similarity measure. We show that using functional similarity to map proteins across species improves network alignment in terms of functional coherence and overlap with experimentally verified protein complexes. Moreover, the results from functional similarity-based network alignment display little overlap (<15%) with sequence similarity-based alignment. Our combined approach integrating sequence and function-based network alignment alongside graph clustering properties offers a 200% increase in coverage of experimental datasets and comparable accuracy to current network alignment methods. AVAILABILITY Program binaries and source code is freely available at http://www.stats.ox.ac.uk/research/bioinfo/resources. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Waqar Ali
- Department of Statistics, University of Oxford, OX1 3TG, UK.
| | | |
Collapse
|
360
|
Dutkowski J, Tiuryn J. Phylogeny-guided interaction mapping in seven eukaryotes. BMC Bioinformatics 2009; 10:393. [PMID: 19948065 PMCID: PMC2793266 DOI: 10.1186/1471-2105-10-393] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2009] [Accepted: 11/30/2009] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND The assembly of reliable and complete protein-protein interaction (PPI) maps remains one of the significant challenges in systems biology. Computational methods which integrate and prioritize interaction data can greatly aid in approaching this goal. RESULTS We developed a Bayesian inference framework which uses phylogenetic relationships to guide the integration of PPI evidence across multiple datasets and species, providing more accurate predictions. We apply our framework to reconcile seven eukaryotic interactomes: H. sapiens, M. musculus, R. norvegicus, D. melanogaster, C. elegans, S. cerevisiae and A. thaliana. Comprehensive GO-based quality assessment indicates a 5% to 44% score increase in predicted interactomes compared to the input data. Further support is provided by gold-standard MIPS, CYC2008 and HPRD datasets. We demonstrate the ability to recover known PPIs in well-characterized yeast and human complexes (26S proteasome, endosome and exosome) and suggest possible new partners interacting with the putative SWI/SNF chromatin remodeling complex in A. thaliana. CONCLUSION Our phylogeny-guided approach compares favorably to two standard methods for mapping PPIs across species. Detailed analysis of predictions in selected functional modules uncovers specific PPI profiles among homologous proteins, establishing interaction-based partitioning of protein families. Provided evidence also suggests that interactions within core complex subunits are in general more conserved and easier to transfer accurately to other organisms, than interactions between these subunits.
Collapse
|
361
|
Schlicker A, Albrecht M. FunSimMat update: new features for exploring functional similarity. Nucleic Acids Res 2009; 38:D244-8. [PMID: 19923227 PMCID: PMC2808991 DOI: 10.1093/nar/gkp979] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Quantifying the functional similarity of genes and their products based on Gene Ontology annotation is an important tool for diverse applications like the analysis of gene expression data, the prediction and validation of protein functions and interactions, and the prioritization of disease genes. The Functional Similarity Matrix (FunSimMat, http://www.funsimmat.de) is a comprehensive database providing various precomputed functional similarity values for proteins in UniProtKB and for protein families in Pfam and SMART. With this update, we significantly increase the coverage of FunSimMat by adding data from the Gene Ontology Annotation project as well as new functional similarity measures. The applicability of the database is greatly extended by the implementation of a new Gene Ontology-based method for disease gene prioritization. Two new visualization tools allow an interactive analysis of the functional relationships between proteins or protein families. This is enhanced further by the introduction of an automatically derived hierarchy of annotation classes. Additional changes include a revised user front-end and a new RESTlike interface for improving the user-friendliness and online accessibility of FunSimMat.
Collapse
Affiliation(s)
- Andreas Schlicker
- Max Planck Institute for Informatics, Campus E1.4, 66123 Saarbrücken, Germany.
| | | |
Collapse
|
362
|
Jing L, Ng MK, Liu Y. Construction of gene networks with hybrid approach from expression profile and gene ontology. ACTA ACUST UNITED AC 2009; 14:107-18. [PMID: 19789116 DOI: 10.1109/titb.2009.2033056] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Gene regulatory networks have been long studied in model organisms as a means of identifying functional relationships among genes or their corresponding products. Despite many existing methods for genome-wide construction of such networks, solutions to the gene regulatory networks problem are however not trivial. Here, we present, a hybrid approach with gene expression profiles and gene ontology (HAEO). HAEO makes use of multimethods (overlapping clustering and reverse engineering methods) to effectively and efficiently construct gene regulatory networks from multisources (gene expression profiles and gene ontology). Application to yeast cell cycle dataset demonstrates HAEO's ability to construct validated gene regulatory networks, such as some potential gene regulatory pairs, which cannot be discovered by general inferring methods and identifying cycles (i.e., feedback loops) between genes. We also experimentally study the efficiency of building networks and show that the proposed method, HAEO is much faster than Bayesian networks method.
Collapse
Affiliation(s)
- Liping Jing
- School of Computer and Information Technology, BeijingJiaotong University, Beijing 100044, China.
| | | | | |
Collapse
|
363
|
Herrmann C, Bérard S, Tichit L. SimCT: a generic tool to visualize ontology-based relationships for biological objects. ACTA ACUST UNITED AC 2009; 25:3197-8. [PMID: 19776214 PMCID: PMC2778334 DOI: 10.1093/bioinformatics/btp553] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
UNLABELLED We present a web-based service, SimCT, which allows to graphically display the relationships between biological objects (e.g. genes or proteins) based on their annotations to a biomedical ontology. The result is presented as a tree of these objects, which can be viewed and explored through a specific java applet designed to highlight relevant features. Unlike the numerous tools that search for overrepresented terms, SimCT draws a simplified representation of biological terms present in the set of objects, and can be applied to any ontology for which annotation data is available. Being web-based, it does not require prior installation, and provides an intuitive, easy-to-use service. AVAILABILITY http://tagc.univ-mrs.fr/SimCT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Carl Herrmann
- TAGC-U928 Inserm, Faculté des Sciences, Université de la Méditerranée, Campus de Luminy Case 928, Marseille, France.
| | | | | |
Collapse
|
364
|
Friedel CC, Krumsiek J, Zimmer R. Bootstrapping the Interactome: Unsupervised Identification of Protein Complexes in Yeast. J Comput Biol 2009; 16:971-87. [DOI: 10.1089/cmb.2009.0023] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Affiliation(s)
- Caroline C. Friedel
- Institut für Informatik, Ludwig-Maximilians-Universität München, München, Germany
| | - Jan Krumsiek
- Institut für Informatik, Ludwig-Maximilians-Universität München, München, Germany
| | - Ralf Zimmer
- Institut für Informatik, Ludwig-Maximilians-Universität München, München, Germany
| |
Collapse
|
365
|
Abstract
In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization. We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies.Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research.
Collapse
Affiliation(s)
- Catia Pesquita
- LaSIGE, Faculty of Sciences, University of Lisboa, Lisboa, Portugal.
| | | | | | | | | |
Collapse
|
366
|
Andreopoulos B, Winter C, Labudde D, Schroeder M. Triangle network motifs predict complexes by complementing high-error interactomes with structural information. BMC Bioinformatics 2009; 10:196. [PMID: 19558694 PMCID: PMC2714575 DOI: 10.1186/1471-2105-10-196] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2009] [Accepted: 06/27/2009] [Indexed: 11/30/2022] Open
Abstract
Background A lot of high-throughput studies produce protein-protein interaction networks (PPINs) with many errors and missing information. Even for genome-wide approaches, there is often a low overlap between PPINs produced by different studies. Second-level neighbors separated by two protein-protein interactions (PPIs) were previously used for predicting protein function and finding complexes in high-error PPINs. We retrieve second level neighbors in PPINs, and complement these with structural domain-domain interactions (SDDIs) representing binding evidence on proteins, forming PPI-SDDI-PPI triangles. Results We find low overlap between PPINs, SDDIs and known complexes, all well below 10%. We evaluate the overlap of PPI-SDDI-PPI triangles with known complexes from Munich Information center for Protein Sequences (MIPS). PPI-SDDI-PPI triangles have ~20 times higher overlap with MIPS complexes than using second-level neighbors in PPINs without SDDIs. The biological interpretation for triangles is that a SDDI causes two proteins to be observed with common interaction partners in high-throughput experiments. The relatively few SDDIs overlapping with PPINs are part of highly connected SDDI components, and are more likely to be detected in experimental studies. We demonstrate the utility of PPI-SDDI-PPI triangles by reconstructing myosin-actin processes in the nucleus, cytoplasm, and cytoskeleton, which were not obvious in the original PPIN. Using other complementary datatypes in place of SDDIs to form triangles, such as PubMed co-occurrences or threading information, results in a similar ability to find protein complexes. Conclusion Given high-error PPINs with missing information, triangles of mixed datatypes are a promising direction for finding protein complexes. Integrating PPINs with SDDIs improves finding complexes. Structural SDDIs partially explain the high functional similarity of second-level neighbors in PPINs. We estimate that relatively little structural information would be sufficient for finding complexes involving most of the proteins and interactions in a typical PPIN.
Collapse
Affiliation(s)
- Bill Andreopoulos
- Biotechnology Center (BIOTEC), Technische Universität Dresden, 01307 Dresden, Germany.
| | | | | | | |
Collapse
|
367
|
Friedel CC, Dölken L, Ruzsics Z, Koszinowski UH, Zimmer R. Conserved principles of mammalian transcriptional regulation revealed by RNA half-life. Nucleic Acids Res 2009; 37:e115. [PMID: 19561200 PMCID: PMC2761256 DOI: 10.1093/nar/gkp542] [Citation(s) in RCA: 168] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
RNA levels in a cell are regulated by the relative rates of RNA synthesis and decay. We recently developed a new approach for measuring both RNA synthesis and decay in a single experimental setting by biosynthetic labeling of newly transcribed RNA. Here, we show that this provides measurements of RNA half-lives from microarray data with a so far unreached accuracy. Based on such measurements of RNA half-lives for human B-cells and mouse fibroblasts, we identified conserved regulatory principles for a large number of biological processes. We show that different regulatory patterns between functionally similar proteins are characterized by differences in the half-life of the corresponding transcripts and can be identified by measuring RNA half-life. We identify more than 100 protein families which show such differential regulatory patterns in both species. Additionally, we provide strong evidence that the activity of protein complexes consisting of subunits with overall long transcript half-lives can be regulated by transcriptional regulation of individual key subunits with short-lived transcripts. Based on this observation, we predict more than 100 key regulatory subunits for human complexes of which 28% could be confirmed in mice (P < 10−9). Therefore, this atlas of transcript half-lives provides new fundamental insights into many cellular processes.
Collapse
Affiliation(s)
- Caroline C. Friedel
- Institute for Informatics, Ludwig-Maximilians-Universität München, Munich 80333 and Max von Pettenkofer-Institute, Ludwig-Maximilians-Universität München, Munich 80337, Germany
- *To whom correspondence should be addressed. Tel: +49-89-2180-4056; Fax: +49-89-2180-4054; Correspondence may also be addressed to Dr Lars Dölken. Tel: +49-89-5160-5290; Fax: +49-89-5160-5292;
| | - Lars Dölken
- Institute for Informatics, Ludwig-Maximilians-Universität München, Munich 80333 and Max von Pettenkofer-Institute, Ludwig-Maximilians-Universität München, Munich 80337, Germany
- *To whom correspondence should be addressed. Tel: +49-89-2180-4056; Fax: +49-89-2180-4054; Correspondence may also be addressed to Dr Lars Dölken. Tel: +49-89-5160-5290; Fax: +49-89-5160-5292;
| | - Zsolt Ruzsics
- Institute for Informatics, Ludwig-Maximilians-Universität München, Munich 80333 and Max von Pettenkofer-Institute, Ludwig-Maximilians-Universität München, Munich 80337, Germany
| | - Ulrich H. Koszinowski
- Institute for Informatics, Ludwig-Maximilians-Universität München, Munich 80333 and Max von Pettenkofer-Institute, Ludwig-Maximilians-Universität München, Munich 80337, Germany
| | - Ralf Zimmer
- Institute for Informatics, Ludwig-Maximilians-Universität München, Munich 80333 and Max von Pettenkofer-Institute, Ludwig-Maximilians-Universität München, Munich 80337, Germany
| |
Collapse
|
368
|
Blankenburg H, Finn RD, Prlić A, Jenkinson AM, Ramírez F, Emig D, Schelhorn SE, Büch J, Lengauer T, Albrecht M. DASMI: exchanging, annotating and assessing molecular interaction data. ACTA ACUST UNITED AC 2009; 25:1321-8. [PMID: 19420069 PMCID: PMC2677739 DOI: 10.1093/bioinformatics/btp142] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Ever increasing amounts of biological interaction data are being accumulated worldwide, but they are currently not readily accessible to the biologist at a single site. New techniques are required for retrieving, sharing and presenting data spread over the Internet. RESULTS We introduce the DASMI system for the dynamic exchange, annotation and assessment of molecular interaction data. DASMI is based on the widely used Distributed Annotation System (DAS) and consists of a data exchange specification, web servers for providing the interaction data and clients for data integration and visualization. The decentralized architecture of DASMI affords the online retrieval of the most recent data from distributed sources and databases. DASMI can also be extended easily by adding new data sources and clients. We describe all DASMI components and demonstrate their use for protein and domain interactions. AVAILABILITY The DASMI tools are available at http://www.dasmi.de/ and http://ipfam.sanger.ac.uk/graph. The DAS registry and the DAS 1.53E specification is found at http://www.dasregistry.org/.
Collapse
Affiliation(s)
- Hagen Blankenburg
- Max Planck Institute for Informatics, Campus E 1.4, 66123 Saarbrücken, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
369
|
Liao CS, Lu K, Baym M, Singh R, Berger B. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 2009; 25:i253-8. [PMID: 19477996 PMCID: PMC2687957 DOI: 10.1093/bioinformatics/btp203] [Citation(s) in RCA: 173] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION With the increasing availability of large protein-protein interaction networks, the question of protein network alignment is becoming central to systems biology. Network alignment is further delineated into two sub-problems: local alignment, to find small conserved motifs across networks, and global alignment, which attempts to find a best mapping between all nodes of the two networks. In this article, our aim is to improve upon existing global alignment results. Better network alignment will enable, among other things, more accurate identification of functional orthologs across species. RESULTS We introduce IsoRankN (IsoRank-Nibble) a global multiple-network alignment tool based on spectral clustering on the induced graph of pairwise alignment scores. IsoRankN outperforms existing algorithms for global network alignment in coverage and consistency on multiple alignments of the five available eukaryotic networks. Being based on spectral methods, IsoRankN is both error tolerant and computationally efficient. AVAILABILITY Our software is available freely for non-commercial purposes on request from: http://isorank.csail.mit.edu/.
Collapse
Affiliation(s)
- Chung-Shou Liao
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
| | | | | | | | | |
Collapse
|
370
|
Blankenburg H, Ramírez F, Büch J, Albrecht M. DASMIweb: online integration, analysis and assessment of distributed protein interaction data. Nucleic Acids Res 2009; 37:W122-8. [PMID: 19502495 PMCID: PMC2703953 DOI: 10.1093/nar/gkp438] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
In recent years, we have witnessed a substantial increase of the amount of available protein interaction data. However, most data are currently not readily accessible to the biologist at a single site, but scattered over multiple online repositories. Therefore, we have developed the DASMIweb server that affords the integration, analysis and qualitative assessment of distributed sources of interaction data in a dynamic fashion. Since DASMIweb allows for querying many different resources of protein and domain interactions simultaneously, it serves as an important starting point for interactome studies and assists the user in finding publicly accessible interaction data with minimal effort. The pool of queried resources is fully configurable and supports the inclusion of own interaction data or confidence scores. In particular, DASMIweb integrates confidence measures like functional similarity scores to assess individual interactions. The retrieved results can be exported in different file formats like MITAB or SIF. DASMIweb is freely available at http://www.dasmiweb.de.
Collapse
Affiliation(s)
- Hagen Blankenburg
- Max Planck Institute for Informatics, Campus E1.4, 66123 Saarbrücken, Germany.
| | | | | | | |
Collapse
|
371
|
Dotan-Cohen D, Kasif S, Melkman AA. Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering. ACTA ACUST UNITED AC 2009; 25:1789-95. [PMID: 19497934 PMCID: PMC2705235 DOI: 10.1093/bioinformatics/btp327] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data. Contact:dotna@cs.bgu.ac.il Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dikla Dotan-Cohen
- Department of Computer Science, Ben-Gurion University, Beer Sheva, Israel 84105.
| | | | | |
Collapse
|
372
|
Merkl R, Wiezer A. GO4genome: a prokaryotic phylogeny based on genome organization. J Mol Evol 2009; 68:550-62. [PMID: 19436929 PMCID: PMC3085772 DOI: 10.1007/s00239-009-9233-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2008] [Revised: 03/10/2009] [Accepted: 04/03/2009] [Indexed: 11/24/2022]
Abstract
Determining the phylogeny of closely related prokaryotes may fail in an analysis of rRNA or a small set of sequences. Whole-genome phylogeny utilizes the maximally available sample space. For a precise determination of genome similarity, two aspects have to be considered when developing an algorithm of whole-genome phylogeny: (1) gene order conservation is a more precise signal than gene content; and (2) when using sequence similarity, failures in identifying orthologues or the in situ replacement of genes via horizontal gene transfer may give misleading results. GO4genome is a new paradigm, which is based on a detailed analysis of gene function and the location of the respective genes. For characterization of genes, the algorithm uses gene ontology enabling a comparison of function independent of evolutionary relationship. After the identification of locally optimal series of gene functions, their length distribution is utilized to compute a phylogenetic distance. The outcome is a classification of genomes based on metabolic capabilities and their organization. Thus, the impact of effects on genome organization that are not covered by methods of molecular phylogeny can be studied. Genomes of strains belonging to Escherichia coli, Shigella, Streptococcus, Methanosarcina, and Yersinia were analyzed. Differences from the findings of classical methods are discussed.
Collapse
Affiliation(s)
- Rainer Merkl
- Institut für Biophysik und Physikalische Biochemie, Universität Regensburg, 93040, Regensburg, Germany.
| | | |
Collapse
|
373
|
Chitale M, Hawkins T, Park C, Kihara D. ESG: extended similarity group method for automated protein function prediction. ACTA ACUST UNITED AC 2009; 25:1739-45. [PMID: 19435743 DOI: 10.1093/bioinformatics/btp309] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability. RESULTS We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains. AVAILABILITY ESG web server is available for automated protein function prediction at http://dragon.bio.purdue.edu/ESG/.
Collapse
Affiliation(s)
- Meghana Chitale
- Department of Computer Science, Purdue University, IN 47907, USA
| | | | | | | |
Collapse
|
374
|
Dotan-Cohen D, Letovsky S, Melkman AA, Kasif S. Biological process linkage networks. PLoS One 2009; 4:e5313. [PMID: 19390589 PMCID: PMC2669181 DOI: 10.1371/journal.pone.0005313] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2008] [Accepted: 03/24/2009] [Indexed: 12/21/2022] Open
Abstract
Background The traditional approach to studying complex biological networks is based on the identification of interactions between internal components of signaling or metabolic pathways. By comparison, little is known about interactions between higher order biological systems, such as biological pathways and processes. We propose a methodology for gleaning patterns of interactions between biological processes by analyzing protein-protein interactions, transcriptional co-expression and genetic interactions. At the heart of the methodology are the concept of Linked Processes and the resultant network of biological processes, the Process Linkage Network (PLN). Results We construct, catalogue, and analyze different types of PLNs derived from different data sources and different species. When applied to the Gene Ontology, many of the resulting links connect processes that are distant from each other in the hierarchy, even though the connection makes eminent sense biologically. Some others, however, carry an element of surprise and may reflect mechanisms that are unique to the organism under investigation. In this aspect our method complements the link structure between processes inherent in the Gene Ontology, which by its very nature is species-independent. As a practical application of the linkage of processes we demonstrate that it can be effectively used in protein function prediction, having the power to increase both the coverage and the accuracy of predictions, when carefully integrated into prediction methods. Conclusions Our approach constitutes a promising new direction towards understanding the higher levels of organization of the cell as a system which should help current efforts to re-engineer ontologies and improve our ability to predict which proteins are involved in specific biological processes.
Collapse
Affiliation(s)
- Dikla Dotan-Cohen
- Department of Computer Science, Ben-Gurion University, Beer Sheva, Israel.
| | | | | | | |
Collapse
|
375
|
Shin CJ, Wong S, Davis MJ, Ragan MA. Protein-protein interaction as a predictor of subcellular location. BMC SYSTEMS BIOLOGY 2009; 3:28. [PMID: 19243629 PMCID: PMC2663780 DOI: 10.1186/1752-0509-3-28] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2008] [Accepted: 02/25/2009] [Indexed: 11/10/2022]
Abstract
Background Many biological processes are mediated by dynamic interactions between and among proteins. In order to interact, two proteins must co-occur spatially and temporally. As protein-protein interactions (PPIs) and subcellular location (SCL) are discovered via separate empirical approaches, PPI and SCL annotations are independent and might complement each other in helping us to understand the role of individual proteins in cellular networks. We expect reliable PPI annotations to show that proteins interacting in vivo are co-located in the same cellular compartment. Our goal here is to evaluate the potential of using PPI annotation in determining SCL of proteins in human, mouse, fly and yeast, and to identify and quantify the factors that contribute to this complementarity. Results Using publicly available data, we evaluate the hypothesis that interacting proteins must be co-located within the same subcellular compartment. Based on a large, manually curated PPI dataset, we demonstrate that a substantial proportion of interacting proteins are in fact co-located. We develop an approach to predict the SCL of a protein based on the SCL of its interaction partners, given sufficient confidence in the interaction itself. The frequency of false positive PPIs can be reduced by use of six lines of supporting evidence, three based on type of recorded evidence (empirical approach, multiplicity of databases, and multiplicity of literature citations) and three based on type of biological evidence (inferred biological process, domain-domain interactions, and orthology relationships), with biological evidence more-effective than recorded evidence. Our approach performs better than four existing prediction methods in identifying the SCL of membrane proteins, and as well as or better for soluble proteins. Conclusion Understanding cellular systems requires knowledge of the SCL of interacting proteins. We show how PPI data can be used more effectively to yield reliable SCL predictions for both soluble and membrane proteins. Scope exists for further improvement in our understanding of cellular function through consideration of the biological context of molecular interactions.
Collapse
Affiliation(s)
- Chang Jin Shin
- The University of Queensland, Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, QLD, Australia.
| | | | | | | |
Collapse
|
376
|
Hawkins T, Chitale M, Luban S, Kihara D. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins 2009; 74:566-82. [PMID: 18655063 DOI: 10.1002/prot.22172] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907, USA
| | | | | | | |
Collapse
|
377
|
Alexopoulou D, Andreopoulos B, Dietze H, Doms A, Gandon F, Hakenberg J, Khelif K, Schroeder M, Wächter T. Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy. BMC Bioinformatics 2009; 10:28. [PMID: 19159460 PMCID: PMC2663782 DOI: 10.1186/1471-2105-10-28] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2008] [Accepted: 01/21/2009] [Indexed: 11/24/2022] Open
Abstract
Background Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. Results The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Conclusion Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. Availability The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
Collapse
Affiliation(s)
- Dimitra Alexopoulou
- Biotechnology Center (BIOTEC), Technische Universität Dresden, 01062, Dresden, Germany.
| | | | | | | | | | | | | | | | | |
Collapse
|
378
|
Brisson L, Collard M. How to Semantically Enhance a Data Mining Process? ENTERP INF SYST-UK 2009. [DOI: 10.1007/978-3-642-00670-8_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
379
|
Ovaska K, Laakso M, Hautaniemi S. Fast gene ontology based clustering for microarray experiments. BioData Min 2008; 1:11. [PMID: 19025591 PMCID: PMC2613876 DOI: 10.1186/1756-0381-1-11] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2008] [Accepted: 11/21/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Analysis of a microarray experiment often results in a list of hundreds of disease-associated genes. In order to suggest common biological processes and functions for these genes, Gene Ontology annotations with statistical testing are widely used. However, these analyses can produce a very large number of significantly altered biological processes. Thus, it is often challenging to interpret GO results and identify novel testable biological hypotheses. RESULTS We present fast software for advanced gene annotation using semantic similarity for Gene Ontology terms combined with clustering and heat map visualisation. The methodology allows rapid identification of genes sharing the same Gene Ontology cluster. CONCLUSION Our R based semantic similarity open-source package has a speed advantage of over 2000-fold compared to existing implementations. From the resulting hierarchical clustering dendrogram genes sharing a GO term can be identified, and their differences in the gene expression patterns can be seen from the heat map. These methods facilitate advanced annotation of genes resulting from data analysis.
Collapse
Affiliation(s)
- Kristian Ovaska
- Computational Systems Biology Laboratory, Institute of Biomedicine and Genome-Scale Biology Program, Biomedicum Helsinki, University of Helsinki, Finland
| | | | | |
Collapse
|
380
|
Xu T, Du L, Zhou Y. Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinformatics 2008; 9:472. [PMID: 18986551 PMCID: PMC2612010 DOI: 10.1186/1471-2105-9-472] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2008] [Accepted: 11/06/2008] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Researchers interested in analysing the expression patterns of functionally related genes usually hope to improve the accuracy of their results beyond the boundaries of currently available experimental data. Gene ontology (GO) data provides a novel way to measure the functional relationship between gene products. Many approaches have been reported for calculating the similarities between two GO terms, known as semantic similarities. However, biologists are more interested in the relationship between gene products than in the scores linking the GO terms. To highlight the relationships among genes, recent studies have focused on functional similarities. RESULTS In this study, we evaluated five functional similarity methods using both protein-protein interaction (PPI) and expression data of S. cerevisiae. The receiver operating characteristics (ROC) and correlation coefficient analysis of these methods showed that the maximum method outperformed the other methods. Statistical comparison of multiple- and single-term annotated proteins in biological process ontology indicated that genes with multiple GO terms may be more reliable for separating true positives from noise. CONCLUSION This study demonstrated the reliability of current approaches that elevate the similarity of GO terms to the similarity of proteins. Suggestions for further improvements in functional similarity analysis are also provided.
Collapse
Affiliation(s)
- Tao Xu
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Chinese National Human Genome Center, Shanghai, PR China.
| | | | | |
Collapse
|
381
|
Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G. Inter-species normalization of gene mentions with GNAT. Bioinformatics 2008; 24:i126-132. [PMID: 18689813 DOI: 10.1093/bioinformatics/btn299] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. RESULTS We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes. AVAILABILITY A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. GNAT will also be available within the BioCreativeMetaService project, see http://bcms.bioinfo.cnio.es. SUPPLEMENTARY INFORMATION The test data set, lexica, and links toexternal data are available at http://cbioc.eas.asu.edu/gnat/
Collapse
Affiliation(s)
- Jörg Hakenberg
- Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA.
| | | | | | | | | |
Collapse
|
382
|
Chagoyen M, Carazo JM, Pascual-Montano A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics 2008; 9:444. [PMID: 18937846 PMCID: PMC2588600 DOI: 10.1186/1471-2105-9-444] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 10/20/2008] [Indexed: 11/23/2022] Open
Abstract
Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at
Collapse
|
383
|
Schelhorn SE, Lengauer T, Albrecht M. An integrative approach for predicting interactions of protein regions. ACTA ACUST UNITED AC 2008; 24:i35-41. [PMID: 18689837 DOI: 10.1093/bioinformatics/btn290] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Protein-protein interactions are commonly mediated by the physical contact of distinct protein regions. Computational identification of interacting protein regions aids in the detailed understanding of protein networks and supports the prediction of novel protein interactions and the reconstruction of protein complexes. RESULTS We introduce an integrative approach for predicting protein region interactions using a probabilistic model fitted to an observed protein network. In particular, we consider globular domains, short linear motifs and coiled-coil regions as potential protein-binding regions. Possible cooperations between multiple regions within the same protein are taken into account. A.negrained confidence system allows for varying the impact of specific protein interactions and region annotations on the modeling process. We apply our prediction approach to a large training set using a maximum likelihood method, compare different scoring functions for region interactions and validate the predicted interactions against a collection of experimentally observed interactions. In addition, we analyze prediction performance with respect to the inclusion of different region types, the incorporation of confidence values for training data and the utilization of predicted protein interactions. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
384
|
Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M. Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008; 9 Suppl 2:S14. [PMID: 18834492 PMCID: PMC2559985 DOI: 10.1186/gb-2008-9-s2-s14] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins. RESULTS We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4% (micro-average) in the BioCreative II interaction pair subtask. CONCLUSION For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages. AVAILABILITY Our methods for gene, protein, and species identification, and extraction of protein-protein are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.
Collapse
Affiliation(s)
- Jörg Hakenberg
- Biotechnological Centre, Technische Universität Dresden, Dresden, Germany.
| | | | | | | | | | | |
Collapse
|
385
|
Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 2008; 9:327. [PMID: 18680592 PMCID: PMC2518162 DOI: 10.1186/1471-2105-9-327] [Citation(s) in RCA: 143] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2008] [Accepted: 08/04/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The availability of various high-throughput experimental and computational methods allows biologists to rapidly infer functional relationships between genes. It is often necessary to evaluate these predictions computationally, a task that requires a reference database for functional relatedness. One such reference is the Gene Ontology (GO). A number of groups have suggested that the semantic similarity of the GO annotations of genes can serve as a proxy for functional relatedness. Here we evaluate a simple measure of semantic similarity, term overlap (TO). RESULTS We computed the TO for randomly selected gene pairs from the mouse genome. For comparison, we implemented six previously reported semantic similarity measures that share the feature of using computation of probabilities of terms to infer information content, in addition to three vector based approaches and a normalized version of the TO measure. We find that the overlap measure is highly correlated with the others but differs in detail. TO is at least as good a predictor of sequence similarity as the other measures. We further show that term overlap may avoid some problems that affect the probability-based measures. Term overlap is also much faster to compute than the information content-based measures. CONCLUSION Our experiments suggest that term overlap can serve as a simple and fast alternative to other approaches which use explicit information content estimation or require complex pre-calculations, while also avoiding problems that some other measures may encounter.
Collapse
|
386
|
Krumsiek J, Friedel CC, Zimmer R. ProCope--protein complex prediction and evaluation. Bioinformatics 2008; 24:2115-6. [PMID: 18635566 DOI: 10.1093/bioinformatics/btn376] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
SUMMARY Recent advances in high-throughput technology have increased the quantity of available data on protein complexes and stimulated the development of many new prediction methods. In this article, we present ProCope, a Java software suite for the prediction and evaluation of protein complexes from affinity purification experiments which integrates the major methods for calculating interaction scores and predicting protein complexes published over the last years. Methods can be accessed via a graphical user interface, command line tools and a Java API. Using ProCope, existing algorithms can be applied quickly and reproducibly on new experimental results, individual steps of the different algorithms can be combined in new and innovative ways and new methods can be implemented and integrated in the existing prediction framework. AVAILABILITY Source code and executables are available at http://www.bio.ifi.lmu.de/Complexes/ProCope/.
Collapse
Affiliation(s)
- Jan Krumsiek
- Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstrasse 17, 80333 München, Germany
| | | | | |
Collapse
|
387
|
Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcão AO, Couto FM. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 2008; 9 Suppl 5:S4. [PMID: 18460186 PMCID: PMC2367622 DOI: 10.1186/1471-2105-9-s5-s4] [Citation(s) in RCA: 191] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations. Results We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation. Conclusions This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid simGIC was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.
Collapse
Affiliation(s)
- Catia Pesquita
- XLDB, Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Campo Grande-Edifício C6, Lisboa, Portugal.
| | | | | | | | | | | |
Collapse
|
388
|
|
389
|
Protein function assignment through mining cross-species protein-protein interactions. PLoS One 2008; 3:e1562. [PMID: 18253506 PMCID: PMC2216687 DOI: 10.1371/journal.pone.0001562] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2007] [Accepted: 01/12/2008] [Indexed: 11/19/2022] Open
Abstract
Background As we move into the post genome-sequencing era, an immediate challenge is how to make best use of the large amount of high-throughput experimental data to assign functions to currently uncharacterized proteins. We here describe CSIDOP, a new method for protein function assignment based on shared interacting domain patterns extracted from cross-species protein-protein interaction data. Methodology/Principal Findings The proposed method is assessed both biologically and statistically over the genome of H. sapiens. The CSIDOP method is capable of making protein function prediction with accuracy of 95.42% using 2,972 gene ontology (GO) functional categories. In addition, we are able to assign novel functional annotations for 181 previously uncharacterized proteins in H. sapiens. Furthermore, we demonstrate that for proteins that are characterized by GO, the CSIDOP may predict extra functions. This is attractive as a protein normally executes a variety of functions in different processes and its current GO annotation may be incomplete. Conclusions/Significance It can be shown through experimental results that the CSIDOP method is reliable and practical in use. The method will continue to improve as more high quality interaction data becomes available and is readily scalable to a genome-wide application.
Collapse
|
390
|
del Pozo A, Pazos F, Valencia A. Defining functional distances over gene ontology. BMC Bioinformatics 2008; 9:50. [PMID: 18221506 PMCID: PMC2375122 DOI: 10.1186/1471-2105-9-50] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2007] [Accepted: 01/25/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A fundamental problem when trying to define the functional relationships between proteins is the difficulty in quantifying functional similarities, even when well-structured ontologies exist regarding the activity of proteins (i.e. 'gene ontology' -GO-). However, functional metrics can overcome the problems in the comparing and evaluating functional assignments and predictions. As a reference of proximity, previous approaches to compare GO terms considered linkage in terms of ontology weighted by a probability distribution that balances the non-uniform 'richness' of different parts of the Direct Acyclic Graph. Here, we have followed a different approach to quantify functional similarities between GO terms. RESULTS We propose a new method to derive 'functional distances' between GO terms that is based on the simultaneous occurrence of terms in the same set of Interpro entries, instead of relying on the structure of the GO. The coincidence of GO terms reveals natural biological links between the GO functions and defines a distance model Df which fulfils the properties of a Metric Space. The distances obtained in this way can be represented as a hierarchical 'Functional Tree'. CONCLUSION The method proposed provides a new definition of distance that enables the similarity between GO terms to be quantified. Additionally, the 'Functional Tree' defines groups with biological meaning enhancing its utility for protein function comparison and prediction. Finally, this approach could be for function-based protein searches in databases, and for analysing the gene clusters produced by DNA array experiments.
Collapse
Affiliation(s)
- Angela del Pozo
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernandez Almagro, 3, E-28029 Madrid, Spain.
| | | | | |
Collapse
|
391
|
|
392
|
Liu ZP, Wu LY, Wang Y, Chen L, Zhang XS. Predicting gene ontology functions from protein's regional surface structures. BMC Bioinformatics 2007; 8:475. [PMID: 18070366 PMCID: PMC2233648 DOI: 10.1186/1471-2105-8-475] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2007] [Accepted: 12/11/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Annotation of protein functions is an important task in the post-genomic era. Most early approaches for this task exploit only the sequence or global structure information. However, protein surfaces are believed to be crucial to protein functions because they are the main interfaces to facilitate biological interactions. Recently, several databases related to structural surfaces, such as pockets and cavities, have been constructed with a comprehensive library of identified surface structures. For example, CASTp provides identification and measurements of surface accessible pockets as well as interior inaccessible cavities. RESULTS A novel method was proposed to predict the Gene Ontology (GO) functions of proteins from the pocket similarity network, which is constructed according to the structure similarities of pockets. The statistics of the networks were presented to explore the relationship between the similar pockets and GO functions of proteins. Cross-validation experiments were conducted to evaluate the performance of the proposed method. Results and codes are available at: http://zhangroup.aporc.org/bioinfo/PSN/. CONCLUSION The computational results demonstrate that the proposed method based on the pocket similarity network is effective and efficient for predicting GO functions of proteins in terms of both computational complexity and prediction accuracy. The proposed method revealed strong relationship between small surface patterns (or pockets) and GO functions, which can be further used to identify active sites or functional motifs. The high quality performance of the prediction method together with the statistics also indicates that pockets play essential roles in biological interactions or the GO functions. Moreover, in addition to pockets, the proposed network framework can also be used for adopting other protein spatial surface patterns to predict the protein functions.
Collapse
Affiliation(s)
- Zhi-Ping Liu
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China.
| | | | | | | | | |
Collapse
|
393
|
Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007; 8:995-1005. [PMID: 18037900 DOI: 10.1038/nrm2281] [Citation(s) in RCA: 354] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
394
|
Abstract
Functional similarity based on Gene Ontology (GO) annotation is used in diverse applications like gene clustering, gene expression data analysis, protein interaction prediction and evaluation. However, there exists no comprehensive resource of functional similarity values although such a database would facilitate the use of functional similarity measures in different applications. Here, we describe FunSimMat (Functional Similarity Matrix, http://funsimmat.bioinf.mpi-inf.mpg.de/), a large new database that provides several different semantic similarity measures for GO terms. It offers various precomputed functional similarity values for proteins contained in UniProtKB and for protein families in Pfam and SMART. The web interface allows users to efficiently perform both semantic similarity searches with GO terms and functional similarity searches with proteins or protein families. All results can be downloaded in tab-delimited files for use with other tools. An additional XML–RPC interface gives automatic online access to FunSimMat for programs and remote services.
Collapse
Affiliation(s)
- Andreas Schlicker
- Max Planck Institute for Informatics, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany.
| | | |
Collapse
|
395
|
Ramírez F, Schlicker A, Assenov Y, Lengauer T, Albrecht M. Computational analysis of human protein interaction networks. Proteomics 2007; 7:2541-52. [PMID: 17647236 DOI: 10.1002/pmic.200600924] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Large amounts of human protein interaction data have been produced by experiments and prediction methods. However, the experimental coverage of the human interactome is still low in contrast to predicted data. To gain insight into the value of publicly available human protein network data, we compared predicted datasets, high-throughput results from yeast two-hybrid screens, and literature-curated protein-protein interactions. This evaluation is not only important for further methodological improvements, but also for increasing the confidence in functional hypotheses derived from predictions. Therefore, we assessed the quality and the potential bias of the different datasets using functional similarity based on the Gene Ontology, structural iPfam domain-domain interactions, likelihood ratios, and topological network parameters. This analysis revealed major differences between predicted datasets, but some of them also scored at least as high as the experimental ones regarding multiple quality measures. Therefore, since only small pair wise overlap between most datasets is observed, they may be combined to enlarge the available human interactome data. For this purpose, we additionally studied the influence of protein length on data quality and the number of disease proteins covered by each dataset. We could further demonstrate that protein interactions predicted by more than one method achieve an elevated reliability.
Collapse
Affiliation(s)
- Fidel Ramírez
- Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| | | | | | | | | |
Collapse
|
396
|
Chabalier J, Mosser J, Burgun A. A transversal approach to predict gene product networks from ontology-based similarity. BMC Bioinformatics 2007; 8:235. [PMID: 17605807 PMCID: PMC1940024 DOI: 10.1186/1471-2105-8-235] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2006] [Accepted: 07/02/2007] [Indexed: 01/19/2023] Open
Abstract
Background Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression. Results The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity. Conclusion Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression.
Collapse
Affiliation(s)
- Julie Chabalier
- E.A 3888, Modélisation Conceptuelle des Connaissances Biomédicales, Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 Rennes Cedex, France
| | - Jean Mosser
- CNRS UMR 6061 Génétique et Développement, Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 Rennes Cedex, France
- OUEST-genopoletranscriptomic platform, Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 Rennes, France
| | - Anita Burgun
- E.A 3888, Modélisation Conceptuelle des Connaissances Biomédicales, Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 Rennes Cedex, France
| |
Collapse
|
397
|
Schlicker A, Huthmacher C, Ramírez F, Lengauer T, Albrecht M. Functional evaluation of domain-domain interactions and human protein interaction networks. Bioinformatics 2007; 23:859-65. [PMID: 17456608 DOI: 10.1093/bioinformatics/btm012] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Large amounts of protein and domain interaction data are being produced by experimental high-throughput techniques and computational approaches. To gain insight into the value of the provided data, we used our new similarity measure based on the Gene Ontology (GO) to evaluate the molecular functions and biological processes of interacting proteins or domains. The applied measure particularly addresses the frequent annotation of proteins or domains with multiple GO terms. RESULTS Using our similarity measure, we compare predicted domain-domain and human protein-protein interactions with experimentally derived interactions. The results show that our similarity measure is of significant benefit in quality assessment and confidence ranking of domain and protein networks. We also derive useful confidence score thresholds for dividing domain interaction predictions into subsets of low and high confidence. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andreas Schlicker
- Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany
| | | | | | | | | |
Collapse
|
398
|
Schlicker A, Rahnenführer J, Albrecht M, Lengauer T, Domingues FS. GOTax: investigating biological processes and biochemical activities along the taxonomic tree. Genome Biol 2007; 8:R33. [PMID: 17346342 PMCID: PMC1868936 DOI: 10.1186/gb-2007-8-3-r33] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2006] [Revised: 01/18/2007] [Accepted: 03/08/2007] [Indexed: 11/10/2022] Open
Abstract
We describe GOTax, a comparative genomics platform that integrates protein annotation with protein family classification and taxonomy. User-defined sets of proteins, protein families, annotation terms or taxonomic groups can be selected and compared, allowing for the analysis of distribution of biological processes and molecular activities over different taxonomic groups. In particular, a measure of functional similarity is available for comparing proteins and protein families, establishing functional relationships independent of evolution.
Collapse
Affiliation(s)
- Andreas Schlicker
- Department of Computational Biology and Applied Algorithmics, Max-Planck-Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
| | - Jörg Rahnenführer
- Department of Computational Biology and Applied Algorithmics, Max-Planck-Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
| | - Mario Albrecht
- Department of Computational Biology and Applied Algorithmics, Max-Planck-Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
| | - Thomas Lengauer
- Department of Computational Biology and Applied Algorithmics, Max-Planck-Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
| | - Francisco S Domingues
- Department of Computational Biology and Applied Algorithmics, Max-Planck-Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
| |
Collapse
|