1
|
Tian Z, Fang H, Teng Z, Ye Y. GOGCN: Graph Convolutional Network on Gene Ontology for Functional Similarity Analysis of Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1053-1064. [PMID: 35687647 DOI: 10.1109/tcbb.2022.3181300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The measurement of gene functional similarity plays a critical role in numerous biological applications, such as gene clustering, the construction of gene similarity networks. However, most existing approaches still rely heavily on traditional computational strategies, which are not guaranteed to achieve satisfactory performance. In this study, we propose a novel computational approach called GOGCN to measure gene functional similarity by modeling the Gene Ontology (GO) through Graph Convolutional Network (GCN). GOGCN is a graph-based approach that performs sufficient representation learning for terms and relations in the GO graph. First, GOGCN employs the GCN-based knowledge graph embedding (KGE) model to learn vector representations (i.e., embeddings) for all entities (i.e., terms). Second, GOGCN calculates the semantic similarity between two terms based on their corresponding vector representations. Finally, GOGCN estimates gene functional similarity by making use of the pair-wise strategy. During the representation learning period, GOGCN promotes semantic interaction between terms through GCN, thereby capturing the rich structural information of the GO graph. Further experimental results on various datasets suggest that GOGCN is superior to the other state-of-the-art approaches, which shows its reliability and effectiveness.
Collapse
|
2
|
Bandyopadhyay SS, Halder AK, Saha S, Chatterjee P, Nasipuri M, Basu S. Assessment of GO-Based Protein Interaction Affinities in the Large-Scale Human-Coronavirus Family Interactome. Vaccines (Basel) 2023; 11:549. [PMID: 36992133 PMCID: PMC10059867 DOI: 10.3390/vaccines11030549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 02/19/2023] [Accepted: 02/23/2023] [Indexed: 03/03/2023] Open
Abstract
SARS-CoV-2 is a novel coronavirus that replicates itself via interacting with the host proteins. As a result, identifying virus and host protein-protein interactions could help researchers better understand the virus disease transmission behavior and identify possible COVID-19 drugs. The International Committee on Virus Taxonomy has determined that nCoV is genetically 89% compared to the SARS-CoV epidemic in 2003. This paper focuses on assessing the host-pathogen protein interaction affinity of the coronavirus family, having 44 different variants. In light of these considerations, a GO-semantic scoring function is provided based on Gene Ontology (GO) graphs for determining the binding affinity of any two proteins at the organism level. Based on the availability of the GO annotation of the proteins, 11 viral variants, viz., SARS-CoV-2, SARS, MERS, Bat coronavirus HKU3, Bat coronavirus Rp3/2004, Bat coronavirus HKU5, Murine coronavirus, Bovine coronavirus, Rat coronavirus, Bat coronavirus HKU4, Bat coronavirus 133/2005, are considered from 44 viral variants. The fuzzy scoring function of the entire host-pathogen network has been processed with ~180 million potential interactions generated from 19,281 host proteins and around 242 viral proteins. ~4.5 million potential level one host-pathogen interactions are computed based on the estimated interaction affinity threshold. The resulting host-pathogen interactome is also validated with state-of-the-art experimental networks. The study has also been extended further toward the drug-repurposing study by analyzing the FDA-listed COVID drugs.
Collapse
Affiliation(s)
- Soumyendu Sekhar Bandyopadhyay
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
- Department of Computer Science and Engineering, School of Engineering and Technology, Adamas University, Kolkata 700126, India
| | - Anup Kumar Halder
- Faculty of Mathematics and Information Sciences, Warsaw University of Technology, 00-662 Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering (Artificial Intelligence and Machine Learning), Techno Main Salt Lake, Sector V, Kolkata 700091, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata 700152, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| |
Collapse
|
3
|
Zhang J, Zhu M, Qian Y. protein2vec: Predicting Protein-Protein Interactions Based on LSTM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1257-1266. [PMID: 32750870 DOI: 10.1109/tcbb.2020.3003941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The semantic similarity of gene ontology (GO) terms is widely used to predict protein-protein interactions (PPIs). The traditional semantic similarity measures are based mainly on manually crafted features, which may ignore some important hidden information of the gene ontology. Moreover, those methods usually obtain the similarity between proteins from similarity between GO terms by some simple statistical rules, such as MAX and BMA (best-match average), oversimplifying the possible complex relationship between the proteins and the GO terms annotated with them. To overcome the two deficiencies, we propose a new method named protein2vec, which characterizes a protein with a vector based on the GO terms annotated to it and combines the information of both the GO and known PPIs. We firstly try to apply the network embedding algorithm on the GO network to generate feature vectors for each GO term. Then, Long Short-Time Memory (LSTM) encodes the feature vectors of the GO terms annotated with a protein into another vector (called protein vector). Finally, two protein vectors are forwarded into a feedforward neural network to predict the interaction between the two corresponding proteins. The experimental results show that protein2vec outperforms almost all commonly used traditional semantic similarity methods.
Collapse
|
4
|
Tian Z, Fang H, Ye Y, Zhu Z. A novel gene functional similarity calculation model by utilizing the specificity of terms and relationships in gene ontology. BMC Bioinformatics 2022; 23:47. [PMID: 35057740 PMCID: PMC8772239 DOI: 10.1186/s12859-022-04557-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 01/03/2022] [Indexed: 11/18/2022] Open
Abstract
Background Recently, with the foundation and development of gene ontology (GO) resources, numerous works have been proposed to compute functional similarity of genes and achieved series of successes in some research fields. Focusing on the calculation of the information content (IC) of terms is the main idea of these methods, which is essential for measuring functional similarity of genes. However, most approaches have some deficiencies, especially when measuring the IC of both GO terms and their corresponding annotated term sets. To this end, measuring functional similarity of genes accurately is still challenging. Results In this article, we proposed a novel gene functional similarity calculation method, which especially encapsulates the specificity of terms and edges (STE). The proposed method mainly contains three steps. Firstly, a novel computing model is put forward to compute the IC of terms. This model has the ability to exploit the specific structural information of GO terms. Secondly, the IC of term sets are computed by capturing the genetic structure between the terms contained in the set. Lastly, we measure the gene functional similarity according to the IC overlap ratio of the corresponding annotated genes sets. The proposed method accurately measures the IC of not only GO terms but also the annotated term sets by leveraging the specificity of edges in the GO graph. Conclusions We conduct experiments on gene functional classification in biological pathways, gene expression datasets, and protein-protein interaction datasets. Extensive experimental results show the better performances of our proposed STE against several baseline methods.
Collapse
|
5
|
Paul M, Anand A. A New Family of Similarity Measures for Scoring Confidence of Protein Interactions Using Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:19-30. [PMID: 34029194 DOI: 10.1109/tcbb.2021.3083150] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The large-scale protein-protein interaction (PPI) data has the potential to play a significant role in the endeavor of understanding cellular processes. However, the presence of a considerable fraction of false positives is a bottleneck in realizing this potential. There have been continuous efforts to utilize complementary resources for scoring confidence of PPIs in a manner that false positive interactions get a low confidence score. Gene Ontology (GO), a taxonomy of biological terms to represent the properties of gene products and their relations, has been widely used for this purpose. We utilize GO to introduce a new set of specificity measures: Relative Depth Specificity (RDS), Relative Node-based Specificity (RNS), and Relative Edge-based Specificity (RES), leading to a new family of similarity measures. We use these similarity measures to obtain a confidence score for each PPI. We evaluate the new measures using four different benchmarks. We show that all the three measures are quite effective. Notably, RNS and RES more effectively distinguish true PPIs from false positives than the existing alternatives. RES also shows a robust set-discriminating power and can be useful for protein functional clustering as well.
Collapse
|
6
|
Queirós P, Novikova P, Wilmes P, May P. Unification of functional annotation descriptions using text mining. Biol Chem 2021; 402:983-990. [PMID: 33984880 DOI: 10.1515/hsz-2021-0125] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 05/03/2021] [Indexed: 02/06/2023]
Abstract
A common approach to genome annotation involves the use of homology-based tools for the prediction of the functional role of proteins. The quality of functional annotations is dependent on the reference data used, as such, choosing the appropriate sources is crucial. Unfortunately, no single reference data source can be universally considered the gold standard, thus using multiple references could potentially increase annotation quality and coverage. However, this comes with challenges, particularly due to the introduction of redundant and exclusive annotations. Through text mining it is possible to identify highly similar functional descriptions, thus strengthening the confidence of the final protein functional annotation and providing a redundancy-free output. Here we present UniFunc, a text mining approach that is able to detect similar functional descriptions with high precision. UniFunc was built as a small module and can be independently used or integrated into protein function annotation pipelines. By removing the need to individually analyse and compare annotation results, UniFunc streamlines the complementary use of multiple reference datasets.
Collapse
Affiliation(s)
| | | | - Paul Wilmes
- Systems Ecology, Esch-sur-Alzette, Luxembourg
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 4362, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
7
|
Queirós P, Delogu F, Hickl O, May P, Wilmes P. Mantis: flexible and consensus-driven genome annotation. Gigascience 2021; 10:giab042. [PMID: 34076241 PMCID: PMC8170692 DOI: 10.1093/gigascience/giab042] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 03/22/2021] [Accepted: 05/14/2021] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND The rapid development of the (meta-)omics fields has produced an unprecedented amount of high-resolution and high-fidelity data. Through the use of these datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia. In this context, protein function annotation can be described as the identification of regions of interest (i.e., domains) in protein sequences and the assignment of biological functions. Despite the existence of numerous tools, challenges remain in terms of speed, flexibility, and reproducibility. In the big data era, it is also increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, and thus overcoming some limitations in overly relying on computationally generated data from single sources. RESULTS We implemented a protein annotation tool, Mantis, which uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output. Mantis is flexible, allowing for the customization of reference data and execution parameters, and is reproducible across different research goals and user environments. We implemented a depth-first search algorithm for domain-specific annotation, which significantly improved annotation performance compared to sequence-wide annotation. The parallelized implementation of Mantis results in short runtimes while also outputting high coverage and high-quality protein function annotations. CONCLUSIONS Mantis is a protein function annotation tool that produces high-quality consensus-driven protein annotations. It is easy to set up, customize, and use, scaling from single genomes to large metagenomes. Mantis is available under the MIT license at https://github.com/PedroMTQ/mantis.
Collapse
Affiliation(s)
- Pedro Queirós
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Francesco Delogu
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Oskar Hickl
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Paul Wilmes
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| |
Collapse
|
8
|
Nepomuceno-Chamorro IA, Nepomuceno JA, Galván-Rojas JL, Vega-Márquez B, Rubio-Escudero C. Using prior knowledge in the inference of gene association networks. APPL INTELL 2020. [DOI: 10.1007/s10489-020-01705-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
9
|
Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:6750296. [PMID: 30809545 PMCID: PMC6369486 DOI: 10.1155/2019/6750296] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Accepted: 01/13/2019] [Indexed: 11/30/2022]
Abstract
In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
Collapse
|
10
|
Zhou S, Kang H, Yao B, Gong Y. An automated pipeline for analyzing medication event reports in clinical settings. BMC Med Inform Decis Mak 2018; 18:113. [PMID: 30526590 PMCID: PMC6284273 DOI: 10.1186/s12911-018-0687-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Medication events in clinical settings are significant threats to patient safety. Analyzing and learning from the medication event reports is an important way to prevent the recurrence of these events. Currently, the analysis of medication event reports is ineffective and requires heavy workloads for clinicians. An automated pipeline is proposed to help clinicians deal with the accumulated reports, extract valuable information and generate feedback from the reports. Thus, the strategy of medication event prevention can be further developed based on the lessons learned. METHODS In order to build the automated pipeline, four classic machine learning classifiers (i.e., support vector machine, Naïve Bayes, random forest, and multi-layer perceptron) were compared to identify the event originating stages, event types, and event causes from the medication event reports. The precision, recall and F-1 measure were calculated to assess the performance of the classifiers. Further, a strategy to measure the similarity of medication event reports in our pipeline was established and evaluated by human subjects through a questionnaire. RESULTS We developed three classifiers to identify the medication event originating stages, event types and causes, respectively. For the event originating stages, a support vector machine classifier obtains the best performance with an F-1 measure of 0.792. For the event types, a support vector machine classifier exhibits the best performance with an F-1 measure of 0.758. And for the event causes, a random forest classifier reaches an F-1 measure of 0.925. The questionnaire results show that the similarity measurement is consistent with the domain experts in the task of identifying similar reports. CONCLUSION We developed and evaluated an automated pipeline that could identify three attributes from the medication event reports and calculate the similarity scores between the reports based on the attributes. The pipeline is expected to improve the efficiency of analyzing the medication event reports and to learn from the reports in a timely manner.
Collapse
Affiliation(s)
- Sicheng Zhou
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA
| | - Hong Kang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA
| | - Bin Yao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA
| | - Yang Gong
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA.
| |
Collapse
|
11
|
Dutta P, Basu S, Kundu M. Assessment of Semantic Similarity between Proteins Using Information Content and Topological Properties of the Gene Ontology Graph. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:839-849. [PMID: 28371781 DOI: 10.1109/tcbb.2017.2689762] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The semantic similarity between two interacting proteins can be estimated by combining the similarity scores of the GO terms associated with the proteins. Greater number of similar GO annotations between two proteins indicates greater interaction affinity. Existing semantic similarity measures make use of the GO graph structure, the information content of GO terms, or a combination of both. In this paper, we present a hybrid approach which utilizes both the topological features of the GO graph and information contents of the GO terms. More specifically, we 1) consider a fuzzy clustering of the GO graph based on the level of association of the GO terms, 2) estimate the GO term memberships to each cluster center based on the respective shortest path lengths, and 3) assign weightage to GO term pairs on the basis of their dissimilarity with respect to the cluster centers. We test the performance of our semantic similarity measure against seven other previously published similarity measures using benchmark protein-protein interaction datasets of Homo sapiens and Saccharomyces cerevisiae based on sequence similarity, Pfam similarity, area under ROC curve, and measure.
Collapse
|
12
|
Zhang J, Jia K, Jia J, Qian Y. An improved approach to infer protein-protein interaction based on a hierarchical vector space model. BMC Bioinformatics 2018; 19:161. [PMID: 29699476 PMCID: PMC5921294 DOI: 10.1186/s12859-018-2152-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 04/09/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Comparing and classifying functions of gene products are important in today's biomedical research. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most widely used indicators for protein interaction. Among the various approaches proposed, those based on the vector space model are relatively simple, but their effectiveness is far from satisfying. RESULTS We propose a Hierarchical Vector Space Model (HVSM) for computing semantic similarity between different genes or their products, which enhances the basic vector space model by introducing the relation between GO terms. Besides the directly annotated terms, HVSM also takes their ancestors and descendants related by "is_a" and "part_of" relations into account. Moreover, HVSM introduces the concept of a Certainty Factor to calibrate the semantic similarity based on the number of terms annotated to genes. To assess the performance of our method, we applied HVSM to Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets. Compared with TCSS, Resnik, and other classic similarity measures, HVSM achieved significant improvement for distinguishing positive from negative protein interactions. We also tested its correlation with sequence, EC, and Pfam similarity using online tool CESSM. CONCLUSIONS HVSM showed an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to basic VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC using AUC scores. CESSM test showed HVSM was comparable to SimGIC, and superior to all other similarity measures in CESSM as well as TCSS. Supplementary information and the software are available at https://github.com/kejia1215/HVSM .
Collapse
Affiliation(s)
- Jiongmin Zhang
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Ke Jia
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Jinmeng Jia
- School of life science, East China Normal University, Dongchuan Road, Shanghai, 200241 China
| | - Ying Qian
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| |
Collapse
|
13
|
Rodríguez-García MÁ, Hoehndorf R. Inferring ontology graph structures using OWL reasoning. BMC Bioinformatics 2018; 19:7. [PMID: 29304741 PMCID: PMC5756413 DOI: 10.1186/s12859-017-1999-8] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Accepted: 12/13/2017] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Ontologies are representations of a conceptualization of a domain. Traditionally, ontologies in biology were represented as directed acyclic graphs (DAG) which represent the backbone taxonomy and additional relations between classes. These graphs are widely exploited for data analysis in the form of ontology enrichment or computation of semantic similarity. More recently, ontologies are developed in a formal language such as the Web Ontology Language (OWL) and consist of a set of axioms through which classes are defined or constrained. While the taxonomy of an ontology can be inferred directly from the axioms of an ontology as one of the standard OWL reasoning tasks, creating general graph structures from OWL ontologies that exploit the ontologies' semantic content remains a challenge. RESULTS We developed a method to transform ontologies into graphs using an automated reasoner while taking into account all relations between classes. Searching for (existential) patterns in the deductive closure of ontologies, we can identify relations between classes that are implied but not asserted and generate graph structures that encode for a large part of the ontologies' semantic content. We demonstrate the advantages of our method by applying it to inference of protein-protein interactions through semantic similarity over the Gene Ontology and demonstrate that performance is increased when graph structures are inferred using deductive inference according to our method. Our software and experiment results are available at http://github.com/bio-ontology-research-group/Onto2Graph . CONCLUSIONS Onto2Graph is a method to generate graph structures from OWL ontologies using automated reasoning. The resulting graphs can be used for improved ontology visualization and ontology-based data analysis.
Collapse
Affiliation(s)
- Miguel Ángel Rodríguez-García
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
| |
Collapse
|
14
|
Mazandu GK, Chimusa ER, Mulder NJ. Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Brief Bioinform 2017; 18:886-901. [PMID: 27473066 DOI: 10.1093/bib/bbw067] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Indexed: 01/02/2023] Open
Abstract
Gene Ontology (GO) semantic similarity tools enable retrieval of semantic similarity scores, which incorporate biological knowledge embedded in the GO structure for comparing or classifying different proteins or list of proteins based on their GO annotations. This facilitates a better understanding of biological phenomena underlying the corresponding experiment and enables the identification of processes pertinent to different biological conditions. Currently, about 14 tools are available, which may play an important role in improving protein analyses at the functional level using different GO semantic similarity measures. Here we survey these tools to provide a comprehensive view of the challenges and advances made in this area to avoid redundant effort in developing features that already exist, or implementing ideas already proven to be obsolete in the context of GO. This helps researchers, tool developers, as well as end users, understand the underlying semantic similarity measures implemented through knowledge of pertinent features of, and issues related to, a particular tool. This should empower users to make appropriate choices for their biological applications and ensure effective knowledge discovery based on GO annotations.
Collapse
|
15
|
Yu G, Lu C, Wang J. NoGOA: predicting noisy GO annotations using evidences and sparse representation. BMC Bioinformatics 2017; 18:350. [PMID: 28732468 PMCID: PMC5521088 DOI: 10.1186/s12859-017-1764-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Accepted: 07/14/2017] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. RESULTS We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. CONCLUSIONS The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Chongqing, China.
| | - Chang Lu
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| |
Collapse
|
16
|
Kang H, Gong Y. Developing a similarity searching module for patient safety event reporting system using semantic similarity measures. BMC Med Inform Decis Mak 2017; 17:75. [PMID: 28699567 PMCID: PMC5506579 DOI: 10.1186/s12911-017-0467-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background The most important knowledge in the field of patient safety is regarding the prevention and reduction of patient safety events (PSE) during treatment and care. The similarities and patterns among the events may otherwise go unnoticed if they are not properly reported and analyzed. There is an urgent need for developing a PSE reporting system that can dynamically measure the similarities of the events and thus promote event analysis and learning effect. Methods In this study, three prevailing algorithms of semantic similarity were implemented to measure the similarities of the 366 PSE annotated by the taxonomy of The Agency for Healthcare Research and Quality (AHRQ). The performance of each algorithm was then evaluated by a group of domain experts based on a 4-point Likert scale. The consistency between the scales of the algorithms and experts was measured and compared with the scales randomly assigned. The similarity algorithms and scores, as a self-learning and self-updating module, were then integrated into the system. Results The result shows that the similarity scores reflect a high consistency with the experts’ review than those randomly assigned. Moreover, incorporating the algorithms into our reporting system enables a mechanism to learn and update based upon PSE similarity. Conclusion In conclusion, integrating semantic similarity algorithms into a PSE reporting system can help us learn from previous events and provide timely knowledge support to the reporters. With the knowledge base in the PSE domain, the new generation reporting system holds promise in educating healthcare providers and preventing the recurrence and serious consequences of PSE.
Collapse
Affiliation(s)
- Hong Kang
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, 7000 Fannin St., Houston, TX, 77030, USA
| | - Yang Gong
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, 7000 Fannin St., Houston, TX, 77030, USA.
| |
Collapse
|
17
|
Kang H, Gong Y. A Novel Schema to Enhance Data Quality of Patient Safety Event Reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1840-1849. [PMID: 28269943 PMCID: PMC5333227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The most important knowledge in the field of patient safety is regarding the prevention and reduction of patient safety events (PSEs). It is believed that PSE reporting systems could be a good resource to share and to learn from previous cases. However, the success of such systems in healthcare is yet to be seen. One reason is that the qualities of most PSE reports are unsatisfactory due to the lack of knowledge output from reporting systems which makes reporters report halfheartedly. In this study, we designed a PSE similarity searching model based on semantic similarity measures, and proposed a novel schema of PSE reporting system which can effectively learn from previous experiences and timely inform the subsequent actions. This system will not only help promote the report qualities but also serve as a knowledge base and education tool to guide healthcare providers in terms of preventing the recurrence of PSEs.
Collapse
Affiliation(s)
- Hong Kang
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yang Gong
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
18
|
Abstract
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA.
| | - Rebecca Davidson
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Eyal Akiva
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| |
Collapse
|
19
|
Abstract
Gene Ontology-based semantic similarity (SS) allows the comparison of GO terms or entities annotated with GO terms, by leveraging on the ontology structure and properties and on annotation corpora. In the last decade the number and diversity of SS measures based on GO has grown considerably, and their application ranges from functional coherence evaluation, protein interaction prediction, and disease gene prioritization.Understanding how SS measures work, what issues can affect their performance and how they compare to each other in different evaluation settings is crucial to gain a comprehensive view of this area and choose the most appropriate approaches for a given application.In this chapter, we provide a guide to understanding and selecting SS measures for biomedical researchers. We present a straightforward categorization of SS measures and describe the main strategies they employ. We discuss the intrinsic and external issues that affect their performance, and how these can be addressed. We summarize comparative assessment studies, highlighting the top measures in different settings, and compare different implementation strategies and their use. Finally, we discuss some of the extant challenges and opportunities, namely the increased semantic complexity of GO and the need for fast and efficient computation, pointing the way towards the future generation of SS measures.
Collapse
Affiliation(s)
- Catia Pesquita
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Edifício C6, Piso 3, Campo Grande, 1749-016, Lisbon, Portugal.
| |
Collapse
|
20
|
Lu C, Wang J, Zhang Z, Yang P, Yu G. NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput Biol Chem 2016; 65:203-211. [PMID: 27670689 DOI: 10.1016/j.compbiolchem.2016.09.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 09/07/2016] [Indexed: 10/21/2022]
Abstract
Gene Ontology (GO) provides GO annotations (GOA) that associate gene products with GO terms that summarize their cellular, molecular and functional aspects in the context of biological pathways. GO Consortium (GOC) resorts to various quality assurances to ensure the correctness of annotations. Due to resources limitations, only a small portion of annotations are manually added/checked by GO curators, and a large portion of available annotations are computationally inferred. While computationally inferred annotations provide greater coverage of known genes, they may also introduce annotation errors (noise) that could mislead the interpretation of the gene functions and their roles in cellular and biological processes. In this paper, we investigate how to identify noisy annotations, a rarely addressed problem, and propose a novel approach called NoisyGOA. NoisyGOA first measures taxonomic similarity between ontological terms using the GO hierarchy and semantic similarity between genes. Next, it leverages the taxonomic similarity and semantic similarity to predict noisy annotations. We compare NoisyGOA with other alternative methods on identifying noisy annotations under different simulated cases of noisy annotations, and on archived GO annotations. NoisyGOA achieved higher accuracy than other alternative methods in comparison. These results demonstrated both taxonomic similarity and semantic similarity contribute to the identification of noisy annotations. Our study shows that annotation errors are predictable and removing noisy annotations improves the performance of gene function prediction. This study can prompt the community to study methods for removing inaccurate annotations, a critical step for annotating gene and pathway functions.
Collapse
Affiliation(s)
- Chang Lu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Zili Zhang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, New South Wales, Australia
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China.
| |
Collapse
|
21
|
Ehsani R, Drabløs F. TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinformatics 2016; 17:296. [PMID: 27473391 PMCID: PMC4966780 DOI: 10.1186/s12859-016-1160-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Accepted: 07/21/2016] [Indexed: 01/14/2023] Open
Abstract
Background The Gene Ontology (GO) is a dynamic, controlled vocabulary that describes the cellular function of genes and proteins according to tree major categories: biological process, molecular function and cellular component. It has become widely used in many bioinformatics applications for annotating genes and measuring their semantic similarity, rather than their sequence similarity. Generally speaking, semantic similarity measures involve the GO tree topology, information content of GO terms, or a combination of both. Results Here we present a new semantic similarity measure called TopoICSim (Topological Information Content Similarity) which uses information on the specific paths between GO terms based on the topology of the GO tree, and the distribution of information content along these paths. The TopoICSim algorithm was evaluated on two human benchmark datasets based on KEGG pathways and Pfam domains grouped as clans, using GO terms from either the biological process or molecular function. The performance of the TopoICSim measure compared favorably to five existing methods. Furthermore, the TopoICSim similarity was also tested on gene/protein sets defined by correlated gene expression, using three human datasets, and showed improved performance compared to two previously published similarity measures. Finally we used an online benchmarking resource which evaluates any similarity measure against a set of 11 similarity measures in three tests, using gene/protein sets based on sequence similarity, Pfam domains, and enzyme classifications. The results for TopoICSim showed improved performance relative to most of the measures included in the benchmarking, and in particular a very robust performance throughout the different tests. Conclusions The TopoICSim similarity measure provides a competitive method with robust performance for quantification of semantic similarity between genes and proteins based on GO annotations. An R script for TopoICSim is available at http://bigr.medisin.ntnu.no/tools/TopoICSim.R.
Collapse
Affiliation(s)
- Rezvan Ehsani
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway.,Department of Mathematics, University of Zabol, Zabol, Iran
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway.
| |
Collapse
|
22
|
Al-Dalky R, Taha K, Al Homouz D, Qasaimeh M. Applying Monte Carlo Simulation to Biomedical Literature to Approximate Genetic Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:494-504. [PMID: 26415184 DOI: 10.1109/tcbb.2015.2481399] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Biologists often need to know the set of genes associated with a given set of genes or a given disease. We propose in this paper a classifier system called Monte Carlo for Genetic Network (MCforGN) that can construct genetic networks, identify functionally related genes, and predict gene-disease associations. MCforGN identifies functionally related genes based on their co-occurrences in the abstracts of biomedical literature. For a given gene g , the system first extracts the set of genes found within the abstracts of biomedical literature associated with g. It then ranks these genes to determine the ones with high co-occurrences with g . It overcomes the limitations of current approaches that employ analytical deterministic algorithms by applying Monte Carlo Simulation to approximate genetic networks. It does so by conducting repeated random sampling to obtain numerical results and to optimize these results. Moreover, it analyzes results to obtain the probabilities of different genes' co-occurrences using series of statistical tests. MCforGN can detect gene-disease associations by employing a combination of centrality measures (to identify the central genes in disease-specific genetic networks) and Monte Carlo Simulation. MCforGN aims at enhancing state-of-the-art biological text mining by applying novel extraction techniques. We evaluated MCforGN by comparing it experimentally with nine approaches. Results showed marked improvement.
Collapse
|
23
|
Zhang SB, Tang QR. Protein-protein interaction inference based on semantic similarity of Gene Ontology terms. J Theor Biol 2016; 401:30-7. [PMID: 27117309 DOI: 10.1016/j.jtbi.2016.04.020] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2015] [Revised: 03/14/2016] [Accepted: 04/16/2016] [Indexed: 11/29/2022]
Abstract
Identifying protein-protein interactions is important in molecular biology. Experimental methods to this issue have their limitations, and computational approaches have attracted more and more attentions from the biological community. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most powerful indicators for protein interaction. However, conventional methods based on GO similarity fail to take advantage of the specificity of GO terms in the ontology graph. We proposed a GO-based method to predict protein-protein interaction by integrating different kinds of similarity measures derived from the intrinsic structure of GO graph. We extended five existing methods to derive the semantic similarity measures from the descending part of two GO terms in the GO graph, then adopted a feature integration strategy to combines both the ascending and the descending similarity scores derived from the three sub-ontologies to construct various kinds of features to characterize each protein pair. Support vector machines (SVM) were employed as discriminate classifiers, and five-fold cross validation experiments were conducted on both human and yeast protein-protein interaction datasets to evaluate the performance of different kinds of integrated features, the experimental results suggest the best performance of the feature that combines information from both the ascending and the descending parts of the three ontologies. Our method is appealing for effective prediction of protein-protein interaction.
Collapse
Affiliation(s)
- Shu-Bo Zhang
- Department of Computer Science, Guangzhou Maritime Institute, Room 803, Building 88, Dashabei Road, Huangpu District, Guangzhou 510725, PR China.
| | - Qiang-Rong Tang
- Department of Shipping, Guangzhou Marine Institute, Room 205, Shipping Building, Hongshan No. 3 Road, Huangpu District, Guangzhou 510725, PR China.
| |
Collapse
|
24
|
Hoehndorf R, Schofield PN, Gkoutos GV. The role of ontologies in biological and biomedical research: a functional perspective. Brief Bioinform 2015; 16:1069-80. [PMID: 25863278 PMCID: PMC4652617 DOI: 10.1093/bib/bbv011] [Citation(s) in RCA: 119] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 01/20/2015] [Indexed: 12/19/2022] Open
Abstract
Ontologies are widely used in biological and biomedical research. Their success lies in their combination of four main features present in almost all ontologies: provision of standard identifiers for classes and relations that represent the phenomena within a domain; provision of a vocabulary for a domain; provision of metadata that describes the intended meaning of the classes and relations in ontologies; and the provision of machine-readable axioms and definitions that enable computational access to some aspects of the meaning of classes and relations. While each of these features enables applications that facilitate data integration, data access and analysis, a great potential lies in the possibility of combining these four features to support integrative analysis and interpretation of multimodal data. Here, we provide a functional perspective on ontologies in biology and biomedicine, focusing on what ontologies can do and describing how they can be used in support of integrative research. We also outline perspectives for using ontologies in data-driven science, in particular their application in structured data mining and machine learning applications.
Collapse
|
25
|
Pérez-Nueno VI, Souchet M, Karaboga AS, Ritchie DW. GESSE: Predicting Drug Side Effects from Drug–Target Relationships. J Chem Inf Model 2015; 55:1804-23. [DOI: 10.1021/acs.jcim.5b00120] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Violeta I. Pérez-Nueno
- Harmonic Pharma, Espace Transfert, 615 rue du Jardin Botanique, 54600 Villers-les-Nancy, France
| | - Michel Souchet
- Harmonic Pharma, Espace Transfert, 615 rue du Jardin Botanique, 54600 Villers-les-Nancy, France
| | - Arnaud S. Karaboga
- Harmonic Pharma, Espace Transfert, 615 rue du Jardin Botanique, 54600 Villers-les-Nancy, France
| | - David W. Ritchie
- INRIA Nancy − Grand Est, Equipe Capsid, 615 rue du Jardin Botanique, 54600 Villers-les-Nancy, France
| |
Collapse
|
26
|
Harispe S, Ranwez S, Janaqi S, Montmain J. Semantic Similarity from Natural Language and Ontology Analysis. ACTA ACUST UNITED AC 2015. [DOI: 10.2200/s00639ed1v01y201504hlt027] [Citation(s) in RCA: 113] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
27
|
Jeong JC, Chen X. A New Semantic Functional Similarity over Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:322-334. [PMID: 26357220 DOI: 10.1109/tcbb.2014.2343963] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Identifying functionally similar or closely related genes and gene products has significant impacts on biological and clinical studies as well as drug discovery. In this paper, we propose an effective and practically useful method measuring both gene and gene product similarity by integrating the topology of gene ontology, known functional domains and their functional annotations. The proposed method is comprehensively evaluated through statistical analysis of the similarities derived from sequence, structure and phylogenetic profiles, and clustering analysis of disease genes clusters. Our results show that the proposed method clearly outperforms other conventional methods. Furthermore, literature analysis also reveals that the proposed method is both statistically and biologically promising for identifying functionally similar genes or gene products. In particular, we demonstrate that the proposed functional similarity metric is capable of discoverying new disease related genes or gene products.
Collapse
|
28
|
Zhang SB, Lai JH. Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information. Gene 2015; 558:108-17. [DOI: 10.1016/j.gene.2014.12.062] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 12/15/2014] [Accepted: 12/24/2014] [Indexed: 11/25/2022]
|
29
|
Speegle G. P-Finder: Reconstruction of Signaling Networks from Protein-Protein Interactions and GO Annotations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:309-321. [PMID: 26357219 DOI: 10.1109/tcbb.2014.2355216] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Because most complex genetic diseases are caused by defects of cell signaling, illuminating a signaling cascade is essential for understanding their mechanisms. We present three novel computational algorithms to reconstruct signaling networks between a starting protein and an ending protein using genome-wide protein-protein interaction (PPI) networks and gene ontology (GO) annotation data. A signaling network is represented as a directed acyclic graph in a merged form of multiple linear pathways. An advanced semantic similarity metric is applied for weighting PPIs as the preprocessing of all three methods. The first algorithm repeatedly extends the list of nodes based on path frequency towards an ending protein. The second algorithm repeatedly appends edges based on the occurrence of network motifs which indicate the link patterns more frequently appearing in a PPI network than in a random graph. The last algorithm uses the information propagation technique which iteratively updates edge orientations based on the path strength and merges the selected directed edges. Our experimental results demonstrate that the proposed algorithms achieve higher accuracy than previous methods when they are tested on well-studied pathways of S. cerevisiae. Furthermore, we introduce an interactive web application tool, called P-Finder, to visualize reconstructed signaling networks.
Collapse
|
30
|
Lian YH, Fang MX, Chen LG. Constructing protein-protein interaction network of hypertension with blood stasis syndrome via digital gene expression sequencing and database mining. JOURNAL OF INTEGRATIVE MEDICINE-JIM 2014; 12:476-82. [PMID: 25412665 DOI: 10.1016/s2095-4964(14)60058-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
OBJECTIVE To construct a protein-protein interaction (PPI) network in hypertension patients with blood-stasis syndrome (BSS) by using digital gene expression (DGE) sequencing and database mining techniques. METHODS DGE analysis based on the Solexa Genome Analyzer platform was performed on vascular endothelial cells incubated with serum of hypertension patients with BSS. The differentially expressed genes were filtered by comparing the expression levels between the different experimental groups. Then functional categories and enriched pathways of the unique genes for BSS were analyzed using Database for Annotation, Visualization and Integrated Discovery (DAVID) to select those in the enrichment pathways. Interologous Interaction Database (I2D) was used to construct PPI networks with the selected genes for hypertension patients with BSS. The potential candidate genes related to BSS were identified by comparing the number of relationships among genes. Confirmed by quantitative reverse transcription-polymerase chain reaction (qRT-PCR), gene ontology (GO) analysis was used to infer the functional annotations of the potential candidate genes for BSS. RESULTS With gene enrichment analysis using DAVID, a list of 58 genes was chosen from the unique genes. The selected 58 genes were analyzed using I2D, and a PPI network was constructed. Based on the network analysis results, candidate genes for BSS were identified: DDIT3, JUN, HSPA8, NFIL3, HSPA5, HIST2H2BE, H3F3B, CEBPB, SAT1 and GADD45A. Verified through qRT-PCR and analyzed by GO, the functional annotations of the potential candidate genes were explored. CONCLUSION Compared with previous methodologies reported in the literature, the present DGE analysis and data mining method have shown a great improvement in analyzing BSS.
Collapse
Affiliation(s)
- Yong-hong Lian
- Institute of Integrated Chinese and Western Medicine, Medical School of Ji'nan University, Guangzhou 510632, Guangdong Province, China
| | - Mei-xia Fang
- Institute of Integrated Chinese and Western Medicine, Medical School of Ji'nan University, Guangzhou 510632, Guangdong Province, China
| | - Li-guo Chen
- Institute of Integrated Chinese and Western Medicine, Medical School of Ji'nan University, Guangzhou 510632, Guangdong Province, China; E-mail:
| |
Collapse
|
31
|
Konopka BM, Golda T, Kotulska M. Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology. J Comput Biol 2014; 21:809-22. [DOI: 10.1089/cmb.2014.0181] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Affiliation(s)
- Bogumil M. Konopka
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland
| | - Tomasz Golda
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland
| | - Malgorzata Kotulska
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland
| |
Collapse
|
32
|
Taha K. RGFinder: a system for determining semantically related genes using GO graph minimum spanning tree. IEEE Trans Nanobioscience 2014; 14:24-37. [PMID: 25343765 DOI: 10.1109/tnb.2014.2363295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Biologists often need to know the set S' of genes that are the most functionally and semantically related to a given set S of genes. For determining the set S', most current gene similarity measures overlook the structural dependencies among the Gene Ontology (GO) terms annotating the set S, which may lead to erroneous results. We introduce in this paper a biological search engine called RGFinder that considers the structural dependencies among GO terms by employing the concept of existence dependency. RGFinder assigns a weight to each edge in GO graph to represent the degree of relatedness between the two GO terms connected by the edge. The value of the weight is determined based on the following factors: 1) type of the relation represented by the edge (e.g., an "is-a" relation is assigned a different weight than a "part-of" relation), 2) the functional relationship between the two GO terms connected by the edge, and 3) the string-substring relationship between the names of the two GO terms connected by the edge. RGFinder then constructs a minimum spanning tree of GO graph based on these weights. In the framework of RGFinder, the set S' is annotated to the GO terms located at the lowest convergences of the subtree of the minimum spanning tree that passes through the GO terms annotating set S. We evaluated RGFinder experimentally and compared it with four gene set enrichment systems. Results showed marked improvement.
Collapse
|
33
|
Taboada M, Rodríguez H, Martínez D, Pardo M, Sobrido MJ. Automated semantic annotation of rare disease cases: a case study. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau045. [PMID: 24903515 PMCID: PMC4207225 DOI: 10.1093/database/bau045] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION As the number of clinical reports in the peer-reviewed medical literature keeps growing, there is an increasing need for online search tools to find and analyze publications on patients with similar clinical characteristics. This problem is especially critical and challenging for rare diseases, where publications of large series are scarce. Through an applied example, we illustrate how to automatically identify new relevant cases and semantically annotate the relevant literature about patient case reports to capture the phenotype of a rare disease named cerebrotendinous xanthomatosis. RESULTS Our results confirm that it is possible to automatically identify new relevant case reports with a high precision and to annotate them with a satisfactory quality (74% F-measure). Automated annotation with an emphasis to entirely describe all phenotypic abnormalities found in a disease may facilitate curation efforts by supplying phenotype retrieval and assessment of their frequency. Availability and Supplementary information: http://www.usc.es/keam/Phenotype Annotation/. Database URL: http://www.usc.es/keam/PhenotypeAnnotation/
Collapse
Affiliation(s)
- Maria Taboada
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - Hadriana Rodríguez
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - Diego Martínez
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - María Pardo
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - María Jesús Sobrido
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| |
Collapse
|
34
|
Semantic particularity measure for functional characterization of gene sets using gene ontology. PLoS One 2014; 9:e86525. [PMID: 24489737 PMCID: PMC3904913 DOI: 10.1371/journal.pone.0086525] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2013] [Accepted: 12/11/2013] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. RESULTS We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. CONCLUSION Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies.
Collapse
|
35
|
Bandyopadhyay S, Mallick K. A New Path Based Hybrid Measure for Gene Ontology Similarity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:116-127. [PMID: 26355512 DOI: 10.1109/tcbb.2013.149] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene Ontology (GO) consists of a controlled vocabulary of terms, annotating a gene or gene product, structured in a directed acyclic graph. In the graph, semantic relations connect the terms, that represent the knowledge of functional description and cellular component information of gene products. GO similarity gives us a numerical representation of biological relationship between a gene set, which can be used to infer various biological facts such as protein interaction, structural similarity, gene clustering, etc. Here we introduce a new shortest path based hybrid measure of ontological similarity between two terms which combines both structure of the GO graph and information content of the terms. Here the similarity between two terms t1 and t2, referred to as GOSim(PBHM)(t1,t2), has two components; one obtained from the common ancestors of t1 and t2. The other from their remaining ancestors. The proposed path based hybrid measure does not suffer from the well-known shallow annotation problem. Its superiority with respect to some other popular measures is established for protein protein interaction prediction, correlation with gene expression and functional classification of genes in a biological pathway. Finally, the proposed measure is utilized to compute the average GO similarity score among the genes that are experimentally validated targets of some microRNAs. Results demonstrate that the targets of a given miRNA have a high degree of similarity in the biological process category of GO.
Collapse
|
36
|
Price T, Peña FI, Cho YR. Survey: Enhancing protein complex prediction in PPI networks with GO similarity weighting. Interdiscip Sci 2013; 5:196-210. [PMID: 24307411 DOI: 10.1007/s12539-013-0174-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Revised: 03/30/2013] [Accepted: 06/12/2013] [Indexed: 11/29/2022]
Abstract
Predicting protein complexes from protein-protein interaction (PPI) networks has been the focus of many computational approaches over the last decade. These methods tend to vary in performance based on the structure of the network and the parameters provided to the algorithm. Here, we evaluate the merits of enhancing PPI networks with semantic similarity edge weights using Gene Ontology (GO) and its annotation data. We compare the cluster features and predictive efficacy of six well-known unweighted protein complex detection methods (Clique Percolation, MCODE, DPClus, IPCA, Graph Entropy, and CoAch) against updated weighted implementations. We conclude that incorporating semantic similarity edge weighting in PPI network analysis unequivocally increases the performance of these methods.
Collapse
Affiliation(s)
- True Price
- Department of Computer Science, Baylor University, Waco, TX, 76798, USA
| | | | | |
Collapse
|
37
|
Cho YR, Mina M, Lu Y, Kwon N, Guzzi PH. M-Finder: Uncovering functionally associated proteins from interactome data integrated with GO annotations. Proteome Sci 2013; 11:S3. [PMID: 24565382 PMCID: PMC3909039 DOI: 10.1186/1477-5956-11-s1-s3] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) play a key role in understanding the mechanisms of cellular processes. The availability of interactome data has catalyzed the development of computational approaches to elucidate functional behaviors of proteins on a system level. Gene Ontology (GO) and its annotations are a significant resource for functional characterization of proteins. Because of wide coverage, GO data have often been adopted as a benchmark for protein function prediction on the genomic scale. RESULTS We propose a computational approach, called M-Finder, for functional association pattern mining. This method employs semantic analytics to integrate the genome-wide PPIs with GO data. We also introduce an interactive web application tool that visualizes a functional association network linked to a protein specified by a user. The proposed approach comprises two major components. First, the PPIs that have been generated by high-throughput methods are weighted in terms of their functional consistency using GO and its annotations. We assess two advanced semantic similarity metrics which quantify the functional association level of each interacting protein pair. We demonstrate that these measures outperform the other existing methods by evaluating their agreement to other biological features, such as sequence similarity, the presence of common Pfam domains, and core PPIs. Second, the information flow-based algorithm is employed to discover a set of proteins functionally associated with the protein in a query and their links efficiently. This algorithm reconstructs a functional association network of the query protein. The output network size can be flexibly determined by parameters. CONCLUSIONS M-Finder provides a useful framework to investigate functional association patterns with any protein. This software will also allow users to perform further systematic analysis of a set of proteins for any specific function. It is available online at http://bionet.ecs.baylor.edu/mfinder.
Collapse
|
38
|
Deng Y, Gao L, Wang B. ppiPre: predicting protein-protein interactions by combining heterogeneous features. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 2:S8. [PMID: 24565177 PMCID: PMC3851814 DOI: 10.1186/1752-0509-7-s2-s8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
BACKGROUND Protein-protein interactions (PPIs) are crucial in cellular processes. Since the current biological experimental techniques are time-consuming and expensive, and the results suffer from the problems of incompleteness and noise, developing computational methods and software tools to predict PPIs is necessary. Although several approaches have been proposed, the species supported are often limited and additional data like homologous interactions in other species, protein sequence and protein expression are often required. And predictive abilities of different features for different kinds of PPI data have not been studied. RESULTS In this paper, we propose ppiPre, an open-source framework for PPI analysis and prediction using a combination of heterogeneous features including three GO-based semantic similarities, one KEGG-based co-pathway similarity and three topology-based similarities. It supports up to twenty species. Only the original PPI data and gold-standard PPI data are required from users. The experiments on binary and co-complex gold-standard yeast PPI data sets show that there exist big differences among the predictive abilities of different features on different kinds of PPI data sets. And the prediction performance on the two data sets shows that ppiPre is capable of handling PPI data in different kinds and sizes. ppiPre is implemented in the R language and is freely available on the CRAN (http://cran.r-project.org/web/packages/ppiPre/). CONCLUSIONS We applied our framework to both binary and co-complex gold-standard PPI data sets. The detailed analysis on three GO aspects suggests that different GO aspects should be used on different kinds of data sets, and that combining all the three aspects of GO often gets the best result. The analysis also shows that using only features based solely on the topology of the PPI network can get a very good result when predicting the co-complex PPI data. ppiPre provides useful functions for analysing PPI data and can be used to predict PPIs for multiple species.
Collapse
|
39
|
Almaça J, Faria D, Sousa M, Uliyakina I, Conrad C, Sirianant L, Clarke L, Martins J, Santos M, Heriché JK, Huber W, Schreiber R, Pepperkok R, Kunzelmann K, Amaral M. High-Content siRNA Screen Reveals Global ENaC Regulators and Potential Cystic Fibrosis Therapy Targets. Cell 2013; 154:1390-400. [DOI: 10.1016/j.cell.2013.08.045] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2012] [Revised: 07/08/2013] [Accepted: 08/26/2013] [Indexed: 01/07/2023]
|
40
|
Abstract
We introduce in this paper a biological search engine called GRtoGR. Given a set of S genes, GRtoGR would determine from GO graph the most significant Lowest Common Ancestor (LCA) of the GO terms annotating the set S. This significant LCA annotates the genes that are the most semantically related to the set S. The framework of GRtoGR refines the concept of LCA by introducing the concepts of Relevant Lowest Common Ancestor (RLCA) and Semantically Relevant Lowest Common Ancestor (SRLCA). A SRLCA is the most significant LCA of the GO terms annotating the set S. We observe that the existence of the GO terms annotating the set S is dependent on the existence of this SRLCA in GO graph. That is, the terms annotating a given set of genes usually have existence dependency relationships with the SRLCA of these terms. We evaluated GRtoGR experimentally and compared it with nine other methods. Results showed marked improvement.
Collapse
|
41
|
Bresso E, Grisoni R, Marchetti G, Karaboga AS, Souchet M, Devignes MD, Smaïl-Tabbone M. Integrative relational machine-learning for understanding drug side-effect profiles. BMC Bioinformatics 2013; 14:207. [PMID: 23802887 PMCID: PMC3710241 DOI: 10.1186/1471-2105-14-207] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Accepted: 06/21/2013] [Indexed: 12/25/2022] Open
Abstract
Background Drug side effects represent a common reason for stopping drug development during clinical trials. Improving our ability to understand drug side effects is necessary to reduce attrition rates during drug development as well as the risk of discovering novel side effects in available drugs. Today, most investigations deal with isolated side effects and overlook possible redundancy and their frequent co-occurrence. Results In this work, drug annotations are collected from SIDER and DrugBank databases. Terms describing individual side effects reported in SIDER are clustered with a semantic similarity measure into term clusters (TCs). Maximal frequent itemsets are extracted from the resulting drug x TC binary table, leading to the identification of what we call side-effect profiles (SEPs). A SEP is defined as the longest combination of TCs which are shared by a significant number of drugs. Frequent SEPs are explored on the basis of integrated drug and target descriptors using two machine learning methods: decision-trees and inductive-logic programming. Although both methods yield explicit models, inductive-logic programming method performs relational learning and is able to exploit not only drug properties but also background knowledge. Learning efficiency is evaluated by cross-validation and direct testing with new molecules. Comparison of the two machine-learning methods shows that the inductive-logic-programming method displays a greater sensitivity than decision trees and successfully exploit background knowledge such as functional annotations and pathways of drug targets, thereby producing rich and expressive rules. All models and theories are available on a dedicated web site. Conclusions Side effect profiles covering significant number of drugs have been extracted from a drug ×side-effect association table. Integration of background knowledge concerning both chemical and biological spaces has been combined with a relational learning method for discovering rules which explicitly characterize drug-SEP associations. These rules are successfully used for predicting SEPs associated with new drugs.
Collapse
|
42
|
Taha K. Determining the Semantic Similarities Among Gene Ontology Terms. IEEE J Biomed Health Inform 2013; 17:512-25. [DOI: 10.1109/jbhi.2013.2248742] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
43
|
Benabderrahmane S. Biomedical Knowledge Extraction Using Fuzzy Differential Profiles and Semantic Ranking. Artif Intell Med 2013. [DOI: 10.1007/978-3-642-38326-7_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
44
|
Abstract
Ontologies are now pervasive in biomedicine, where they serve as a means to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. For this purpose, research on biomedical ontologies applies theories and methods from diverse disciplines such as information management, knowledge representation, cognitive science, linguistics and philosophy. Depending on the desired applications in which ontologies are being applied, the evaluation of research in biomedical ontologies must follow different strategies. Here, we provide a classification of research problems in which ontologies are being applied, focusing on the use of ontologies in basic and translational research, and we demonstrate how research results in biomedical ontologies can be evaluated. The evaluation strategies depend on the desired application and measure the success of using an ontology for a particular biomedical problem. For many applications, the success can be quantified, thereby facilitating the objective evaluation and comparison of research in biomedical ontology. The objective, quantifiable comparison of research results based on scientific applications opens up the possibility for systematically improving the utility of ontologies in biomedical research.
Collapse
Affiliation(s)
- Robert Hoehndorf
- Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, SY23 3DB, UK.
| | | | | |
Collapse
|
45
|
Škunca N, Altenhoff A, Dessimoz C. Quality of computationally inferred gene ontology annotations. PLoS Comput Biol 2012; 8:e1002533. [PMID: 22693439 PMCID: PMC3364937 DOI: 10.1371/journal.pcbi.1002533] [Citation(s) in RCA: 97] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 04/01/2012] [Indexed: 01/10/2023] Open
Abstract
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. In the UniProt Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these “electronic GO annotations” are generally perceived as unreliable; they are disregarded in many studies. In this article, we introduce novel methodology to systematically evaluate the quality of electronic annotations. We then provide the first comprehensive assessment of the reliability of electronic GO annotations. Overall, we found that electronic annotations are more reliable than generally believed, to an extent that they are competitive with annotations inferred by curators when they use evidence other than experiments from primary literature. But we also report significant variations among inference methods, types of annotations, and organisms. This work provides guidance for Gene Ontology users and lays the foundations for improving computational approaches to GO function inference.
Collapse
Affiliation(s)
- Nives Škunca
- Ruđer Bošković Institute, Division of Electronics, Zagreb, Croatia
- ETH Zurich, Computer Science, Zurich, Switzerland
| | - Adrian Altenhoff
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christophe Dessimoz
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
- EMBL-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
46
|
A topology-based metric for measuring term similarity in the gene ontology. Adv Bioinformatics 2012; 2012:975783. [PMID: 22666244 PMCID: PMC3361142 DOI: 10.1155/2012/975783] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2011] [Revised: 02/29/2012] [Accepted: 03/13/2012] [Indexed: 12/25/2022] Open
Abstract
The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues.
Collapse
|
47
|
Pivovarov R, Elhadad N. A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts. J Biomed Inform 2012; 45:471-81. [PMID: 22289420 DOI: 10.1016/j.jbi.2012.01.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2011] [Revised: 01/06/2012] [Accepted: 01/08/2012] [Indexed: 10/14/2022]
Abstract
An open research question when leveraging ontological knowledge is when to treat different concepts separately from each other and when to aggregate them. For instance, concepts for the terms "paroxysmal cough" and "nocturnal cough" might be aggregated in a kidney disease study, but should be left separate in a pneumonia study. Determining whether two concepts are similar enough to be aggregated can help build better datasets for data mining purposes and avoid signal dilution. Quantifying the similarity among concepts is a difficult task, however, in part because such similarity is context-dependent. We propose a comprehensive method, which computes a similarity score for a concept pair by combining data-driven and ontology-driven knowledge. We demonstrate our method on concepts from SNOMED-CT and on a corpus of clinical notes of patients with chronic kidney disease. By combining information from usage patterns in clinical notes and from ontological structure, the method can prune out concepts that are simply related from those which are semantically similar. When evaluated against a list of concept pairs annotated for similarity, our method reaches an AUC (area under the curve) of 92%.
Collapse
Affiliation(s)
- Rimma Pivovarov
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, VC-5, New York, NY 10032, USA
| | | |
Collapse
|
48
|
Ramírez F, Lawyer G, Albrecht M. Novel search method for the discovery of functional relationships. ACTA ACUST UNITED AC 2011; 28:269-76. [PMID: 22180409 PMCID: PMC3259435 DOI: 10.1093/bioinformatics/btr631] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Motivation: Numerous annotations are available that functionally characterize genes and proteins with regard to molecular process, cellular localization, tissue expression, protein domain composition, protein interaction, disease association and other properties. Searching this steadily growing amount of information can lead to the discovery of new biological relationships between genes and proteins. To facilitate the searches, methods are required that measure the annotation similarity of genes and proteins. However, most current similarity methods are focused only on annotations from the Gene Ontology (GO) and do not take other annotation sources into account. Results: We introduce the new method BioSim that incorporates multiple sources of annotations to quantify the functional similarity of genes and proteins. We compared the performance of our method with four other well-known methods adapted to use multiple annotation sources. We evaluated the methods by searching for known functional relationships using annotations based only on GO or on our large data warehouse BioMyn. This warehouse integrates many diverse annotation sources of human genes and proteins. We observed that the search performance improved substantially for almost all methods when multiple annotation sources were included. In particular, our method outperformed the other methods in terms of recall and average precision. Contact:mario.albrecht@mpi-inf.mpg.de Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fidel Ramírez
- Max Planck Institute for Informatics, Campus E1.4, 66123 Saarbrücken, Germany
| | | | | |
Collapse
|
49
|
Hoehndorf R, Dumontier M, Gennari JH, Wimalaratne S, de Bono B, Cook DL, Gkoutos GV. Integrating systems biology models and biomedical ontologies. BMC SYSTEMS BIOLOGY 2011; 5:124. [PMID: 21835028 PMCID: PMC3170340 DOI: 10.1186/1752-0509-5-124] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2011] [Accepted: 08/11/2011] [Indexed: 01/30/2023]
Abstract
BACKGROUND Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology. RESULTS We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models. CONCLUSIONS We establish an information flow between biomedical ontologies and biosimulation models and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms.
Collapse
Affiliation(s)
- Robert Hoehndorf
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| | - Michel Dumontier
- Department of Biology, Carleton University, 1125 Colonel By Drive, Ottawa, K1S 5B6, Canada
- School of Computer Science, Carleton University, 1125 Colonel By Drive, Ottawa, K1S 5B6, Canada
| | - John H Gennari
- Biomedical & Health Informatics, Department of Medical Education and Biomedical Informatics, University of Washington, 1959 NE Pacific Street, Box 357420, Seattle, Washington 98195, USA
| | - Sarala Wimalaratne
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Bernard de Bono
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Daniel L Cook
- Department of Physiology & Biophysics, University of Washington, 1705 NE Pacific Street, Box 357290, Seattle, Washington 98195, USA
- Department of Biological Structure, University of Washington, 1959 NE Pacific Street, Box 357420, Seattle, Washington 98195, USA
| | - Georgios V Gkoutos
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| |
Collapse
|
50
|
Cho JH, Wang K, Galas DJ. An integrative approach to inferring biologically meaningful gene modules. BMC SYSTEMS BIOLOGY 2011; 5:117. [PMID: 21791051 PMCID: PMC3156758 DOI: 10.1186/1752-0509-5-117] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2011] [Accepted: 07/26/2011] [Indexed: 12/21/2022]
Abstract
Background The ability to construct biologically meaningful gene networks and modules is critical for contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to shed light on the functioning of complex biological systems, most modules in these networks have shown little association with meaningful biological function. We have devised a method which directly incorporates gene ontology (GO) annotation in construction of gene modules in order to gain better functional association. Results We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that integrates various gene-gene pairwise similarity values, including information obtained from gene expression, protein-protein interactions and GO annotations, in the construction of modules using affinity propagation clustering. We demonstrated the performance of the proposed method using data from two complex biological responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly stronger association with biological functions. Conclusions The incorporation of semantic similarity based on GO annotation with gene expression and protein-protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level.
Collapse
Affiliation(s)
- Ji-Hoon Cho
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | | |
Collapse
|