1
|
Beverley J, Babcock S, Carvalho G, Cowell LG, Duesing S, He Y, Hurley R, Merrell E, Scheuermann RH, Smith B. Coordinating virus research: The Virus Infectious Disease Ontology. PLoS One 2024; 19:e0285093. [PMID: 38236918 PMCID: PMC10796065 DOI: 10.1371/journal.pone.0285093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 04/12/2023] [Indexed: 01/22/2024] Open
Abstract
The COVID-19 pandemic prompted immense work on the investigation of the SARS-CoV-2 virus. Rapid, accurate, and consistent interpretation of generated data is thereby of fundamental concern. Ontologies-structured, controlled, vocabularies-are designed to support consistency of interpretation, and thereby to prevent the development of data silos. This paper describes how ontologies are serving this purpose in the COVID-19 research domain, by following principles of the Open Biological and Biomedical Ontology (OBO) Foundry and by reusing existing ontologies such as the Infectious Disease Ontology (IDO) Core, which provides terminological content common to investigations of all infectious diseases. We report here on the development of an IDO extension, the Virus Infectious Disease Ontology (VIDO), a reference ontology covering viral infectious diseases. We motivate term and definition choices, showcase reuse of terms from existing OBO ontologies, illustrate how ontological decisions were motivated by relevant life science research, and connect VIDO to the Coronavirus Infectious Disease Ontology (CIDO). We next use terms from these ontologies to annotate selections from life science research on SARS-CoV-2, highlighting how ontologies employing a common upper-level vocabulary may be seamlessly interwoven. Finally, we outline future work, including bacteria and fungus infectious disease reference ontologies currently under development, then cite uses of VIDO and CIDO in host-pathogen data analytics, electronic health record annotation, and ontology conflict-resolution projects.
Collapse
Affiliation(s)
- John Beverley
- Department of Philosophy, University at Buffalo, Buffalo, NY, United States of America
- National Center for Ontological Research, Buffalo, NY, United States of America
| | - Shane Babcock
- National Center for Ontological Research, Buffalo, NY, United States of America
- Air Force Research Laboratory, Wright Patterson Air Force Base, Riverside, OH, United States of America
| | - Gustavo Carvalho
- Department of Cognitive Science, Northwestern University, Evanston, IL, United States of America
| | - Lindsay G. Cowell
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States of America
| | - Sebastian Duesing
- Department of Philosophy, Loyola University, Chicago, IL, United States of America
| | - Yongqun He
- Computational Medicine and Bioinformatics, University of Michigan Medical School, He Group, Ann Arbor, MI, United States of America
| | - Regina Hurley
- National Center for Ontological Research, Buffalo, NY, United States of America
- Department of Philosophy, Northwestern University, Evanston, IL, United States of America
| | - Eric Merrell
- Department of Philosophy, University at Buffalo, Buffalo, NY, United States of America
- National Center for Ontological Research, Buffalo, NY, United States of America
| | - Richard H. Scheuermann
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
- Department of Pathology, University of California, San Diego, CA, United States of America
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, United States of America
| | - Barry Smith
- Department of Philosophy, University at Buffalo, Buffalo, NY, United States of America
- National Center for Ontological Research, Buffalo, NY, United States of America
| |
Collapse
|
2
|
Hernández L, Estévez-Priego E, López-Pérez L, Fernanda Cabrera-Umpiérrez M, Arredondo MT, Fico G. HeNeCOn: An ontology for integrative research in Head and Neck cancer. Int J Med Inform 2024; 181:105284. [PMID: 37981440 DOI: 10.1016/j.ijmedinf.2023.105284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/14/2023] [Accepted: 11/01/2023] [Indexed: 11/21/2023]
Abstract
BACKGROUND Head and Neck Cancer (HNC) has a high incidence and prevalence in the worldwide population. The broad terminology associated with these diseases and their multimodality treatments generates large amounts of heterogeneous clinical data, which motivates the construction of a high-quality harmonization model to standardize this multi-source clinical data in terms of format and semantics. The use of ontologies and semantic techniques is a well-known approach to face this challenge. OBJECTIVE This work aims to provide a clinically reliable data model for HNC processes during all phases of the disease: prognosis, treatment, and follow-up. Therefore, we built the first ontology specifically focused on the HNC domain, named HeNeCOn (Head and Neck Cancer Ontology). METHODS First, an annotated dataset was established to provide a formal reference description of HNC. Then, 170 clinical variables were organized into a taxonomy, and later expanded and mapped to formalize and integrate multiple databases into the HeNeCOn ontology. The outcomes of this iterative process were reviewed and validated by clinicians and statisticians. RESULTS HeNeCOn is an ontology consisting of 502 classes, a taxonomy with a hierarchical structure, semantic definitions of 283 medical terms and detailed relations between them, which can be used as a tool for information extraction and knowledge management. CONCLUSION HeNeCOn is a reusable, extendible and standardized ontology which establishes a reference data model for terminology structure and standard definitions in the Head and Neck Cancer domain. This ontology allows handling both current and newly generated knowledge in Head and Neck cancer research, by means of data linking and mapping with other public ontologies.
Collapse
Affiliation(s)
- Liss Hernández
- Universidad Politécnica de Madrid-Life Supporting Technologies Research Group, ETSIT, 28040 Madrid, Spain
| | - Estefanía Estévez-Priego
- Universidad Politécnica de Madrid-Life Supporting Technologies Research Group, ETSIT, 28040 Madrid, Spain
| | - Laura López-Pérez
- Universidad Politécnica de Madrid-Life Supporting Technologies Research Group, ETSIT, 28040 Madrid, Spain
| | | | - María Teresa Arredondo
- Universidad Politécnica de Madrid-Life Supporting Technologies Research Group, ETSIT, 28040 Madrid, Spain
| | - Giuseppe Fico
- Universidad Politécnica de Madrid-Life Supporting Technologies Research Group, ETSIT, 28040 Madrid, Spain.
| |
Collapse
|
3
|
Wang H, Zheng H, Chen DZ. TANGO: A GO-Term Embedding Based Method for Protein Semantic Similarity Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:694-706. [PMID: 35030084 DOI: 10.1109/tcbb.2022.3143480] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
We aim to quantitatively predict protein semantic similarities (PSS), which is vital to making biological discoveries. Previously, researchers commonly exploited Gene Ontology (GO) graphs (containing standardized hierarchically-organized GO terms for annotating distinct protein attributes) to learn GO term embeddings (vector representations) for quantifying protein attribute similarities and aggregate these embeddings to form protein embeddings for similarity measurement. However, two key properties of GO terms and annotated proteins are not yet well-explored by these learning-based methods: (1) taxonomy relations between GO terms; (2) GO terms' different contributions in describing protein semantics. In this paper, we propose TANGO, a new framework composed of a TAxoNomy-aware embedding module and an aggreGatiOn module. Our Embedding Module encodes taxonomic information into GO term embeddings by incorporating GO term topological distances in the GO graph hierarchy. Hence, distances between GO term embeddings can be used to more accurately measure shared meanings between correlated protein attributes. Our Aggregation Module automatically determines the contributions of GO terms when merging into the target protein embeddings, by mining GO term concept dependency relations in the GO graph and correlations in protein annotations. We conduct extensive experiments on several public datasets. On two PSS metrics, our new method significantly outperforms known methods by a large margin.
Collapse
|
4
|
Wood EC, Glen AK, Kvarfordt LG, Womack F, Acevedo L, Yoon TS, Ma C, Flores V, Sinha M, Chodpathumwan Y, Termehchy A, Roach JC, Mendoza L, Hoffman AS, Deutsch EW, Koslicki D, Ramsey SA. RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine. BMC Bioinformatics 2022; 23:400. [PMID: 36175836 PMCID: PMC9520835 DOI: 10.1186/s12859-022-04932-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions. Within that project and the broader field, there is a need for a framework that can efficiently and reproducibly build an integrated, standards-compliant, and comprehensive biomedical knowledge graph that can be downloaded in standard serialized form or queried via a public application programming interface (API). RESULTS To create a knowledge provider system within the Translator project, we have developed RTX-KG2, an open-source software system for building-and hosting a web API for querying-a biomedical knowledge graph that uses an Extract-Transform-Load approach to integrate 70 knowledge sources (including the aforementioned core six sources) into a knowledge graph with provenance information including (where available) citations. The semantic layer and schema for RTX-KG2 follow the standard Biolink model to maximize interoperability. RTX-KG2 is currently being used by multiple Translator reasoning agents, both in its downloadable form and via its SmartAPI-registered interface. Serializations of RTX-KG2 are available for download in both the pre-canonicalized form and in canonicalized form (in which synonyms are merged). The current canonicalized version (KG2.7.3) of RTX-KG2 contains 6.4M nodes and 39.3M edges with a hierarchy of 77 relationship types from Biolink. CONCLUSION RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema. RTX-KG2 is publicly available for querying via its API at arax.rtx.ai/api/rtxkg2/v1.2/openapi.json . The code to build RTX-KG2 is publicly available at github:RTXteam/RTX-KG2 .
Collapse
Affiliation(s)
- E C Wood
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Amy K Glen
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
| | - Lindsey G Kvarfordt
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Finn Womack
- Computer Science and Engineering, Penn State University, State College, PA, USA
| | - Liliana Acevedo
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Timothy S Yoon
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Chunyu Ma
- Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
| | - Veronica Flores
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Meghamala Sinha
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | | | - Arash Termehchy
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | | | | | - Andrew S Hoffman
- Interdisciplinary Hub for Digitalization and Society, Radboud University, Nijmegen, The Netherlands
| | | | - David Koslicki
- Computer Science and Engineering, Penn State University, State College, PA, USA.,Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA.,Department of Biology, Penn State University, State College, PA, USA
| | - Stephen A Ramsey
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.,Department of Biomedical Sciences, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
5
|
Chen J, Althagafi A, Hoehndorf R. Predicting candidate genes from phenotypes, functions and anatomical site of expression. Bioinformatics 2021; 37:853-860. [PMID: 33051643 PMCID: PMC8248315 DOI: 10.1093/bioinformatics/btaa879] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 08/26/2020] [Accepted: 09/28/2020] [Indexed: 12/30/2022] Open
Abstract
Motivation Over the past years, many computational methods have been developed to
incorporate information about phenotypes for disease–gene
prioritization task. These methods generally compute the similarity between
a patient’s phenotypes and a database of gene-phenotype to find the
most phenotypically similar match. The main limitation in these methods is
their reliance on knowledge about phenotypes associated with particular
genes, which is not complete in humans as well as in many model organisms,
such as the mouse and fish. Information about functions of gene products and
anatomical site of gene expression is available for more genes and can also
be related to phenotypes through ontologies and machine-learning models. Results We developed a novel graph-based machine-learning method for biomedical
ontologies, which is able to exploit axioms in ontologies and other
graph-structured data. Using our machine-learning method, we embed genes
based on their associated phenotypes, functions of the gene products and
anatomical location of gene expression. We then develop a machine-learning
model to predict gene–disease associations based on the associations
between genes and multiple biomedical ontologies, and this model
significantly improves over state-of-the-art methods. Furthermore, we extend
phenotype-based gene prioritization methods significantly to all genes,
which are associated with phenotypes, functions or site of expression. Availability and implementation Software and data are available at https://github.com/bio-ontology-research-group/DL2Vec. Supplementary information Supplementary data
are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Chen
- Computational Bioscience Research Center (CBRC), Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Azza Althagafi
- Computational Bioscience Research Center (CBRC), Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.,Computer Science Department, College of Computers and Information Technology, Taif University, Taif 26571, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| |
Collapse
|
6
|
Liu-Wei W, Kafkas Ş, Chen J, Dimonaco NJ, Tegnér J, Hoehndorf R. DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes. Bioinformatics 2021; 37:2722-2729. [PMID: 33682875 PMCID: PMC8428617 DOI: 10.1093/bioinformatics/btab147] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 01/18/2021] [Accepted: 03/01/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Infectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. RESULTS We developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction. AVAILABILITY Code and datasets for reproduction and customization are available at https://github.com/bio-ontology-research-group/DeepViral. Prediction results for 14 virus families are available at https://doi.org/10.5281/zenodo.4429824.
Collapse
Affiliation(s)
- Wang Liu-Wei
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Şenay Kafkas
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.,Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Jun Chen
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Nicholas J Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, SY23 3BQ, Wales, UK
| | - Jesper Tegnér
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.,Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.,Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| |
Collapse
|
7
|
Dhombres F, Charlet J. Design and Use of Semantic Resources: Findings from the Section on Knowledge Representation and Management of the 2020 International Medical Informatics Association Yearbook. Yearb Med Inform 2020; 29:163-168. [PMID: 32823311 PMCID: PMC7442529 DOI: 10.1055/s-0040-1702010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE To select, present, and summarize the best papers in the field of Knowledge Representation and Management (KRM) published in 2019. METHODS A comprehensive and standardized review of the biomedical informatics literature was performed to select the most interesting papers of KRM published in 2019, based on PubMed and ISI Web Of Knowledge queries. RESULTS Four best papers were selected among 1,189 publications retrieved, following the usual International Medical Informatics Association Yearbook reviewing process. In 2019, research areas covered by pre-selected papers were represented by the design of semantic resources (methods, visualization, curation) and the application of semantic representations for the integration/enrichment of biomedical data. Besides new ontologies and sound methodological guidance to rethink knowledge bases design, we observed large scale applications, promising results for phenotypes characterization, semantic-aware machine learning solutions for biomedical data analysis, and semantic provenance information representations for scientific reproducibility evaluation. CONCLUSION In the KRM selection for 2019, research on knowledge representation demonstrated significant contributions both in the design and in the application of semantic resources. Semantic representations serve a great variety of applications across many medical domains, with actionable results.
Collapse
Affiliation(s)
- Ferdinand Dhombres
- Sorbonne Université, Université Paris Nord, INSERM, UMR_S 1142, LIMICS, Paris, France
- Médecine Sorbonne Université, Service de Médecine Fœtale, Hôpital Armand Trousseau, Paris, France
| | - Jean Charlet
- Sorbonne Université, Université Paris Nord, INSERM, UMR_S 1142, LIMICS, Paris, France
- AP-HP, DRCI, Paris, France
| | | |
Collapse
|