1
|
Vogt L. FAIR data representation in times of eScience: a comparison of instance-based and class-based semantic representations of empirical data using phenotype descriptions as example. J Biomed Semantics 2021; 12:20. [PMID: 34823588 PMCID: PMC8613519 DOI: 10.1186/s13326-021-00254-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 11/11/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The size, velocity, and heterogeneity of Big Data outclasses conventional data management tools and requires data and metadata to be fully machine-actionable (i.e., eScience-compliant) and thus findable, accessible, interoperable, and reusable (FAIR). This can be achieved by using ontologies and through representing them as semantic graphs. Here, we discuss two different semantic graph approaches of representing empirical data and metadata in a knowledge graph, with phenotype descriptions as an example. Almost all phenotype descriptions are still being published as unstructured natural language texts, with far-reaching consequences for their FAIRness, substantially impeding their overall usability within the life sciences. However, with an increasing amount of anatomy ontologies becoming available and semantic applications emerging, a solution to this problem becomes available. Researchers are starting to document and communicate phenotype descriptions through the Web in the form of highly formalized and structured semantic graphs that use ontology terms and Uniform Resource Identifiers (URIs) to circumvent the problems connected with unstructured texts. RESULTS Using phenotype descriptions as an example, we compare and evaluate two basic representations of empirical data and their accompanying metadata in the form of semantic graphs: the class-based TBox semantic graph approach called Semantic Phenotype and the instance-based ABox semantic graph approach called Phenotype Knowledge Graph. Their main difference is that only the ABox approach allows for identifying every individual part and property mentioned in the description in a knowledge graph. This technical difference results in substantial practical consequences that significantly affect the overall usability of empirical data. The consequences affect findability, accessibility, and explorability of empirical data as well as their comparability, expandability, universal usability and reusability, and overall machine-actionability. Moreover, TBox semantic graphs often require querying under entailment regimes, which is computationally more complex. CONCLUSIONS We conclude that, from a conceptual point of view, the advantages of the instance-based ABox semantic graph approach outweigh its shortcomings and outweigh the advantages of the class-based TBox semantic graph approach. Therefore, we recommend the instance-based ABox approach as a FAIR approach for documenting and communicating empirical data and metadata in a knowledge graph.
Collapse
Affiliation(s)
- Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1B, 30167, Hanover, Germany.
| |
Collapse
|
2
|
Fujiwara T, Yamamoto Y, Kim JD, Buske O, Takagi T. PubCaseFinder: A Case-Report-Based, Phenotype-Driven Differential-Diagnosis System for Rare Diseases. Am J Hum Genet 2018; 103:389-399. [PMID: 30173820 DOI: 10.1016/j.ajhg.2018.08.003] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Accepted: 08/01/2018] [Indexed: 01/29/2023] Open
Abstract
Recently, to speed up the differential-diagnosis process based on symptoms and signs observed from an affected individual in the diagnosis of rare diseases, researchers have developed and implemented phenotype-driven differential-diagnosis systems. The performance of those systems relies on the quantity and quality of underlying databases of disease-phenotype associations (DPAs). Although such databases are often developed by manual curation, they inherently suffer from limited coverage. To address this problem, we propose a text-mining approach to increase the coverage of DPA databases and consequently improve the performance of differential-diagnosis systems. Our analysis showed that a text-mining approach using one million case reports obtained from PubMed could increase the coverage of manually curated DPAs in Orphanet by 125.6%. We also present PubCaseFinder (see Web Resources), a new phenotype-driven differential-diagnosis system in a freely available web application. By utilizing automatically extracted DPAs from case reports in addition to manually curated DPAs, PubCaseFinder improves the performance of automated differential diagnosis. Moreover, PubCaseFinder helps clinicians search for relevant case reports by using phenotype-based comparisons and confirm the results with detailed contextual information.
Collapse
|
3
|
Vogt L. Towards a semantic approach to numerical tree inference in phylogenetics. Cladistics 2018; 34:200-224. [PMID: 34645075 DOI: 10.1111/cla.12195] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/03/2017] [Indexed: 12/24/2022] Open
Abstract
Conventional approaches to phylogeny reconstruction require a character analysis step prior to and methodologically separated from a numerical tree inference step. The former results in a character matrix that contains the empirical data analysed in the latter. This separation of steps involves various methodological and conceptual problems (e.g. homology assessment independent of tree inference and character optimization, character dependencies, discounting of alternative homology hypotheses). In morphology, the character analysis step covers the stages of morphological comparative studies, homology assessment and the identification and coding of morphological characters. Unfortunately, only the last stage requires some formalism, whereas the preceding stages are commonly regarded to be pre-rational and intuitive, which is why their reproducibility and analytical accessibility is limited. Here, I introduce a rational for a semantic approach to numerical tree inference that uses sets of semantic instance anatomies as data source instead of character matrices, thereby avoiding the above-mentioned problems. A semantic instance anatomy is an ontology-based description of the anatomical organization of a specimen in the form of a semantic graph. The semantic approach to numerical tree inference combines and integrates the steps of character analysis and numerical tree inference and makes both analytically accessible and communicable. Before outlining first steps for a research programme dedicated to the semantic approach to numerical tree inference, I discuss in detail the methodological, conceptual, and computational challenges and requirements that first have to be dealt with before adequate algorithms can be developed.
Collapse
Affiliation(s)
- Lars Vogt
- Institut für Evolutionsbiologie und Ökologie, Universität Bonn, An der Immenburg 1, Bonn, D-53121, Germany
| |
Collapse
|
4
|
Bello SM, Shimoyama M, Mitraka E, Laulederkind SJF, Smith CL, Eppig JT, Schriml LM. Disease Ontology: improving and unifying disease annotations across species. Dis Model Mech 2018; 11:dmm.032839. [PMID: 29590633 PMCID: PMC5897730 DOI: 10.1242/dmm.032839] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 02/08/2018] [Indexed: 11/20/2022] Open
Abstract
Model organisms are vital to uncovering the mechanisms of human disease and developing new therapeutic tools. Researchers collecting and integrating relevant model organism and/or human data often apply disparate terminologies (vocabularies and ontologies), making comparisons and inferences difficult. A unified disease ontology is required that connects data annotated using diverse disease terminologies, and in which the terminology relationships are continuously maintained. The Mouse Genome Database (MGD, http://www.informatics.jax.org), Rat Genome Database (RGD, http://rgd.mcw.edu) and Disease Ontology (DO, http://www.disease-ontology.org) projects are collaborating to augment DO, aligning and incorporating disease terms used by MGD and RGD, and improving DO as a tool for unifying disease annotations across species. Coordinated assessment of MGD's and RGD's disease term annotations identified new terms that enhance DO's representation of human diseases. Expansion of DO term content and cross-references to clinical vocabularies (e.g. OMIM, ORDO, MeSH) has enriched the DO's domain coverage and utility for annotating many types of data generated from experimental and clinical investigations. The extension of anatomy-based DO classification structure of disease improves accessibility of terms and facilitates application of DO for computational research. A consistent representation of disease associations across data types from cellular to whole organism, generated from clinical and model organism studies, will promote the integration, mining and comparative analysis of these data. The coordinated enrichment of the DO and adoption of DO by MGD and RGD demonstrates DO's usability across human data, MGD, RGD and the rest of the model organism database community. Summary: Analyzing diverse disease data requires a comprehensive, robust disease ontology to integrate annotations and retrieve accurate, interpretable results. MGD, RGD and DO are working in collaboration to achieve this goal.
Collapse
Affiliation(s)
| | - Mary Shimoyama
- Department of Biomedical Engineering, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Elvira Mitraka
- Department of Epidemiology and Public Health, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | | | | | | | - Lynn M Schriml
- Department of Epidemiology and Public Health, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| |
Collapse
|
5
|
Vilar S, Hripcsak G. The role of drug profiles as similarity metrics: applications to repurposing, adverse effects detection and drug-drug interactions. Brief Bioinform 2017; 18:670-681. [PMID: 27273288 PMCID: PMC6078166 DOI: 10.1093/bib/bbw048] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2016] [Revised: 04/18/2016] [Indexed: 12/30/2022] Open
Abstract
Explosion of the availability of big data sources along with the development in computational methods provides a useful framework to study drugs' actions, such as interactions with pharmacological targets and off-targets. Databases related to protein interactions, adverse effects and genomic profiles are available to be used for the construction of computational models. In this article, we focus on the description of biological profiles for drugs that can be used as a system to compare similarity and create methods to predict and analyze drugs' actions. We highlight profiles constructed with different biological data, such as target-protein interactions, gene expression measurements, adverse effects and disease profiles. We focus on the discovery of new targets or pathways for drugs already in the pharmaceutical market, also called drug repurposing, in the interaction with off-targets responsible for adverse reactions and in drug-drug interaction analysis. The current and future applications, strengths and challenges facing all these methods are also discussed. Biological profiles or signatures are an important source of data generation to deeply analyze biological actions with important implications in drug-related studies.
Collapse
Affiliation(s)
- Santiago Vilar
- Corresponding author: Santiago Vilar, Department of Biomedical Informatics, Columbia University Medical Center, New York, NY 10032, USA. E-mail: ; George Hripcsak, Department of Biomedical Informatics, Columbia University Medical Center, New York, NY 10032, USA. E-mail:
| | - George Hripcsak
- Corresponding author: Santiago Vilar, Department of Biomedical Informatics, Columbia University Medical Center, New York, NY 10032, USA. E-mail: ; George Hripcsak, Department of Biomedical Informatics, Columbia University Medical Center, New York, NY 10032, USA. E-mail:
| |
Collapse
|
6
|
Manda P, Balhoff JP, Lapp H, Mabee P, Vision TJ. Using the phenoscape knowledgebase to relate genetic perturbations to phenotypic evolution. Genesis 2015. [PMID: 26220875 DOI: 10.1002/dvg.22878] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
The abundance of phenotypic diversity among species can enrich our knowledge of development and genetics beyond the limits of variation that can be observed in model organisms. The Phenoscape Knowledgebase (KB) is designed to enable exploration and discovery of phenotypic variation among species. Because phenotypes in the KB are annotated using standard ontologies, evolutionary phenotypes can be compared with phenotypes from genetic perturbations in model organisms. To illustrate the power of this approach, we review the use of the KB to find taxa showing evolutionary variation similar to that of a query gene. Matches are made between the full set of phenotypes described for a gene and an evolutionary profile, the latter of which is defined as the set of phenotypes that are variable among the daughters of any node on the taxonomic tree. Phenoscape's semantic similarity interface allows the user to assess the statistical significance of each match and flags matches that may only result from differences in annotation coverage between genetic and evolutionary studies. Tools such as this will help meet the challenge of relating the growing volume of genetic knowledge in model organisms to the diversity of phenotypes in nature. The Phenoscape KB is available at http://kb.phenoscape.org.
Collapse
Affiliation(s)
- Prashanti Manda
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina.,US National Evolutionary Synthesis Center, Durham, North Carolina
| | - James P Balhoff
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina.,US National Evolutionary Synthesis Center, Durham, North Carolina
| | - Hilmar Lapp
- US National Evolutionary Synthesis Center, Durham, North Carolina.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina
| | - Paula Mabee
- Department of Biology, University of South Dakota, Vermillion, South Dakota
| | - Todd J Vision
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina.,US National Evolutionary Synthesis Center, Durham, North Carolina
| |
Collapse
|
7
|
Bello SM, Smith CL, Eppig JT. Allele, phenotype and disease data at Mouse Genome Informatics: improving access and analysis. Mamm Genome 2015; 26:285-94. [PMID: 26162703 PMCID: PMC4534497 DOI: 10.1007/s00335-015-9582-y] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Accepted: 06/23/2015] [Indexed: 11/16/2022]
Abstract
A core part of the Mouse Genome Informatics (MGI) resource is the collection of mouse mutations and the annotation phenotypes and diseases displayed by mice carrying these mutations. These data are integrated with the rest of data in MGI and exported to numerous other resources. The use of mouse phenotype data to drive translational research into human disease has expanded rapidly with the improvements in sequencing technology. MGI has implemented many improvements in allele and phenotype data annotation, search, and display to facilitate access to these data through multiple avenues. For example, the description of alleles has been modified to include more detailed categories of allele attributes. This allows improved discrimination between mutation types. Further, connections have been created between mutations involving multiple genes and each of the genes overlapping the mutation. This allows users to readily find all mutations affecting a gene and see all genes affected by a mutation. In a similar manner, the genes expressed by transgenic or knock-in alleles are now connected to these alleles. The advanced search forms and public reports have been updated to take advantage of these improvements. These search forms and reports are used by an expanding number of researchers to identify novel human disease genes and mouse models of human disease.
Collapse
Affiliation(s)
- Susan M Bello
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME, 04609, USA,
| | | | | |
Collapse
|