1
|
Louarn M, Collet G, Barré È, Fest T, Dameron O, Siegel A, Chatonnet F. Regulus infers signed regulatory relations from few samples' information using discretization and likelihood constraints. PLoS Comput Biol 2024; 20:e1011816. [PMID: 38252636 PMCID: PMC10833539 DOI: 10.1371/journal.pcbi.1011816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 02/01/2024] [Accepted: 01/08/2024] [Indexed: 01/24/2024] Open
Abstract
MOTIVATION Transcriptional regulation is performed by transcription factors (TF) binding to DNA in context-dependent regulatory regions and determines the activation or inhibition of gene expression. Current methods of transcriptional regulatory circuits inference, based on one or all of TF, regions and genes activity measurements require a large number of samples for ranking the candidate TF-gene regulation relations and rarely predict whether they are activations or inhibitions. We hypothesize that transcriptional regulatory circuits can be inferred from fewer samples by (1) fully integrating information on TF binding, gene expression and regulatory regions accessibility, (2) reducing data complexity and (3) using biology-based likelihood constraints to determine the global consistency between a candidate TF-gene relation and patterns of genes expressions and region activations, as well as qualify regulations as activations or inhibitions. RESULTS We introduce Regulus, a method which computes TF-gene relations from gene expressions, regulatory region activities and TF binding sites data, together with the genomic locations of all entities. After aggregating gene expressions and region activities into patterns, data are integrated into a RDF (Resource Description Framework) endpoint. A dedicated SPARQL (SPARQL Protocol and RDF Query Language) query retrieves all potential relations between expressed TF and genes involving active regulatory regions. These TF-region-gene relations are then filtered using biological likelihood constraints allowing to qualify them as activation or inhibition. Regulus provides signed relations consistent with public databases and, when applied to biological data, identifies both known and potential new regulators. Regulus is devoted to context-specific transcriptional circuits inference in human settings where samples are scarce and cell populations are closely related, using discretization into patterns and likelihood reasoning to decipher the most robust regulatory relations.
Collapse
Affiliation(s)
- Marine Louarn
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, Rennes, France
- UMR_S 1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, France
| | | | - Ève Barré
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, Rennes, France
| | - Thierry Fest
- UMR_S 1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, France
- Laboratoire d’Hématologie, Pôle de Biologie, CHU de Rennes, Rennes, France
| | | | - Anne Siegel
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, Rennes, France
| | - Fabrice Chatonnet
- UMR_S 1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, France
- Laboratoire d’Hématologie, Pôle de Biologie, CHU de Rennes, Rennes, France
| |
Collapse
|
2
|
García-Closas M, Ahearn TU, Gaudet MM, Hurson AN, Balasubramanian JB, Choudhury PP, Gerlanc NM, Patel B, Russ D, Abubakar M, Freedman ND, Wong WSW, Chanock SJ, Berrington de Gonzalez A, Almeida JS. Moving Toward Findable, Accessible, Interoperable, Reusable Practices in Epidemiologic Research. Am J Epidemiol 2023; 192:995-1005. [PMID: 36804665 PMCID: PMC10505418 DOI: 10.1093/aje/kwad040] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 11/28/2022] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Data sharing is essential for reproducibility of epidemiologic research, replication of findings, pooled analyses in consortia efforts, and maximizing study value to address multiple research questions. However, barriers related to confidentiality, costs, and incentives often limit the extent and speed of data sharing. Epidemiological practices that follow Findable, Accessible, Interoperable, Reusable (FAIR) principles can address these barriers by making data resources findable with the necessary metadata, accessible to authorized users, and interoperable with other data, to optimize the reuse of resources with appropriate credit to its creators. We provide an overview of these principles and describe approaches for implementation in epidemiology. Increasing degrees of FAIRness can be achieved by moving data and code from on-site locations to remote, accessible ("Cloud") data servers, using machine-readable and nonproprietary files, and developing open-source code. Adoption of these practices will improve daily work and collaborative analyses and facilitate compliance with data sharing policies from funders and scientific journals. Achieving a high degree of FAIRness will require funding, training, organizational support, recognition, and incentives for sharing research resources, both data and code. However, these costs are outweighed by the benefits of making research more reproducible, impactful, and equitable by facilitating the reuse of precious research resources by the scientific community.
Collapse
Affiliation(s)
- Montserrat García-Closas
- Correspondence to Dr. Montserrat García-Closas, Trans-Divisional Research Program, Division of Cancer Epidemiology and Genetics National Cancer Institute, 9609 Medical Center Drive, Rockville, MD 20850 (e-mail: )
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Touré V, Krauss P, Gnodtke K, Buchhorn J, Unni D, Horki P, Raisaro JL, Kalt K, Teixeira D, Crameri K, Österle S. FAIRification of health-related data using semantic web technologies in the Swiss Personalized Health Network. Sci Data 2023; 10:127. [PMID: 36899064 PMCID: PMC10006404 DOI: 10.1038/s41597-023-02028-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 02/17/2023] [Indexed: 03/12/2023] Open
Abstract
The Swiss Personalized Health Network (SPHN) is a government-funded initiative developing federated infrastructures for a responsible and efficient secondary use of health data for research purposes in compliance with the FAIR principles (Findable, Accessible, Interoperable and Reusable). We built a common standard infrastructure with a fit-for-purpose strategy to bring together health-related data and ease the work of both data providers to supply data in a standard manner and researchers by enhancing the quality of the collected data. As a result, the SPHN Resource Description Framework (RDF) schema was implemented together with a data ecosystem that encompasses data integration, validation tools, analysis helpers, training and documentation for representing health metadata and data in a consistent manner and reaching nationwide data interoperability goals. Data providers can now efficiently deliver several types of health data in a standardised and interoperable way while a high degree of flexibility is granted for the various demands of individual research projects. Researchers in Switzerland have access to FAIR health data for further use in RDF triplestores.
Collapse
Affiliation(s)
- Vasundra Touré
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Philip Krauss
- Trivadis - Part of Accenture, 4051, Basel, Switzerland
| | - Kristin Gnodtke
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | | | - Deepak Unni
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Petar Horki
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Jean Louis Raisaro
- Health Informatics and Data Privacy Group, Biomedical Data Science Center, 1010 Lausanne University Hospital, Lausanne, Switzerland
| | - Katie Kalt
- Clinical Data Platform Research, Directorate of Research and Education, Zurich University Hospital, 8091, Zurich, Switzerland
| | - Daniel Teixeira
- DSI - Data Group, Geneva University Hospital, 1205, Geneva, Switzerland
| | - Katrin Crameri
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Sabine Österle
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland.
| |
Collapse
|
4
|
Laufs D, Peters M, Schultz C. Data platforms for open life sciences-A systematic analysis of management instruments. PLoS One 2022; 17:e0276204. [PMID: 36282849 PMCID: PMC9595524 DOI: 10.1371/journal.pone.0276204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 10/02/2022] [Indexed: 11/05/2022] Open
Abstract
Open data platforms are interfaces between data demand of and supply from their users. Yet, data platform providers frequently struggle to aggregate data to suit their users' needs and to establish a high intensity of data exchange in a collaborative environment. Here, using open life science data platforms as an example for a diverse data structure, we systematically categorize these platforms based on their technology intermediation and the range of domains they cover to derive general and specific success factors for their management instruments. Our qualitative content analysis is based on 39 in-depth interviews with experts employed by data platforms and external stakeholders. We thus complement peer initiatives which focus solely on data quality, by additionally highlighting the data platforms' role to enable data utilization for innovative output. Based on our analysis, we propose a clearly structured and detailed guideline for seven management instruments. This guideline helps to establish and operationalize data platforms and to best exploit the data provided. Our findings support further exploitation of the open innovation potential in the life sciences and beyond.
Collapse
Affiliation(s)
- Daniel Laufs
- Technology Management Research Group, Faculty of Business, Economics and Social Sciences, Kiel University, Kiel, SH, Germany
| | - Mareike Peters
- Technology Management Research Group, Faculty of Business, Economics and Social Sciences, Kiel University, Kiel, SH, Germany
| | - Carsten Schultz
- Technology Management Research Group, Faculty of Business, Economics and Social Sciences, Kiel University, Kiel, SH, Germany
| |
Collapse
|
5
|
Louarn M, Chatonnet F, Garnier X, Fest T, Siegel A, Faron C, Dameron O. Improving reusability along the data life cycle: a regulatory circuits case study. J Biomed Semantics 2022; 13:11. [PMID: 35346379 PMCID: PMC8962212 DOI: 10.1186/s13326-022-00266-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 03/07/2022] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. RESULTS We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. CONCLUSION The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.
Collapse
Affiliation(s)
- Marine Louarn
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
- UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000 France
| | - Fabrice Chatonnet
- UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000 France
- Laboratoire d’Hématologie, Pôle de Biologie, Centre Hospitalier Universitaire de Rennes, Rennes, 35033 France
| | - Xavier Garnier
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
| | - Thierry Fest
- UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000 France
- Laboratoire d’Hématologie, Pôle de Biologie, Centre Hospitalier Universitaire de Rennes, Rennes, 35033 France
| | - Anne Siegel
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
| | - Catherine Faron
- Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France
| | - Olivier Dameron
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
| |
Collapse
|
6
|
Abstract
Knowledge graphs (KGs) have rapidly emerged as an important area in AI over the last ten years. Building on a storied tradition of graphs in the AI community, a KG may be simply defined as a directed, labeled, multi-relational graph with some form of semantics. In part, this has been fueled by increased publication of structured datasets on the Web, and well-publicized successes of large-scale projects such as the Google Knowledge Graph and the Amazon Product Graph. However, another factor that is less discussed, but which has been equally instrumental in the success of KGs, is the cross-disciplinary nature of academic KG research. Arguably, because of the diversity of this research, a synthesis of how different KG research strands all tie together could serve a useful role in enabling more ‘moonshot’ research and large-scale collaborations. This review of the KG research landscape attempts to provide such a synthesis by first showing what the major strands of research are, and how those strands map to different communities, such as Natural Language Processing, Databases and Semantic Web. A unified framework is suggested in which to view the distinct, but overlapping, foci of KG research within these communities.
Collapse
|
7
|
High-dimensional role of AI and machine learning in cancer research. Br J Cancer 2022; 126:523-532. [PMID: 35013580 PMCID: PMC8854697 DOI: 10.1038/s41416-021-01689-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 11/23/2021] [Accepted: 12/23/2021] [Indexed: 01/12/2023] Open
Abstract
The role of Artificial Intelligence and Machine Learning in cancer research offers several advantages, primarily scaling up the information processing and increasing the accuracy of the clinical decision-making. The key enabling tools currently in use in Precision, Digital and Translational Medicine, here named as 'Intelligent Systems' (IS), leverage unprecedented data volumes and aim to model their underlying heterogeneous influences and variables correlated with patients' outcomes. As functionality and performance of IS are associated with complex diagnosis and therapy decisions, a rich spectrum of patterns and features detected in high-dimensional data may be critical for inference purposes. Many challenges are also present in such discovery task. First, the generation of interpretable model results from a mix of structured and unstructured input information. Second, the design, and implementation of automated clinical decision processes for drawing disease trajectories and patient profiles. Ultimately, the clinical impacts depend on the data effectively subjected to steps such as harmonisation, integration, validation, etc. The aim of this work is to discuss the transformative value of IS applied to multimodal data acquired through various interrelated cancer domains (high-throughput genomics, experimental biology, medical image processing, radiomics, patient electronic records, etc.).
Collapse
|
8
|
A Domain-Adaptable Heterogeneous Information Integration Platform: Tourism and Biomedicine Domains. INFORMATION 2021. [DOI: 10.3390/info12110435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In recent years, information integration systems have become very popular in mashup-type applications. Information sources are normally presented in an individual and unrelated fashion, and the development of new technologies to reduce the negative effects of information dispersion is needed. A major challenge is the integration and implementation of processing pipelines using different technologies promoting the emergence of advanced architectures capable of processing such a number of diverse sources. This paper describes a semantic domain-adaptable platform to integrate those sources and provide high-level functionalities, such as recommendations, shallow and deep natural language processing, text enrichment, and ontology standardization. Our proposed intelligent domain-adaptable platform (IDAP) has been implemented and tested in the tourism and biomedicine domains to demonstrate the adaptability, flexibility, modularity, and utility of the platform. Questionnaires, performance metrics, and A/B control groups’ evaluations have shown improvements when using IDAP in learning environments.
Collapse
|
9
|
Bresso E, Monnin P, Bousquet C, Calvier FE, Ndiaye NC, Petitpain N, Smaïl-Tabbone M, Coulet A. Investigating ADR mechanisms with Explainable AI: a feasibility study with knowledge graph mining. BMC Med Inform Decis Mak 2021; 21:171. [PMID: 34039343 PMCID: PMC8157660 DOI: 10.1186/s12911-021-01518-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 05/05/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Adverse drug reactions (ADRs) are statistically characterized within randomized clinical trials and postmarketing pharmacovigilance, but their molecular mechanism remains unknown in most cases. This is true even for hepatic or skin toxicities, which are classically monitored during drug design. Aside from clinical trials, many elements of knowledge about drug ingredients are available in open-access knowledge graphs, such as their properties, interactions, or involvements in pathways. In addition, drug classifications that label drugs as either causative or not for several ADRs, have been established. METHODS We propose in this paper to mine knowledge graphs for identifying biomolecular features that may enable automatically reproducing expert classifications that distinguish drugs causative or not for a given type of ADR. In an Explainable AI perspective, we explore simple classification techniques such as Decision Trees and Classification Rules because they provide human-readable models, which explain the classification itself, but may also provide elements of explanation for molecular mechanisms behind ADRs. In summary, (1) we mine a knowledge graph for features; (2) we train classifiers at distinguishing, on the basis of extracted features, drugs associated or not with two commonly monitored ADRs: drug-induced liver injuries (DILI) and severe cutaneous adverse reactions (SCAR); (3) we isolate features that are both efficient in reproducing expert classifications and interpretable by experts (i.e., Gene Ontology terms, drug targets, or pathway names); and (4) we manually evaluate in a mini-study how they may be explanatory. RESULTS Extracted features reproduce with a good fidelity classifications of drugs causative or not for DILI and SCAR (Accuracy = 0.74 and 0.81, respectively). Experts fully agreed that 73% and 38% of the most discriminative features are possibly explanatory for DILI and SCAR, respectively; and partially agreed (2/3) for 90% and 77% of them. CONCLUSION Knowledge graphs provide sufficiently diverse features to enable simple and explainable models to distinguish between drugs that are causative or not for ADRs. In addition to explaining classifications, most discriminative features appear to be good candidates for investigating ADR mechanisms further.
Collapse
Affiliation(s)
- Emmanuel Bresso
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
- Centre d’Investigations Cliniques Plurithématique 1433, Inserm 1116, CHRU de Nancy, Université de Lorraine, Nancy, France
| | - Pierre Monnin
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
- Orange, Belfort, France
| | - Cédric Bousquet
- Service de santé publique et information médicale, CHU de Saint Etienne, Saint Etienne, France
- Sorbonne Université, Inserm, Université Paris 13, LIMICS, Paris, France
| | - François-Elie Calvier
- Service de santé publique et information médicale, CHU de Saint Etienne, Saint Etienne, France
| | | | - Nadine Petitpain
- Centre Régional de Pharmacovigilance, CHRU of Nancy, Nancy, France
| | | | - Adrien Coulet
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
- Inria Paris, Paris, France
- Centre de Recherche des Cordeliers, INSERM, Sorbonne Université, Université de Paris, Paris, France
| |
Collapse
|
10
|
Kamdar MR, Musen MA. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 2021; 8:24. [PMID: 33479214 PMCID: PMC7819992 DOI: 10.1038/s41597-021-00797-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 12/04/2020] [Indexed: 01/29/2023] Open
Abstract
While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 biomedical linked open data sources into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
- Elsevier Health Markets, Philadelphia, PA, USA.
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
11
|
Thessen AE, Grondin CJ, Kulkarni RD, Brander S, Truong L, Vasilevsky NA, Callahan TJ, Chan LE, Westra B, Willis M, Rothenberg SE, Jarabek AM, Burgoon L, Korrick SA, Haendel MA. Community Approaches for Integrating Environmental Exposures into Human Models of Disease. ENVIRONMENTAL HEALTH PERSPECTIVES 2020; 128:125002. [PMID: 33369481 PMCID: PMC7769179 DOI: 10.1289/ehp7215] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 11/30/2020] [Accepted: 12/04/2020] [Indexed: 05/03/2023]
Abstract
BACKGROUND A critical challenge in genomic medicine is identifying the genetic and environmental risk factors for disease. Currently, the available data links a majority of known coding human genes to phenotypes, but the environmental component of human disease is extremely underrepresented in these linked data sets. Without environmental exposure information, our ability to realize precision health is limited, even with the promise of modern genomics. Achieving integration of gene, phenotype, and environment will require extensive translation of data into a standard, computable form and the extension of the existing gene/phenotype data model. The data standards and models needed to achieve this integration do not currently exist. OBJECTIVES Our objective is to foster development of community-driven data-reporting standards and a computational model that will facilitate the inclusion of exposure data in computational analysis of human disease. To this end, we present a preliminary semantic data model and use cases and competency questions for further community-driven model development and refinement. DISCUSSION There is a real desire by the exposure science, epidemiology, and toxicology communities to use informatics approaches to improve their research workflow, gain new insights, and increase data reuse. Critical to success is the development of a community-driven data model for describing environmental exposures and linking them to existing models of human disease. https://doi.org/10.1289/EHP7215.
Collapse
Affiliation(s)
- Anne E. Thessen
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, USA
- Ronin Institute for Independent Scholarship, Montclair, New Jersey, USA
| | - Cynthia J. Grondin
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina, USA
| | - Resham D. Kulkarni
- Biomedical Informatics and Data Science, Frederick National Laboratory for Cancer Research, Frederick, Maryland, USA
| | - Susanne Brander
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, USA
| | - Lisa Truong
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, USA
| | - Nicole A. Vasilevsky
- Oregon Clinical & Translational Research Institute, Oregon Health & Science University, Portland, Oregon, USA
- Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, Oregon, USA
| | - Tiffany J. Callahan
- Computational Bioscience Program, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado, USA
- Department of Pharmacology, School of Medicine, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado, USA
| | - Lauren E. Chan
- Nutrition, Oregon State University, Corvallis, Oregon, USA
| | - Brian Westra
- University Libraries, University of Iowa, Iowa City, Iowa, USA
| | - Mary Willis
- School of Biological and Population Health Sciences, College of Public Health and Human Sciences, Oregon State University, Corvallis, Oregon, USA
| | - Sarah E. Rothenberg
- School of Biological and Population Health Sciences, College of Public Health and Human Sciences, Oregon State University, Corvallis, Oregon, USA
| | - Annie M. Jarabek
- Center for Public Health and Environmental Assessment, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina, USA
| | - Lyle Burgoon
- U.S. Army Engineering Research and Development Center, Vicksburg, Mississippi, USA
| | - Susan A. Korrick
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Melissa A. Haendel
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, USA
| |
Collapse
|
12
|
Kamdar MR, Stanley CE, Carroll M, Wogulis L, Dowling W, Deus HF, Samarasinghe M. Text Snippets to Corroborate Medical Relations: An Unsupervised Approach using a Knowledge Graph and Embeddings. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:288-297. [PMID: 32477648 PMCID: PMC7233036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Knowledge graphs have been shown to significantly improve search results. Usually populated by subject matter experts, relations therein need to keep up to date with medical literature in order for search to remain relevant. Dynamically identifying text snippets in literature that confirm or deny knowledge graph triples is increasingly becoming the differentiator between trusted and untrusted medical decision support systems. This work describes our approach to mapping triples to medical text. A medical knowledge graph is used as a source of triples that are used to find matching sentences in reference text. Our unsupervised approach uses phrase embeddings and cosine similarity measures, and boosts candidate text snippets when certain key concepts exist. Using this approach, we can accurately map semantic relations within the medical knowledge graph to text snippets with a precision of 61.4% and recall of 86.3%. This method will be used to develop a novel application in the future to retrieve medical relations and corroborating snippets from medical text given a user query.
Collapse
Affiliation(s)
| | | | | | - Linda Wogulis
- Elsevier, Health and Commercial Markets, Philadelphia, PA
| | | | - Helena F Deus
- Elsevier, Health and Commercial Markets, Philadelphia, PA
| | | |
Collapse
|