1
|
Herr BW, Hardi J, Quardokus EM, Bueckle A, Chen L, Wang F, Caron AR, Osumi-Sutherland D, Musen MA, Börner K. Specimen, biological structure, and spatial ontologies in support of a Human Reference Atlas. Sci Data 2023; 10:171. [PMID: 36973309 PMCID: PMC10043028 DOI: 10.1038/s41597-023-01993-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 01/30/2023] [Indexed: 03/29/2023] Open
Abstract
The Human Reference Atlas (HRA) is defined as a comprehensive, three-dimensional (3D) atlas of all the cells in the healthy human body. It is compiled by an international team of experts who develop standard terminologies that they link to 3D reference objects, describing anatomical structures. The third HRA release (v1.2) covers spatial reference data and ontology annotations for 26 organs. Experts access the HRA annotations via spreadsheets and view reference object models in 3D editing tools. This paper introduces the Common Coordinate Framework (CCF) Ontology v2.0.1 that interlinks specimen, biological structure, and spatial data, together with the CCF API that makes the HRA programmatically accessible and interoperable with Linked Open Data (LOD). We detail how real-world user needs and experimental data guide CCF Ontology design and implementation, present CCF Ontology classes and properties together with exemplary usage, and report on validation methods. The CCF Ontology graph database and API are used in the HuBMAP portal, HRA Organ Gallery, and other applications that support data queries across multiple, heterogeneous sources.
Collapse
Affiliation(s)
- Bruce W Herr
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Josef Hardi
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, 94305, USA
| | - Ellen M Quardokus
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Andreas Bueckle
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA.
| | - Lu Chen
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
| | - Fusheng Wang
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, 11794, USA
| | - Anita R Caron
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | | | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, 94305, USA
| | - Katy Börner
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA.
| |
Collapse
|
2
|
|
3
|
van Reisen M, Oladipo F, Stokmans M, Mpezamihgo M, Folorunso S, Schultes E, Basajja M, Aktau A, Amare SY, Taye GT, Purnama Jati PH, Chindoza K, Wirtz M, Ghardallou M, van Stam G, Ayele W, Nalugala R, Abdullahi I, Osigwe O, Graybeal J, Medhanyie AA, Kawu AA, Liu F, Wolstencroft K, Flikkenschild E, Lin Y, Stocker J, Musen MA. Design of a FAIR digital data health infrastructure in Africa for COVID-19 reporting and research. ACTA ACUST UNITED AC 2021; 2:e10050. [PMID: 34514430 PMCID: PMC8420285 DOI: 10.1002/ggn2.10050] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 05/20/2021] [Accepted: 05/21/2021] [Indexed: 12/13/2022]
Abstract
The limited volume of COVID‐19 data from Africa raises concerns for global genome research, which requires a diversity of genotypes for accurate disease prediction, including on the provenance of the new SARS‐CoV‐2 mutations. The Virus Outbreak Data Network (VODAN)‐Africa studied the possibility of increasing the production of clinical data, finding concerns about data ownership, and the limited use of health data for quality treatment at point of care. To address this, VODAN Africa developed an architecture to record clinical health data and research data collected on the incidence of COVID‐19, producing these as human‐ and machine‐readable data objects in a distributed architecture of locally governed, linked, human‐ and machine‐readable data. This architecture supports analytics at the point of care and—through data visiting, across facilities—for generic analytics. An algorithm was run across FAIR Data Points to visit the distributed data and produce aggregate findings. The FAIR data architecture is deployed in Uganda, Ethiopia, Liberia, Nigeria, Kenya, Somalia, Tanzania, Zimbabwe, and Tunisia.
Collapse
Affiliation(s)
- Mirjam van Reisen
- Leiden University Leiden Netherlands.,Leiden University Medical Centre (LUMC) Leiden University Leiden Netherlands.,Leiden Institute of Advanced Computer Science (LIACS) Leiden University Leiden Netherlands.,Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands
| | | | - Mia Stokmans
- Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands
| | | | - Sakinat Folorunso
- Department of Computer Science Olabisi Onabanjo University Ago Iwoye Nigeria
| | | | - Mariam Basajja
- Leiden University Leiden Netherlands.,Leiden Institute of Advanced Computer Science (LIACS) Leiden University Leiden Netherlands
| | - Aliya Aktau
- Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands
| | | | - Getu Tadele Taye
- Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands.,Department of Health informatics, School of Public Health Mekelle University Mek'ele Ethiopia
| | - Putu Hadi Purnama Jati
- Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands.,Badan Pusat Statistik Central Jakarta Indonesia
| | - Kudakwashe Chindoza
- Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands.,Department of Computer Science Great Zimbabwe University Masvingo Zimbabwe
| | - Morgane Wirtz
- Faculty of Humanities and Digital Sciences Tilburg University Tilburg Netherlands
| | - Meriem Ghardallou
- Department of Community Medicine Université de Sousse Sousse Tunisia
| | | | - Wondimu Ayele
- Department of Biostatistics and Epidemiology, School of Public health College of Health Sciences Addis Ababa University Addis Ababa Ethiopia
| | | | | | | | - John Graybeal
- Stanford Center for Biomedical Informatics Research Stanford University Stanford California USA
| | - Araya Abrha Medhanyie
- Department of Reproductive health, School of Public Health Mekelle University Mek'ele Ethiopia
| | | | | | | | - Erik Flikkenschild
- Leiden University Medical Centre (LUMC) Leiden University Leiden Netherlands
| | - Yi Lin
- Leiden University Leiden Netherlands
| | - Joëlle Stocker
- Department of Geosciences Utrecht University Utrecht Netherlands
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research Stanford University Stanford California USA
| |
Collapse
|
4
|
Maitra A, Kamdar MR, Zulman DM, Haverfield MC, Brown-Johnson C, Schwartz R, Israni ST, Verghese A, Musen MA. Using ethnographic methods to classify the human experience in medicine: a case study of the presence ontology. J Am Med Inform Assoc 2021; 28:1900-1909. [PMID: 34151988 PMCID: PMC8363802 DOI: 10.1093/jamia/ocab091] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 04/26/2021] [Accepted: 05/13/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE Although social and environmental factors are central to provider-patient interactions, the data that reflect these factors can be incomplete, vague, and subjective. We sought to create a conceptual framework to describe and classify data about presence, the domain of interpersonal connection in medicine. METHODS Our top-down approach for ontology development based on the concept of "relationality" included the following: 1) a broad survey of the social sciences literature and a systematic literature review of >20 000 articles around interpersonal connection in medicine, 2) relational ethnography of clinical encounters (n = 5 pilot, 27 full), and 3) interviews about relational work with 40 medical and nonmedical professionals. We formalized the model using the Web Ontology Language in the Protégé ontology editor. We iteratively evaluated and refined the Presence Ontology through manual expert review and automated annotation of literature. RESULTS AND DISCUSSION The Presence Ontology facilitates the naming and classification of concepts that would otherwise be vague. Our model categorizes contributors to healthcare encounters and factors such as communication, emotions, tools, and environment. Ontology evaluation indicated that cognitive models (both patients' explanatory models and providers' caregiving approaches) influenced encounters and were subsequently incorporated. We show how ethnographic methods based in relationality can aid the representation of experiential concepts (eg, empathy, trust). Our ontology could support investigative methods to improve healthcare processes for both patients and healthcare providers, including annotation of videotaped encounters, development of clinical instruments to measure presence, or implementation of electronic health record-based reminders for providers. CONCLUSION The Presence Ontology provides a model for using ethnographic approaches to classify interpersonal data.
Collapse
Affiliation(s)
- Amrapali Maitra
- Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts, USA
- Presence Center, Stanford University School of Medicine, Stanford, California, USA
| | - Maulik R Kamdar
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| | - Donna M Zulman
- Division of Primary Care and Population Health, Stanford University, Stanford, California, USA
- Center for Innovation to Implementation, VA Palo Alto Health Care System, Menlo Park, California, USA
| | - Marie C Haverfield
- Department of Communication Studies, San Jose State University, San Jose, California, USA
| | - Cati Brown-Johnson
- Division of Primary Care and Population Health, Stanford University, Stanford, California, USA
| | - Rachel Schwartz
- WellMD Center, Stanford University School of Medicine, Stanford, California, USA
| | | | - Abraham Verghese
- Presence Center, Stanford University School of Medicine, Stanford, California, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| |
Collapse
|
5
|
Abstract
While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 biomedical linked open data sources into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
- Elsevier Health Markets, Philadelphia, PA, USA.
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
6
|
Tu SW, Nyulas CI, Tudorache T, Musen MA, Martinuzzi A, van Gool C, Mea VD, Chute CG, Frattura L, Hardiker N, Napel HT, Madden R, Almborg AH, Ginige JA, Sykes C, Cekik C, Jakob R. Toward a Harmonized WHO Family of International Classifications Content Model. Stud Health Technol Inform 2020; 270:1409-1410. [PMID: 32570683 DOI: 10.3233/shti200466] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
An overarching WHO-FIC Content Model will allow uniform modeling of classifications in the WHO Family of International Classifications (WHO-FIC) and promote their joint use. We provide an initial conceptualization of such a model.
Collapse
Affiliation(s)
| | | | | | | | | | - Coen van Gool
- National Institute for Public Health and the Environment, The Netherlands
| | | | | | | | | | - Huib Ten Napel
- National Institute for Public Health and the Environment, The Netherlands
| | | | | | | | | | | | | |
Collapse
|
7
|
O'Connor MJ, Warzel DB, Martínez-Romero M, Hardi J, Willrett D, Egyedi AL, Eftekhari A, Graybeal J, Musen MA. Unleashing the value of Common Data Elements through the CEDAR Workbench. AMIA Annu Symp Proc 2020; 2019:681-690. [PMID: 32308863 PMCID: PMC7153094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Developing promising treatments in biomedicine often requires aggregation and analysis of data from disparate sources across the healthcare and research spectrum. To facilitate these approaches, there is a growing focus on supporting interoperation of datasets by standardizing data-capture and reporting requirements. Common Data Elements (CDEs)-precise specifications of questions and the set of allowable answers to each question-are increasingly being adopted to help meet these standardization goals. While CDEs can provide a strong conceptual foundation for interoperation, there are no widely recognized serialization or interchange formats to describe and exchange their definitions. As a result, CDEs defined in one system cannot be easily be reused by other systems. An additional problem is that current CDE-based systems tend to be rather heavyweight and cannot be easily adopted and used by third-parties. To address these problems, we developed extensions to a metadata management system called the CEDAR Workbench to provide a platform to simplify the creation, exchange, and use of CDEs. We show how the resulting system allows users to quickly define and share CDEs and to immediately use these CDEs to build and deploy Web-based forms to acquire conforming metadata. We also show how we incorporated a large CDE library from the National Cancer Institute's caDSR system and made these CDEs publicly available for general use.
Collapse
Affiliation(s)
- Martin J O'Connor
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Denise B Warzel
- Cancer Informatics Branch, National Cancer Institute, Bethesda, MD, USA
| | | | - Josef Hardi
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Debra Willrett
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Attila L Egyedi
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | | | - John Graybeal
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
8
|
Kamdar MR, Fernández JD, Polleres A, Tudorache T, Musen MA. Enabling Web-scale data integration in biomedicine through Linked Open Data. NPJ Digit Med 2019; 2:90. [PMID: 31531395 PMCID: PMC6736878 DOI: 10.1038/s41746-019-0162-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Accepted: 08/06/2019] [Indexed: 01/17/2023] Open
Abstract
The biomedical data landscape is fragmented with several isolated, heterogeneous data and knowledge sources, which use varying formats, syntaxes, schemas, and entity notations, existing on the Web. Biomedical researchers face severe logistical and technical challenges to query, integrate, analyze, and visualize data from multiple diverse sources in the context of available biomedical knowledge. Semantic Web technologies and Linked Data principles may aid toward Web-scale semantic processing and data integration in biomedicine. The biomedical research community has been one of the earliest adopters of these technologies and principles to publish data and knowledge on the Web as linked graphs and ontologies, hence creating the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we provide our perspective on some opportunities proffered by the use of LSLOD to integrate biomedical data and knowledge in three domains: (1) pharmacology, (2) cancer research, and (3) infectious diseases. We will discuss some of the major challenges that hinder the wide-spread use and consumption of LSLOD by the biomedical research community. Finally, we provide a few technical solutions and insights that can address these challenges. Eventually, LSLOD can enable the development of scalable, intelligent infrastructures that support artificial intelligence methods for augmenting human intelligence to achieve better clinical outcomes for patients, to enhance the quality of biomedical research, and to improve our understanding of living systems.
Collapse
Affiliation(s)
- Maulik R. Kamdar
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | - Javier D. Fernández
- Vienna University of Economics & Business, Vienna, Austria
- Complexity Science Hub Vienna, Vienna, Austria
| | - Axel Polleres
- Vienna University of Economics & Business, Vienna, Austria
- Complexity Science Hub Vienna, Vienna, Austria
| | - Tania Tudorache
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | - Mark A. Musen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| |
Collapse
|
9
|
Abstract
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
Collapse
Affiliation(s)
- Rafael S. Gonçalves
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA
| | - Mark A. Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA
| |
Collapse
|
10
|
Martínez-Romero M, O'Connor MJ, Egyedi AL, Willrett D, Hardi J, Graybeal J, Musen MA. Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases. Database (Oxford) 2019; 2019:baz059. [PMID: 31210270 PMCID: PMC6866600 DOI: 10.1093/database/baz059] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Revised: 03/21/2019] [Accepted: 04/15/2019] [Indexed: 12/28/2022]
Abstract
Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata acquisition process is onerous and time consuming, with little interactive guidance or assistance provided to users. Secondary problems include the lack of validation and sparse use of standardized terms or ontologies when authoring metadata. There is a pressing need for improvements to the metadata acquisition process that will help users to enter metadata quickly and accurately. In this paper, we outline a recommendation system for metadata that aims to address this challenge. Our approach uses association rule mining to uncover hidden associations among metadata values and to represent them in the form of association rules. These rules are then used to present users with real-time recommendations when authoring metadata. The novelties of our method are that it is able to combine analyses of metadata from multiple repositories when generating recommendations and can enhance those recommendations by aligning them with ontology terms. We implemented our approach as a service integrated into the CEDAR Workbench metadata authoring platform, and evaluated it using metadata from two public biomedical repositories: US-based National Center for Biotechnology Information BioSample and European Bioinformatics Institute BioSamples. The results show that our approach is able to use analyses of previously entered metadata coupled with ontology-based mappings to present users with accurate recommendations when authoring metadata.
Collapse
Affiliation(s)
- Marcos Martínez-Romero
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| | - Martin J O'Connor
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| | - Attila L Egyedi
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| | - Debra Willrett
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| | - Josef Hardi
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| | - John Graybeal
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA
| |
Collapse
|
11
|
Geller J, Keloth VK, Musen MA. How Sustainable are Biomedical Ontologies? AMIA Annu Symp Proc 2018; 2018:470-479. [PMID: 30815087 PMCID: PMC6371329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
BioPortal is widely regarded to be the world's most comprehensive repository of biomedical ontologies. With a coverage of many biomedical subfields by 716 ontologies (June 27, 2018), BioPortal is an extremely diverse repository. BioPortal maintains easily accessible information about the ontologies submitted by ontology curators. This includes size (concepts/classes, relationships/properties), number of projects, update history, and access history. Ontologies vary by size (from a few concepts to hundreds of thousands), by frequency of update/visit and by number of projects. Interestingly, some ontologies are rarely updated even though they contain thousands of concepts. In an informal email inquiry, we attempted to understand the reasons why ontologies that were built with a major investment of effort are apparently not sustained. Our analysis indicates that lack of funding, unavailability of human resources, and folding of ontologies into other ontologies are the most common among several other factors for discontinued maintenance of these ontologies.
Collapse
Affiliation(s)
- James Geller
- New Jersey Institute of Technology, Newark, New Jersey, USA
| | | | | |
Collapse
|
12
|
Bukhari SAC, O'Connor MJ, Martínez-Romero M, Egyedi AL, Willrett D, Graybeal J, Musen MA, Rubelt F, Cheung KH, Kleinstein SH. The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories. Front Immunol 2018; 9:1877. [PMID: 30166985 PMCID: PMC6105692 DOI: 10.3389/fimmu.2018.01877] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 07/30/2018] [Indexed: 11/13/2022] Open
Abstract
The adaptation of high-throughput sequencing to the B cell receptor and T cell receptor has made it possible to characterize the adaptive immune receptor repertoire (AIRR) at unprecedented depth. These AIRR sequencing (AIRR-seq) studies offer tremendous potential to increase the understanding of adaptive immune responses in vaccinology, infectious disease, autoimmunity, and cancer. The increasingly wide application of AIRR-seq is leading to a critical mass of studies being deposited in the public domain, offering the possibility of novel scientific insights through secondary analyses and meta-analyses. However, effective sharing of these large-scale data remains a challenge. The AIRR community has proposed minimal information about adaptive immune receptor repertoire (MiAIRR), a standard for reporting AIRR-seq studies. The MiAIRR standard has been operationalized using the National Center for Biotechnology Information (NCBI) repositories. Submissions of AIRR-seq data to the NCBI repositories typically use a combination of web-based and flat-file templates and include only a minimal amount of terminology validation. As a result, AIRR-seq studies at the NCBI are often described using inconsistent terminologies, limiting scientists' ability to access, find, interoperate, and reuse the data sets. In order to improve metadata quality and ease submission of AIRR-seq studies to the NCBI, we have leveraged the software framework developed by the Center for Expanded Data Annotation and Retrieval (CEDAR), which develops technologies involving the use of data standards and ontologies to improve metadata quality. The resulting CEDAR-AIRR (CAIRR) pipeline enables data submitters to: (i) create web-based templates whose entries are controlled by ontology terms, (ii) generate and validate metadata, and (iii) submit the ontology-linked metadata and sequence files (FASTQ) to the NCBI BioProject, BioSample, and Sequence Read Archive databases. Overall, CAIRR provides a web-based metadata submission interface that supports compliance with the MiAIRR standard. This pipeline is available at http://cairr.miairr.org, and will facilitate the NCBI submission process and improve the metadata quality of AIRR-seq studies.
Collapse
Affiliation(s)
- Syed Ahmad Chan Bukhari
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT, United States
| | - Martin J O'Connor
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | - Marcos Martínez-Romero
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | - Attila L Egyedi
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | - Debra Willrett
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | - John Graybeal
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | - Florian Rubelt
- Department of Microbiology and Immunology, Institute for Immunity, Transplantation and Infection, Stanford University School of Medicine, Stanford, CA, United States
| | - Kei-Hoi Cheung
- Department of Emergency Medicine, Yale School of Medicine, Yale University, New Haven, CT, United States.,Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, United States.,Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| | - Steven H Kleinstein
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT, United States.,Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| |
Collapse
|
13
|
Bukhari SAC, Martínez-Romero M, O' Connor MJ, Egyedi AL, Willrett D, Graybeal J, Musen MA, Cheung KH, Kleinstein SH. CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata. BMC Bioinformatics 2018; 19:268. [PMID: 30012108 PMCID: PMC6048706 DOI: 10.1186/s12859-018-2247-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 06/14/2018] [Indexed: 12/17/2022] Open
Abstract
Background Public biomedical data repositories often provide web-based interfaces to collect experimental metadata. However, these interfaces typically reflect the ad hoc metadata specification practices of the associated repositories, leading to a lack of standardization in the collected metadata. This lack of standardization limits the ability of the source datasets to be broadly discovered, reused, and integrated with other datasets. To increase reuse, discoverability, and reproducibility of the described experiments, datasets should be appropriately annotated by using agreed-upon terms, ideally from ontologies or other controlled term sources. Results This work presents “CEDAR OnDemand”, a browser extension powered by the NCBO (National Center for Biomedical Ontology) BioPortal that enables users to seamlessly enter ontology-based metadata through existing web forms native to individual repositories. CEDAR OnDemand analyzes the web page contents to identify the text input fields and associate them with relevant ontologies which are recommended automatically based upon input fields’ labels (using the NCBO ontology recommender) and a pre-defined list of ontologies. These field-specific ontologies are used for controlling metadata entry. CEDAR OnDemand works for any web form designed in the HTML format. We demonstrate how CEDAR OnDemand works through the NCBI (National Center for Biotechnology Information) BioSample web-based metadata entry. Conclusion CEDAR OnDemand helps lower the barrier of incorporating ontologies into standardized metadata entry for public data repositories. CEDAR OnDemand is available freely on the Google Chrome store https://chrome.google.com/webstore/search/CEDAROnDemand
Collapse
Affiliation(s)
| | - Marcos Martínez-Romero
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Martin J O' Connor
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Attila L Egyedi
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Debra Willrett
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - John Graybeal
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Kei-Hoi Cheung
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA. .,Department of Emergency Medicine and Yale Center for Medical Informatics, Yale University School of Medicine, New Haven, CT, USA.
| | - Steven H Kleinstein
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA. .,Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
| |
Collapse
|
14
|
Kamdar MR, Musen MA. Mechanism-based Pharmacovigilance over the Life Sciences Linked Open Data Cloud. AMIA Annu Symp Proc 2018; 2017:1014-1023. [PMID: 29854169 PMCID: PMC5977627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Adverse drug reactions (ADR) result in significant morbidity and mortality in patients, and a substantial proportion of these ADRs are caused by drug-drug interactions (DDIs). Pharmacovigilance methods are used to detect unanticipated DDIs and ADRs by mining Spontaneous Reporting Systems, such as the US FDA Adverse Event Reporting System (FAERS). However, these methods do not provide mechanistic explanations for the discovered drug-ADR associations in a systematic manner. In this paper, we present a systems pharmacology-based approach to perform mechanism-based pharmacovigilance. We integrate data and knowledge from four different sources using Semantic Web Technologies and Linked Data principles to generate a systems network. We present a network-based Apriori algorithm for association mining in FAERS reports. We evaluate our method against existing pharmacovigilance methods for three different validation sets. Our method has AUROC statistics of 0.7-0.8, similar to current methods, and event-specific thresholds generate AUROC statistics greater than 0.75 for certain ADRs. Finally, we discuss the benefits of using Semantic Web technologies to attain the objectives for mechanism-based pharmacovigilance.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Center for Biomedical Informatics Research, Stanford University, CA, USA
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, CA, USA
| |
Collapse
|
15
|
Martínez-Romero M, O'Connor MJ, Shankar RD, Panahiazar M, Willrett D, Egyedi AL, Gevaert O, Graybeal J, Musen MA. Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations. AMIA Annu Symp Proc 2018; 2017:1272-1281. [PMID: 29854196 PMCID: PMC5977712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Collapse
Affiliation(s)
| | - Martin J O'Connor
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Ravi D Shankar
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Maryam Panahiazar
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Debra Willrett
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Attila L Egyedi
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Olivier Gevaert
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - John Graybeal
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
16
|
Tomczak A, Mortensen JM, Winnenburg R, Liu C, Alessi DT, Swamy V, Vallania F, Lofgren S, Haynes W, Shah NH, Musen MA, Khatri P. Interpretation of biological experiments changes with evolution of the Gene Ontology and its annotations. Sci Rep 2018; 8:5115. [PMID: 29572502 PMCID: PMC5865181 DOI: 10.1038/s41598-018-23395-2] [Citation(s) in RCA: 65] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 03/12/2018] [Indexed: 12/12/2022] Open
Abstract
Gene Ontology (GO) enrichment analysis is ubiquitously used for interpreting high throughput molecular data and generating hypotheses about underlying biological phenomena of experiments. However, the two building blocks of this analysis — the ontology and the annotations — evolve rapidly. We used gene signatures derived from 104 disease analyses to systematically evaluate how enrichment analysis results were affected by evolution of the GO over a decade. We found low consistency between enrichment analyses results obtained with early and more recent GO versions. Furthermore, there continues to be a strong annotation bias in the GO annotations where 58% of the annotations are for 16% of the human genes. Our analysis suggests that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time. Hence, researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.
Collapse
Affiliation(s)
- Aurelie Tomczak
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA.,Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Jonathan M Mortensen
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Rainer Winnenburg
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Charles Liu
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA
| | - Dominique T Alessi
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Varsha Swamy
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA
| | - Francesco Vallania
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA
| | - Shane Lofgren
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA
| | - Winston Haynes
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Purvesh Khatri
- Stanford Institute for Immunity, Transplantation and Infection (ITI), Stanford University, Stanford, CA, 94305, USA. .,Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
17
|
Abstract
Biomedical ontologies are large: Several ontologies in the BioPortal repository contain thousands or even hundreds of thousands of entities. The development and maintenance of such large ontologies is difficult. To support ontology authors and repository developers in their work, it is crucial to improve our understanding of how these ontologies are explored, queried, reused, and used in downstream applications by biomedical researchers. We present an exploratory empirical analysis of user activities in the BioPortal ontology repository by analyzing BioPortal interaction logs across different access modes over several years. We investigate how users of BioPortal query and search for ontologies and their classes, how they explore the ontologies, and how they reuse classes from different ontologies. Additionally, through three real-world scenarios, we not only analyze the usage of ontologies for annotation tasks but also compare it to the browsing and querying behaviors of BioPortal users. For our investigation, we use several different visualization techniques. To inspect large amounts of interaction, reuse, and real-world usage data at a glance, we make use of and extend PolygOnto, a visualization method that has been successfully used to analyze reuse of ontologies in previous work. Our results show that exploration, query, reuse, and actual usage behaviors rarely align, suggesting that different users tend to explore, query and use different parts of an ontology. Finally, we highlight and discuss differences and commonalities among users of BioPortal.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Simon Walk
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| |
Collapse
|
18
|
Abstract
Abstract:Developers of computer-based decision-support tools frequently adopt either pattern recognition or artificial intelligence techniques as the basis for their programs. Because these developers often choose to accentuate the differences between these alternative approaches, the more fundamental similarities are frequently overlooked. The principal challenge in the creation of any clinical consultation program - regardless of the methodology that is used - lies in creating a computational model of the application domain. The difficulty in generating such a model manifests itself in symptoms that workers in the expert systems community have labeled “the knowledge-acquisition bottleneck” and “the problem of brittleness”. This paper explores these two symptoms and shows how the development of consultation programs based on pattern-recognition techniques is subject to analogous difficulties. The expert systems and pattern recognition communities must recognize that they face similar challenges, and must unite to develop methods that assist with the process of building of models of complex application tasks.
Collapse
|
19
|
Hadley D, Pan J, El-Sayed O, Aljabban J, Aljabban I, Azad TD, Hadied MO, Raza S, Rayikanti BA, Chen B, Paik H, Aran D, Spatz J, Himmelstein D, Panahiazar M, Bhattacharya S, Sirota M, Musen MA, Butte AJ. Precision annotation of digital samples in NCBI's gene expression omnibus. Sci Data 2017; 4:170125. [PMID: 28925997 PMCID: PMC5604135 DOI: 10.1038/sdata.2017.125] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 07/28/2017] [Indexed: 12/16/2022] Open
Abstract
The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open ‘big data’ under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.
Collapse
Affiliation(s)
- Dexter Hadley
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - James Pan
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Osama El-Sayed
- University of Illinois College of Medicine, Chicago, Illinois 60612, USA
| | - Jihad Aljabban
- Harvard Medical School Department of Immunology, Harvard University, Boston, Massachusetts 02115, USA
| | - Imad Aljabban
- Harvard Medical School Department of Immunology, Harvard University, Boston, Massachusetts 02115, USA
| | - Tej D Azad
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Mohamad O Hadied
- Wayne State University School of Medicine, Detroit, Michigan 48201, USA
| | - Shuaib Raza
- Yale School of Medicine, Yale University, New Haven, Connecticut 06519, USA
| | | | - Bin Chen
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Hyojung Paik
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Dvir Aran
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Jordan Spatz
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Daniel Himmelstein
- Program in Biological &Medical Informatics, University of California, San Francisco, CA 94158, USA
| | - Maryam Panahiazar
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Sanchita Bhattacharya
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Marina Sirota
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Atul J Butte
- Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA
| |
Collapse
|
20
|
Gonçalves RS, Tu SW, Nyulas CI, Tierney MJ, Musen MA. An ontology-driven tool for structured data acquisition using Web forms. J Biomed Semantics 2017; 8:26. [PMID: 28764813 PMCID: PMC5540339 DOI: 10.1186/s13326-017-0133-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 06/26/2017] [Indexed: 11/13/2022] Open
Abstract
Background Structured data acquisition is a common task that is widely performed in biomedicine. However, current solutions for this task are far from providing a means to structure data in such a way that it can be automatically employed in decision making (e.g., in our example application domain of clinical functional assessment, for determining eligibility for disability benefits) based on conclusions derived from acquired data (e.g., assessment of impaired motor function). To use data in these settings, we need it structured in a way that can be exploited by automated reasoning systems, for instance, in the Web Ontology Language (OWL); the de facto ontology language for the Web. Results We tackle the problem of generating Web-based assessment forms from OWL ontologies, and aggregating input gathered through these forms as an ontology of “semantically-enriched” form data that can be queried using an RDF query language, such as SPARQL. We developed an ontology-based structured data acquisition system, which we present through its specific application to the clinical functional assessment domain. We found that data gathered through our system is highly amenable to automatic analysis using queries. Conclusions We demonstrated how ontologies can be used to help structuring Web-based forms and to semantically enrich the data elements of the acquired structured data. The ontologies associated with the enriched data elements enable automated inferences and provide a rich vocabulary for performing queries.
Collapse
Affiliation(s)
- Rafael S Gonçalves
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
| | - Samson W Tu
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Csongor I Nyulas
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Michael J Tierney
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
21
|
Tso GJ, Tu SW, Musen MA, Goldstein MK. High-Risk Drug-Drug Interactions Between Clinical Practice Guidelines for Management of Chronic Conditions. AMIA Jt Summits Transl Sci Proc 2017; 2017:531-539. [PMID: 28815153 PMCID: PMC5543385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Clinicians and clinical decision-support systems often follow pharmacotherapy recommendations for patients based on clinical practice guidelines (CPGs). In multimorbid patients, these recommendations can potentially have clinically significant drug-drug interactions (DDIs). In this study, we describe and validate a method for programmatically detecting DDIs among CPG recommendations. The system extracts pharmacotherapy intervention recommendations from narrative CPGs, normalizes the terms, creates a mapping of drugs and drug classes, and then identifies occurrences of DDIs between CPG pairs. We used this system to analyze 75 CPGs written by authoring entities in the United States that discuss outpatient management of common chronic diseases. Using a reference list of high-risk DDIs, we identified 2198 of these DDIs in 638 CPG pairs (46 unique CPGs). Only 9 high-risk DDIs were discussed by both CPGs in a pairing. In 69 of the pairings, neither CPG had a pharmacologic reference or a warning of the possibility of a DDI.
Collapse
Affiliation(s)
- Geoffrey J. Tso
- VA Palo Alto Health Care System, Palo Alto, CA;,Stanford University, Stanford, CA
| | | | | | - Mary K. Goldstein
- VA Palo Alto Health Care System, Palo Alto, CA;,Stanford University, Stanford, CA
| |
Collapse
|
22
|
Martínez-Romero M, Jonquet C, O'Connor MJ, Graybeal J, Pazos A, Musen MA. NCBO Ontology Recommender 2.0: an enhanced approach for biomedical ontology recommendation. J Biomed Semantics 2017; 8:21. [PMID: 28592275 PMCID: PMC5463318 DOI: 10.1186/s13326-017-0128-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2016] [Accepted: 04/13/2017] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Ontologies and controlled terminologies have become increasingly important in biomedical research. Researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability across disparate datasets. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) released the Ontology Recommender, which is a service that receives a biomedical text corpus or a list of keywords and suggests ontologies appropriate for referencing the indicated terms. METHODS We developed a new version of the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a novel recommendation approach that evaluates the relevance of an ontology to biomedical text data according to four different criteria: (1) the extent to which the ontology covers the input data; (2) the acceptance of the ontology in the biomedical community; (3) the level of detail of the ontology classes that cover the input data; and (4) the specialization of the ontology to the domain of the input data. RESULTS Our evaluation shows that the enhanced recommender provides higher quality suggestions than the original approach, providing better coverage of the input data, more detailed information about their concepts, increased specialization for the domain of the input data, and greater acceptance and use in the community. In addition, it provides users with more explanatory information, along with suggestions of not only individual ontologies but also groups of ontologies to use together. It also can be customized to fit the needs of different ontology recommendation scenarios. CONCLUSIONS Ontology Recommender 2.0 suggests relevant ontologies for annotating biomedical text data. It combines the strengths of its predecessor with a range of adjustments and new features that improve its reliability and usefulness. Ontology Recommender 2.0 recommends over 500 biomedical ontologies from the NCBO BioPortal platform, where it is openly available (both via the user interface at http://bioportal.bioontology.org/recommender , and via a Web service API).
Collapse
Affiliation(s)
- Marcos Martínez-Romero
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA, 94305-5479, USA.
| | - Clement Jonquet
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA, 94305-5479, USA.,Laboratory of Informatics, Robotics and Microelectronics of Montpellier (LIRMM), University of Montpellier, 161 rue Ada, 34095, Montpellier, Cdx 5, France
| | - Martin J O'Connor
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA, 94305-5479, USA
| | - John Graybeal
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA, 94305-5479, USA
| | - Alejandro Pazos
- Department of Information and Communication Technologies, Computer Science Building, Elviña Campus, University of A Coruña, 15071, A Coruña, Spain
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford University School of Medicine, Stanford, CA, 94305-5479, USA
| |
Collapse
|
23
|
Abstract
Integrated approaches for pharmacology are required for the mechanism-based predictions of adverse drug reactions that manifest due to concomitant intake of multiple drugs. These approaches require the integration and analysis of biomedical data and knowledge from multiple, heterogeneous sources with varying schemas, entity notations, and formats. To tackle these integrative challenges, the Semantic Web community has published and linked several datasets in the Life Sciences Linked Open Data (LSLOD) cloud using established W3C standards. We present the PhLeGrA platform for Linked Graph Analytics in Pharmacology in this paper. Through query federation, we integrate four sources from the LSLOD cloud and extract a drug-reaction network, composed of distinct entities. We represent this graph as a hidden conditional random field (HCRF), a discriminative latent variable model that is used for structured output predictions. We calculate the underlying probability distributions in the drug-reaction HCRF using the datasets from the U.S. Food and Drug Administration's Adverse Event Reporting System. We predict the occurrence of 146 adverse reactions due to multiple drug intake with an AUROC statistic greater than 0.75. The PhLeGrA platform can be extended to incorporate other sources published using Semantic Web technologies, as well as to discover other types of pharmacological associations.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Center for Biomedical Informatics Research, Stanford University, USA
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, USA
| |
Collapse
|
24
|
Lou Y, Tu SW, Nyulas C, Tudorache T, Chalmers RJG, Musen MA. Use of ontology structure and Bayesian models to aid the crowdsourcing of ICD-11 sanctioning rules. J Biomed Inform 2017; 68:20-34. [PMID: 28192233 DOI: 10.1016/j.jbi.2017.02.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 02/02/2017] [Accepted: 02/08/2017] [Indexed: 11/18/2022]
Abstract
The International Classification of Diseases (ICD) is the de facto standard international classification for mortality reporting and for many epidemiological, clinical, and financial use cases. The next version of ICD, ICD-11, will be submitted for approval by the World Health Assembly in 2018. Unlike previous versions of ICD, where coders mostly select single codes from pre-enumerated disease and disorder codes, ICD-11 coding will allow extensive use of multiple codes to give more detailed disease descriptions. For example, "severe malignant neoplasms of left breast" may be coded using the combination of a "stem code" (e.g., code for malignant neoplasms of breast) with a variety of "extension codes" (e.g., codes for laterality and severity). The use of multiple codes (a process called post-coordination), while avoiding the pitfall of having to pre-enumerate vast number of possible disease and qualifier combinations, risks the creation of meaningless expressions that combine stem codes with inappropriate qualifiers. To prevent that from happening, "sanctioning rules" that define legal combinations are necessary. In this work, we developed a crowdsourcing method for obtaining sanctioning rules for the post-coordination of concepts in ICD-11. Our method utilized the hierarchical structures in the domain to improve the accuracy of the sanctioning rules and to lower the crowdsourcing cost. We used Bayesian networks to model crowd workers' skills, the accuracy of their responses, and our confidence in the acquired sanctioning rules. We applied reinforcement learning to develop an agent that constantly adjusted the confidence cutoffs during the crowdsourcing process to maximize the overall quality of sanctioning rules under a fixed budget. Finally, we performed formative evaluations using a skin-disease branch of the draft ICD-11 and demonstrated that the crowd-sourced sanctioning rules replicated those defined by an expert dermatologist with high precision and recall. This work demonstrated that a crowdsourcing approach could offer a reasonably efficient method for generating a first draft of sanctioning rules that subject matter experts could verify and edit, thus relieving them of the tedium and cost of formulating the initial set of rules.
Collapse
Affiliation(s)
- Yun Lou
- Stanford University, Stanford, CA, USA
| | | | | | | | | | | |
Collapse
|
25
|
Leung TI, Goldstein MK, Musen MA, Cronkite R, Chen JH, Gottlieb A, Leitersdorf E. The New HIT: Human Health Information Technology. Stud Health Technol Inform 2017; 245:768-772. [PMID: 29295202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Humanism in medicine is defined as health care providers' attitudes and actions that demonstrate respect for patients' values and concerns in relation to their social, psychological and spiritual life domains. Specifically, humanistic clinical medicine involves showing respect for the patient, building a personal connection, and eliciting and addressing a patient's emotional response to illness. Health information technology (IT) often interferes with humanistic clinical practice, potentially disabling these core aspects of the therapeutic patient-physician relationship. Health IT has evolved rapidly in recent years - and the imperative to maintain humanism in practice has never been greater. In this vision paper, we aim to discuss why preserving humanism is imperative in the design and implementation of health IT systems.
Collapse
Affiliation(s)
- Tiffany I Leung
- Faculty of Health, Medicine and Life Sciences, Maastricht University, Maastricht, The Netherlands
| | - Mary K Goldstein
- Department of Veterans Affairs, VA Palo Alto Health Care System, Palo Alto, CA, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Ruth Cronkite
- Department of Veterans Affairs, VA Palo Alto Health Care System, Palo Alto, CA, USA
| | - Jonathan H Chen
- Division of General Medical Disciplines, Stanford University, Stanford, CA, USA
| | - Assaf Gottlieb
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Eran Leitersdorf
- Center for Research, Prevention, and Treatment of Atherosclerosis Internal Medicine Department, Hadassah Hebrew University Medical Center, Jerusalem, Israel
| |
Collapse
|
26
|
Abstract
Reusing ontologies and their terms is a principle and best practice that most ontology development methodologies strongly encourage. Reuse comes with the promise to support the semantic interoperability and to reduce engineering costs. In this paper, we present a descriptive study of the current extent of term reuse and overlap among biomedical ontologies. We use the corpus of biomedical ontologies stored in the BioPortal repository, and analyze different types of reuse and overlap constructs. While we find an approximate term overlap between 25-31%, the term reuse is only <9%, with most ontologies reusing fewer than 5% of their terms from a small set of popular ontologies. Clustering analysis shows that the terms reused by a common set of ontologies have >90% semantic similarity, hinting that ontology developers tend to reuse terms that are sibling or parent-child nodes. We validate this finding by analysing the logs generated from a Protégé plugin that enables developers to reuse terms from BioPortal. We find most reuse constructs were 2-level subtrees on the higher levels of the class hierarchy. We developed a Web application that visualizes reuse dependencies and overlap among ontologies, and that proposes similar terms from BioPortal for a term of interest. We also identified a set of error patterns that indicate that ontology developers did intend to reuse terms from other ontologies, but that they were using different and sometimes incorrect representations. Our results stipulate the need for semi-automated tools that augment term reuse in the ontology engineering process through personalized recommendations.
Collapse
Affiliation(s)
- Maulik R. Kamdar
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University
| | - Mark A. Musen
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University
| |
Collapse
|
27
|
Ochs C, Geller J, Perl Y, Musen MA. A unified software framework for deriving, visualizing, and exploring abstraction networks for ontologies. J Biomed Inform 2016; 62:90-105. [PMID: 27345947 DOI: 10.1016/j.jbi.2016.06.008] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 06/02/2016] [Accepted: 06/22/2016] [Indexed: 11/27/2022]
Abstract
Software tools play a critical role in the development and maintenance of biomedical ontologies. One important task that is difficult without software tools is ontology quality assurance. In previous work, we have introduced different kinds of abstraction networks to provide a theoretical foundation for ontology quality assurance tools. Abstraction networks summarize the structure and content of ontologies. One kind of abstraction network that we have used repeatedly to support ontology quality assurance is the partial-area taxonomy. It summarizes structurally and semantically similar concepts within an ontology. However, the use of partial-area taxonomies was ad hoc and not generalizable. In this paper, we describe the Ontology Abstraction Framework (OAF), a unified framework and software system for deriving, visualizing, and exploring partial-area taxonomy abstraction networks. The OAF includes support for various ontology representations (e.g., OWL and SNOMED CT's relational format). A Protégé plugin for deriving "live partial-area taxonomies" is demonstrated.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
28
|
Tu SW, Nyulas CI, Tudorache T, Musen MA. A Method to Compare ICF and SNOMED CT for Coverage of U.S. Social Security Administration's Disability Listing Criteria. AMIA Annu Symp Proc 2015; 2015:1224-1233. [PMID: 26958262 PMCID: PMC4765666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We developed a method to evaluate the extent to which the International Classification of Function, Disability, and Health (ICF) and SNOMED CT cover concepts used in the disability listing criteria of the U.S. Social Security Administration's "Blue Book." First we decomposed the criteria into their constituent concepts and relationships. We defined different types of mappings and manually mapped the recognized concepts and relationships to either ICF or SNOMED CT. We defined various metrics for measuring the coverage of each terminology, taking into account the effects of inexact matches and frequency of occurrence. We validated our method by mapping the terms in the disability criteria of Adult Listings, Chapter 12 (Mental Disorders). SNOMED CT dominates ICF in almost all the metrics that we have computed. The method is applicable for determining any terminology's coverage of eligibility criteria.
Collapse
|
29
|
Musen MA, Bean CA, Cheung KH, Dumontier M, Durante KA, Gevaert O, Gonzalez-Beltran A, Khatri P, Kleinstein SH, O'Connor MJ, Pouliot Y, Rocca-Serra P, Sansone SA, Wiser JA. The center for expanded data annotation and retrieval. J Am Med Inform Assoc 2015; 22:1148-52. [PMID: 26112029 PMCID: PMC5009916 DOI: 10.1093/jamia/ocv048] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Revised: 04/07/2015] [Accepted: 04/18/2015] [Indexed: 12/22/2022] Open
Abstract
The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and data reuse. We take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments.
Collapse
Affiliation(s)
- Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | - Carol A Bean
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | - Kei-Hoi Cheung
- Interdepartmental Program in Computational Biology and Bioinformatics, Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT USA
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | - Kim A Durante
- Stanford University Libraries, Stanford University, Stanford, CA USA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | | | - Purvesh Khatri
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA Stanford Institute for Immunity, Transplantation, and Infection, Stanford, CA USA
| | - Steven H Kleinstein
- Interdepartmental Program in Computational Biology and Bioinformatics, Departments of Pathology and Immunobiology, Yale University School of Medicine, New Haven, CT USA
| | - Martin J O'Connor
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | - Yannick Pouliot
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
| | | | | | | |
Collapse
|
30
|
Lamprecht D, Strohmaier M, Helic D, Nyulas C, Tudorache T, Noy NF, Musen MA. Using ontologies to model human navigation behavior in information networks: A study based on Wikipedia. Semant Web 2015; 6:403-422. [PMID: 26568745 PMCID: PMC4643321 DOI: 10.3233/sw-140143] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The need to examine the behavior of different user groups is a fundamental requirement when building information systems. In this paper, we present Ontology-based Decentralized Search (OBDS), a novel method to model the navigation behavior of users equipped with different types of background knowledge. Ontology-based Decentralized Search combines decentralized search, an established method for navigation in social networks, and ontologies to model navigation behavior in information networks. The method uses ontologies as an explicit representation of background knowledge to inform the navigation process and guide it towards navigation targets. By using different ontologies, users equipped with different types of background knowledge can be represented. We demonstrate our method using four biomedical ontologies and their associated Wikipedia articles. We compare our simulation results with base line approaches and with results obtained from a user study. We find that our method produces click paths that have properties similar to those originating from human navigators. The results suggest that our method can be used to model human navigation behavior in systems that are based on information networks, such as Wikipedia. This paper makes the following contributions: (i) To the best of our knowledge, this is the first work to demonstrate the utility of ontologies in modeling human navigation and (ii) it yields new insights and understanding about the mechanisms of human navigation in information networks.
Collapse
Affiliation(s)
- Daniel Lamprecht
- Knowledge Management Institute, Graz University of Technology, Austria
| | - Markus Strohmaier
- Knowledge Management Institute, Graz University of Technology, Austria
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Denis Helic
- Knowledge Management Institute, Graz University of Technology, Austria
| | - Csongor Nyulas
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Natalya F. Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Mark A. Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| |
Collapse
|
31
|
Kamdar MR, Tudorache T, Musen MA. Investigating Term Reuse and Overlap in Biomedical Ontologies. CEUR Workshop Proc 2015; 1515:http://ceur-ws.org/Vol-1515/regular9.pdf. [PMID: 29636656 PMCID: PMC5889951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We investigate the current extent of term reuse and overlap among biomedical ontologies. We use the corpus of biomedical ontologies stored in the BioPortal repository, and analyze three types of reuse constructs: (a) explicit term reuse, (b) xref reuse, and (c) Concept Unique Identifier (CUI) reuse. While there is a term label similarity of approximately 14.4% of the total terms, we observed that most ontologies reuse considerably fewer than 5% of their terms from a concise set of a few core ontologies. We developed an interactive visualization to explore reuse dependencies among biomedical ontologies. Moreover, we identified a set of patterns that indicate ontology developers did intend to reuse terms from other ontologies, but they were using different and sometimes incorrect representations. Our results suggest the value of semi-automated tools that augment term reuse in the ontology engineering process through personalized recommendations.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University
| |
Collapse
|
32
|
|
33
|
Affiliation(s)
- Vincent Liu
- Kaiser Permanente Division of Research, Oakland, California
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford, California
| | - Timothy Chou
- Department of Computer Science, Stanford University, Stanford, California
| |
Collapse
|
34
|
Mortensen JM, Musen MA, Noy NF. An empirically derived taxonomy of errors in SNOMED CT. AMIA Annu Symp Proc 2014; 2014:899-906. [PMID: 25954397 PMCID: PMC4419962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Ontologies underpin methods throughout biomedicine and biomedical informatics. However, as ontologies increase in size and complexity, so does the likelihood that they contain errors. Effective methods that identify errors are typically manual and expert-driven; however, automated methods are essential for the size of modern biomedical ontologies. The effect of ontology errors on their application is unclear, creating a challenge in differentiating salient, relevant errors with those that have no discernable effect. As a first step in understanding the challenge of identifying salient, common errors at a large scale, we asked 5 experts to verify a random subset of complex relations in the SNOMED CT CORE Problem List Subset. The experts found 39 errors that followed several common patterns. Initially, the experts disagreed about errors almost entirely, indicating that ontology verification is very difficult and requires many eyes on the task. It is clear that additional empirically-based, application-focused ontology verification method development is necessary. Toward that end, we developed a taxonomy that can serve as a checklist to consult during ontology quality assurance.
Collapse
Affiliation(s)
- Jonathan M Mortensen
- Stanford Center for Biomedical Informatics Research Stanford University, Stanford, CA 94305-5479 U.S.A
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research Stanford University, Stanford, CA 94305-5479 U.S.A
| | - Natalya F Noy
- Stanford Center for Biomedical Informatics Research Stanford University, Stanford, CA 94305-5479 U.S.A
| |
Collapse
|
35
|
Mortensen JM, Minty EP, Januszyk M, Sweeney TE, Rector AL, Noy NF, Musen MA. Using the wisdom of the crowds to find critical errors in biomedical ontologies: a study of SNOMED CT. J Am Med Inform Assoc 2014; 22:640-8. [PMID: 25342179 DOI: 10.1136/amiajnl-2014-002901] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 09/15/2014] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVES The verification of biomedical ontologies is an arduous process that typically involves peer review by subject-matter experts. This work evaluated the ability of crowdsourcing methods to detect errors in SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) and to address the challenges of scalable ontology verification. METHODS We developed a methodology to crowdsource ontology verification that uses micro-tasking combined with a Bayesian classifier. We then conducted a prospective study in which both the crowd and domain experts verified a subset of SNOMED CT comprising 200 taxonomic relationships. RESULTS The crowd identified errors as well as any single expert at about one-quarter of the cost. The inter-rater agreement (κ) between the crowd and the experts was 0.58; the inter-rater agreement between experts themselves was 0.59, suggesting that the crowd is nearly indistinguishable from any one expert. Furthermore, the crowd identified 39 previously undiscovered, critical errors in SNOMED CT (eg, 'septic shock is a soft-tissue infection'). DISCUSSION The results show that the crowd can indeed identify errors in SNOMED CT that experts also find, and the results suggest that our method will likely perform well on similar ontologies. The crowd may be particularly useful in situations where an expert is unavailable, budget is limited, or an ontology is too large for manual error checking. Finally, our results suggest that the online anonymous crowd could successfully complete other domain-specific tasks. CONCLUSIONS We have demonstrated that the crowd can address the challenges of scalable ontology verification, completing not only intuitive, common-sense tasks, but also expert-level, knowledge-intensive tasks.
Collapse
Affiliation(s)
- Jonathan M Mortensen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA Biomedical Informatics Training Program, Stanford University, Stanford, California, USA
| | - Evan P Minty
- Biomedical Informatics Training Program, Stanford University, Stanford, California, USA Faculty of Medicine, University of Calgary, Calgary, Canada
| | - Michael Januszyk
- Biomedical Informatics Training Program, Stanford University, Stanford, California, USA
| | - Timothy E Sweeney
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| | - Alan L Rector
- School of Computer Science, University of Manchester, Manchester, UK
| | - Natalya F Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA Google Inc., Mountain View, California, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA Biomedical Informatics Training Program, Stanford University, Stanford, California, USA
| |
Collapse
|
36
|
Walk S, Singer P, Strohmaier M, Tudorache T, Musen MA, Noy NF. Discovering beaten paths in collaborative ontology-engineering projects using Markov chains. J Biomed Inform 2014; 51:254-71. [PMID: 24953242 PMCID: PMC4194274 DOI: 10.1016/j.jbi.2014.06.004] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2014] [Revised: 06/04/2014] [Accepted: 06/07/2014] [Indexed: 11/26/2022]
Abstract
Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For example, the 11th revision of the International Classification of Diseases, which is currently under active development by the World Health Organization contains nearly 50,000 classes representing a vast variety of different diseases and causes of death. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no single individual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scale efforts involving just a few domain experts to large-scale projects that require effective collaboration between dozens or even hundreds of experts, practitioners and other stakeholders. Understanding the way these different stakeholders collaborate will enable us to improve editing environments that support such collaborations. In this paper, we uncover how large ontology-engineering projects, such as the International Classification of Diseases in its 11th revision, unfold by analyzing usage logs of five different biomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interaction patterns (e.g., which properties users frequently change after specific given ones) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identify commonalities and differences between different projects that have implications for project managers, ontology editors, developers and contributors working on collaborative ontology-engineering projects and tools in the biomedical domain.
Collapse
Affiliation(s)
- Simon Walk
- Institute for Information Systems and Computer Media, Graz University of Technology, Austria.
| | - Philipp Singer
- GESIS - Leibniz-Institute for the Social Sciences, Cologne, Germany
| | - Markus Strohmaier
- GESIS - Leibniz-Institute for the Social Sciences, Cologne, Germany; Dept. of Computer Science, University of Koblenz-Landau, Germany
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | - Natalya F Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, USA
| |
Collapse
|
37
|
Horridge M, Tudorache T, Nuylas C, Vendetti J, Noy NF, Musen MA. WebProtégé: a collaborative Web-based platform for editing biomedical ontologies. Bioinformatics 2014; 30:2384-5. [PMID: 24771560 DOI: 10.1093/bioinformatics/btu256] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED WebProtégé is an open-source Web application for editing OWL 2 ontologies. It contains several features to aid collaboration, including support for the discussion of issues, change notification and revision-based change tracking. WebProtégé also features a simple user interface, which is geared towards editing the kinds of class descriptions and annotations that are prevalent throughout biomedical ontologies. Moreover, it is possible to configure the user interface using views that are optimized for editing Open Biomedical Ontology (OBO) class descriptions and metadata. Some of these views are shown in the Supplementary Material and can be seen in WebProtégé itself by configuring the project as an OBO project. AVAILABILITY AND IMPLEMENTATION WebProtégé is freely available for use on the Web at http://webprotege.stanford.edu. It is implemented in Java and JavaScript using the OWL API and the Google Web Toolkit. All major browsers are supported. For users who do not wish to host their ontologies on the Stanford servers, WebProtégé is available as a Web app that can be run locally using a Servlet container such as Tomcat. Binaries, source code and documentation are available under an open-source license at http://protegewiki.stanford.edu/wiki/WebProtege.
Collapse
Affiliation(s)
- Matthew Horridge
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA
| | - Csongor Nuylas
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA
| | - Jennifer Vendetti
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA
| | - Natalya F Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA
| |
Collapse
|
38
|
Walk S, Pöschko J, Strohmaier M, Andrews K, Tudorache T, Noy NF, Nyulas C, Musen MA. PragmatiX: An Interactive Tool for Visualizing the Creation Process Behind Collaboratively Engineered Ontologies. INT J SEMANT WEB INF 2014; 9:45-78. [PMID: 24465189 DOI: 10.4018/jswis.2013010103] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the emergence of tools for collaborative ontology engineering, more and more data about the creation process behind collaborative construction of ontologies is becoming available. Today, collaborative ontology engineering tools such as Collaborative Protégé offer rich and structured logs of changes, thereby opening up new challenges and opportunities to study and analyze the creation of collaboratively constructed ontologies. While there exists a plethora of visualization tools for ontologies, they have primarily been built to visualize aspects of the final product (the ontology) and not the collaborative processes behind construction (e.g. the changes made by contributors over time). To the best of our knowledge, there exists no ontology visualization tool today that focuses primarily on visualizing the history behind collaboratively constructed ontologies. Since the ontology engineering processes can influence the quality of the final ontology, we believe that visualizing process data represents an important stepping-stone towards better understanding of managing the collaborative construction of ontologies in the future. In this application paper, we present a tool - PragmatiX - which taps into structured change logs provided by tools such as Collaborative Protégé to visualize various pragmatic aspects of collaborative ontology engineering. The tool is aimed at managers and leaders of collaborative ontology engineering projects to help them in monitoring progress, in exploring issues and problems, and in tracking quality-related issues such as overrides and coordination among contributors. The paper makes the following contributions: (i) we present PragmatiX, a tool for visualizing the creation process behind collaboratively constructed ontologies (ii) we illustrate the functionality and generality of the tool by applying it to structured logs of changes of two large collaborative ontology-engineering projects and (iii) we conduct a heuristic evaluation of the tool with domain experts to uncover early design challenges and opportunities for improvement. Finally, we hope that this work sparks a new line of research on visualization tools for collaborative ontology engineering projects.
Collapse
Affiliation(s)
- Simon Walk
- Knowledge Management Institute, Graz University of Technology, Inffeldgasse 21a/II, 8010 Graz
| | - Jan Pöschko
- Knowledge Management Institute, Graz University of Technology, Inffeldgasse 21a/II, 8010 Graz
| | - Markus Strohmaier
- Knowledge Management Institute, Graz University of Technology, Inffeldgasse 21a/II, 8010 Graz
| | - Keith Andrews
- Institute for Information Systems and Computer Media, Graz University of Technology, Inffeldgasse 16c, 8010 Graz
| | - Tania Tudorache
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305-5479, USA
| | - Natalya F Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305-5479, USA
| | - Csongor Nyulas
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305-5479, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305-5479, USA
| |
Collapse
|
39
|
Mortensen JM, Musen MA, Noy NF. Crowdsourcing the verification of relationships in biomedical ontologies. AMIA Annu Symp Proc 2013; 2013:1020-1029. [PMID: 24551391 PMCID: PMC3900126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Biomedical ontologies are often large and complex, making ontology development and maintenance a challenge. To address this challenge, scientists use automated techniques to alleviate the difficulty of ontology development. However, for many ontology-engineering tasks, human judgment is still necessary. Microtask crowdsourcing, wherein human workers receive remuneration to complete simple, short tasks, is one method to obtain contributions by humans at a large scale. Previously, we developed and refined an effective method to verify ontology hierarchy using microtask crowdsourcing. In this work, we report on applying this method to find errors in the SNOMED CT CORE subset. By using crowdsourcing via Amazon Mechanical Turk with a Bayesian inference model, we correctly verified 86% of the relations from the CORE subset of SNOMED CT in which Rector and colleagues previously identified errors via manual inspection. Our results demonstrate that an ontology developer could deploy this method in order to audit large-scale ontologies quickly and relatively cheaply.
Collapse
Affiliation(s)
- Jonathan M Mortensen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305-5479 U.S.A
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305-5479 U.S.A
| | - Natalya F Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305-5479 U.S.A
| |
Collapse
|
40
|
Strohmaier M, Walk S, Pöschko J, Lamprecht D, Tudorache T, Nyulas C, Musen MA, Noy NF. How Ontologies are Made: Studying the Hidden Social Dynamics Behind Collaborative Ontology Engineering Projects. Web Semant 2013; 20:10.1016/j.websem.2013.04.001. [PMID: 24311994 PMCID: PMC3845806 DOI: 10.1016/j.websem.2013.04.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Traditionally, evaluation methods in the field of semantic technologies have focused on the end result of ontology engineering efforts, mainly, on evaluating ontologies and their corresponding qualities and characteristics. This focus has led to the development of a whole arsenal of ontology-evaluation techniques that investigate the quality of ontologies as a product. In this paper, we aim to shed light on the process of ontology engineering construction by introducing and applying a set of measures to analyze hidden social dynamics. We argue that especially for ontologies which are constructed collaboratively, understanding the social processes that have led to its construction is critical not only in understanding but consequently also in evaluating the ontology. With the work presented in this paper, we aim to expose the texture of collaborative ontology engineering processes that is otherwise left invisible. Using historical change-log data, we unveil qualitative differences and commonalities between different collaborative ontology engineering projects. Explaining and understanding these differences will help us to better comprehend the role and importance of social factors in collaborative ontology engineering projects. We hope that our analysis will spur a new line of evaluation techniques that view ontologies not as the static result of deliberations among domain experts, but as a dynamic, collaborative and iterative process that needs to be understood, evaluated and managed in itself. We believe that advances in this direction would help our community to expand the existing arsenal of ontology evaluation techniques towards more holistic approaches.
Collapse
Affiliation(s)
- Markus Strohmaier
- Knowledge Management Institute, Graz University of Technology, Austria ; Stanford Center for Biomedical Informatics Research, Stanford University, USA
| | | | | | | | | | | | | | | |
Collapse
|
41
|
Salvadores M, Alexander PR, Musen MA, Noy NF. BioPortal as a Dataset of Linked Biomedical Ontologies and Terminologies in RDF. Semant Web 2013; 4:277-284. [PMID: 25214827 PMCID: PMC4159173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
BioPortal is a repository of biomedical ontologies-the largest such repository, with more than 300 ontologies to date. This set includes ontologies that were developed in OWL, OBO and other formats, as well as a large number of medical terminologies that the US National Library of Medicine distributes in its own proprietary format. We have published the RDF version of all these ontologies at http://sparql.bioontology.org. This dataset contains 190M triples, representing both metadata and content for the 300 ontologies. We use the metadata that the ontology authors provide and simple RDFS reasoning in order to provide dataset users with uniform access to key properties of the ontologies, such as lexical properties for the class names and provenance data. The dataset also contains 9.8M cross-ontology mappings of different types, generated both manually and automatically, which come with their own metadata.
Collapse
Affiliation(s)
- Manuel Salvadores
- Stanford Center for Biomedical Informatics Research Stanford University, US
| | - Paul R. Alexander
- Stanford Center for Biomedical Informatics Research Stanford University, US
| | - Mark A. Musen
- Stanford Center for Biomedical Informatics Research Stanford University, US
| | - Natalya F. Noy
- Stanford Center for Biomedical Informatics Research Stanford University, US
| |
Collapse
|
42
|
Abstract
In this paper, we present WebProtégé-a lightweight ontology editor and knowledge acquisition tool for the Web. With the wide adoption of Web 2.0 platforms and the gradual adoption of ontologies and Semantic Web technologies in the real world, we need ontology-development tools that are better suited for the novel ways of interacting, constructing and consuming knowledge. Users today take Web-based content creation and online collaboration for granted. WebProtégé integrates these features as part of the ontology development process itself. We tried to lower the entry barrier to ontology development by providing a tool that is accessible from any Web browser, has extensive support for collaboration, and a highly customizable and pluggable user interface that can be adapted to any level of user expertise. The declarative user interface enabled us to create custom knowledge-acquisition forms tailored for domain experts. We built WebProtégé using the existing Protégé infrastructure, which supports collaboration on the back end side, and the Google Web Toolkit for the front end. The generic and extensible infrastructure allowed us to easily deploy WebProtégé in production settings for several projects. We present the main features of WebProtégé and its architecture and describe briefly some of its uses for real-world projects. WebProtégé is free and open source. An online demo is available at http://webprotege.stanford.edu.
Collapse
|
43
|
Abstract
Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of “significant genes.” One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to makes sense of the results of high-throughput experiments. The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. To determine the biological relevance of a lengthy gene list, the usual solution is to perform enrichment analysis with the GO. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study. While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask “Which biological process is over-represented in my set of interesting genes or proteins?” we can also ask “Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins?“. For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases—blood coagulation disorders—that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. With the availability of tools for automatic annotation of datasets with terms from disease ontologies, there is no reason to restrict enrichment analyses to the GO. In this chapter, we will discuss methods to perform enrichment analysis using any ontology available in the biomedical domain. We will review the general methodology of enrichment analysis, the associated challenges, and discuss the novel translational analyses enabled by the existence of public, national computational infrastructure and by the use of disease ontologies in such analyses.
Collapse
Affiliation(s)
- Nigam H Shah
- Center for Biomedical Informatics Research, Stanford University, Stanford, California, United States of America.
| | | | | |
Collapse
|
44
|
Mortensen JM, Horridge M, Musen MA, Noy NF. Applications of ontology design patterns in biomedical ontologies. AMIA Annu Symp Proc 2012; 2012:643-652. [PMID: 23304337 PMCID: PMC3540458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Ontology design patterns (ODPs) are a proposed solution to facilitate ontology development, and to help users avoid some of the most frequent modeling mistakes. ODPs originate from similar approaches in software engineering, where software design patterns have become a critical aspect of software development. There is little empirical evidence for ODP prevalence or effectiveness thus far. In this work, we determine the use and applicability of ODPs in a case study of biomedical ontologies. We encoded ontology design patterns from two ODP catalogs. We then searched for these patterns in a set of eight ontologies. We found five patterns of the 69 patterns. Two of the eight ontologies contained these patterns. While ontology design patterns provide a vehicle for capturing formally reoccurring models and best practices in ontology design, we show that today their use in a case study of widely used biomedical ontologies is limited.
Collapse
Affiliation(s)
- Jonathan M Mortensen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA
| | | | | | | |
Collapse
|
45
|
Kulikowski CA, Shortliffe EH, Currie LM, Elkin PL, Hunter LE, Johnson TR, Kalet IJ, Lenert LA, Musen MA, Ozbolt JG, Smith JW, Tarczy-Hornoch PZ, Williamson JJ. AMIA Board white paper: definition of biomedical informatics and specification of core competencies for graduate education in the discipline. J Am Med Inform Assoc 2012; 19:931-8. [PMID: 22683918 DOI: 10.1136/amiajnl-2012-001053] [Citation(s) in RCA: 126] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
The AMIA biomedical informatics (BMI) core competencies have been designed to support and guide graduate education in BMI, the core scientific discipline underlying the breadth of the field's research, practice, and education. The core definition of BMI adopted by AMIA specifies that BMI is 'the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem solving and decision making, motivated by efforts to improve human health.' Application areas range from bioinformatics to clinical and public health informatics and span the spectrum from the molecular to population levels of health and biomedicine. The shared core informatics competencies of BMI draw on the practical experience of many specific informatics sub-disciplines. The AMIA BMI analysis highlights the central shared set of competencies that should guide curriculum design and that graduate students should be expected to master.
Collapse
Affiliation(s)
- Casimir A Kulikowski
- Department of Computer Science, Rutgers University, New Brunswick, New Jersey, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
46
|
Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. J Am Med Inform Assoc 2012; 19:e149-56. [PMID: 22493050 PMCID: PMC3392861 DOI: 10.1136/amiajnl-2011-000744] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.
Collapse
Affiliation(s)
- Stephen T Wu
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55905, USA.
| | | | | | | | | | | | | |
Collapse
|
47
|
Abstract
The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.
Collapse
Affiliation(s)
- Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Stanford, California 94305-5479, USA.
| | | | | | | | | | | | | | | |
Collapse
|
48
|
Jonquet C, LePendu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH. NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semant 2011; 9:316-324. [PMID: 21918645 PMCID: PMC3170774 DOI: 10.1016/j.websem.2011.06.005] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index-a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics "under the hood."
Collapse
Affiliation(s)
- Clement Jonquet
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
- Laboratory of Informatics, Robotics, and Microelectronics of Montpellier (LIRMM), University of Montpellier, 161 rue Ada, 34095 Montpellier, Cdx 5, France
| | - Paea LePendu
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
| | - Sean Falconer
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
| | - Adrien Coulet
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
- Lorraine Informatics Research and Applications Laboratory (LORIA) – INRIA Nancy - Grand-Est, Campus Scientifique - BP 239, 54506 Vandoeuvre-lès-Nancy Cedex, France
| | - Natalya F. Noy
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
| | - Mark A. Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
| | - Nigam H. Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA
| |
Collapse
|
49
|
Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res 2011; 39:W541-5. [PMID: 21672956 PMCID: PMC3125807 DOI: 10.1093/nar/gkr469] [Citation(s) in RCA: 360] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The National Center for Biomedical Ontology (NCBO) is one of the National Centers for Biomedical Computing funded under the NIH Roadmap Initiative. Contributing to the national computing infrastructure, NCBO has developed BioPortal, a web portal that provides access to a library of biomedical ontologies and terminologies (http://bioportal.bioontology.org) via the NCBO Web services. BioPortal enables community participation in the evaluation and evolution of ontology content by providing features to add mappings between terms, to add comments linked to specific ontology terms and to provide ontology reviews. The NCBO Web services (http://www.bioontology.org/wiki/index.php/NCBO_REST_services) enable this functionality and provide a uniform mechanism to access ontologies from a variety of knowledge representation formats, such as Web Ontology Language (OWL) and Open Biological and Biomedical Ontologies (OBO) format. The Web services provide multi-layered access to the ontology content, from getting all terms in an ontology to retrieving metadata about a term. Users can easily incorporate the NCBO Web services into software applications to generate semantically aware applications and to facilitate structured data collection.
Collapse
Affiliation(s)
- Patricia L Whetzel
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA.
| | | | | | | | | | | | | |
Collapse
|
50
|
Coulet A, Garten Y, Dumontier M, Altman RB, Musen MA, Shah NH. Integration and publication of heterogeneous text-mined relationships on the Semantic Web. J Biomed Semantics 2011; 2 Suppl 2:S10. [PMID: 21624156 PMCID: PMC3102890 DOI: 10.1186/2041-1480-2-s2-s10] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Background Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering. Results We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network. Conclusions The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at http://purl.bioontology.org/ontology/PHARE.
Collapse
Affiliation(s)
- Adrien Coulet
- LORIA - INRIA Nancy - Grand-Est, Campus Scientifique - BP 239 - 54506 Vandoeuvre-lès-Nancy Cedex, France.
| | | | | | | | | | | |
Collapse
|