1
|
Zheng L, Perl Y, He Y. Big knowledge visualization of the COVID-19 CIDO ontology evolution. BMC Med Inform Decis Mak 2023; 23:88. [PMID: 37161560 PMCID: PMC10169115 DOI: 10.1186/s12911-023-02184-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 04/20/2023] [Indexed: 05/11/2023] Open
Abstract
BACKGROUND The extensive international research for medications and vaccines for the devastating COVID-19 pandemic requires a standard reference ontology. Among the current COVID-19 ontologies, the Coronavirus Infectious Disease Ontology (CIDO) is the largest one. Furthermore, it keeps growing very frequently. Researchers using CIDO as a reference ontology, need a quick update about the content added in a recent release to know how relevant the new concepts are to their research needs. Although CIDO is only a medium size ontology, it is still a large knowledge base posing a challenge for a user interested in obtaining the "big picture" of content changes between releases. Both a theoretical framework and a proper visualization are required to provide such a "big picture". METHODS The child-of-based layout of the weighted aggregate partial-area taxonomy summarization network (WAT) provides a "big picture" convenient visualization of the content of an ontology. In this paper we address the "big picture" of content changes between two releases of an ontology. We introduce a new DIFF framework named Diff Weighted Aggregate Taxonomy (DWAT) to display the differences between the WATs of two releases of an ontology. We use a layered approach which consists first of a DWAT of major subjects in CIDO, and then drill down a major subject of interest in the top-level DWAT to obtain a DWAT of secondary subjects and even further refined layers. RESULTS A visualization of the Diff Weighted Aggregate Taxonomy is demonstrated on the CIDO ontology. The evolution of CIDO between 2020 and 2022 is demonstrated in two perspectives. Drilling down for a DWAT of secondary subject networks is also demonstrated. We illustrate how the DWAT of CIDO provides insight into its evolution. CONCLUSIONS The new Diff Weighted Aggregate Taxonomy enables a layered approach to view the "big picture" of the changes in the content between two releases of an ontology.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA.
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, and Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| |
Collapse
|
2
|
Keloth VK, Zhou S, Lindemann L, Zheng L, Elhanan G, Einstein AJ, Geller J, Perl Y. Mining of EHR for interface terminology concepts for annotating EHRs of COVID patients. BMC Med Inform Decis Mak 2023; 23:40. [PMID: 36829139 PMCID: PMC9951157 DOI: 10.1186/s12911-023-02136-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 02/09/2023] [Indexed: 02/26/2023] Open
Abstract
BACKGROUND Two years into the COVID-19 pandemic and with more than five million deaths worldwide, the healthcare establishment continues to struggle with every new wave of the pandemic resulting from a new coronavirus variant. Research has demonstrated that there are variations in the symptoms, and even in the order of symptom presentations, in COVID-19 patients infected by different SARS-CoV-2 variants (e.g., Alpha and Omicron). Textual data in the form of admission notes and physician notes in the Electronic Health Records (EHRs) is rich in information regarding the symptoms and their orders of presentation. Unstructured EHR data is often underutilized in research due to the lack of annotations that enable automatic extraction of useful information from the available extensive volumes of textual data. METHODS We present the design of a COVID Interface Terminology (CIT), not just a generic COVID-19 terminology, but one serving a specific purpose of enabling automatic annotation of EHRs of COVID-19 patients. CIT was constructed by integrating existing COVID-related ontologies and mining additional fine granularity concepts from clinical notes. The iterative mining approach utilized the techniques of 'anchoring' and 'concatenation' to identify potential fine granularity concepts to be added to the CIT. We also tested the generalizability of our approach on a hold-out dataset and compared the annotation coverage to the coverage obtained for the dataset used to build the CIT. RESULTS Our experiments demonstrate that this approach results in higher annotation coverage compared to existing ontologies such as SNOMED CT and Coronavirus Infectious Disease Ontology (CIDO). The final version of CIT achieved about 20% more coverage than SNOMED CT and 50% more coverage than CIDO. In the future, the concepts mined and added into CIT could be used as training data for machine learning models for mining even more concepts into CIT and further increasing the annotation coverage. CONCLUSION In this paper, we demonstrated the construction of a COVID interface terminology that can be utilized for automatically annotating EHRs of COVID-19 patients. The techniques presented can identify frequently documented fine granularity concepts that are missing in other ontologies thereby increasing the annotation coverage.
Collapse
Affiliation(s)
- Vipina K Keloth
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
| | - Shuxin Zhou
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Luke Lindemann
- School of Medicine and Health Sciences, The George Washington University, Washington (D.C.), USA
| | - Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA
| | - Gai Elhanan
- Renown Institute for Health Innovation, Desert Research Institute, Reno, NV, USA
| | - Andrew J Einstein
- Cardiology Division, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
- Department of Radiology, Columbia University Irving Medical Center, New York, NY, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
3
|
He Y, Yu H, Huffman A, Lin AY, Natale DA, Beverley J, Zheng L, Perl Y, Wang Z, Liu Y, Ong E, Wang Y, Huang P, Tran L, Du J, Shah Z, Shah E, Desai R, Huang HH, Tian Y, Merrell E, Duncan WD, Arabandi S, Schriml LM, Zheng J, Masci AM, Wang L, Liu H, Smaili FZ, Hoehndorf R, Pendlington ZM, Roncaglia P, Ye X, Xie J, Tang YW, Yang X, Peng S, Zhang L, Chen L, Hur J, Omenn GS, Athey B, Smith B. A comprehensive update on CIDO: the community-based coronavirus infectious disease ontology. J Biomed Semantics 2022; 13:25. [PMID: 36271389 PMCID: PMC9585694 DOI: 10.1186/s13326-022-00279-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 09/13/2022] [Indexed: 11/24/2022] Open
Abstract
Background The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020. Results As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment. Conclusion CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-022-00279-z.
Collapse
Affiliation(s)
- Yongqun He
- University of Michigan Medical School, Ann Arbor, MI, USA.
| | - Hong Yu
- People's Hospital of Guizhou Province, Guiyang, Guizhou, China.
| | | | - Asiyah Yu Lin
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.,National Center for Ontological Research, Buffalo, NY, USA
| | | | - John Beverley
- National Center for Ontological Research, Buffalo, NY, USA.,The Johns Hopkins University Applied Physics Laboratory, Laurel, MD, USA
| | - Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Zhigang Wang
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & School of Basic Medicine, Peking Union Medical College, Beijing, China
| | - Yingtong Liu
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Edison Ong
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yang Wang
- University of Michigan Medical School, Ann Arbor, MI, USA.,People's Hospital of Guizhou Province, Guiyang, Guizhou, China
| | - Philip Huang
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Long Tran
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Jinyang Du
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Zalan Shah
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Easheta Shah
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Roshan Desai
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Hsin-Hui Huang
- University of Michigan Medical School, Ann Arbor, MI, USA.,National Yang-Ming University, Taipei, Taiwan
| | - Yujia Tian
- Rutgers University, New Brunswick, NJ, USA
| | | | | | | | - Lynn M Schriml
- University of Maryland School of Medicine, Baltimore, MD, USA
| | - Jie Zheng
- Department of Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Anna Maria Masci
- Office of Data Science, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | | | | | | | - Robert Hoehndorf
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Zoë May Pendlington
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Paola Roncaglia
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Xianwei Ye
- People's Hospital of Guizhou Province, Guiyang, Guizhou, China
| | - Jiangan Xie
- School of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Yi-Wei Tang
- Cepheid, Danaher Diagnostic Platform, Shanghai, China
| | - Xiaolin Yang
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & School of Basic Medicine, Peking Union Medical College, Beijing, China
| | - Suyuan Peng
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Luxia Zhang
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Luonan Chen
- Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | - Junguk Hur
- University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND, USA
| | | | - Brian Athey
- University of Michigan Medical School, Ann Arbor, MI, USA
| | - Barry Smith
- National Center for Ontological Research, Buffalo, NY, USA.,University at Buffalo, Buffalo, NY, 14260, USA
| |
Collapse
|
4
|
Liu C, Lee J, Ta C, Soroush A, Rogers JR, Kim JH, Natarajan K, Zucker J, Perl Y, Weng C. Risk Factors Associated With SARS-CoV-2 Breakthrough Infections in Fully mRNA-Vaccinated Individuals: Retrospective Analysis. JMIR Public Health Surveill 2022; 8:e35311. [PMID: 35486806 PMCID: PMC9132195 DOI: 10.2196/35311] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 03/29/2022] [Accepted: 04/27/2022] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND COVID-19 messenger RNA (mRNA) vaccines have demonstrated efficacy and effectiveness in preventing symptomatic COVID-19, while being relatively safe in trial studies. However, vaccine breakthrough infections have been reported. OBJECTIVE This study aims to identify risk factors associated with COVID-19 breakthrough infections among fully mRNA-vaccinated individuals. METHODS We conducted a series of observational retrospective analyses using the electronic health records (EHRs) of the Columbia University Irving Medical Center/New York Presbyterian (CUIMC/NYP) up to September 21, 2021. New York City (NYC) adult residences with at least 1 polymerase chain reaction (PCR) record were included in this analysis. Poisson regression was performed to assess the association between the breakthrough infection rate in vaccinated individuals and multiple risk factors-including vaccine brand, demographics, and underlying conditions-while adjusting for calendar month, prior number of visits, and observational days in the EHR. RESULTS The overall estimated breakthrough infection rate was 0.16 (95% CI 0.14-0.18). Individuals who were vaccinated with Pfizer/BNT162b2 (incidence rate ratio [IRR] against Moderna/mRNA-1273=1.66, 95% CI 1.17-2.35) were male (IRR against female=1.47, 95% CI 1.11-1.94) and had compromised immune systems (IRR=1.48, 95% CI 1.09-2.00) were at the highest risk for breakthrough infections. Among all underlying conditions, those with primary immunodeficiency, a history of organ transplant, an active tumor, use of immunosuppressant medications, or Alzheimer disease were at the highest risk. CONCLUSIONS Although we found both mRNA vaccines were effective, Moderna/mRNA-1273 had a lower incidence rate of breakthrough infections. Immunocompromised and male individuals were among the highest risk groups experiencing breakthrough infections. Given the rapidly changing nature of the SARS-CoV-2 pandemic, continued monitoring and a generalizable analysis pipeline are warranted to inform quick updates on vaccine effectiveness in real time.
Collapse
Affiliation(s)
- Cong Liu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| | - Junghwan Lee
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| | - Ali Soroush
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| | - James R Rogers
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| | - Jae Hyun Kim
- School of Pharmacy, Jeonbuk National University, Jeonju, Republic of Korea
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| | - Jason Zucker
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, United States
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States
| |
Collapse
|
5
|
Zheng L, Perl Y, He Y, Ochs C, Geller J, Liu H, Keloth VK. Visual comprehension and orientation into the COVID-19 CIDO ontology. J Biomed Inform 2021; 120:103861. [PMID: 34224898 PMCID: PMC8252699 DOI: 10.1016/j.jbi.2021.103861] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 05/11/2021] [Accepted: 06/30/2021] [Indexed: 12/12/2022]
Abstract
The current intensive research on potential remedies and vaccinations for COVID-19 would greatly benefit from an ontology of standardized COVID terms. The Coronavirus Infectious Disease Ontology (CIDO) is the largest among several COVID ontologies, and it keeps growing, but it is still a medium sized ontology. Sophisticated CIDO users, who need more than searching for a specific concept, require orientation and comprehension of CIDO. In previous research, we designed a summarization network called "partial-area taxonomy" to support comprehension of ontologies. The partial-area taxonomy for CIDO is of smaller magnitude than CIDO, but is still too large for comprehension. We present here the "weighted aggregate taxonomy" of CIDO, designed to provide compact views at various granularities of our partial-area taxonomy (and the CIDO ontology). Such a compact view provides a "big picture" of the content of an ontology. In previous work, in the visualization patterns used for partial-area taxonomies, the nodes were arranged in levels according to the numbers of relationships of their concepts. Applying this visualization pattern to CIDO's weighted aggregate taxonomy resulted in an overly long and narrow layout that does not support orientation and comprehension since the names of nodes are barely readable. Thus, we introduce in this paper an innovative visualization of the weighted aggregate taxonomy for better orientation and comprehension of CIDO (and other ontologies). A measure for the efficiency of a layout is introduced and is used to demonstrate the advantage of the new layout over the previous one. With this new visualization, the user can "see the forest for the trees" of the ontology. Benefits of this visualization in highlighting insights into CIDO's content are provided. Generality of the new layout is demonstrated.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA.
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Yongqun He
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Hao Liu
- Columbia University Irving Medical Center, New York, NY, USA
| | - Vipina K Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
6
|
Zheng L, Min H, Chen Y, Keloth V, Geller J, Perl Y, Hripcsak G. Outlier concepts auditing methodology for a large family of biomedical ontologies. BMC Med Inform Decis Mak 2020; 20:296. [PMID: 33319713 PMCID: PMC7737254 DOI: 10.1186/s12911-020-01311-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 10/28/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Summarization networks are compact summaries of ontologies. The "Big Picture" view offered by summarization networks enables to identify sets of concepts that are more likely to have errors than control concepts. For ontologies that have outgoing lateral relationships, we have developed the "partial-area taxonomy" summarization network. Prior research has identified one kind of outlier concepts, concepts of small partials-areas within partial-area taxonomies. Previously we have shown that the small partial-area technique works successfully for four ontologies (or their hierarchies). METHODS To improve the Quality Assurance (QA) scalability, a family-based QA framework, where one QA technique is potentially applicable to a whole family of ontologies with similar structural features, was developed. The 373 ontologies hosted at the NCBO BioPortal in 2015 were classified into a collection of families based on structural features. A meta-ontology represents this family collection, including one family of ontologies having outgoing lateral relationships. The process of updating the current meta-ontology is described. To conclude that one QA technique is applicable for at least half of the members for a family F, this technique should be demonstrated as successful for six out of six ontologies in F. We describe a hypothesis setting the condition required for a technique to be successful for a given ontology. The process of a study to demonstrate such success is described. This paper intends to prove the scalability of the small partial-area technique. RESULTS We first updated the meta-ontology classifying 566 BioPortal ontologies. There were 371 ontologies in the family with outgoing lateral relationships. We demonstrated the success of the small partial-area technique for two ontology hierarchies which belong to this family, SNOMED CT's Specimen hierarchy and NCIt's Gene hierarchy. Together with the four previous ontologies from the same family, we fulfilled the "six out of six" condition required to show the scalability for the whole family. CONCLUSIONS We have shown that the small partial-area technique can be potentially successful for the family of ontologies with outgoing lateral relationships in BioPortal, thus improve the scalability of this QA technique.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, 07764, USA.
| | - Hua Min
- Department of Health Administration and Policy, George Mason University, Fairfax, VA, 22030, USA
| | - Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, New York, NY, 10007, USA
| | - Vipina Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
| |
Collapse
|
7
|
Zheng L, Chen Y, Min H, Hildebrand PL, Liu H, Halper M, Geller J, de Coronado S, Perl Y. Missing lateral relationships in top-level concepts of an ontology. BMC Med Inform Decis Mak 2020; 20:305. [PMID: 33319709 PMCID: PMC7737264 DOI: 10.1186/s12911-020-01319-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Accepted: 11/09/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontologies house various kinds of domain knowledge in formal structures, primarily in the form of concepts and the associative relationships between them. Ontologies have become integral components of many health information processing environments. Hence, quality assurance of the conceptual content of any ontology is critical. Relationships are foundational to the definition of concepts. Missing relationship errors (i.e., unintended omissions of important definitional relationships) can have a deleterious effect on the quality of an ontology. An abstraction network is a structure that overlays an ontology and provides an alternate, summarization view of its contents. One kind of abstraction network is called an area taxonomy, and a variation of it is called a subtaxonomy. A methodology based on these taxonomies for more readily finding missing relationship errors is explored. METHODS The area taxonomy and the subtaxonomy are deployed to help reveal concepts that have a high likelihood of exhibiting missing relationship errors. A specific top-level grouping unit found within the area taxonomy and subtaxonomy, when deemed to be anomalous, is used as an indicator that missing relationship errors are likely to be found among certain concepts. Two hypotheses pertaining to the effectiveness of our Quality Assurance approach are studied. RESULTS Our Quality Assurance methodology was applied to the Biological Process hierarchy of the National Cancer Institute thesaurus (NCIt) and SNOMED CT's Eye/vision finding subhierarchy within its Clinical finding hierarchy. Many missing relationship errors were discovered and confirmed in our analysis. For both test-bed hierarchies, our Quality Assurance methodology yielded a statistically significantly higher number of concepts with missing relationship errors in comparison to a control sample of concepts. Two hypotheses are confirmed by these findings. CONCLUSIONS Quality assurance is a critical part of an ontology's lifecycle, and automated or semi-automated tools for supporting this process are invaluable. We introduced a Quality Assurance methodology targeted at missing relationship errors. Its successful application to the NCIt's Biological Process hierarchy and SNOMED CT's Eye/vision finding subhierarchy indicates that it can be a useful addition to the arsenal of tools available to ontology maintenance personnel.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, 07764, USA.
| | - Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, New York, NY, 10007, USA
| | - Hua Min
- Department of Health Administration and Policy, George Mason University, Fairfax, VA, 22030, USA
| | | | - Hao Liu
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Michael Halper
- Department of Informatics, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Sherri de Coronado
- National Cancer Institute, Center for Biomedical Informatics and Information Technology, National Institutes of Health, Rockville, MD, 20850, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| |
Collapse
|
8
|
Liu H, Perl Y, Geller J. Concept placement using BERT trained by transforming and summarizing biomedical ontology structure. J Biomed Inform 2020; 112:103607. [PMID: 33098987 DOI: 10.1016/j.jbi.2020.103607] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 09/07/2020] [Accepted: 10/17/2020] [Indexed: 11/17/2022]
Abstract
The comprehensive modeling and hierarchical positioning of a new concept in an ontology heavily relies on its set of proper subsumption relationships (IS-As) to other concepts. Identifying a concept's IS-A relationships is a laborious task requiring curators to have both domain knowledge and terminology skills. In this work, we propose a method to automatically predict the presence of IS-A relationships between a new concept and pre-existing concepts based on the language representation model BERT. This method converts the neighborhood network of a concept into "sentences" and harnesses BERT's Next Sentence Prediction (NSP) capability of predicting the adjacency of two sentences. To augment our method's performance, we refined the training data by employing an ontology summarization technique. We trained our model with the two largest hierarchies of the SNOMED CT 2017 July release and applied it to predicting the parents of new concepts added in the SNOMED CT 2018 January release. The results showed that our method achieved an average F1 score of 0.88, and the average Recall score improves slightly from 0.94 to 0.96 by using the ontology summarization technique.
Collapse
Affiliation(s)
- Hao Liu
- Dept of Computer Science, NJIT, Newark, NJ, USA.
| | | | | |
Collapse
|
9
|
Zheng L, He Z, Wei D, Keloth V, Fan JW, Lindemann L, Zhu X, Cimino JJ, Perl Y. A review of auditing techniques for the Unified Medical Language System. J Am Med Inform Assoc 2020; 27:1625-1638. [PMID: 32766692 PMCID: PMC7566540 DOI: 10.1093/jamia/ocaa108] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/05/2020] [Accepted: 05/13/2020] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE The study sought to describe the literature related to the development of methods for auditing the Unified Medical Language System (UMLS), with particular attention to identifying errors and inconsistencies of attributes of the concepts in the UMLS Metathesaurus. MATERIALS AND METHODS We applied the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) approach by searching the MEDLINE database and Google Scholar for studies referencing the UMLS and any of several terms related to auditing, error detection, and quality assurance. A qualitative analysis and summarization of articles that met inclusion criteria were performed. RESULTS Eighty-three studies were reviewed in detail. We first categorized techniques based on various aspects including concepts, concept names, and synonymy (n = 37), semantic type assignments (n = 36), hierarchical relationships (n = 24), lateral relationships (n = 12), ontology enrichment (n = 8), and ontology alignment (n = 18). We also categorized the methods according to their level of automation (ie, automated systematic, automated heuristic, or manual) and the type of knowledge used (ie, intrinsic or extrinsic knowledge). CONCLUSIONS This study is a comprehensive review of the published methods for auditing the various conceptual aspects of the UMLS. Categorizing the auditing techniques according to the various aspects will enable the curators of the UMLS as well as researchers comprehensive easy access to this wealth of knowledge (eg, for auditing lateral relationships in the UMLS). We also reviewed ontology enrichment and alignment techniques due to their critical use of and impact on the UMLS.
Collapse
Affiliation(s)
- Ling Zheng
- Department of Computer Science and Software Engineering, Monmouth University, West Long Branch, New Jersey, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| | - Duo Wei
- School of Business, Stockton University, Galloway, New Jersey, USA
| | - Vipina Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Jung-Wei Fan
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Luke Lindemann
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, Connecticut, USA
| | - Xinxin Zhu
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, Connecticut, USA
| | - James J Cimino
- Informatics Institute, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, USA
| |
Collapse
|
10
|
Liu H, Perl Y, Geller J. Transfer Learning from BERT to Support Insertion of New Concepts into SNOMED CT. AMIA Annu Symp Proc 2020; 2019:1129-1138. [PMID: 32308910 PMCID: PMC7153142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With advances in Machine Learning (ML), neural network-based methods, such as Convolutional/Recurrent Neural Networks, have been proposed to assist terminology curators in the development and maintenance of terminologies. Bidirectional Encoder Representations from Transformers (BERT), a new language representation model, obtains state-of-the-art results on a wide array of general English NLP tasks. We explore BERT's applicability to medical terminology-related tasks. Utilizing the "next sentence prediction" capability of BERT, we show that the Fine-tuning strategy of Transfer Learning (TL) from the BERTBASE model can address a challenging problem in automatic terminology enrichment - insertion of new concepts. Adding a pre-training strategy enhances the results. We apply our strategies to the two largest hierarchies of SNOMED CT, with one release as training data and the following release as test data. The performance of the combined two proposed TL models achieves an average F1 score of 0.85 and 0.86 for the two hierarchies, respectively.
Collapse
Affiliation(s)
- Hao Liu
- Dept of Computer Science, NJIT, Newark, NJ, USA
| | | | | |
Collapse
|
11
|
Zheng L, Liu H, Perl Y, Geller J. Training a Convolutional Neural Network with Terminology Summarization Data Improves SNOMED CT Enrichment. AMIA Annu Symp Proc 2020; 2019:972-981. [PMID: 32308894 PMCID: PMC7153126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
As a step toward learning to automatically insert new concepts into a large biomedical ontology, we are studying the easier problem of automatically verifying that an IS-A link should exist between a new child concept and an existing parent concept. We are using a Convolutional Neural Network, a powerful machine learning method. However, results depend on the quality of the training data. We use SNOMED CT (July 2017) for training and the subsequent release for testing. The main problem is to find a good set of negative training data. We experiment with two approaches, based on uncle-nephew (not connected) pairs of concepts. We contrast using the complete Clinical Finding hierarchy of SNOMED CT with using the powerful Area Taxonomy ontology summarization mechanism to constrain the training data. The results for the task of verifying IS-A links are improved by 8.6% when going from the complete hierarchy to the Area Taxonomy.
Collapse
Affiliation(s)
- Ling Zheng
- CSSE Department, Monmouth University, West Long Branch, NJ, USA
| | - Hao Liu
- SABOC Center, Department of Computer Science, NJIT, Newark, NJ, USA
| | - Yehoshua Perl
- SABOC Center, Department of Computer Science, NJIT, Newark, NJ, USA
| | - James Geller
- SABOC Center, Department of Computer Science, NJIT, Newark, NJ, USA
| |
Collapse
|
12
|
Zheng L, Liu H, Perl Y, Geller J, Ochs C, Case JT. Overlapping Complex Concepts Have More Commission Errors, Especially in Intensive Terminology Auditing. AMIA Annu Symp Proc 2018; 2018:1157-1166. [PMID: 30815158 PMCID: PMC6371375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
SNOMED CT is a large, complex and widely-used terminology. Auditing is part of the life cycle of terminologies. A review of terminologies' content can identify two error categories: commission errors, such as an incorrect parent or attribute relationship, indicating errors in a concept's modeling, and omission errors, such as missing a parent or attribute relationship, representing incomplete modeling of a concept. According to our experience, terminology curators are mostly interested in commission errors. In recent years, a long-term remodeling project has addressed modeling issues in SNOMED CT's Infectious disease and Congenital disease subhierarchies. In this longitudinal study, we investigated a posteriori the efficacy of complex concepts, called overlapping concepts, to identify commission errors during intensive auditing periods and during maintenance periods over several releases. The algorithmic implication is that when auditing resources are scarce, a methodology of auditing first, or only, the overlapping concepts will obtain a higher auditing yield.
Collapse
Affiliation(s)
- Ling Zheng
- Monmouth University, West Long Branch, NJ, US
| | - Hao Liu
- New Jersey Institute of Technology, Newark, NJ, US
| | | | - James Geller
- New Jersey Institute of Technology, Newark, NJ, US
| | | | | |
Collapse
|
13
|
Liu H, Geller J, Halper M, Perl Y. Using Convolutional Neural Networks to Support Insertion of New Concepts into SNOMED CT. AMIA Annu Symp Proc 2018; 2018:750-759. [PMID: 30815117 PMCID: PMC6371320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Many major medical ontologies go through a regular (bi-annual, monthly, etc.) release cycle. A new release will contain corrections to the previous release, as well as genuinely new concepts that are the result of either user requests or new developments in the domain. New concepts need to be placed at the correct place in the ontology hierarchy. Traditionally, this is done by an expert modeling a new concept and running a classifier algorithm. We propose an alternative approach that is based on providing only the name of a new concept and using a Convolutional Neural Network-based machine learning method. We first tested this approach within one version of SNOMED CT and achieved an average 88.5% precision and an F1 score of 0.793. In comparing the July 2017 release with the January 2018 release, limiting ourselves to predicting one out of two or more parents, our average F1 score was 0.701.
Collapse
Affiliation(s)
- Hao Liu
- New Jersey Institute of Technology, Newark, NJ
| | | | | | | |
Collapse
|
14
|
Zheng L, Chen Y, Elhanan G, Perl Y, Geller J, Ochs C. Complex overlapping concepts: An effective auditing methodology for families of similarly structured BioPortal ontologies. J Biomed Inform 2018; 83:135-149. [PMID: 29852316 DOI: 10.1016/j.jbi.2018.05.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 05/25/2018] [Accepted: 05/26/2018] [Indexed: 11/30/2022]
Abstract
In previous research, we have demonstrated for a number of ontologies that structurally complex concepts (for different definitions of "complex") in an ontology are more likely to exhibit errors than other concepts. Thus, such complex concepts often become fertile ground for quality assurance (QA) in ontologies. They should be audited first. One example of complex concepts is given by "overlapping concepts" (to be defined below.) Historically, a different auditing methodology had to be developed for every single ontology. For better scalability and efficiency, it is desirable to identify family-wide QA methodologies. Each such methodology would be applicable to a whole family of similar ontologies. In past research, we had divided the 685 ontologies of BioPortal into families of structurally similar ontologies. We showed for four ontologies of the same large family in BioPortal that "overlapping concepts" are indeed statistically significantly more likely to exhibit errors. In order to make an authoritative statement concerning the success of "overlapping concepts" as a methodology for a whole family of similar ontologies (or of large subhierarchies of ontologies), it is necessary to show that "overlapping concepts" have a higher likelihood of errors for six out of six ontologies of the family. In this paper, we are demonstrating for two more ontologies that "overlapping concepts" can successfully predict groups of concepts with a higher error rate than concepts from a control group. The fifth ontology is the Neoplasm subhierarchy of the National Cancer Institute thesaurus (NCIt). The sixth ontology is the Infectious Disease subhierarchy of SNOMED CT. We demonstrate quality assurance results for both of them. Furthermore, in this paper we observe two novel, important, and useful phenomena during quality assurance of "overlapping concepts." First, an erroneous "overlapping concept" can help with discovering other erroneous "non-overlapping concepts" in its vicinity. Secondly, correcting erroneous "overlapping concepts" may turn them into "non-overlapping concepts." We demonstrate that this may reduce the complexity of parts of the ontology, which in turn makes the ontology more comprehensible, simplifying maintenance and use of the ontology.
Collapse
Affiliation(s)
- Ling Zheng
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | - Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, NY 10007, United States
| | - Gai Elhanan
- Applied Innovation Center, Desert Research Institute, Reno, NV 89512, United States
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States
| | | |
Collapse
|
15
|
Abstract
Abstract:Controlled medical terminologies are increasingly becoming strategic components of various healthcare enterprises. However, the typical medical terminology can be difficult to exploit due to its extensive size and high density. The schema of a medical terminology offered by an object-oriented representation is a valuable tool in providing an abstract view of the terminology, enhancing comprehensibility and making it more usable. However, schemas themselves can be large and unwieldy. We present a methodology for partitioning a medical terminology schema into manageably sized fragments that promote increased comprehension. Our methodology has a refinement process for the subclass hierarchy of the terminology schema. The methodology is carried out by a medical domain expert in conjunction with a computer. The expert is guided by a set of three modeling rules, which guarantee that the resulting partitioned schema consists of a forest of trees. This makes it easier to understand and consequently use the medical terminology. The application of our methodology to the schema of the Medical Entities Dictionary (MED) is presented.
Collapse
|
16
|
He Z, Perl Y, Elhanan G, Chen Y, Geller J, Bian J. Auditing the Assignments of Top-Level Semantic Types in the UMLS Semantic Network to UMLS Concepts. Proceedings (IEEE Int Conf Bioinformatics Biomed) 2017; 2017:1262-1269. [PMID: 29375930 DOI: 10.1109/bibm.2017.8217840] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The Unified Medical Language System (UMLS) is an important terminological system. By the policy of its curators, each concept of the UMLS should be assigned the most specific Semantic Types (STs) in the UMLS Semantic Network (SN). Hence, the Semantic Types of most UMLS concepts are assigned at or near the bottom (leaves) of the UMLS Semantic Network. While most ST assignments are correct, some errors do occur. Therefore, Quality Assurance efforts of UMLS curators for ST assignments should concentrate on automatically detected sets of UMLS concepts with higher error rates than random sets. In this paper, we investigate the assignments of top-level semantic types in the UMLS semantic network to concepts, identify potential erroneous assignments, define four categories of errors, and thus provide assistance to curators of the UMLS to avoid these assignments errors. Human experts analyzed samples of concepts assigned 10 of the top-level semantic types and categorized the erroneous ST assignments into these four logical categories. Two thirds of the concepts assigned these 10 top-level semantic types are erroneous. Our results demonstrate that reviewing top-level semantic type assignments to concepts provides an effective way for UMLS quality assurance, comparing to reviewing a random selection of semantic type assignments.
Collapse
Affiliation(s)
- Zhe He
- School of Information, Florida State University, Tallahassee, FL,
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Tehnology, Newark, NJ,
| | - Gai Elhanan
- Department of Computer Science, New Jersey Institute of Tehnology, Newark, NJ,
| | - Yan Chen
- Department of Computer Information Systems, BMCC, CUNY, New York, NJ,
| | - James Geller
- Department of Computer Science, New Jersey Institute of Tehnology, Newark, NJ,
| | - Jiang Bian
- Department of Health Outcomes and Policy, University of Florida, Gainesville, FL,
| |
Collapse
|
17
|
Zheng L, Yumak H, Chen L, Ochs C, Geller J, Kapusnik-Uner J, Perl Y. Quality assurance of chemical ingredient classification for the National Drug File - Reference Terminology. J Biomed Inform 2017; 73:30-42. [PMID: 28723580 DOI: 10.1016/j.jbi.2017.07.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Revised: 07/13/2017] [Accepted: 07/14/2017] [Indexed: 02/04/2023]
Abstract
The National Drug File - Reference Terminology (NDF-RT) is a large and complex drug terminology consisting of several classification hierarchies on top of an extensive collection of drug concepts. These hierarchies provide important information about clinical drugs, e.g., their chemical ingredients, mechanisms of action, dosage form and physiological effects. Within NDF-RT such information is represented using tens of thousands of roles connecting drugs to classifications. In previous studies, we have introduced various kinds of Abstraction Networks to summarize the content and structure of terminologies in order to facilitate their visual comprehension, and support quality assurance of terminologies. However, these previous kinds of Abstraction Networks are not appropriate for summarizing the NDF-RT classification hierarchies, due to its unique structure. In this paper, we present the novel Ingredient Abstraction Network (IAbN) to summarize, visualize and support the audit of NDF-RT's Chemical Ingredients hierarchy and its associated drugs. A common theme in our quality assurance framework is to use characterizations of sets of concepts, revealed by the Abstraction Network structure, to capture concepts, the modeling of which is more complex than for other concepts. For the IAbN, we characterize drug ingredient concepts as more complex if they belong to IAbN groups with multiple parent groups. We show that such concepts have a statistically significantly higher rate of errors than a control sample and identify two especially common patterns of errors.
Collapse
Affiliation(s)
- Ling Zheng
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | - Hasan Yumak
- BMCC, CUNY, New York, NY 10007, United States.
| | - Ling Chen
- BMCC, CUNY, New York, NY 10007, United States.
| | - Christopher Ochs
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | | | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| |
Collapse
|
18
|
Elhanan G, Ochs C, Mejino JLV, Liu H, Mungall CJ, Perl Y. From SNOMED CT to Uberon: Transferability of evaluation methodology between similarly structured ontologies. Artif Intell Med 2017; 79:9-14. [PMID: 28532962 DOI: 10.1016/j.artmed.2017.05.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Revised: 05/03/2017] [Accepted: 05/04/2017] [Indexed: 12/29/2022]
Abstract
OBJECTIVE To examine whether disjoint partial-area taxonomy, a semantically-based evaluation methodology that has been successfully tested in SNOMED CT, will perform with similar effectiveness on Uberon, an anatomical ontology that belongs to a structurally similar family of ontologies as SNOMED CT. METHOD A disjoint partial-area taxonomy was generated for Uberon. One hundred randomly selected test concepts that overlap between partial-areas were matched to a same size control sample of non-overlapping concepts. The samples were blindly inspected for non-critical issues and presumptive errors first by a general domain expert whose results were then confirmed or rejected by a highly experienced anatomical ontology domain expert. Reported issues were subsequently reviewed by Uberon's curators. RESULTS Overlapping concepts in Uberon's disjoint partial-area taxonomy exhibited a significantly higher rate of all issues. Clear-cut presumptive errors trended similarly but did not reach statistical significance. A sub-analysis of overlapping concepts with three or more relationship types indicated a much higher rate of issues. CONCLUSIONS Overlapping concepts from Uberon's disjoint abstraction network are quite likely (up to 28.9%) to exhibit issues. The results suggest that the methodology can transfer well between same family ontologies. Although Uberon exhibited relatively few overlapping concepts, the methodology can be combined with other semantic indicators to expand the process to other concepts within the ontology that will generate high yields of discovered issues.
Collapse
Affiliation(s)
- Gai Elhanan
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA.
| | - Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA
| | - Jose L V Mejino
- Department of Biological Structure (Structural Informatics Group), University of Washington, Seattle, WA, USA
| | - Hao Liu
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA
| | | | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
19
|
Min H, Zheng L, Perl Y, Halper M, De Coronado S, Ochs C. Relating Complexity and Error Rates of Ontology Concepts. More Complex NCIt Concepts Have More Errors. Methods Inf Med 2017; 56:200-208. [PMID: 28244549 DOI: 10.3414/me16-01-0085] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 01/19/2017] [Indexed: 11/09/2022]
Abstract
OBJECTIVES Ontologies are knowledge structures that lend support to many health-information systems. A study is carried out to assess the quality of ontological concepts based on a measure of their complexity. The results show a relation between complexity of concepts and error rates of concepts. METHODS A measure of lateral complexity defined as the number of exhibited role types is used to distinguish between more complex and simpler concepts. Using a framework called an area taxonomy, a kind of abstraction network that summarizes the structural organization of an ontology, concepts are divided into two groups along these lines. Various concepts from each group are then subjected to a two-phase QA analysis to uncover and verify errors and inconsistencies in their modeling. A hierarchy of the National Cancer Institute thesaurus (NCIt) is used as our test-bed. A hypothesis pertaining to the expected error rates of the complex and simple concepts is tested. RESULTS Our study was done on the NCIt's Biological Process hierarchy. Various errors, including missing roles, incorrect role targets, and incorrectly assigned roles, were discovered and verified in the two phases of our QA analysis. The overall findings confirmed our hypothesis by showing a statistically significant difference between the amounts of errors exhibited by more laterally complex concepts vis-à-vis simpler concepts. CONCLUSIONS QA is an essential part of any ontology's maintenance regimen. In this paper, we reported on the results of a QA study targeting two groups of ontology concepts distinguished by their level of complexity, defined in terms of the number of exhibited role types. The study was carried out on a major component of an important ontology, the NCIt. The findings suggest that more complex concepts tend to have a higher error rate than simpler concepts. These findings can be utilized to guide ongoing efforts in ontology QA.
Collapse
Affiliation(s)
- Hua Min
- Hua Min, Department of Health Administration and Policy, College of Health and Human Services, George Mason University, MS: 1J3, 4400 University Drive, Fairfax, VA 22030-4444, USA, E-mail:
| | | | | | | | | | | |
Collapse
|
20
|
Ochs C, Case JT, Perl Y. Analyzing structural changes in SNOMED CT's Bacterial infectious diseases using a visual semantic delta. J Biomed Inform 2017; 67:101-116. [PMID: 28215561 DOI: 10.1016/j.jbi.2017.02.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Revised: 02/08/2017] [Accepted: 02/09/2017] [Indexed: 12/23/2022]
Abstract
Thousands of changes are applied to SNOMED CT's concepts during each release cycle. These changes are the result of efforts to improve or expand the coverage of health domains in the terminology. Understanding which concepts changed, how they changed, and the overall impact of a set of changes is important for editors and end users. Each SNOMED CT release comes with delta files, which identify all of the individual additions and removals of concepts and relationships. These files typically contain tens of thousands of individual entries, overwhelming users. They also do not identify the editorial processes that were applied to individual concepts and they do not capture the overall impact of a set of changes on a subhierarchy of concepts. In this paper we introduce a methodology and accompanying software tool called a SNOMED CT Visual Semantic Delta ("semantic delta" for short) to enable a comprehensive review of changes in SNOMED CT. The semantic delta displays a graphical list of editing operations that provides semantics and context to the additions and removals in the delta files. However, there may still be thousands of editing operations applied to a set of concepts. To address this issue, a semantic delta includes a visual summary of changes that affected sets of structurally and semantically similar concepts. The software tool for creating semantic deltas offers views of various granularities, allowing a user to control how much change information they view. In this tool a user can select a set of structurally and semantically similar concepts and review the editing operations that affected their modeling. The semantic delta methodology is demonstrated on SNOMED CT's Bacterial infectious disease subhierarchy, which has undergone a significant remodeling effort over the last two years.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA.
| | - James T Case
- National Library of Medicine/National Institutes of Health, Bethesda, MD 20894, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| |
Collapse
|
21
|
Ochs C, Case JT, Perl Y. Tracking the Remodeling of SNOMED CT's Bacterial Infectious Diseases. AMIA Annu Symp Proc 2017; 2016:974-983. [PMID: 28269894 PMCID: PMC5333319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
SNOMED CT's content undergoes many changes from one release to the next. Over the last year SNOMED CT's Bacterial infectious disease subhierarchy has undergone significant editing to bring consistent modeling to its concepts. In this paper we analyze the stated and inferred structural modifications that affected the Bacterial infectious disease subhierarchy between the Jan 2015 and Jan 2016 SNOMED CT releases using a two-phased approach. First, we introduce a methodology for creating a human readable list of changes. Next, we utilize partial-area taxonomies, which are compact summaries of SNOMED CT's content and structure, to identify the "big picture" changes that occurred in the subhierarchy. We illustrate how partial-area taxonomies can be used to help identify groups of concepts that were affected by these editing operations and the nature of these changes. Modeling issues identified using our two-phase methodology are discussed.
Collapse
|
22
|
Zheng L, Perl Y, Elhanan G, Ochs C, Geller J, Halper M. Summarizing an Ontology: A "Big Knowledge" Coverage Approach. Stud Health Technol Inform 2017; 245:978-982. [PMID: 29295246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Maintenance and use of a large ontology, consisting of thousands of knowledge assertions, are hampered by its scope and complexity. It is important to provide tools for summarization of ontology content in order to facilitate user "big picture" comprehension. We present a parameterized methodology for the semi-automatic summarization of major topics in an ontology, based on a compact summary of the ontology, called an "aggregate partial-area taxonomy", followed by manual enhancement. An experiment is presented to test the effectiveness of such summarization measured by coverage of a given list of major topics of the corresponding application domain. SNOMED CT's Specimen hierarchy is the test-bed. A domain-expert provided a list of topics that serves as a gold standard. The enhanced results show that the aggregate taxonomy covers most of the domain's main topics.
Collapse
Affiliation(s)
- Ling Zheng
- College of Computing, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA
| | - Yehoshua Perl
- College of Computing, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA
| | - Gai Elhanan
- College of Computing, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA
| | - Christopher Ochs
- College of Computing, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA
| | - James Geller
- College of Computing, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA
| | - Michael Halper
- College of Computing, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA
| |
Collapse
|
23
|
Liu H, Zheng L, Perl Y, Chen Y, Elhanan G. Correcting Ontology Errors Simplifies Visual Complexity. Stud Health Technol Inform 2017; 245:1330. [PMID: 29295411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In previous research we have shown that hierarchically complex overlapping concepts have a higher error rate of errors versus control concepts. In this poster we show an exmaple from Neoplasm concepts of the NCI thesaurus (NCIt) demonstrating that erroneous overplapping concepts, reflected in the partial-area units of a partial-area taxonomy, display visual complexity. Furthermore, correcting these erroneous concepts causes visual simplification.
Collapse
Affiliation(s)
- Hao Liu
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102-1982 USA
| | - Ling Zheng
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102-1982 USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102-1982 USA
| | - Yan Chen
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102-1982 USA
| | - Gai Elhanan
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102-1982 USA
| |
Collapse
|
24
|
Perl Y, Geller J, Halper M, Ochs C, Zheng L, Kapusnik-Uner J. Introducing the Big Knowledge to Use (BK2U) challenge. Ann N Y Acad Sci 2016; 1387:12-24. [PMID: 27750400 DOI: 10.1111/nyas.13225] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 07/07/2016] [Accepted: 08/11/2016] [Indexed: 12/26/2022]
Abstract
The purpose of the Big Data to Knowledge initiative is to develop methods for discovering new knowledge from large amounts of data. However, if the resulting knowledge is so large that it resists comprehension, referred to here as Big Knowledge (BK), how can it be used properly and creatively? We call this secondary challenge, Big Knowledge to Use. Without a high-level mental representation of the kinds of knowledge in a BK knowledgebase, effective or innovative use of the knowledge may be limited. We describe summarization and visualization techniques that capture the big picture of a BK knowledgebase, possibly created from Big Data. In this research, we distinguish between assertion BK and rule-based BK (rule BK) and demonstrate the usefulness of summarization and visualization techniques of assertion BK for clinical phenotyping. As an example, we illustrate how a summary of many intracranial bleeding concepts can improve phenotyping, compared to the traditional approach. We also demonstrate the usefulness of summarization and visualization techniques of rule BK for drug-drug interaction discovery.
Collapse
Affiliation(s)
| | | | - Michael Halper
- Information Technology Department, New Jersey Institute of Technology, Newark, New Jersey
| | | | | | | |
Collapse
|
25
|
Ochs C, Geller J, Perl Y, Musen MA. A unified software framework for deriving, visualizing, and exploring abstraction networks for ontologies. J Biomed Inform 2016; 62:90-105. [PMID: 27345947 DOI: 10.1016/j.jbi.2016.06.008] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 06/02/2016] [Accepted: 06/22/2016] [Indexed: 11/27/2022]
Abstract
Software tools play a critical role in the development and maintenance of biomedical ontologies. One important task that is difficult without software tools is ontology quality assurance. In previous work, we have introduced different kinds of abstraction networks to provide a theoretical foundation for ontology quality assurance tools. Abstraction networks summarize the structure and content of ontologies. One kind of abstraction network that we have used repeatedly to support ontology quality assurance is the partial-area taxonomy. It summarizes structurally and semantically similar concepts within an ontology. However, the use of partial-area taxonomies was ad hoc and not generalizable. In this paper, we describe the Ontology Abstraction Framework (OAF), a unified framework and software system for deriving, visualizing, and exploring partial-area taxonomy abstraction networks. The OAF includes support for various ontology representations (e.g., OWL and SNOMED CT's relational format). A Protégé plugin for deriving "live partial-area taxonomies" is demonstrated.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
26
|
|
27
|
Abstract
The gene ontology (GO) is used extensively in the field of genomics. Like other large and complex ontologies, quality assurance (QA) efforts for GO's content can be laborious and time consuming. Abstraction networks (AbNs) are summarization networks that reveal and highlight high-level structural and hierarchical aggregation patterns in an ontology. They have been shown to successfully support QA work in the context of various ontologies. Two kinds of AbNs, called the area taxonomy and the partial-area taxonomy, are developed for GO hierarchies and derived specifically for the biological process (BP) hierarchy. Within this framework, several QA heuristics, based on the identification of groups of anomalous terms which exhibit certain taxonomy-defined characteristics, are introduced. Such groups are expected to have higher error rates when compared to other terms. Thus, by focusing QA efforts on anomalous terms one would expect to find relatively more erroneous content. By automatically identifying these potential problem areas within an ontology, time and effort will be saved during manual reviews of GO's content. BP is used as a testbed, with samples of three kinds of anomalous BP terms chosen for a taxonomy-based QA review. Additional heuristics for QA are demonstrated. From the results of this QA effort, it is observed that different kinds of inconsistencies in the modeling of GO can be exposed with the use of the proposed heuristics. For comparison, the results of QA work on a sample of terms chosen from GO's general population are presented.
Collapse
Affiliation(s)
- Christopher Ochs
- * Computer Science Department, New Jersey Institute of Technology Newark, NJ 07102, USA
| | - Yehoshua Perl
- * Computer Science Department, New Jersey Institute of Technology Newark, NJ 07102, USA
| | - Michael Halper
- † Information Technology Department, New Jersey Institute of Technology Newark, NJ 07102, USA
| | - James Geller
- * Computer Science Department, New Jersey Institute of Technology Newark, NJ 07102, USA
| | - Jane Lomax
- ‡ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
28
|
Ochs C, Zheng L, Gu H, Perl Y, Geller J, Kapusnik-Uner J, Zakharchenko A. Drug-drug Interaction Discovery Using Abstraction Networks for "National Drug File - Reference Terminology" Chemical Ingredients. AMIA Annu Symp Proc 2015; 2015:973-982. [PMID: 26958234 PMCID: PMC4765653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The National Drug File - Reference Terminology (NDF-RT) is a large and complex drug terminology. NDF-RT provides important information about clinical drugs, e.g., their chemical ingredients, mechanisms of action, dosage form and physiological effects. Within NDF-RT such information is represented using tens of thousands of roles. It is difficult to comprehend large, complex terminologies like NDF-RT. In previous studies, we introduced abstraction networks to summarize the content and structure of terminologies. In this paper, we introduce the Ingredient Abstraction Network to summarize NDF-RT's Chemical Ingredients and their associated drugs. Additionally, we introduce the Aggregate Ingredient Abstraction Network, for controlling the granularity of summarization provided by the Ingredient Abstraction Network. The Ingredient Abstraction Network is used to support the discovery of new candidate drug-drug interactions (DDIs) not appearing in First Databank, Inc.'s DDI knowledgebase.
Collapse
Affiliation(s)
| | - Ling Zheng
- New Jersey Institute of Technology, Newark, NJ
| | - Huanying Gu
- New York Institute of Technology, New York, NY
| | | | | | | | | |
Collapse
|
29
|
Wei D, Helen Gu H, Perl Y, Halper M, Ochs C, Elhanan G, Chen Y. Structural measures to track the evolution of SNOMED CT hierarchies. J Biomed Inform 2015; 57:278-87. [PMID: 26260003 DOI: 10.1016/j.jbi.2015.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 08/01/2015] [Accepted: 08/01/2015] [Indexed: 11/28/2022]
Abstract
The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is an extensive reference terminology with an attendant amount of complexity. It has been updated continuously and revisions have been released semi-annually to meet users' needs and to reflect the results of quality assurance (QA) activities. Two measures based on structural features are proposed to track the effects of both natural terminology growth and QA activities based on aspects of the complexity of SNOMED CT. These two measures, called the structural density measure and accumulated structural measure, are derived based on two abstraction networks, the area taxonomy and the partial-area taxonomy. The measures derive from attribute relationship distributions and various concept groupings that are associated with the abstraction networks. They are used to track the trends in the complexity of structures as SNOMED CT changes over time. The measures were calculated for consecutive releases of five SNOMED CT hierarchies, including the Specimen hierarchy. The structural density measure shows that natural growth tends to move a hierarchy's structure toward a more complex state, whereas the accumulated structural measure shows that QA processes tend to move a hierarchy's structure toward a less complex state. It is also observed that both the structural density and accumulated structural measures are useful tools to track the evolution of an entire SNOMED CT hierarchy and reveal internal concept migration within it.
Collapse
Affiliation(s)
- Duo Wei
- Computer Science and Information Systems-BUSN, Stockton University, Galloway, NJ 08205, United States.
| | - Huanying Helen Gu
- Computer Science Dept., New York Institute of Technology, New York, NY 10023, United States
| | - Yehoshua Perl
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - Michael Halper
- Information Technology Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - Christopher Ochs
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - Gai Elhanan
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States; Halfpenny Technologies Inc., Blue Bell, PA 19422, United States
| | - Yan Chen
- Computer Information Systems Dept., BMCC, CUNY, New York, NY 10007, United States
| |
Collapse
|
30
|
Ochs C, Perl Y, Geller J, Haendel M, Brush M, Arabandi S, Tu S. Summarizing and visualizing structural changes during the evolution of biomedical ontologies using a Diff Abstraction Network. J Biomed Inform 2015; 56:127-44. [PMID: 26048076 DOI: 10.1016/j.jbi.2015.05.018] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 04/01/2015] [Accepted: 05/27/2015] [Indexed: 10/23/2022]
Abstract
Biomedical ontologies are a critical component in biomedical research and practice. As an ontology evolves, its structure and content change in response to additions, deletions and updates. When editing a biomedical ontology, small local updates may affect large portions of the ontology, leading to unintended and potentially erroneous changes. Such unwanted side effects often go unnoticed since biomedical ontologies are large and complex knowledge structures. Abstraction networks, which provide compact summaries of an ontology's content and structure, have been used to uncover structural irregularities, inconsistencies and errors in ontologies. In this paper, we introduce Diff Abstraction Networks ("Diff AbNs"), compact networks that summarize and visualize global structural changes due to ontology editing operations that result in a new ontology release. A Diff AbN can be used to support curators in identifying unintended and unwanted ontology changes. The derivation of two Diff AbNs, the Diff Area Taxonomy and the Diff Partial-area Taxonomy, is explained and Diff Partial-area Taxonomies are derived and analyzed for the Ontology of Clinical Research, Sleep Domain Ontology, and eagle-i Research Resource Ontology. Diff Taxonomy usage for identifying unintended erroneous consequences of quality assurance and ontology merging are demonstrated.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Melissa Haendel
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA
| | - Matthew Brush
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA
| | | | - Samson Tu
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
31
|
Halper M, Gu H, Perl Y, Ochs C. Abstraction networks for terminologies: Supporting management of "big knowledge". Artif Intell Med 2015; 64:1-16. [PMID: 25890687 DOI: 10.1016/j.artmed.2015.03.005] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 02/24/2015] [Accepted: 03/25/2015] [Indexed: 11/16/2022]
Abstract
OBJECTIVE Terminologies and terminological systems have assumed important roles in many medical information processing environments, giving rise to the "big knowledge" challenge when terminological content comprises tens of thousands to millions of concepts arranged in a tangled web of relationships. Use and maintenance of knowledge structures on that scale can be daunting. The notion of abstraction network is presented as a means of facilitating the usability, comprehensibility, visualization, and quality assurance of terminologies. METHODS AND MATERIALS An abstraction network overlays a terminology's underlying network structure at a higher level of abstraction. In particular, it provides a more compact view of the terminology's content, avoiding the display of minutiae. General abstraction network characteristics are discussed. Moreover, the notion of meta-abstraction network, existing at an even higher level of abstraction than a typical abstraction network, is described for cases where even the abstraction network itself represents a case of "big knowledge." Various features in the design of abstraction networks are demonstrated in a methodological survey of some existing abstraction networks previously developed and deployed for a variety of terminologies. RESULTS The applicability of the general abstraction-network framework is shown through use-cases of various terminologies, including the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT), the Medical Entities Dictionary (MED), and the Unified Medical Language System (UMLS). Important characteristics of the surveyed abstraction networks are provided, e.g., the magnitude of the respective size reduction referred to as the abstraction ratio. Specific benefits of these alternative terminology-network views, particularly their use in terminology quality assurance, are discussed. Examples of meta-abstraction networks are presented. CONCLUSIONS The "big knowledge" challenge constitutes the use and maintenance of terminological structures that comprise tens of thousands to millions of concepts and their attendant complexity. The notion of abstraction network has been introduced as a tool in helping to overcome this challenge, thus enhancing the usefulness of terminologies. Abstraction networks have been shown to be applicable to a variety of existing biomedical terminologies, and these alternative structural views hold promise for future expanded use with additional terminologies.
Collapse
Affiliation(s)
- Michael Halper
- Information Technology Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - Huanying Gu
- Computer Science Department, New York Institute of Technology, New York, NY 10023, USA.
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| |
Collapse
|
32
|
Ochs C, Geller J, Perl Y, Chen Y, Xu J, Min H, Case JT, Wei Z. Scalable quality assurance for large SNOMED CT hierarchies using subject-based subtaxonomies. J Am Med Inform Assoc 2014; 22:507-18. [PMID: 25336594 DOI: 10.1136/amiajnl-2014-003151] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2014] [Accepted: 09/27/2014] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Standards terminologies may be large and complex, making their quality assurance challenging. Some terminology quality assurance (TQA) methodologies are based on abstraction networks (AbNs), compact terminology summaries. We have tested AbNs and the performance of related TQA methodologies on small terminology hierarchies. However, some standards terminologies, for example, SNOMED, are composed of very large hierarchies. Scaling AbN TQA techniques to such hierarchies poses a significant challenge. We present a scalable subject-based approach for AbN TQA. METHODS An innovative technique is presented for scaling TQA by creating a new kind of subject-based AbN called a subtaxonomy for large hierarchies. New hypotheses about concentrations of erroneous concepts within the AbN are introduced to guide scalable TQA. RESULTS We test the TQA methodology for a subject-based subtaxonomy for the Bleeding subhierarchy in SNOMED's large Clinical finding hierarchy. To test the error concentration hypotheses, three domain experts reviewed a sample of 300 concepts. A consensus-based evaluation identified 87 erroneous concepts. The subtaxonomy-based TQA methodology was shown to uncover statistically significantly more erroneous concepts when compared to a control sample. DISCUSSION The scalability of TQA methodologies is a challenge for large standards systems like SNOMED. We demonstrated innovative subject-based TQA techniques by identifying groups of concepts with a higher likelihood of having errors within the subtaxonomy. Scalability is achieved by reviewing a large hierarchy by subject. CONCLUSIONS An innovative methodology for scaling the derivation of AbNs and a TQA methodology was shown to perform successfully for the largest hierarchy of SNOMED.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yan Chen
- Computer Information Systems Department, BMCC, CUNY, New York, New York, USA
| | - Junchuan Xu
- Division of Knowledge Informatics, NYU, New York, New York, USA
| | - Hua Min
- Department of Health Administration and Policy, George Mason University, Fairfax, Virginia, USA
| | | | - Zhi Wei
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| |
Collapse
|
33
|
Ochs C, Geller J, Perl Y, Chen Y, Agrawal A, Case JT, Hripcsak G. A tribal abstraction network for SNOMED CT target hierarchies without attribute relationships. J Am Med Inform Assoc 2014; 22:628-39. [PMID: 25332354 DOI: 10.1136/amiajnl-2014-003173] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 09/20/2014] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE Large and complex terminologies, such as Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT), are prone to errors and inconsistencies. Abstraction networks are compact summarizations of the content and structure of a terminology. Abstraction networks have been shown to support terminology quality assurance. In this paper, we introduce an abstraction network derivation methodology which can be applied to SNOMED CT target hierarchies whose classes are defined using only hierarchical relationships (ie, without attribute relationships) and similar description-logic-based terminologies. METHODS We introduce the tribal abstraction network (TAN), based on the notion of a tribe-a subhierarchy rooted at a child of a hierarchy root, assuming only the existence of concepts with multiple parents. The TAN summarizes a hierarchy that does not have attribute relationships using sets of concepts, called tribal units that belong to exactly the same multiple tribes. Tribal units are further divided into refined tribal units which contain closely related concepts. A quality assurance methodology that utilizes TAN summarizations is introduced. RESULTS A TAN is derived for the Observable entity hierarchy of SNOMED CT, summarizing its content. A TAN-based quality assurance review of the concepts of the hierarchy is performed, and erroneous concepts are shown to appear more frequently in large refined tribal units than in small refined tribal units. Furthermore, more erroneous concepts appear in large refined tribal units of more tribes than of fewer tribes. CONCLUSIONS In this paper we introduce the TAN for summarizing SNOMED CT target hierarchies. A TAN was derived for the Observable entity hierarchy of SNOMED CT. A quality assurance methodology utilizing the TAN was introduced and demonstrated.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yan Chen
- Computer Information Systems Department, BMCC, CUNY, New York, New York, USA
| | - Ankur Agrawal
- Department of Computer Science, Manhattan College, Riverdale, New York, USA
| | | | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, New Jersey, USA
| |
Collapse
|
34
|
He Z, Ochs C, Agrawal A, Perl Y, Zeginis D, Tarabanis K, Elhanan G, Halper M, Noy N, Geller J. A family-based framework for supporting quality assurance of biomedical ontologies in BioPortal. AMIA Annu Symp Proc 2013; 2013:581-590. [PMID: 24551360 PMCID: PMC3900201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
BioPortal contains over 300 ontologies, for which quality assurance (QA) is critical. Abstraction networks (ANs), compact summarizations of ontology structure and content, have been used in such QA efforts, typically in a "one-off" manner for a single ontology. Ontologies can be characterized-independently of knowledge-content focus-from a structural standpoint leading to the formulation of ontology families. A family is defined as a set of ontologies satisfying some overarching condition regarding their structural features. Seven such families, comprising 186 ontologies, are identified. To increase efficiency, a new family-based QA framework is introduced in which an automated, uniform AN derivation technique and accompanying semi-automated, uniform QA regimen are applicable to the ontologies of a given family. Specifically, across an entire family, the QA efforts exploit family-wide AN features in the characterization of sets of classes that are more likely to harbor errors. The approach is demonstrated on the Cancer Chemoprevention BioPortal ontology.
Collapse
Affiliation(s)
- Zhe He
- New Jersey Institute of Technology, Newark, NJ
| | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Agrawal A, Perl Y, Chen Y, Elhanan G, Liu M. Identifying inconsistencies in SNOMED CT problem lists using structural indicators. AMIA Annu Symp Proc 2013; 2013:17-26. [PMID: 24551319 PMCID: PMC3900119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The National Library of Medicine has published the CORE and the VA/KP problem lists to facilitate the usage of SNOMED CT for encoding diagnoses and clinical data of patients in electronic health records. Therefore, it is essential for the content of the problem lists to be as accurate and consistent as possible. This study assesses the effectiveness of using a concept's word length and number of parents, two structural indicators for measuring concept complexity, to identify inconsistencies with high probability. The method is able to isolate concepts with over 40% expected of being erroneous. A structural indicator for concepts which is able to identify 52% of the examined concepts as having errors in synonyms is also presented. The results demonstrate that the concepts in problem lists are not free of inconsistencies and further quality assurance is needed to improve the quality of these concepts.
Collapse
Affiliation(s)
| | | | - Yan Chen
- Borough of Manhattan Community College, New York, NY
| | | | - Mei Liu
- New Jersey Institute of Technology, Newark, NJ
| |
Collapse
|
36
|
Ochs C, Perl Y, Geller J, Halper M, Gu H, Chen Y, Elhanan G. Scalability of abstraction-network-based quality assurance to large SNOMED hierarchies. AMIA Annu Symp Proc 2013; 2013:1071-1080. [PMID: 24551393 PMCID: PMC3900129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Abstraction networks are compact summarizations of terminologies used to support orientation and terminology quality assurance (TQA). Area taxonomies and partial-area taxonomies are abstraction networks that have been successfully employed in support of TQA of small SNOMED CT hierarchies. However, nearly half of SNOMED CT's concepts are in the large Procedure and Clinical Finding hierarchies. Abstraction network derivation methodologies applied to those hierarchies resulted in taxonomies that were too large to effectively support TQA. A methodology for deriving sub-taxonomies from large taxonomies is presented, and the resultant smaller abstraction networks are shown to facilitate TQA, allowing for the scaling of our taxonomy-based TQA regimen to large hierarchies. Specifically, sub-taxonomies are derived for the Procedure hierarchy and a review for errors and inconsistencies is performed. Concepts are divided into groups within the sub-taxonomy framework, and it is shown that small groups are statistically more likely to harbor erroneous and inconsistent concepts than large groups.
Collapse
Affiliation(s)
| | | | | | | | - Huanying Gu
- New York Institute of Technology, New York, NY
| | | | - Gai Elhanan
- New Jersey Institute of Technology, Newark, NJ
| |
Collapse
|
37
|
Agrawal A, He Z, Perl Y, Wei D, Halper M, Elhanan G, Chen Y. The readiness of SNOMED problem list concepts for meaningful use of electronic health records. Artif Intell Med 2013; 58:73-80. [PMID: 23602702 DOI: 10.1016/j.artmed.2013.03.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2012] [Revised: 03/05/2013] [Accepted: 03/17/2013] [Indexed: 11/24/2022]
Abstract
OBJECTIVE By 2015, SNOMED CT (SCT) will become the USA's standard for encoding diagnoses and problem lists in electronic health records (EHRs). To facilitate this effort, the National Library of Medicine has published the "SCT Clinical Observations Recording and Encoding" and the "Veterans Health Administration and Kaiser Permanente" problem lists (collectively, the "PL"). The PL is studied in regard to its readiness to support meaningful use of EHRs. In particular, we wish to determine if inconsistencies appearing in SCT, in general, occur as frequently in the PL, and whether further quality-assurance (QA) efforts on the PL are required. METHODS AND MATERIALS A study is conducted where two random samples of SCT concepts are compared. The first consists of concepts strictly from the PL and the second contains general SCT concepts distributed proportionally to the PL's in terms of their hierarchies. Each sample is analyzed for its percentage of primitive concepts and for frequency of modeling errors of various severity levels as quality measures. A simple structural indicator, namely, the number of parents, is suggested to locate high likelihood inconsistencies in hierarchical relationships. The effectiveness of this indicator is evaluated. RESULTS PL concepts are found to be slightly better than other concepts in the respective SCT hierarchies with regards to the quality measure of the percentage of primitive concepts and the frequency of modeling errors. There were 58% primitive concepts in the PL sample versus 62% in the control sample. The structural indicator of number of parents is shown to be statistically significant in its ability to identify concepts having a higher likelihood of inconsistencies in their hierarchical relationships. The absolute number of errors in the group of concepts having 1-3 parents was shown to be significantly lower than that for concepts with 4-6 parents and those with 7 or more parents based on Chi-squared analyses. CONCLUSION PL concepts suffer from the same issues as general SCT concepts, although to a slightly lesser extent, and do require further QA efforts to promote meaningful use of EHRs. To support such efforts, a structural indicator is shown to effectively ferret out potentially problematic concepts where those QA efforts should be focused.
Collapse
Affiliation(s)
- Ankur Agrawal
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | | | | | | | | | | | | |
Collapse
|
38
|
Agrawal A, Perl Y, Elhanan G. Identifying problematic concepts in SNOMED CT using a lexical approach. Stud Health Technol Inform 2013; 192:773-777. [PMID: 23920662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
SNOMED CT (SCT) has been endorsed as a premier clinical terminology by many organizations with a perceived use within electronic health records and clinical information systems. However, there are indications that, at the moment, SCT is not optimally structured for its intended use by healthcare practitioners. A study is conducted to investigate the extent of inconsistencies among the concepts in SCT. A group auditing technique to improve the quality of SCT is introduced that can help identify problematic concepts with a high probability. Positional similarity sets are defined, which are groups of concepts that are lexically similar and the position of the differing word in the fully specified name of the concepts of a set that correspond to each other. A manual auditing of a sample of such sets found 38% of the sets exhibiting one or more inconsistent concepts. Group auditing techniques such as this can thus be very helpful to assure the quality of SCT, which will help expedite its adoption as a reference terminology for clinical purposes.
Collapse
Affiliation(s)
- Ankur Agrawal
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | | | | |
Collapse
|
39
|
Geller J, Ochs C, Perl Y, Xu J. New abstraction networks and a new visualization tool in support of auditing the SNOMED CT content. AMIA Annu Symp Proc 2012; 2012:237-246. [PMID: 23304293 PMCID: PMC3540556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Medical terminologies are large and complex. Frequently, errors are hidden in this complexity. Our objective is to find such errors, which can be aided by deriving abstraction networks from a large terminology. Abstraction networks preserve important features but eliminate many minor details, which are often not useful for identifying errors. Providing visualizations for such abstraction networks aids auditors by allowing them to quickly focus on elements of interest within a terminology. Previously we introduced area taxonomies and partial area taxonomies for SNOMED CT. In this paper, two advanced, novel kinds of abstraction networks, the relationship-constrained partial area subtaxonomy and the root-constrained partial area subtaxonomy are defined and their benefits are demonstrated. We also describe BLUSNO, an innovative software tool for quickly generating and visualizing these SNOMED CT abstraction networks. BLUSNO is a dynamic, interactive system that provides quick access to well organized information about SNOMED CT.
Collapse
Affiliation(s)
- James Geller
- New Jersey Institute of Technology, Newark, NJ, USA
| | | | | | | |
Collapse
|
40
|
Ochs C, Agrawal A, Perl Y, Halper M, Tu SW, Carini S, Sim I, Noy N, Musen M, Geller J. Deriving an abstraction network to support quality assurance in OCRe. AMIA Annu Symp Proc 2012; 2012:681-689. [PMID: 23304341 PMCID: PMC3540580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
An abstraction network is an auxiliary network of nodes and links that provides a compact, high-level view of an ontology. Such a view lends support to ontology orientation, comprehension, and quality-assurance efforts. A methodology is presented for deriving a kind of abstraction network, called a partial-area taxonomy, for the Ontology of Clinical Research (OCRe). OCRe was selected as a representative of ontologies implemented using the Web Ontology Language (OWL) based on shared domains. The derivation of the partial-area taxonomy for the Entity hierarchy of OCRe is described. Utilizing the visualization of the content and structure of the hierarchy provided by the taxonomy, the Entity hierarchy is audited, and several errors and inconsistencies in OCRe's modeling of its domain are exposed. After appropriate corrections are made to OCRe, a new partial-area taxonomy is derived. The generalizability of the paradigm of the derivation methodology to various families of biomedical ontologies is discussed.
Collapse
|
41
|
Gu HH, Elhanan G, Perl Y, Hripcsak G, Cimino JJ, Xu J, Chen Y, Geller J, Paul Morrey C. A study of terminology auditors' performance for UMLS semantic type assignments. J Biomed Inform 2012; 45:1042-8. [PMID: 22687822 DOI: 10.1016/j.jbi.2012.05.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Revised: 05/26/2012] [Accepted: 05/31/2012] [Indexed: 11/30/2022]
Abstract
Auditing healthcare terminologies for errors requires human experts. In this paper, we present a study of the performance of auditors looking for errors in the semantic type assignments of complex UMLS concepts. In this study, concepts are considered complex whenever they are assigned combinations of semantic types. Past research has shown that complex concepts have a higher likelihood of errors. The results of this study indicate that individual auditors are not reliable when auditing such concepts and their performance is low, according to various metrics. These results confirm the outcomes of an earlier pilot study. They imply that to achieve an acceptable level of reliability and performance, when auditing such concepts of the UMLS, several auditors need to be assigned the same task. A mechanism is then needed to combine the possibly differing opinions of the different auditors into a final determination. In the current study, in contrast to our previous work, we used a majority mechanism for this purpose. For a sample of 232 complex UMLS concepts, the majority opinion was found reliable and its performance for accuracy, recall, precision and the F-measure was found statistically significantly higher than the average performance of individual auditors.
Collapse
|
42
|
Morrey CP, Perl Y, Halper M, Chen L, Gu H“H. A chemical specialty semantic network for the Unified Medical Language System. J Cheminform 2012; 4:9. [PMID: 22577759 PMCID: PMC3428652 DOI: 10.1186/1758-2946-4-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2012] [Accepted: 05/11/2012] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Terms representing chemical concepts found the Unified Medical Language System (UMLS) are used to derive an expanded semantic network with mutually exclusive semantic types. The UMLS Semantic Network (SN) is composed of a collection of broad categories called semantic types (STs) that are assigned to concepts. Within the UMLS's coverage of the chemical domain, we find a great deal of concepts being assigned more than one ST. This leads to the situation where the extent of a given ST may contain concepts elaborating variegated semantics.A methodology for expanding the chemical subhierarchy of the SN into a finer-grained categorization of mutually exclusive types with semantically uniform extents is presented. We call this network a Chemical Specialty Semantic Network (CSSN). A CSSN is derived automatically from the existing chemical STs and their assignments. The methodology incorporates a threshold value governing the minimum size of a type's extent needed for inclusion in the CSSN. Thus, different CSSNs can be created by choosing different threshold values based on varying requirements. RESULTS A complete CSSN is derived using a threshold value of 300 and having 68 STs. It is used effectively to provide high-level categorizations for a random sample of compounds from the "Chemical Entities of Biological Interest" (ChEBI) ontology. The effect on the size of the CSSN using various threshold parameter values between one and 500 is shown. CONCLUSIONS The methodology has several potential applications, including its use to derive a pre-coordinated guide for ST assignments to new UMLS chemical concepts, as a tool for auditing existing concepts, inter-terminology mapping, and to serve as an upper-level network for ChEBI.
Collapse
Affiliation(s)
- C Paul Morrey
- Department of Information Systems and Technology, Utah Valley University, 800 West University Parkway, Orem, UT 84058, USA
| | - Yehoshua Perl
- Structural Analysis of Biomedical Ontologies Center, Department of Computer Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| | - Michael Halper
- Information Technology Program, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| | - Ling Chen
- Department of Science, Borough of Manhattan Community College, City University of New York, 199 Chambers Street, New York, NY 10007, USA
| | - Huanying “Helen” Gu
- Department of Computer Science, New York Institute of Technology, 1855 Broadway, New York, NY 10023, USA
| |
Collapse
|
43
|
Wang Y, Halper M, Wei D, Gu H, Perl Y, Xu J, Elhanan G, Chen Y, Spackman KA, Case JT, Hripcsak G. Auditing complex concepts of SNOMED using a refined hierarchical abstraction network. J Biomed Inform 2012; 45:1-14. [PMID: 21907827 PMCID: PMC3313651 DOI: 10.1016/j.jbi.2011.08.016] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2011] [Revised: 08/25/2011] [Accepted: 08/26/2011] [Indexed: 10/17/2022]
Abstract
Auditors of a large terminology, such as SNOMED CT, face a daunting challenge. To aid them in their efforts, it is essential to devise techniques that can automatically identify concepts warranting special attention. "Complex" concepts, which by their very nature are more difficult to model, fall neatly into this category. A special kind of grouping, called a partial-area, is utilized in the characterization of complex concepts. In particular, the complex concepts that are the focus of this work are those appearing in intersections of multiple partial-areas and are thus referred to as overlapping concepts. In a companion paper, an automatic methodology for identifying and partitioning the entire collection of overlapping concepts into disjoint, singly-rooted groups, that are more manageable to work with and comprehend, has been presented. The partitioning methodology formed the foundation for the development of an abstraction network for the overlapping concepts called a disjoint partial-area taxonomy. This new disjoint partial-area taxonomy offers a collection of semantically uniform partial-areas and is exploited herein as the basis for a novel auditing methodology. The review of the overlapping concepts is done in a top-down order within semantically uniform groups. These groups are themselves reviewed in a top-down order, which proceeds from the less complex to the more complex overlapping concepts. The results of applying the methodology to SNOMED's Specimen hierarchy are presented. Hypotheses regarding error ratios for overlapping concepts and between different kinds of overlapping concepts are formulated. Two phases of auditing the Specimen hierarchy for two releases of SNOMED are reported on. With the use of the double bootstrap and Fisher's exact test (two-tailed), the auditing of concepts and especially roots of overlapping partial-areas is shown to yield a statistically significant higher proportion of errors.
Collapse
Affiliation(s)
- Yue Wang
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Michael Halper
- Information Technology Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Duo Wei
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Huanying Gu
- Computer Science Dept., New York Institute of Technology, New York, NY 10023 USA
| | - Yehoshua Perl
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Junchuan Xu
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Gai Elhanan
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
- Halfpenny Technologies, Inc., Blue Bell, PA 19422 USA
| | - Yan Chen
- Computer Information Systems Dept., BMCC, CUNY, New York, NY 10007 USA
| | | | | | - George Hripcsak
- Dept. of Biomedical Informatics, Columbia University, New York, NY 10032 USA
| |
Collapse
|
44
|
Chen Y, Gu H, Perl Y, Geller J. Overcoming an obstacle in expanding a UMLS semantic type extent. J Biomed Inform 2012; 45:61-70. [PMID: 21925287 PMCID: PMC3272131 DOI: 10.1016/j.jbi.2011.08.021] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Revised: 08/30/2011] [Accepted: 08/31/2011] [Indexed: 11/25/2022]
Abstract
This paper strives to overcome a major problem encountered by a previous expansion methodology for discovering concepts highly likely to be missing a specific semantic type assignment in the UMLS. This methodology is the basis for an algorithm that presents the discovered concepts to a human auditor for review and possible correction. We analyzed the problem of the previous expansion methodology and discovered that it was due to an obstacle constituted by one or more concepts assigned the UMLS Semantic Network semantic type Classification. A new methodology was designed that bypasses such an obstacle without a combinatorial explosion in the number of concepts presented to the human auditor for review. The new expansion methodology with obstacle avoidance was tested with the semantic type Experimental Model of Disease and found over 500 concepts missed by the previous methodology that are in need of this semantic type assignment. Furthermore, other semantic types suffering from the same major problem were discovered, indicating that the methodology is of more general applicability. The algorithmic discovery of concepts that are likely missing a semantic type assignment is possible even in the face of obstacles, without an explosion in the number of processed concepts.
Collapse
Affiliation(s)
- Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, 199 Chamber Street, New York, NY 10007, USA.
| | | | | | | |
Collapse
|
45
|
Wang Y, Halper M, Wei D, Perl Y, Geller J. Abstraction of complex concepts with a refined partial-area taxonomy of SNOMED. J Biomed Inform 2012; 45:15-29. [PMID: 21878396 PMCID: PMC3313654 DOI: 10.1016/j.jbi.2011.08.013] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2010] [Revised: 08/22/2011] [Accepted: 08/23/2011] [Indexed: 10/17/2022]
Abstract
An algorithmically-derived abstraction network, called the partial-area taxonomy, for a SNOMED hierarchy has led to the identification of concepts considered complex. The designation "complex" is arrived at automatically on the basis of structural analyses of overlap among the constituent concept groups of the partial-area taxonomy. Such complex concepts, called overlapping concepts, constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy's content. A new methodology for partitioning the entire collection of overlapping concepts into singly-rooted groups, that are more manageable to work with and comprehend, is presented. Different kinds of overlapping concepts with varying degrees of complexity are identified. This leads to an abstract model of the overlapping concepts called the disjoint partial-area taxonomy, which serves as a vehicle for enhanced, high-level display. The methodology is demonstrated with an application to SNOMED's Specimen hierarchy. Overall, the resulting disjoint partial-area taxonomy offers a refined view of the hierarchy's structural organization and conceptual content that can aid users, such as maintenance personnel, working with SNOMED. The utility of the disjoint partial-area taxonomy as the basis for a SNOMED auditing regimen is presented in a companion paper.
Collapse
Affiliation(s)
- Yue Wang
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Michael Halper
- Information Technology Dept., New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Duo Wei
- Computer Science and Information Systems, School of Business, The Richard Stockton College of New Jersey, Galloway, NJ 08205 USA
| | - Yehoshua Perl
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - James Geller
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, USA
| |
Collapse
|
46
|
He Z, Halper M, Perl Y, Elhanan G. Clinical Clarity versus Terminological Order - The Readiness of SNOMED CT Concept Descriptors for Primary Care. MIXHS 12 (2012) 2012; 2012:1-6. [PMID: 26870837 DOI: 10.1145/2389672.2389674] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
As SNOMED usage becomes more ingrained within applications, its range of concept descriptors, and particularly its synonym adequacy, becomes more important. A simulated clinical scenario involving various term-based concept searches is used to assess whether SNOMED's concept descriptors provide sufficient differentiation to enable possible concept selection between similar terms. Four random samples from different SNOMED concept populations are utilized. Of particular interest are concepts mapped duplicately into UMLS concepts due to shared term patterns. While overall synonym problems are rare (1%), some concept populations exhibited a high rate of potential problems for clinical use (17-62%). The vast majority of issues are due to SNOMED's inherent structure and fine granularity. Many findings hint at a lack of clear delineation between reference and interface terminological qualities. Closer attention should be given to practical clinical use-case scenarios. Reducing SNOMED's structural complexity may alleviate many of the described findings and encourage clinical adoption.
Collapse
Affiliation(s)
- Zhe He
- Computer Science Dept., NJIT Newark, NJ 07102 1-973-596-2867
| | - Michael Halper
- Information Technology Department, NJIT Newark, NJ 07102 1-973-596-5752
| | - Yehoshua Perl
- Computer Science Dept., NJIT Newark, NJ 07102 1-973-596-2867
| | - Gai Elhanan
- Halfpenny Technologies, Inc. Blue Bell, PA 19422 1-347-443-9741
| |
Collapse
|
47
|
Elhanan G, Perl Y, Geller J. A survey of SNOMED CT direct users, 2010: impressions and preferences regarding content and quality. J Am Med Inform Assoc 2011; 18 Suppl 1:i36-44. [PMID: 21836159 PMCID: PMC3241171 DOI: 10.1136/amiajnl-2011-000341] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2011] [Accepted: 07/11/2011] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Little information exists concerning SNOMED CT (systematized nomenclature of medicine-clinical terms) users. This report describes current impressions and preferences of direct SNOMED CT users regarding coverage, quality, and concept details, and the change request mechanism. DESIGN A 43-question anonymous survey distributed electronically to relevant online communities. MEASUREMENTS Data on user demographic characteristics, modes and purposes of use, means and frequencies of access, satisfaction with SNOMED CT content coverage and quality and with the change request mechanism were recorded. RESULTS The survey was conducted in January 2010 and elicited 215 responses. Details regarding users' profiles, modes of use and access were reported elsewhere. The coverage of SNOMED CT was perceived to be at least 85% complete by 42% of responders, and 60% were at least satisfied with its quality. Various deficiencies were encountered at least 'somewhat often' by 28-61% of responders. Incorrect data were more bothersome than missing data. Users indicated that significant resources should be allocated to more consistent and complete conceptual representations and to further enhance content coverage. Enhanced synonym coverage and the introduction of textual definitions were important to users (54% and 63%, respectively). LIMITATIONS A survey format with limited control over recruitment and selection bias. Lack of information regarding the SNOMED CT version used by responders. CONCLUSION Despite overall satisfaction, direct users indicated a strong desire to improve consistency, quality, and completeness of conceptual representations and concept details, as well as a continued desire to expand coverage. The survey provides much needed data for informed decisions regarding the use and development goals of SNOMED CT. Focused periodical surveys are warranted.
Collapse
Affiliation(s)
- Gai Elhanan
- Department of Computer Science, New Jersey Institute of Technology (NJIT), Newark, New Jersey 07102-1982, USA.
| | | | | |
Collapse
|
48
|
Halper M, Morrey CP, Chen Y, Elhanan G, Hripcsak G, Perl Y. Auditing hierarchical cycles to locate other inconsistencies in the UMLS. AMIA Annu Symp Proc 2011; 2011:529-36. [PMID: 22195107 PMCID: PMC3243212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
A cycle in the parent relationship hierarchy of the UMLS is a configuration that effectively makes some concept(s) an ancestor of itself. Such a structural inconsistency can easily be found automatically. A previous strategy for disconnecting cycles is to break them with the deletion of one or more parent relationships-irrespective of the correctness of the deleted relationships. A methodology is introduced for auditing of cycles that seeks to discover and delete erroneous relationships only. Cycles involving three concepts are the primary consideration. Hypotheses about the high probability of locating an erroneous parent relationship in a cycle are proposed and confirmed with statistical confidence and lend credence to the auditing approach. A cycle may serve as an indicator of other non-structural inconsistencies that are otherwise difficult to detect automatically. An extensive auditing example shows how a cycle can indicate further inconsistencies.
Collapse
|
49
|
Morrey CP, Chen L, Halper M, Perl Y. Resolution of redundant semantic type assignments for organic chemicals in the UMLS. Artif Intell Med 2011; 52:141-51. [PMID: 21646001 DOI: 10.1016/j.artmed.2011.05.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2009] [Revised: 05/03/2011] [Accepted: 05/09/2011] [Indexed: 11/27/2022]
Abstract
OBJECTIVE The Unified Medical Language System (UMLS) integrates terms from different sources into concepts and supplements these with the assignment of one or more high-level semantic types (STs) from its Semantic Network (SN). For a composite organic chemical concept, multiple assignments of organic chemical STs often serve to enumerate the types of the composite's underlying chemical constituents. This practice sometimes leads to the introduction of a forbidden redundant ST assignment, where both an ST and one of its descendants are assigned to the same concept. A methodology for resolving redundant ST assignments for organic chemicals, better capturing the essence of such composite chemicals than the typical omission of the more general ST, is presented. MATERIALS AND METHODS The typical SN resolution of a redundant ST assignment is to retain only the more specific ST assignment and omit the more general one. However, with organic chemicals, that is not always the correct strategy. A methodology for properly dealing with the redundancy based on the relative sizes of the chemical components is presented. It is more accurate to use the ST of the larger chemical component for capturing the category of the concept, even if that means using the more general ST. RESULTS A sample of 254 chemical concepts having redundant ST assignments in older UMLS releases was audited to analyze the accuracy of current ST assignments. For 81 (32%) of them, our chemical analysis-based approach yielded a different recommendation from the UMLS (2009AA). New UMLS usage notes capturing rules of this methodology are proffered. CONCLUSIONS Redundant ST assignments have typically arisen for organic composite chemical concepts. A methodology for dealing with this kind of erroneous configuration, capturing the proper category for a composite chemical, is presented and demonstrated.
Collapse
Affiliation(s)
- C Paul Morrey
- Information Systems, New York-Presbyterian Hospital, NY 10030, United States.
| | | | | | | |
Collapse
|
50
|
Huang KC, Geller J, Elhanan G, Perl Y, Halper M. Auditing SNOMED Integration into the UMLS for Duplicate Concepts. AMIA Annu Symp Proc 2010; 2010:321-325. [PMID: 21346993 PMCID: PMC3041353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The UMLS contains terms from many sources. Every update of a source requires reintegration. Each new term needs to be assigned to a preexisting UMLS concept, or a new concept must be created. Whenever the integration process unnecessarily creates a new concept, this is undesirable. We report on a method to detect such undesirable duplicate concepts. Terms are removed from the UMLS and reintegrated using "piecewise synonym generation." The concept of the reintegrated term is programmatically compared to the initial concept of the term (before removal). If they are different, this indicates an error, either in the integration process or in the initial concept. Thus, such a term-concept pair is deemed suspicious. A study of five hierarchies of the SNOMED found 7.7% suspicious matches. A human expert needs to evaluate the correctness of suspicious concepts. In a sample of 149 of those, 19% of concepts were found to be duplicates.
Collapse
|