1
|
Li J, Li Y, Pan Y, Guo J, Sun Z, Li F, He Y, Tao C. Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models. J Biomed Semantics 2024; 15:14. [PMID: 39123237 PMCID: PMC11316402 DOI: 10.1186/s13326-024-00318-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 07/31/2024] [Indexed: 08/12/2024] Open
Abstract
BACKGROUND Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. CLINICALTRIALS gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. RESULTS In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. CONCLUSION This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Collapse
Affiliation(s)
- Jianfu Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Yuanyi Pan
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
| | - Jinjing Guo
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
| | - Zenan Sun
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Fang Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA.
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA.
| |
Collapse
|
2
|
Shyu R, Weng C. Enabling Semantic Topic Modeling on Twitter Using MetaMap. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024; 2024:670-678. [PMID: 38827089 PMCID: PMC11141808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Topic modeling performs poorly on short phrases or sentences and ever-changing slang, which are common in social media, such as X, formerly known as Twitter. This study investigates whether concept annotation tools such as MetaMap can enable topic modeling at the semantic level. Using tweets mentioning "hydroxychloroquine" for a case study, we extracted 56,017 posted between 03/01/2020-12/31/2021. The tweets were run through MetaMap to encode concepts with UMLS Concept Unique Identifiers (CUIs) and then we used Latent Dirichlet Allocation (LDA) to identify the optimal model for two datasets: 1) tweets with the original text and 2) tweets with the replaced CUIs. We found that the MetaMap LDA models outperformed the non-MetaMap models in terms of coherence and representativeness and identified topics timely relevant to social and political discussions. We concluded that integrating MetaMap to standardize tweets through UMLS concepts improved semantic topic modeling performance amidst noise in the text.
Collapse
Affiliation(s)
- Rebecca Shyu
- Department of Biomedical Informatics, Columbia University, New York, New York
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York
| |
Collapse
|
3
|
Riepenhausen S, Blumenstock M, Niklas C, Hegselmann S, Neuhaus P, Meidt A, Püttmann C, Storck M, Ganzinger M, Varghese J, Dugas M. Europe's Largest Research Infrastructure for Curated Medical Data Models with Semantic Annotations. Methods Inf Med 2024. [PMID: 38740374 DOI: 10.1055/s-0044-1786839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
BACKGROUND Structural metadata from the majority of clinical studies and routine health care systems is currently not yet available to the scientific community. OBJECTIVE To provide an overview of available contents in the Portal of Medical Data Models (MDM Portal). METHODS The MDM Portal is a registered European information infrastructure for research and health care, and its contents are curated and semantically annotated by medical experts. It enables users to search, view, discuss, and download existing medical data models. RESULTS The most frequent keyword is "clinical trial" (n = 18,777), and the most frequent disease-specific keyword is "breast neoplasms" (n = 1,943). Most data items are available in English (n = 545,749) and German (n = 109,267). Manually curated semantic annotations are available for 805,308 elements (554,352 items, 58,101 item groups, and 192,855 code list items), which were derived from 25,257 data models. In total, 1,609,225 Unified Medical Language System (UMLS) codes have been assigned, with 66,373 unique UMLS codes. CONCLUSION To our knowledge, the MDM Portal constitutes Europe's largest collection of medical data models with semantically annotated elements. As such, it can be used to increase compatibility of medical datasets and can be utilized as a large expert-annotated medical text corpus for natural language processing.
Collapse
Affiliation(s)
- Sarah Riepenhausen
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Max Blumenstock
- Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany
| | - Christian Niklas
- Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany
| | - Stefan Hegselmann
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Philipp Neuhaus
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Alexandra Meidt
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Cornelia Püttmann
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Michael Storck
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Matthias Ganzinger
- Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany
| | - Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Nordrhein-Westfalen, Germany
| | - Martin Dugas
- Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany
- European Research Center for Information Systems (ERCIS), Münster, Nordrhein-Westfalen, Germany
| |
Collapse
|
4
|
Zhang X, Feng Y, Li F, Ding J, Tahseen D, Hinojosa E, Chen Y, Tao C. Evaluating MedDRA-to-ICD terminology mappings. BMC Med Inform Decis Mak 2024; 23:299. [PMID: 38326827 PMCID: PMC10851449 DOI: 10.1186/s12911-023-02375-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 11/14/2023] [Indexed: 02/09/2024] Open
Abstract
BACKGROUND In this era of big data, data harmonization is an important step to ensure reproducible, scalable, and collaborative research. Thus, terminology mapping is a necessary step to harmonize heterogeneous data. Take the Medical Dictionary for Regulatory Activities (MedDRA) and International Classification of Diseases (ICD) for example, the mapping between them is essential for drug safety and pharmacovigilance research. Our main objective is to provide a quantitative and qualitative analysis of the mapping status between MedDRA and ICD. We focus on evaluating the current mapping status between MedDRA and ICD through the Unified Medical Language System (UMLS) and Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). We summarized the current mapping statistics and evaluated the quality of the current MedDRA-ICD mapping; for unmapped terms, we used our self-developed algorithm to rank the best possible mapping candidates for additional mapping coverage. RESULTS The identified MedDRA-ICD mapped pairs cover 27.23% of the overall MedDRA preferred terms (PT). The systematic quality analysis demonstrated that, among the mapped pairs provided by UMLS, only 51.44% are considered an exact match. For the 2400 sampled unmapped terms, 56 of the 2400 MedDRA Preferred Terms (PT) could have exact match terms from ICD. CONCLUSION Some of the mapped pairs between MedDRA and ICD are not exact matches due to differences in granularity and focus. For 72% of the unmapped PT terms, the identified exact match pairs illustrate the possibility of identifying additional mapped pairs. Referring to its own mapping standard, some of the unmapped terms should qualify for the expansion of MedDRA to ICD mapping in UMLS.
Collapse
Affiliation(s)
- Xinyuan Zhang
- McWilliam School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yixue Feng
- School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Fang Li
- McWilliam School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jin Ding
- McWilliam School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Danyal Tahseen
- McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ezekiel Hinojosa
- McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yong Chen
- The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Cui Tao
- McWilliam School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, USA.
| |
Collapse
|
5
|
Hao X, Abeysinghe R, Shi J, Cui L. A GCN-based approach to uncover misaligned synonymous terms in the UMLS Metathesaurus. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:977-986. [PMID: 38222357 PMCID: PMC10785861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
The Unified Medical Language System (UMLS), a large repository of biomedical vocabularies, has been used for supporting various biomedical applications. Ensuring the quality of the UMLS is critical to maintain both the accuracy of its content and the reliability of downstream applications. In this work, we present a Graph Convolutional Network (GCN)-based approach to identify misaligned synonymous terms organized under different UMLS concepts. We used synonymous terms grouped under the same concept as positive samples and top lexically similar terms as negative samples to train the GCN model. We applied the model to a test set and suggested those negative samples predicted to be synonymous as potentially misaligned synonymous terms. A total of 147,625 suggestions were made. A human expert evaluated 100 randomly selected suggestions and agreed with 60 of them. The results indicate that our GCN-based approach shows promise to help improve the synonymy grouping in the UMLS.
Collapse
Affiliation(s)
- Xubing Hao
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Rashmie Abeysinghe
- Department of Neurology, The University of Texas Health Science Center at Houston, Houston, TX
| | - Jay Shi
- Intermountain Healthcare, Denver, CO
| | - Licong Cui
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| |
Collapse
|
6
|
Mao Y, Miller RA, Bodenreider O, Nguyen V, Fung KW. Two complementary AI approaches for predicting UMLS semantic group assignment: heuristic reasoning and deep learning. J Am Med Inform Assoc 2023; 30:1887-1894. [PMID: 37528056 PMCID: PMC10654847 DOI: 10.1093/jamia/ocad152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 06/17/2023] [Accepted: 07/20/2023] [Indexed: 08/03/2023] Open
Abstract
OBJECTIVE Use heuristic, deep learning (DL), and hybrid AI methods to predict semantic group (SG) assignments for new UMLS Metathesaurus atoms, with target accuracy ≥95%. MATERIALS AND METHODS We used train-test datasets from successive 2020AA-2022AB UMLS Metathesaurus releases. Our heuristic "waterfall" approach employed a sequence of 7 different SG prediction methods. Atoms not qualifying for a method were passed on to the next method. The DL approach generated BioWordVec and SapBERT embeddings for atom names, BioWordVec embeddings for source vocabulary names, and BioWordVec embeddings for atom names of the second-to-top nodes of an atom's source hierarchy. We fed a concatenation of the 4 embeddings into a fully connected multilayer neural network with an output layer of 15 nodes (one for each SG). For both approaches, we developed methods to estimate the probability that their predicted SG for an atom would be correct. Based on these estimations, we developed 2 hybrid SG prediction methods combining the strengths of heuristic and DL methods. RESULTS The heuristic waterfall approach accurately predicted 94.3% of SGs for 1 563 692 new unseen atoms. The DL accuracy on the same dataset was also 94.3%. The hybrid approaches achieved an average accuracy of 96.5%. CONCLUSION Our study demonstrated that AI methods can predict SG assignments for new UMLS atoms with sufficient accuracy to be potentially useful as an intermediate step in the time-consuming task of assigning new atoms to UMLS concepts. We showed that for SG prediction, combining heuristic methods and DL methods can produce better results than either alone.
Collapse
Affiliation(s)
- Yuqing Mao
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Randolph A Miller
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Olivier Bodenreider
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Vinh Nguyen
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Kin Wah Fung
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
7
|
Gholipour M, Khajouei R, Amiri P, Hajesmaeel Gohari S, Ahmadian L. Extracting cancer concepts from clinical notes using natural language processing: a systematic review. BMC Bioinformatics 2023; 24:405. [PMID: 37898795 PMCID: PMC10613366 DOI: 10.1186/s12859-023-05480-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 09/13/2023] [Indexed: 10/30/2023] Open
Abstract
BACKGROUND Extracting information from free texts using natural language processing (NLP) can save time and reduce the hassle of manually extracting large quantities of data from incredibly complex clinical notes of cancer patients. This study aimed to systematically review studies that used NLP methods to identify cancer concepts from clinical notes automatically. METHODS PubMed, Scopus, Web of Science, and Embase were searched for English language papers using a combination of the terms concerning "Cancer", "NLP", "Coding", and "Registries" until June 29, 2021. Two reviewers independently assessed the eligibility of papers for inclusion in the review. RESULTS Most of the software programs used for concept extraction reported were developed by the researchers (n = 7). Rule-based algorithms were the most frequently used algorithms for developing these programs. In most articles, the criteria of accuracy (n = 14) and sensitivity (n = 12) were used to evaluate the algorithms. In addition, Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) and Unified Medical Language System (UMLS) were the most commonly used terminologies to identify concepts. Most studies focused on breast cancer (n = 4, 19%) and lung cancer (n = 4, 19%). CONCLUSION The use of NLP for extracting the concepts and symptoms of cancer has increased in recent years. The rule-based algorithms are well-liked algorithms by developers. Due to these algorithms' high accuracy and sensitivity in identifying and extracting cancer concepts, we suggested that future studies use these algorithms to extract the concepts of other diseases as well.
Collapse
Affiliation(s)
- Maryam Gholipour
- Student Research Committee, Kerman University of Medical Sciences, Kerman, Iran
| | - Reza Khajouei
- Department of Health Information Sciences, Faculty of Management and Medical Information Sciences, Kerman University of Medical Sciences, Kerman, Iran
| | - Parastoo Amiri
- Student Research Committee, Kerman University of Medical Sciences, Kerman, Iran
| | - Sadrieh Hajesmaeel Gohari
- Medical Informatics Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran
| | - Leila Ahmadian
- Department of Health Information Sciences, Faculty of Management and Medical Information Sciences, Kerman University of Medical Sciences, Kerman, Iran.
| |
Collapse
|
8
|
Li J, Li Y, Pan Y, Guo J, Sun Z, Li F, He Y, Tao C. Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models. RESEARCH SQUARE 2023:rs.3.rs-3362256. [PMID: 37841880 PMCID: PMC10571639 DOI: 10.21203/rs.3.rs-3362256/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2023]
Abstract
Background Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. ClinicalTrials.gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. Results In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. Conclusion This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Collapse
Affiliation(s)
- Jianfu Li
- The University of Texas Health Science Center at Houston
| | - Yiming Li
- The University of Texas Health Science Center at Houston
| | | | | | - Zenan Sun
- The University of Texas Health Science Center at Houston
| | - Fang Li
- The University of Texas Health Science Center at Houston
| | | | - Cui Tao
- The University of Texas Health Science Center at Houston
| |
Collapse
|
9
|
Abstract
Laboratory clinical decision support (CDS) typically relies on data from the electronic health record (EHR). The implementation of a sustainable, effective laboratory CDS program requires a commitment to standardization and harmonization of key EHR data elements that are the foundation of laboratory CDS. The direct use of artificial intelligence algorithms in CDS programs will be limited unless key elements of the EHR are structured. The identification, curation, maintenance, and preprocessing steps necessary to implement robust laboratory-based algorithms must account for the heterogeneity of data present in a typical EHR.
Collapse
|
10
|
SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
11
|
Havrilla JM, Liu C, Dong X, Weng C, Wang K. PhenCards: a data resource linking human phenotype information to biomedical knowledge. Genome Med 2021; 13:91. [PMID: 34034817 PMCID: PMC8147460 DOI: 10.1186/s13073-021-00909-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 05/13/2021] [Indexed: 02/07/2023] Open
Abstract
We present PhenCards ( https://phencards.org ), a database and web server intended as a one-stop shop for previously disconnected biomedical knowledge related to human clinical phenotypes. Users can query human phenotype terms or clinical notes. PhenCards obtains relevant disease/phenotype prevalence and co-occurrence, drug, procedural, pathway, literature, grant, and collaborator data. PhenCards recommends the most probable genetic diseases and candidate genes based on phenotype terms from clinical notes. PhenCards facilitates exploration of phenotype, e.g., which drugs cause or are prescribed for patient symptoms, which genes likely cause specific symptoms, and which comorbidities co-occur with phenotypes.
Collapse
Affiliation(s)
- James M Havrilla
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Xiangchen Dong
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA. .,Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA.
| |
Collapse
|
12
|
Hegselmann S, Storck M, Gessner S, Neuhaus P, Varghese J, Bruland P, Meidt A, Mertens C, Riepenhausen S, Baier S, Stöcker B, Henke J, Schmidt CO, Dugas M. Pragmatic MDR: a metadata repository with bottom-up standardization of medical metadata through reuse. BMC Med Inform Decis Mak 2021; 21:160. [PMID: 34001121 PMCID: PMC8130274 DOI: 10.1186/s12911-021-01524-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Accepted: 05/09/2021] [Indexed: 11/27/2022] Open
Abstract
Background The variety of medical documentation often leads to incompatible data elements that impede data integration between institutions. A common approach to standardize and distribute metadata definitions are ISO/IEC 11179 norm-compliant metadata repositories with top-down standardization. To the best of our knowledge, however, it is not yet common practice to reuse the content of publicly accessible metadata repositories for creation of case report forms or routine documentation. We suggest an alternative concept called pragmatic metadata repository, which enables a community-driven bottom-up approach for agreeing on data collection models. A pragmatic metadata repository collects real-world documentation and considers frequent metadata definitions as high quality with potential for reuse. Methods We implemented a pragmatic metadata repository proof of concept application and filled it with medical forms from the Portal of Medical Data Models. We applied this prototype in two use cases to demonstrate its capabilities for reusing metadata: first, integration into a study editor for the suggestion of data elements and, second, metadata synchronization between two institutions. Moreover, we evaluated the emergence of bottom-up standards in the prototype and two medical data managers assessed their quality for 24 medical concepts. Results The resulting prototype contained 466,569 unique metadata definitions. Integration into the study editor led to a reuse of 1836 items and item groups. During the metadata synchronization, semantic codes of 4608 data elements were transferred. Our evaluation revealed that for less complex medical concepts weak bottom-up standards could be established. However, more diverse disease-related concepts showed no convergence of data elements due to an enormous heterogeneity of metadata. The survey showed fair agreement (Kalpha = 0.50, 95% CI 0.43–0.56) for good item quality of bottom-up standards. Conclusions We demonstrated the feasibility of the pragmatic metadata repository concept for medical documentation. Applications of the prototype in two use cases suggest that it facilitates the reuse of data elements. Our evaluation showed that bottom-up standardization based on a large collection of real-world metadata can yield useful results. The proposed concept shall not replace existing top-down approaches, rather it complements them by showing what is commonly used in the community to guide other researchers. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01524-8.
Collapse
Affiliation(s)
- Stefan Hegselmann
- Institute of Medical Informatics, University of Münster, Münster, Germany.
| | - Michael Storck
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sophia Gessner
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Philipp Neuhaus
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Philipp Bruland
- University of Applied Sciences Ostwestfalen-Lippe, Lemgo, Germany
| | - Alexandra Meidt
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Cornelia Mertens
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sarah Riepenhausen
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sonja Baier
- Centre for Clinical Trials, University of Münster, Münster, Germany
| | - Benedikt Stöcker
- Centre for Clinical Trials, University of Münster, Münster, Germany
| | - Jörg Henke
- Institute of Community Medicine, University Medicine of Greifswald, Greifswald, Germany
| | | | - Martin Dugas
- Institute of Medical Informatics, University of Münster, Münster, Germany
| |
Collapse
|
13
|
He Z, Tao C, Bian J, Zhang R. Selected articles from the Fourth International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). BMC Med Inform Decis Mak 2020; 20:315. [PMID: 33317524 PMCID: PMC7734704 DOI: 10.1186/s12911-020-01292-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
In this introduction, we first summarize the Fourth International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019) held on October 26, 2019 in conjunction with the 18th International Semantic Web Conference (ISWC 2019) in Auckland, New Zealand, and then briefly introduce seven research articles included in this supplement issue, covering the topics on Knowledge Graph, Ontology-Powered Analytics, and Deep Learning.
Collapse
Affiliation(s)
- Zhe He
- School of Information, College of Communication and Information, Florida State University, 142 Collegiate Loop, Tallahassee, FL, 32306-2100, USA.
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center At Houston, Houston, TX, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
| | - Rui Zhang
- Institute for Health Informatics and College of Pharmacy, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
14
|
Abstract
The Unified Medical Language System (UMLS) is an internationally recognized medical vocabulary that enables semantic interoperability across various biomedical terminologies. To use its knowledge, the users must understand its complex knowledge structure, a structure that is not interoperable or is not compliant with any known biomedical and healthcare standard. Further, the users also need to have good technical skills to understand its inner working and interact with UMLS in general. These barriers might cause UMLS usage concerns among inter-disciplinary users in biomedical and healthcare informatics. Currently, there exists no terminology service that normalizes UMLS’s complex knowledge structure to a widely accepted interoperable healthcare standard and allows easy access to its knowledge, thus hiding its workings. The objective of this research is to design and implement a light-weight terminology service that allows easy access to UMLS knowledge structured using the fast health interoperability resources (FHIR) standard, a widely accepted interoperability healthcare standard. The developed terminology service, named UMLS FHIR, leverages FHIR resources and features, and can easily be integrated into any application to consume UMLS knowledge in the FHIR format without the need to understand UMLS’s native knowledge structure and its internal working.
Collapse
|
15
|
Humphreys BL, Del Fiol G, Xu H. The UMLS knowledge sources at 30: indispensable to current research and applications in biomedical informatics. J Am Med Inform Assoc 2020; 27:1499-1501. [PMID: 33059366 DOI: 10.1093/jamia/ocaa208] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Indexed: 01/22/2023] Open
Affiliation(s)
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|