1
|
Ahmadi N, Zoch M, Guengoeze O, Facchinello C, Mondorf A, Stratmann K, Musleh K, Erasmus HP, Tchertov J, Gebler R, Schaaf J, Frischen LS, Nasirian A, Dai J, Henke E, Tremblay D, Srisuwananukorn A, Bornhäuser M, Röllig C, Eckardt JN, Middeke JM, Wolfien M, Sedlmayr M. How to customize common data models for rare diseases: an OMOP-based implementation and lessons learned. Orphanet J Rare Dis 2024; 19:298. [PMID: 39143600 PMCID: PMC11325822 DOI: 10.1186/s13023-024-03312-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 08/06/2024] [Indexed: 08/16/2024] Open
Abstract
BACKGROUND Given the geographical sparsity of Rare Diseases (RDs), assembling a cohort is often a challenging task. Common data models (CDM) can harmonize disparate sources of data that can be the basis of decision support systems and artificial intelligence-based studies, leading to new insights in the field. This work is sought to support the design of large-scale multi-center studies for rare diseases. METHODS In an interdisciplinary group, we derived a list of elements of RDs in three medical domains (endocrinology, gastroenterology, and pneumonology) according to specialist knowledge and clinical guidelines in an iterative process. We then defined a RDs data structure that matched all our data elements and built Extract, Transform, Load (ETL) processes to transfer the structure to a joint CDM. To ensure interoperability of our developed CDM and its subsequent usage for further RDs domains, we ultimately mapped it to Observational Medical Outcomes Partnership (OMOP) CDM. We then included a fourth domain, hematology, as a proof-of-concept and mapped an acute myeloid leukemia (AML) dataset to the developed CDM. RESULTS We have developed an OMOP-based rare diseases common data model (RD-CDM) using data elements from the three domains (endocrinology, gastroenterology, and pneumonology) and tested the CDM using data from the hematology domain. The total study cohort included 61,697 patients. After aligning our modules with those of Medical Informatics Initiative (MII) Core Dataset (CDS) modules, we leveraged its ETL process. This facilitated the seamless transfer of demographic information, diagnoses, procedures, laboratory results, and medication modules from our RD-CDM to the OMOP. For the phenotypes and genotypes, we developed a second ETL process. We finally derived lessons learned for customizing our RD-CDM for different RDs. DISCUSSION This work can serve as a blueprint for other domains as its modularized structure could be extended towards novel data types. An interdisciplinary group of stakeholders that are actively supporting the project's progress is necessary to reach a comprehensive CDM. CONCLUSION The customized data structure related to our RD-CDM can be used to perform multi-center studies to test data-driven hypotheses on a larger scale and take advantage of the analytical tools offered by the OHDSI community.
Collapse
Affiliation(s)
- Najia Ahmadi
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany.
| | - Michele Zoch
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
| | - Oya Guengoeze
- Department of Internal Medicine I, University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Carlo Facchinello
- Department of Internal Medicine I, University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Antonia Mondorf
- Department of Internal Medicine I, University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Katharina Stratmann
- Department of Internal Medicine I, University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Khader Musleh
- Department of Internal Medicine I, University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Hans-Peter Erasmus
- Department of Internal Medicine I, University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Jana Tchertov
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
| | - Richard Gebler
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
| | - Jannik Schaaf
- Goethe University Frankfurt, University Hospital, Institute of Medical Informatics, Frankfurt, Germany
| | - Lena S Frischen
- University Hospital Frankfurt, Goethe University, Executive Department for Medical IT-Systems and Digitalization, Frankfurt, Germany
| | - Azadeh Nasirian
- Center of Medical Informatics, University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Jiabin Dai
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
| | - Elisa Henke
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
| | - Douglas Tremblay
- Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Martin Bornhäuser
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Christoph Röllig
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Jan-Niklas Eckardt
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
- Else-Kroener-Fresenius-Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | - Jan Moritz Middeke
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
- Else-Kroener-Fresenius-Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | - Markus Wolfien
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Dresden, Germany
| | - Martin Sedlmayr
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, TUD Dresden University of Technology, Fetscherstraße 74, 01307, Dresden, Germany
| |
Collapse
|
2
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.28.573586. [PMID: 38234802 PMCID: PMC10793431 DOI: 10.1101/2023.12.28.573586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ~20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLM to identify named medical entities from the clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
3
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning large language models for rare disease concept normalization. J Am Med Inform Assoc 2024:ocae133. [PMID: 38829731 DOI: 10.1093/jamia/ocae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 05/20/2024] [Accepted: 05/22/2024] [Indexed: 06/05/2024] Open
Abstract
OBJECTIVE We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). METHODS We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama 2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. RESULTS When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ∼20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. CONCLUSION Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLMs to identify named medical entities from clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ 08520, United States
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| |
Collapse
|
4
|
Tarride JE, Okoh A, Aryal K, Prada C, Milinkovic D, Keepanasseril A, Iorio A. Scoping review of the recommendations and guidance for improving the quality of rare disease registries. Orphanet J Rare Dis 2024; 19:187. [PMID: 38711103 PMCID: PMC11075280 DOI: 10.1186/s13023-024-03193-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 04/19/2024] [Indexed: 05/08/2024] Open
Abstract
BACKGROUND Rare disease registries (RDRs) are valuable tools for improving clinical care and advancing research. However, they often vary qualitatively, structurally, and operationally in ways that can determine their potential utility as a source of evidence to support decision-making regarding the approval and funding of new treatments for rare diseases. OBJECTIVES The goal of this research project was to review the literature on rare disease registries and identify best practices to improve the quality of RDRs. METHODS In this scoping review, we searched MEDLINE and EMBASE as well as the websites of regulatory bodies and health technology assessment agencies from 2010 to April 2023 for literature offering guidance or recommendations to ensure, improve, or maintain quality RDRs. RESULTS The search yielded 1,175 unique references, of which 64 met the inclusion criteria. The characteristics of RDRs deemed to be relevant to their quality align with three main domains and several sub-domains considered to be best practices for quality RDRs: (1) governance (registry purpose and description; governance structure; stakeholder engagement; sustainability; ethics/legal/privacy; data governance; documentation; and training and support); (2) data (standardized disease classification; common data elements; data dictionary; data collection; data quality and assurance; and data analysis and reporting); and (3) information technology (IT) infrastructure (physical and virtual infrastructure; and software infrastructure guided by FAIR principles (Findability; Accessibility; Interoperability; and Reusability). CONCLUSIONS Although RDRs face numerous challenges due to their small and dispersed populations, RDRs can generate quality data to support healthcare decision-making through the use of standards and principles on strong governance, quality data practices, and IT infrastructure.
Collapse
Affiliation(s)
- J E Tarride
- Department of Health Research Methods, Evidence and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Canada
- Centre for Health Economics and Policy Analysis (CHEPA), McMaster University, Hamilton, Canada
- Programs for the Assessment of Technologies in Health (PATH), The Research Institute of St. Joe's Hamilton, St. Joseph's Healthcare Hamilton, Hamilton, ON, Canada
| | - A Okoh
- Department of Health Research Methods, Evidence and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - K Aryal
- Department of Health Research Methods, Evidence and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - C Prada
- Department of Health Research Methods, Evidence and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Deborah Milinkovic
- Centre for Health Economics and Policy Analysis (CHEPA), McMaster University, Hamilton, Canada.
| | - A Keepanasseril
- Department of Health Research Methods, Evidence and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - A Iorio
- Department of Health Research Methods, Evidence and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| |
Collapse
|
5
|
Na R, Bae JB, Jung SH, Kim KW. Clinical Data Interchange Standards in Clinical Trials on Alzheimer's Disease. Psychiatry Investig 2022; 19:814-823. [PMID: 36327961 PMCID: PMC9633174 DOI: 10.30773/pi.2022.0149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 08/04/2022] [Indexed: 11/27/2022] Open
Abstract
OBJECTIVE The Clinical Data Interchange Standards Consortium (CDISC) proposed outcome measures for clinical trials on Alzheimer's disease (AD) in the Therapeutic Area User Guide for AD (TAUG-AD). To investigate how well the clinical trials on AD registered in the ClinicalTrials.gov complied with the recommendations on outcome measures by the CDISC. METHODS We compared the outcome measures proposed in the TAUG-AD version 2.0.1 with those employed in the protocols of clinical trials on AD registered in ClinicalTrials.gov. RESULTS We analyzed 101 outcome measures from 305 protocols. The TAUG-AD listed ten scales for outcome measures of clinical trials on AD. The scales for cognition, activities of daily living, behavioral and psychological symptoms of dementia, and global severity listed in TAUG-AD were most frequently employed in the clinical trials on AD. However, TAUG-AD did not include any scale on quality of life. Also, several scales such as Montreal Cognitive Assessment, Alzheimer's Disease Cooperative Study-Activities of Daily Living, and Cohen- Mansfield Agitation Inventory not listed in the TAUG-AD were commonly employed in the clinical trials on AD and changed over time. CONCLUSION To properly standardize the data from clinical trials on AD, the gap between the TAUG-AD and the measures employed in real-world clinical trials should be filled.
Collapse
Affiliation(s)
- Riyoung Na
- Republic of Korea National Institute of Dementia, Seoul, Republic of Korea
| | - Jong Bin Bae
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Sue Hyun Jung
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Ki Woong Kim
- Republic of Korea National Institute of Dementia, Seoul, Republic of Korea.,Department of Neuropsychiatry, Seoul National University Bundang Hospital, Seongnam, Republic of Korea.,Department of Psychiatry, Seoul National University College of Medicine, Seoul, Republic of Korea.,Department of Brain and Cognitive Science, Seoul National University College of Natural Sciences, Seoul, Republic of Korea
| |
Collapse
|
6
|
Turner EC, Gantman EC, Sampaio C, Sivakumaran S. Huntington's Disease Regulatory Science Consortium: Accelerating Medical Product Development. J Huntingtons Dis 2022; 11:97-104. [PMID: 35466945 DOI: 10.3233/jhd-220533] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Huntington's disease (HD) is a devastating neurodegenerative disorder that urgently needs disease-modifying therapeutics. To this end, collaboration to standardize clinical research practices in the field and drive progress in addressing drug development challenges is paramount. At a meeting in 2017 organized by CHDI Foundation and the Critical Path Institute, stakeholders across the pharmaceutical industry, academia, regulatory agencies, and patient advocacy groups discussed the need for and potential impact of a consortium dedicated to HD regulatory science. Consequently, the Huntington's Disease Regulatory Science Consortium (HD-RSC) was formed, a precompetitive consortium that is dedicated to building a regulatory strategy to expedite the approval of HD therapeutics.
Collapse
|
7
|
Kinnunen KM, Mullin AP, Pustina D, Turner EC, Burton J, Gordon MF, Scahill RI, Gantman EC, Noble S, Romero K, Georgiou-Karistianis N, Schwarz AJ. Recommendations to Optimize the Use of Volumetric MRI in Huntington's Disease Clinical Trials. Front Neurol 2021; 12:712565. [PMID: 34744964 PMCID: PMC8569234 DOI: 10.3389/fneur.2021.712565] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 09/21/2021] [Indexed: 12/12/2022] Open
Abstract
Volumetric magnetic resonance imaging (vMRI) has been widely studied in Huntington's disease (HD) and is commonly used to assess treatment effects on brain atrophy in interventional trials. Global and regional trajectories of brain atrophy in HD, with early involvement of striatal regions, are becoming increasingly understood. However, there remains heterogeneity in the methods used and a lack of widely-accessible multisite, longitudinal, normative datasets in HD. Consensus for standardized practices for data acquisition, analysis, sharing, and reporting will strengthen the interpretation of vMRI results and facilitate their adoption as part of a pathobiological disease staging system. The Huntington's Disease Regulatory Science Consortium (HD-RSC) currently comprises 37 member organizations and is dedicated to building a regulatory science strategy to expedite the approval of HD therapeutics. Here, we propose four recommendations to address vMRI standardization in HD research: (1) a checklist of standardized practices for the use of vMRI in clinical research and for reporting results; (2) targeted research projects to evaluate advanced vMRI methodologies in HD; (3) the definition of standard MRI-based anatomical boundaries for key brain structures in HD, plus the creation of a standard reference dataset to benchmark vMRI data analysis methods; and (4) broad access to raw images and derived data from both observational studies and interventional trials, coded to protect participant identity. In concert, these recommendations will enable a better understanding of disease progression and increase confidence in the use of vMRI for drug development.
Collapse
Affiliation(s)
| | - Ariana P Mullin
- Critical Path Institute, Tucson, AZ, United States.,Wave Life Sciences, Ltd., Cambridge, MA, United States
| | - Dorian Pustina
- CHDI Management/CHDI Foundation, Princeton, NJ, United States
| | | | | | - Mark F Gordon
- Teva Pharmaceuticals, West Chester, PA, United States
| | - Rachael I Scahill
- Huntington's Disease Research Centre, UCL Institute of Neurology, London, United Kingdom
| | - Emily C Gantman
- CHDI Management/CHDI Foundation, Princeton, NJ, United States
| | - Simon Noble
- CHDI Management/CHDI Foundation, Princeton, NJ, United States
| | - Klaus Romero
- Critical Path Institute, Tucson, AZ, United States
| | - Nellie Georgiou-Karistianis
- School of Psychological Sciences and Turner Institute for Brain and Mental Health, Monash University, Melbourne, VIC, Australia
| | - Adam J Schwarz
- Takeda Pharmaceuticals, Ltd., Cambridge, MA, United States
| |
Collapse
|