1
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning large language models for rare disease concept normalization. J Am Med Inform Assoc 2024; 31:2076-2083. [PMID: 38829731 PMCID: PMC11339522 DOI: 10.1093/jamia/ocae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 05/20/2024] [Accepted: 05/22/2024] [Indexed: 06/05/2024] Open
Abstract
OBJECTIVE We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). METHODS We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama 2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. RESULTS When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ∼20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. CONCLUSION Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLMs to identify named medical entities from clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ 08520, United States
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| |
Collapse
|
2
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.28.573586. [PMID: 38234802 PMCID: PMC10793431 DOI: 10.1101/2023.12.28.573586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ~20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLM to identify named medical entities from the clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
3
|
Uribe-Carretero E, Rey V, Fuentes JM, Tamargo-Gómez I. Lysosomal Dysfunction: Connecting the Dots in the Landscape of Human Diseases. BIOLOGY 2024; 13:34. [PMID: 38248465 PMCID: PMC10813815 DOI: 10.3390/biology13010034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/22/2023] [Accepted: 01/02/2024] [Indexed: 01/23/2024]
Abstract
Lysosomes are the main organelles responsible for the degradation of macromolecules in eukaryotic cells. Beyond their fundamental role in degradation, lysosomes are involved in different physiological processes such as autophagy, nutrient sensing, and intracellular signaling. In some circumstances, lysosomal abnormalities underlie several human pathologies with different etiologies known as known as lysosomal storage disorders (LSDs). These disorders can result from deficiencies in primary lysosomal enzymes, dysfunction of lysosomal enzyme activators, alterations in modifiers that impact lysosomal function, or changes in membrane-associated proteins, among other factors. The clinical phenotype observed in affected patients hinges on the type and location of the accumulating substrate, influenced by genetic mutations and residual enzyme activity. In this context, the scientific community is dedicated to exploring potential therapeutic approaches, striving not only to extend lifespan but also to enhance the overall quality of life for individuals afflicted with LSDs. This review provides insights into lysosomal dysfunction from a molecular perspective, particularly in the context of human diseases, and highlights recent advancements and breakthroughs in this field.
Collapse
Affiliation(s)
- Elisabet Uribe-Carretero
- Departamento de Bioquímica y Biología Molecular y Genética, Facultad de Enfermería y Terapia Ocupacional, Universidad de Extremadura, 10003 Caceres, Spain; (E.U.-C.)
- Centro de Investigación Biomédica en Red en Enfermedades Neurodegenerativa, Instituto de Salud Carlos III (CIBER-CIBERNED-ISCIII), 28029 Madrid, Spain
- Instituto Universitario de Investigación Biosanitaria de Extremadura (INUBE), 10003 Caceres, Spain
| | - Verónica Rey
- Instituto de Investigación Sanitaria del Principado de Asturias (ISPA), 33011 Oviedo, Spain
| | - Jose Manuel Fuentes
- Departamento de Bioquímica y Biología Molecular y Genética, Facultad de Enfermería y Terapia Ocupacional, Universidad de Extremadura, 10003 Caceres, Spain; (E.U.-C.)
- Centro de Investigación Biomédica en Red en Enfermedades Neurodegenerativa, Instituto de Salud Carlos III (CIBER-CIBERNED-ISCIII), 28029 Madrid, Spain
- Instituto Universitario de Investigación Biosanitaria de Extremadura (INUBE), 10003 Caceres, Spain
| | - Isaac Tamargo-Gómez
- Instituto de Investigación Sanitaria del Principado de Asturias (ISPA), 33011 Oviedo, Spain
| |
Collapse
|
4
|
Almenabawy N, Ramadan M, Kamel M, Mahmoud IG, Amer F, Shaheen Y, Elnaggar W, Selim L. Clinical, biochemical, and molecular characterization of mucopolysaccharidosis type III in 34 Egyptian patients. Am J Med Genet A 2023; 191:2354-2363. [PMID: 37596900 DOI: 10.1002/ajmg.a.63342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 06/09/2023] [Accepted: 06/20/2023] [Indexed: 08/21/2023]
Abstract
Mucopolysaccharidosis type III (MPS III) is a rare autosomal recessive lysosomal storage disorder characterized by progressive neurocognitive deterioration. There are four MPS III subtypes (A, B, C, and D) that are clinically indistinguishable with variable rates of progression. A retrospective analysis was carried out on 34 patients with MPS III types at Cairo University Children's Hospital. We described the clinical, biochemical, and molecular spectrum of MPS III patients. Of 34 patients, 22 patients had MPS IIIB, 7/34 had MPS IIIC, 4/34 had MPS IIIA, and only 1 had MPS IIID. All patients presented with developmental delay/intellectual disability, and speech delay. Ataxia was reported in a patient with MPS IIIC, and cerebellar atrophy in a patient with MPS IIIA. We reported 25 variants in the 4 MPS III genes, 11 of which were not previously reported. This is the first study to analyze the clinical and genetic spectrum of MPS III patients in Egypt. This study explores the genetic map of MPS III in the Egyptian population. It will pave the way for a national registry for rare diseases in Egypt, a country with a high rate of consanguineous marriage and consequently a high rate of autosomal recessive disorders.
Collapse
Affiliation(s)
- Nihal Almenabawy
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| | - Manal Ramadan
- Pediatric Department, Ahmed Maher Teaching Hospital, Cairo, Egypt
| | - Mona Kamel
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| | - Iman G Mahmoud
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| | - Fawzia Amer
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| | - Yara Shaheen
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| | - Walaa Elnaggar
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| | - Laila Selim
- Pediatric Department, Pediatric Neurology and Metabolic Division, Cairo University Children's Hospital, Cairo, Egypt
| |
Collapse
|
5
|
Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes. Am J Hum Genet 2018; 103:58-73. [PMID: 29961570 DOI: 10.1016/j.ajhg.2018.05.010] [Citation(s) in RCA: 81] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Accepted: 05/24/2018] [Indexed: 01/17/2023] Open
Abstract
Integration of detailed phenotype information with genetic data is well established to facilitate accurate diagnosis of hereditary disorders. As a rich source of phenotype information, electronic health records (EHRs) promise to empower diagnostic variant interpretation. However, how to accurately and efficiently extract phenotypes from heterogeneous EHR narratives remains a challenge. Here, we present EHR-Phenolyzer, a high-throughput EHR framework for extracting and analyzing phenotypes. EHR-Phenolyzer extracts and normalizes Human Phenotype Ontology (HPO) concepts from EHR narratives and then prioritizes genes with causal variants on the basis of the HPO-coded phenotype manifestations. We assessed EHR-Phenolyzer on 28 pediatric individuals with confirmed diagnoses of monogenic diseases and found that the genes with causal variants were ranked among the top 100 genes selected by EHR-Phenolyzer for 16/28 individuals (p < 2.2 × 10-16), supporting the value of phenotype-driven gene prioritization in diagnostic sequence interpretation. To assess the generalizability, we replicated this finding on an independent EHR dataset of ten individuals with a positive diagnosis from a different institution. We then assessed the broader utility by examining two additional EHR datasets, including 31 individuals who were suspected of having a Mendelian disease and underwent different types of genetic testing and 20 individuals with positive diagnoses of specific Mendelian etiologies of chronic kidney disease from exome sequencing. Finally, through several retrospective case studies, we demonstrated how combined analyses of genotype data and deep phenotype data from EHRs can expedite genetic diagnoses. In summary, EHR-Phenolyzer leverages EHR narratives to automate phenotype-driven analysis of clinical exomes or genomes, facilitating the broader implementation of genomic medicine.
Collapse
|
6
|
SeqMule: automated pipeline for analysis of human exome/genome sequencing data. Sci Rep 2015; 5:14283. [PMID: 26381817 PMCID: PMC4585643 DOI: 10.1038/srep14283] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Accepted: 08/21/2015] [Indexed: 11/16/2022] Open
Abstract
Next-generation sequencing (NGS) technology has greatly helped us identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. Discrepancies exist among aligners and variant callers. We developed a computational pipeline, SeqMule, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48 GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. SeqMule supports Sun Grid Engine for parallel processing, offers turn-key solution for deployment on Amazon Web Services, allows quality check, Mendelian error check, consistency evaluation, HTML-based reports. SeqMule is available at http://seqmule.openbioinformatics.org.
Collapse
|