1
|
Mao X, Huang Y, Jin Y, Wang L, Chen X, Liu H, Yang X, Xu H, Luan X, Xiao Y, Feng S, Zhu J, Zhang X, Jiang R, Zhang S, Chen T. A phenotype-based AI pipeline outperforms human experts in differentially diagnosing rare diseases using EHRs. NPJ Digit Med 2025; 8:68. [PMID: 39875532 PMCID: PMC11775211 DOI: 10.1038/s41746-025-01452-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 01/15/2025] [Indexed: 01/30/2025] Open
Abstract
Rare diseases, affecting ~350 million people worldwide, pose significant challenges in clinical diagnosis due to the lack of experienced physicians and the complexity of differentiating between numerous rare diseases. To address these challenges, we introduce PhenoBrain, a fully automated artificial intelligence pipeline. PhenoBrain utilizes a BERT-based natural language processing model to extract phenotypes from clinical texts in EHRs and employs five new diagnostic models for differential diagnoses of rare diseases. The AI system was developed and evaluated on diverse, multi-country rare disease datasets, comprising 2271 cases with 431 rare diseases. In 1936 test cases, PhenoBrain achieved an average predicted top-3 recall of 0.513 and a top-10 recall of 0.654, surpassing 13 leading prediction methods. In a human-computer study with 75 cases, PhenoBrain exhibited exceptional performance with a top-3 recall of 0.613 and a top-10 recall of 0.813, surpassing the performance of 50 specialist physicians and large language models like ChatGPT and GPT-4. Combining PhenoBrain's predictions with specialists increased the top-3 recall to 0.768, demonstrating its potential to enhance diagnostic accuracy in clinical workflows.
Collapse
Affiliation(s)
- Xiaohao Mao
- Department of Computer Science and Technology & Institute for Artificial Intelligence & BNRist, Tsinghua University, Beijing, China
| | - Yu Huang
- Department of Computer Science and Technology & Institute for Artificial Intelligence & BNRist, Tsinghua University, Beijing, China.
- Tencent Jarvis Lab, Shenzhen, China.
| | - Ye Jin
- Medical Research Center, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Lun Wang
- Department of Internal Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Xuanzhong Chen
- Department of Computer Science and Technology & Institute for Artificial Intelligence & BNRist, Tsinghua University, Beijing, China
| | - Honghong Liu
- Department of Internal Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Xinglin Yang
- Department of Cardiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Haopeng Xu
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | - Xiaodong Luan
- State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Ying Xiao
- Department of Geriatrics, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Siqin Feng
- Department of Cardiology, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jiahao Zhu
- Department of Computer Science and Technology & Institute for Artificial Intelligence & BNRist, Tsinghua University, Beijing, China
| | - Xuegong Zhang
- Department of Automation & BNRist, Tsinghua University, Beijing, China
| | - Rui Jiang
- Department of Automation & BNRist, Tsinghua University, Beijing, China
| | - Shuyang Zhang
- Department of Cardiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.
| | - Ting Chen
- Department of Computer Science and Technology & Institute for Artificial Intelligence & BNRist, Tsinghua University, Beijing, China.
| |
Collapse
|
2
|
Albayrak A, Xiao Y, Mukherjee P, Barnett SS, Marcou CA, Hart SN. Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation. J Pathol Inform 2025; 16:100409. [PMID: 39720417 PMCID: PMC11667693 DOI: 10.1016/j.jpi.2024.100409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 11/04/2024] [Accepted: 11/12/2024] [Indexed: 12/26/2024] Open
Abstract
With the increasing utilization of exome and genome sequencing in clinical and research genetics, accurate and automated extraction of human phenotype ontology (HPO) terms from clinical texts has become imperative. Traditional methods for HPO term extraction, such as PhenoTagger, often face limitations in coverage and precision. In this study, we propose a novel approach that leverages large language models (LLMs) to generate synthetic sentences with clinical context, which were semantically encoded into vector embeddings. These embeddings are linked to HPO terms, creating a robust knowledgebase that facilitates precise information retrieval. Our method circumvents the known issue of LLM hallucinations by storing and querying these embeddings within a true database, ensuring accurate context matching without the need for a predictive model. We evaluated the performance of three different embedding models, all of which demonstrated substantial improvements over PhenoTagger. Top recall (sensitivity), precision (positive-predictive value, PPV), and F1 are 0.64, 0.64, and 0.64, respectively, which were 31%, 10%, and 21% better than PhenoTagger. Furthermore, optimal performance was achieved when we combined the best performing embedding model with PhenoTagger (a.k.a. Fused model), resulting in recall (sensitivity), precision (PPV), and F1 values of 0.7, 0.7, and 0.7, respectively, which are 10%, 10%, and 10% better than the best embedding models. Our findings underscore the potential of this integrated approach to enhance the precision and reliability of HPO term extraction, offering a scalable and effective solution for biomedical data annotation.
Collapse
Affiliation(s)
- Abdulkadir Albayrak
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, United States of America
| | - Yao Xiao
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States of America
| | - Piyush Mukherjee
- Center for Digital Health, Mayo Clinic, Rochester, MN, United States of America
| | - Sarah S. Barnett
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, United States of America
| | - Cherisse A. Marcou
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, United States of America
| | - Steven N. Hart
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, United States of America
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States of America
| |
Collapse
|
3
|
Tammen I, Mather M, Leeb T, Nicholas FW. Online Mendelian Inheritance in Animals (OMIA): a genetic resource for vertebrate animals. Mamm Genome 2024; 35:556-564. [PMID: 39143381 PMCID: PMC11522177 DOI: 10.1007/s00335-024-10059-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 08/01/2024] [Indexed: 08/16/2024]
Abstract
Online Mendelian Inheritance in Animals (OMIA) is a freely available curated knowledgebase that contains information and facilitates research on inherited traits and diseases in animals. For the past 29 years, OMIA has been used by animal geneticists, breeders, and veterinarians worldwide as a definitive source of information. Recent increases in curation capacity and funding for software engineering support have resulted in software upgrades and commencement of several initiatives, which include the enhancement of variant information and links to human data resources, and the introduction of ontology-based breed information and categories. We provide an overview of current information and recent enhancements to OMIA and discuss how we are expanding the integration of OMIA into other resources and databases via the use of ontologies and the adaptation of tools used in human genetics.
Collapse
Affiliation(s)
- Imke Tammen
- Sydney School of Veterinary Science, The University of Sydney, Sydney, NSW, 2006, Australia.
| | - Marius Mather
- Sydney Informatics Hub, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Tosso Leeb
- Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern, 3001, Switzerland
| | - Frank W Nicholas
- Sydney School of Veterinary Science, The University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|
4
|
Chen F, Ahimaz P, Nguyen QM, Lewis R, Chung WK, Ta CN, Szigety KM, Sheppard SE, Campbell IM, Wang K, Weng C, Liu C. Phenotype driven molecular genetic test recommendation for diagnosing pediatric rare disorders. NPJ Digit Med 2024; 7:333. [PMID: 39572625 PMCID: PMC11582592 DOI: 10.1038/s41746-024-01331-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Accepted: 11/07/2024] [Indexed: 11/24/2024] Open
Abstract
Patients with rare diseases often experience prolonged diagnostic delays. Ordering appropriate genetic tests is crucial yet challenging, especially for general pediatricians without genetic expertise. Recent American College of Medical Genetics (ACMG) guidelines embrace early use of exome sequencing (ES) or genome sequencing (GS) for conditions like congenital anomalies or developmental delays while still recommend gene panels for patients exhibiting strong manifestations of a specific disease. Recognizing the difficulty in navigating these options, we developed a machine learning model trained on 1005 patient records from Columbia University Irving Medical Center to recommend appropriate genetic tests based on the phenotype information. The model achieved a remarkable performance with an AUROC of 0.823 and AUPRC of 0.918, aligning closely with decisions made by genetic specialists, and demonstrated strong generalizability (AUROC:0.77, AUPRC: 0.816) in an external cohort, indicating its potential value for general pediatricians to expedite rare disease diagnosis by enhancing genetic test ordering.
Collapse
Affiliation(s)
- Fangyi Chen
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Priyanka Ahimaz
- Department of Pediatrics, Columbia University, New York, NY, USA
- Institute of Genomic Medicine, Columbia University, New York, NY, USA
| | - Quan M Nguyen
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
| | - Rachel Lewis
- Department of Pediatrics, Columbia University, New York, NY, USA
| | - Wendy K Chung
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Casey N Ta
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Katherine M Szigety
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Sarah E Sheppard
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Ian M Campbell
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| | - Cong Liu
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
5
|
Dohi E, Takatsuki T, Tateisi Y, Fujiwara T, Yamamoto Y. Examining HPO by organ and system to facilitate practical use by clinicians. Genomics Inform 2024; 22:23. [PMID: 39533429 PMCID: PMC11559069 DOI: 10.1186/s44342-024-00024-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 09/29/2024] [Indexed: 11/16/2024] Open
Abstract
The Human Phenotype Ontology (HPO) is widely used for annotating clinical text data, and sufficient annotation is crucial for the effective utilization of clinical texts. It was known that the use of LLMs can successfully extract symptoms and findings, but cannot annotate them with the HPO. We hypothesized that one of the potential issue for this is the lack of appropriate terms in the HPO. Therefore, during the Biomedical Linked Annotation Hackathon 8 (BLAH8), we attempted the following two tasks in order to grasp the overall picture of HPO. (1) Extract all HPO terms for each of the 23 HPO subclasses (defined as categories) directly under the HPO "Phenotypic abnormality" and then (2) search for major attributes in each of 23 categories. We employed LLM for these two tasks related to examining HPO and, at the same time, found that LLM didn't work well without ingenuity for tasks that lacked sentences and context. A manual search for terms within each category revealed that the HPO contains a mix of terms with four major attributes: (1) Disease Name, (2) Condition, (3) Test Data, and (4) Symptoms and Findings. Manual curation showed that the ratio of symptoms and findings varied from 0 to 93.1% across categories. For clinicians, who are end-users of medical terminology including HPO, it is difficult to understand ontologies. However, for good quality ontology is also important for good-quality data, and a clinician's help is essential. It is also important to make the overall picture and limitations of ontologies easy to understand in order to bring out the explanatory power of LLMs and artificial intelligence.
Collapse
Affiliation(s)
- Eisuke Dohi
- National Center of Neurology and Psychiatry, National Institute of Neuroscience, Kodaira, Tokyo, Japan.
| | - Terue Takatsuki
- Database Center for Life Science, ROIS-DS, Kashiwa, Chiba, Japan
| | | | | | | |
Collapse
|
6
|
Tekumalla R, Banda JM. Towards automated phenotype definition extraction using large language models. Genomics Inform 2024; 22:21. [PMID: 39482749 PMCID: PMC11529293 DOI: 10.1186/s44342-024-00023-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 09/29/2024] [Indexed: 11/03/2024] Open
Abstract
Electronic phenotyping involves a detailed analysis of both structured and unstructured data, employing rule-based methods, machine learning, natural language processing, and hybrid approaches. Currently, the development of accurate phenotype definitions demands extensive literature reviews and clinical experts, rendering the process time-consuming and inherently unscalable. Large language models offer a promising avenue for automating phenotype definition extraction but come with significant drawbacks, including reliability issues, the tendency to generate non-factual data ("hallucinations"), misleading results, and potential harm. To address these challenges, our study embarked on two key objectives: (1) defining a standard evaluation set to ensure large language models outputs are both useful and reliable and (2) evaluating various prompting approaches to extract phenotype definitions from large language models, assessing them with our established evaluation task. Our findings reveal promising results that still require human evaluation and validation for this task. However, enhanced phenotype extraction is possible, reducing the amount of time spent in literature review and evaluation.
Collapse
Affiliation(s)
| | - Juan M Banda
- Stanford Health Care, Stanford, CA, USA.
- Observational Health Data Sciences and Informatics, New York, NY, USA.
| |
Collapse
|
7
|
Bazalar-Montoya J, Cornejo-Olivas M, Duenas-Roque MM, Purizaca-Rosillo N, Rodriguez RS, Milla-Neyra K, De La Torre-Hernandez CA, Sarapura-Castro E, Galarreta Aima CI, Manassero-Morales G, Chávez-Pasco G, Celis-García L, La Serna-Infantes JE, Chekalin E, Thorpe E, Taft RJ. Clinical genome sequencing in patients with suspected rare genetic disease in Peru. NPJ Genom Med 2024; 9:51. [PMID: 39468051 PMCID: PMC11519459 DOI: 10.1038/s41525-024-00434-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 09/24/2024] [Indexed: 10/30/2024] Open
Abstract
There is limited access to molecular genetic testing in most low- and middle-income countries. The iHope program provides clinical genome sequencing (cGS) to underserved individuals with signs or symptoms of rare genetic diseases and limited or no access to molecular genetic testing. Here we describe the performance and impact of cGS in 247 patients from three clinics in Peru. Although most patients had at least one genetic test prior to cGS (70.9%), the most frequent was karyotyping (53.4%). The diagnostic yield of cGS was 54.3%, with candidate variants reported in an additional 22.3% of patients. Clinical GS results impacted clinician diagnostic evaluation in 85.0% and genetic counseling in 72.1% of cases. Changes in management were reported in 71.3%, inclusive of referrals (64.7%), therapeutics (26.3%), laboratory or physiological testing (25.5%), imaging (19%), and palliative care (17.4%), suggesting that increased availability of genomic testing in Peru would enable improved patient management.
Collapse
Affiliation(s)
- Jeny Bazalar-Montoya
- Instituto Nacional de Salud del Niño San Borja, Lima, Peru
- School of Public Health and Administration, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Mario Cornejo-Olivas
- Neurogenetics Working Group, Universidad Cientifica del Sur, Lima, Peru
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurológicas, Lima, Peru
| | | | | | - Richard S Rodriguez
- School of Public Health and Administration, Universidad Peruana Cayetano Heredia, Lima, Peru
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurológicas, Lima, Peru
| | - Karina Milla-Neyra
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurológicas, Lima, Peru
| | | | - Elison Sarapura-Castro
- Neurogenetics Working Group, Universidad Cientifica del Sur, Lima, Peru
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurológicas, Lima, Peru
| | | | | | | | | | - Jorge E La Serna-Infantes
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurológicas, Lima, Peru
- Instituto de Investigaciones en Ciencias Biomédicas (INICIB), Facultad de Medicina, Universidad Ricardo Palma, Lima, Peru
| | | | - Erin Thorpe
- Illumina Inc, San Diego, CA, USA
- Genetic Alliance, Damascus, MD, USA
| | - Ryan J Taft
- Illumina Inc, San Diego, CA, USA.
- Genetic Alliance, Damascus, MD, USA.
| |
Collapse
|
8
|
Soysal E, Roberts K. PheNormGPT: a framework for extraction and normalization of key medical findings. Database (Oxford) 2024; 2024:baae103. [PMID: 39444329 PMCID: PMC11498178 DOI: 10.1093/database/baae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 07/31/2024] [Accepted: 08/27/2024] [Indexed: 10/25/2024]
Abstract
This manuscript presents PheNormGPT, a framework for extraction and normalization of key findings in clinical text. PheNormGPT relies on an innovative approach, leveraging large language models to extract key findings and phenotypic data in unstructured clinical text and map them to Human Phenotype Ontology concepts. It utilizes OpenAI's GPT-3.5 Turbo and GPT-4 models with fine-tuning and few-shot learning strategies, including a novel few-shot learning strategy for custom-tailored few-shot example selection per request. PheNormGPT was evaluated in the BioCreative VIII Track 3: Genetic Phenotype Extraction from Dysmorphology Physical Examination Entries shared task. PheNormGPT achieved an F1 score of 0.82 for standard matching and 0.72 for exact matching, securing first place for this shared task.
Collapse
Affiliation(s)
- Ekin Soysal
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St #600, Houston, TX 77030, United States
| | - Kirk Roberts
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St #600, Houston, TX 77030, United States
| |
Collapse
|
9
|
Wu D, Yang J, Wang K. Exploring the reversal curse and other deductive logical reasoning in BERT and GPT-based large language models. PATTERNS (NEW YORK, N.Y.) 2024; 5:101030. [PMID: 39568650 PMCID: PMC11573886 DOI: 10.1016/j.patter.2024.101030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/11/2024] [Accepted: 07/01/2024] [Indexed: 11/22/2024]
Abstract
The "Reversal Curse" describes the inability of autoregressive decoder large language models (LLMs) to deduce "B is A" from "A is B," assuming that B and A are distinct and can be uniquely identified from each other. This logical failure suggests limitations in using generative pretrained transformer (GPT) models for tasks like constructing knowledge graphs. Our study revealed that a bidirectional LLM, bidirectional encoder representations from transformers (BERT), does not suffer from this issue. To investigate further, we focused on more complex deductive reasoning by training encoder and decoder LLMs to perform union and intersection operations on sets. While both types of models managed tasks involving two sets, they struggled with operations involving three sets. Our findings underscore the differences between encoder and decoder models in handling logical reasoning. Thus, selecting BERT or GPT should depend on the task's specific needs, utilizing BERT's bidirectional context comprehension or GPT's sequence prediction strengths.
Collapse
Affiliation(s)
- Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
10
|
Adams DR, van Karnebeek CDM, Agulló SB, Faùndes V, Jamuar SS, Lynch SA, Pintos-Morell G, Puri RD, Shai R, Steward CA, Tumiene B, Verloes A. Addressing diagnostic gaps and priorities of the global rare diseases community: Recommendations from the IRDiRC diagnostics scientific committee. Eur J Med Genet 2024; 70:104951. [PMID: 38848991 DOI: 10.1016/j.ejmg.2024.104951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 06/05/2024] [Indexed: 06/09/2024]
Abstract
The International Rare Diseases Research Consortium (IRDiRC) Diagnostic Scientific Committee (DSC) is charged with discussion and contribution to progress on diagnostic aspects of the IRDiRC core mission. Specifically, IRDiRC goals include timely diagnosis, use of globally coordinated diagnostic pipelines, and assessing the impact of rare diseases on affected individuals. As part of this mission, the DSC endeavored to create a list of research priorities to achieve these goals. We present a discussion of those priorities along with aspects of current, global rare disease needs and opportunities that support our prioritization. In support of this discussion, we also provide clinical vignettes illustrating real-world examples of diagnostic challenges.
Collapse
Affiliation(s)
- David R Adams
- National Human Genome Research Institute, National Institutes of Health, USA.
| | - Clara D M van Karnebeek
- Departments of Pediatrics and Human Genetics, Emma Center for Personalized Medicine, Amsterdam Gastro-enterology Endocrinology Metabolism, Amsterdam University Medical Centers, the Netherlands
| | - Sergi Beltran Agulló
- Centre Nacional d'Anàlisi Genòmica (CNAG), Spain; Departament de Genètica, Microbiologia i Estadística, Facultat de Biologia, Universitat de Barcelona (UB), Spain
| | - Víctor Faùndes
- Laboratorio de Genética y Enfermedades Metabólicas, Instituto de Nutrición y Tecnología de los Alimentos, Universidad de Chile, Chile
| | - Saumya Shekhar Jamuar
- Genetics Service, KK Women's and Children's Hospital and Paediatrics ACP, Duke-NUS Medical School, Singapore; Singhealth Duke-NUS Institute of Precision Medicine, Singapore
| | | | - Guillem Pintos-Morell
- Vall d'Hebron Research Institute (VHIR), Vall d'Hebron Barcelona Hospital, Spain; MPS-Spain Patient Advocacy Organization, Spain
| | - Ratna Dua Puri
- Institute of Medical Genetics and Genomics, Sir Ganga Ram Hospital, India
| | - Ruty Shai
- Pediatric Cancer Molecular Lab, Sheba Medical Center, Israel
| | | | - Biruté Tumiene
- Vilnius University, Faculty of Medicine, Institute of Biomedical Sciences, Lithuania
| | - Alain Verloes
- Département de Génétique, CHU Paris - Hôpital Robert Debré, France
| |
Collapse
|
11
|
Mullin S, McDougal R, Cheung KH, Kilicoglu H, Beck A, Zeiss CJ. Chemical entity normalization for successful translational development of Alzheimer's disease and dementia therapeutics. J Biomed Semantics 2024; 15:13. [PMID: 39080729 PMCID: PMC11290083 DOI: 10.1186/s13326-024-00314-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 06/09/2024] [Indexed: 08/02/2024] Open
Abstract
BACKGROUND Identifying chemical mentions within the Alzheimer's and dementia literature can provide a powerful tool to further therapeutic research. Leveraging the Chemical Entities of Biological Interest (ChEBI) ontology, which is rich in hierarchical and other relationship types, for entity normalization can provide an advantage for future downstream applications. We provide a reproducible hybrid approach that combines an ontology-enhanced PubMedBERT model for disambiguation with a dictionary-based method for candidate selection. RESULTS There were 56,553 chemical mentions in the titles of 44,812 unique PubMed article abstracts. Based on our gold standard, our method of disambiguation improved entity normalization by 25.3 percentage points compared to using only the dictionary-based approach with fuzzy-string matching for disambiguation. For the CRAFT corpus, our method outperformed baselines (maximum 78.4%) with a 91.17% accuracy. For our Alzheimer's and dementia cohort, we were able to add 47.1% more potential mappings between MeSH and ChEBI when compared to BioPortal. CONCLUSION Use of natural language models like PubMedBERT and resources such as ChEBI and PubChem provide a beneficial way to link entity mentions to ontology terms, while further supporting downstream tasks like filtering ChEBI mentions based on roles and assertions to find beneficial therapies for Alzheimer's and dementia.
Collapse
Affiliation(s)
- Sarah Mullin
- Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA.
| | | | | | | | - Amanda Beck
- Albert Einstein College of Medicine, Bronx, NY, USA
| | | |
Collapse
|
12
|
Thorpe E, Williams T, Shaw C, Chekalin E, Ortega J, Robinson K, Button J, Jones MC, Campo MD, Basel D, McCarrier J, Keppen LD, Royer E, Foster-Bonds R, Duenas-Roque MM, Urraca N, Bosfield K, Brown CW, Lydigsen H, Mroczkowski HJ, Ward J, Sirchia F, Giorgio E, Vaux K, Salguero HP, Lumaka A, Mubungu G, Makay P, Ngole M, Lukusa PT, Vanderver A, Muirhead K, Sherbini O, Lah MD, Anderson K, Bazalar-Montoya J, Rodriguez RS, Cornejo-Olivas M, Milla-Neyra K, Shinawi M, Magoulas P, Henry D, Gibson K, Wiafe S, Jayakar P, Salyakina D, Masser-Frye D, Serize A, Perez JE, Taylor A, Shenbagam S, Abou Tayoun A, Malhotra A, Bennett M, Rajan V, Avecilla J, Warren A, Arseneault M, Kalista T, Crawford A, Ajay SS, Perry DL, Belmont J, Taft RJ. The impact of clinical genome sequencing in a global population with suspected rare genetic disease. Am J Hum Genet 2024; 111:1271-1281. [PMID: 38843839 PMCID: PMC11267518 DOI: 10.1016/j.ajhg.2024.05.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 05/03/2024] [Accepted: 05/06/2024] [Indexed: 07/03/2024] Open
Abstract
There is mounting evidence of the value of clinical genome sequencing (cGS) in individuals with suspected rare genetic disease (RGD), but cGS performance and impact on clinical care in a diverse population drawn from both high-income countries (HICs) and low- and middle-income countries (LMICs) has not been investigated. The iHope program, a philanthropic cGS initiative, established a network of 24 clinical sites in eight countries through which it provided cGS to individuals with signs or symptoms of an RGD and constrained access to molecular testing. A total of 1,004 individuals (median age, 6.5 years; 53.5% male) with diverse ancestral backgrounds (51.8% non-majority European) were assessed from June 2016 to September 2021. The diagnostic yield of cGS was 41.4% (416/1,004), with individuals from LMIC sites 1.7 times more likely to receive a positive test result compared to HIC sites (LMIC 56.5% [195/345] vs. HIC 33.5% [221/659], OR 2.6, 95% CI 1.9-3.4, p < 0.0001). A change in diagnostic evaluation occurred in 76.9% (514/668) of individuals. Change of management, inclusive of specialty referrals, imaging and testing, therapeutic interventions, and palliative care, was reported in 41.4% (285/694) of individuals, which increased to 69.2% (480/694) when genetic counseling and avoidance of additional testing were also included. Individuals from LMIC sites were as likely as their HIC counterparts to experience a change in diagnostic evaluation (OR 6.1, 95% CI 1.1-∞, p = 0.05) and change of management (OR 0.9, 95% CI 0.5-1.3, p = 0.49). Increased access to genomic testing may support diagnostic equity and the reduction of global health care disparities.
Collapse
Affiliation(s)
| | | | - Chad Shaw
- Genetic and Genomic Services PBC, Houston, TX, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA; Department of Statistics, Rice University, Houston, TX, USA
| | | | - Julia Ortega
- Illumina Inc, San Diego, CA, USA; C2N Diagnostics, St. Louis, MO, USA
| | | | | | - Marilyn C Jones
- Rady Children's Hospital, San Diego, CA, USA; University of California, San Diego, San Diego, CA, USA
| | - Miguel Del Campo
- Rady Children's Hospital, San Diego, CA, USA; University of California, San Diego, San Diego, CA, USA
| | - Donald Basel
- Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Julie McCarrier
- Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | | | - Erin Royer
- Sanford Children's Specialty Clinics at Sanford Health, USD Sanford School of Medicine, Sioux Falls, SD, USA
| | | | | | - Nora Urraca
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Kerri Bosfield
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Chester W Brown
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Holly Lydigsen
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Henry J Mroczkowski
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Jewell Ward
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Fabio Sirchia
- Department of Molecular Medicine, University of Pavia, Pavia, Italy; Medical Genetics Unit, IRCCS San Matteo Foundation, Pavia, Italy
| | - Elisa Giorgio
- Department of Molecular Medicine, University of Pavia, Pavia, Italy; Medical Genetics Unit, IRCCS Mondino Foundation, Pavia, Italy
| | - Keith Vaux
- Point Loma Pediatrics, San Diego, CA, USA
| | | | - Aimé Lumaka
- Centre de Genetique Humaine, Universite de Kinshasa, Kinshasa, Democratic Republic of the Congo; Center for Human Genetics, Centre Hospitalier Universitaire, Liège, Belgium
| | - Gerrye Mubungu
- Centre de Genetique Humaine, Universite de Kinshasa, Kinshasa, Democratic Republic of the Congo
| | - Prince Makay
- Centre de Genetique Humaine, Universite de Kinshasa, Kinshasa, Democratic Republic of the Congo
| | - Mamy Ngole
- Centre de Genetique Humaine, Universite de Kinshasa, Kinshasa, Democratic Republic of the Congo
| | - Prosper Tshilobo Lukusa
- Centre de Genetique Humaine, Universite de Kinshasa, Kinshasa, Democratic Republic of the Congo
| | - Adeline Vanderver
- Division of Neurology, Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA; Department of Neurology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | | - Omar Sherbini
- Division of Neurology, Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Melissa D Lah
- Indiana University School of Medicine, Indianapolis, IN, USA
| | | | | | | | - Mario Cornejo-Olivas
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurologicas, Lima, Peru; Neurogenetics Working Group, Universidad Científica del Sur, Lima, Peru
| | - Karina Milla-Neyra
- Neurogenetics Research Center, Instituto Nacional de Ciencias Neurologicas, Lima, Peru
| | - Marwan Shinawi
- Washington University, St. Louis, MO, USA; St. Louis Children's Hospital, St. Louis, MO, USA
| | | | - Duncan Henry
- UCSF Benioff Children's Hospitals, San Francisco, CA, USA
| | - Kate Gibson
- Canterbury District Health Board, Canterbury, New Zealand
| | | | | | | | - Diane Masser-Frye
- Rady Children's Hospital, San Diego, CA, USA; San Diego-Imperial Counties Developmental Services, Inc., San Diego, CA, USA
| | | | | | - Alan Taylor
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, United Arab Emirates
| | - Shruti Shenbagam
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, United Arab Emirates
| | - Ahmad Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, United Arab Emirates; Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
| | | | | | - Vani Rajan
- Illumina Inc, San Diego, CA, USA; Veracyte, San Diego, CA, USA
| | | | | | | | | | | | | | | | - John Belmont
- Genetic and Genomic Services PBC, Houston, TX, USA
| | | |
Collapse
|
13
|
Groza T, Gration D, Baynam G, Robinson PN. FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology. Bioinformatics 2024; 40:btae406. [PMID: 38913850 PMCID: PMC11227366 DOI: 10.1093/bioinformatics/btae406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 05/18/2024] [Accepted: 06/19/2024] [Indexed: 06/26/2024] Open
Abstract
MOTIVATION Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data. RESULTS We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens-to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10 000 publication abstracts in 5 s. AVAILABILITY AND IMPLEMENTATION FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children’s Hospital, Nedlands, WA 6009, Australia
- Telethon Kids Institute, Nedlands, WA 6009, Australia
- School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Bentley, WA 6102, Australia
- SingHealth Duke-NUS Institute of Precision Medicine, Singapore 169609, Singapore
| | - Dylan Gration
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, Subiaco, WA 6008, Australia
| | - Gareth Baynam
- Rare Care Centre, Perth Children’s Hospital, Nedlands, WA 6009, Australia
- Telethon Kids Institute, Nedlands, WA 6009, Australia
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, Subiaco, WA 6008, Australia
- Faculty of Health and Medical Sciences, University of Western Australia, Crawley, WA 6009, Australia
| | - Peter N Robinson
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
| |
Collapse
|
14
|
Yao X, Ouyang S, Lian Y, Peng Q, Zhou X, Huang F, Hu X, Shi F, Xia J. PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies. Genome Med 2024; 16:56. [PMID: 38627848 PMCID: PMC11020195 DOI: 10.1186/s13073-024-01330-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 04/02/2024] [Indexed: 04/19/2024] Open
Abstract
Despite the abundance of genotype-phenotype association studies, the resulting association outcomes often lack robustness and interpretations. To address these challenges, we introduce PheSeq, a Bayesian deep learning model that enhances and interprets association studies through the integration and perception of phenotype descriptions. By implementing the PheSeq model in three case studies on Alzheimer's disease, breast cancer, and lung cancer, we identify 1024 priority genes for Alzheimer's disease and 818 and 566 genes for breast cancer and lung cancer, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.
Collapse
Affiliation(s)
- Xinzhi Yao
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Sizhuo Ouyang
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Yulong Lian
- College of Science, Huazhong Agricultural University, Wuhan, China
| | - Qianqian Peng
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Xionghui Zhou
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Feier Huang
- College of Life Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Xuehai Hu
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Feng Shi
- College of Science, Huazhong Agricultural University, Wuhan, China
| | - Jingbo Xia
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China.
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China.
| |
Collapse
|
15
|
Yao X, He Z, Liu Y, Wang Y, Ouyang S, Xia J. Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer. Sci Data 2024; 11:265. [PMID: 38431735 PMCID: PMC10908799 DOI: 10.1038/s41597-024-03083-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 02/20/2024] [Indexed: 03/05/2024] Open
Abstract
It is vital to investigate the complex mechanisms underlying tumors to better understand cancer and develop effective treatments. Metabolic abnormalities and clinical phenotypes can serve as essential biomarkers for diagnosing this challenging disease. Additionally, genetic alterations provide profound insights into the fundamental aspects of cancer. This study introduces Cancer-Alterome, a literature-mined dataset that focuses on the regulatory events of an organism's biological processes or clinical phenotypes caused by genetic alterations. By proposing and leveraging a text-mining pipeline, we identify 16,681 thousand of regulatory events records encompassing 21K genes, 157K genetic alterations and 154K downstream bio-concepts, extracted from 4,354K pan-cancer literature. The resulting dataset empowers a multifaceted investigation of cancer pathology, enabling the meticulous tracking of relevant literature support. Its potential applications extend to evidence-based medicine and precision medicine, yielding valuable insights for further advancements in cancer research.
Collapse
Affiliation(s)
- Xinzhi Yao
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Zhihan He
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yawen Liu
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, P.R. China
| | - Sizhuo Ouyang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China.
| |
Collapse
|
16
|
Groza T, Caufield H, Gration D, Baynam G, Haendel MA, Robinson PN, Mungall CJ, Reese JT. An evaluation of GPT models for phenotype concept recognition. BMC Med Inform Decis Mak 2024; 24:30. [PMID: 38297371 PMCID: PMC10829255 DOI: 10.1186/s12911-024-02439-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/24/2024] [Indexed: 02/02/2024] Open
Abstract
OBJECTIVE Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. MATERIALS AND METHODS The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. RESULTS The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. CONCLUSION Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children's Hospital, 15 Hospital Avenue, Nedlands, WA, 6009, Australia.
- Telethon Kids Institute, 15 Hospital Avenue, Nedlands, WA, 6009, Australia.
- School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Kent St, Bentley, WA, 6102, Australia.
- SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore, 169609, Singapore.
| | - Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Dylan Gration
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA, 6008, Australia
| | - Gareth Baynam
- Rare Care Centre, Perth Children's Hospital, 15 Hospital Avenue, Nedlands, WA, 6009, Australia
- Telethon Kids Institute, 15 Hospital Avenue, Nedlands, WA, 6009, Australia
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA, 6008, Australia
- Faculty of Health and Medical Sciences, University of Western Australia, 35 Stirling Hwy, Crawley, WA, 6009, Australia
| | - Melissa A Haendel
- University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
- Institute for Systems Genomics, University of Connecticut, Farmington, CT, 06032, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Justin T Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| |
Collapse
|
17
|
Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. PATTERNS (NEW YORK, N.Y.) 2024; 5:100887. [PMID: 38264716 PMCID: PMC10801236 DOI: 10.1016/j.patter.2023.100887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 10/25/2023] [Accepted: 11/06/2023] [Indexed: 01/25/2024]
Abstract
To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models-PhenoBCBERT and PhenoGPT-for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models to automate the detection of phenotype terms, including those not in the current HPO. We compare these models with PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also show strong performance in case studies on biomedical literature. We evaluate the strengths and weaknesses of BERT- and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.
Collapse
Affiliation(s)
- Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Wendy Deng
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Yunyun Zhou
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
18
|
Weissenbacher D, Rawal S, Zhao X, Priestley JRC, Szigety KM, Schmidt SF, Higgins MJ, Magge A, O'Connor K, Gonzalez-Hernandez G, Campbell IM. PhenoID, a language model normalizer of physical examinations from genetics clinical notes. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.10.16.23296894. [PMID: 37904943 PMCID: PMC10614999 DOI: 10.1101/2023.10.16.23296894] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Background Phenotypes identified during dysmorphology physical examinations are critical to genetic diagnosis and nearly universally documented as free-text in the electronic health record (EHR). Variation in how phenotypes are recorded in free-text makes large-scale computational analysis extremely challenging. Existing natural language processing (NLP) approaches to address phenotype extraction are trained largely on the biomedical literature or on case vignettes rather than actual EHR data. Methods We implemented a tailored system at the Children's Hospital of Philadelpia that allows clinicians to document dysmorphology physical exam findings. From the underlying data, we manually annotated a corpus of 3136 organ system observations using the Human Phenotype Ontology (HPO). We provide this corpus publicly. We trained a transformer based NLP system to identify HPO terms from exam observations. The pipeline includes an extractor, which identifies tokens in the sentence expected to contain an HPO term, and a normalizer, which uses those tokens together with the original observation to determine the specific term mentioned. Findings We find that our labeler and normalizer NLP pipeline, which we call PhenoID, achieves state-of-the-art performance for the dysmorphology physical exam phenotype extraction task. PhenoID's performance on the test set was 0.717, compared to the nearest baseline system (Pheno-Tagger) performance of 0.633. An analysis of our system's normalization errors shows possible imperfections in the HPO terminology itself but also reveals a lack of semantic understanding by our transformer models. Interpretation Transformers-based NLP models are a promising approach to genetic phenotype extraction and, with recent development of larger pre-trained causal language models, may improve semantic understanding in the future. We believe our results also have direct applicability to more general extraction of medical signs and symptoms. Funding US National Institutes of Health.
Collapse
|
19
|
Groza T, Wu H, Dinger ME, Danis D, Hilton C, Bagley A, Davids JR, Luo L, Lu Z, Robinson PN. Term-BLAST-like alignment tool for concept recognition in noisy clinical texts. Bioinformatics 2023; 39:btad716. [PMID: 38001031 PMCID: PMC10710372 DOI: 10.1093/bioinformatics/btad716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/20/2023] [Accepted: 11/23/2023] [Indexed: 11/26/2023] Open
Abstract
MOTIVATION Methods for concept recognition (CR) in clinical texts have largely been tested on abstracts or articles from the medical literature. However, texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other nonstandard ways of representing clinical concepts. RESULTS Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches. AVAILABILITY AND IMPLEMENTATION Fenominal is a Java library that implements TBLAT for named CR of Human Phenotype Ontology terms and is available at https://github.com/monarch-initiative/fenominal under the GNU General Public License v3.0.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children’s Hospital, Nedlands, WA 6009, Australia
- Genetics and Rare Diseases Program, Telethon Kids Institute, Nedlands, WA 6009, Australia
| | - Honghan Wu
- Institute of Health Informatics, University College London, London WC1E 6BT, United Kingdom
| | - Marcel E Dinger
- Pryzm Health, Sydney, NSW 2089, Australia
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, NSW 2006, Australia
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
| | - Coleman Hilton
- Shriners Children’s Corporate Headquarters, Tampa, FL 33607, United States
| | - Anita Bagley
- Shriners Children's Northern California, Sacramento, CA 95817, United States
| | - Jon R Davids
- Shriners Children's Northern California, Sacramento, CA 95817, United States
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, United States
| |
Collapse
|
20
|
Aradhya S, Facio FM, Metz H, Manders T, Colavin A, Kobayashi Y, Nykamp K, Johnson B, Nussbaum RL. Applications of artificial intelligence in clinical laboratory genomics. AMERICAN JOURNAL OF MEDICAL GENETICS. PART C, SEMINARS IN MEDICAL GENETICS 2023; 193:e32057. [PMID: 37507620 DOI: 10.1002/ajmg.c.32057] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 07/30/2023]
Abstract
The transition from analog to digital technologies in clinical laboratory genomics is ushering in an era of "big data" in ways that will exceed human capacity to rapidly and reproducibly analyze those data using conventional approaches. Accurately evaluating complex molecular data to facilitate timely diagnosis and management of genomic disorders will require supportive artificial intelligence methods. These are already being introduced into clinical laboratory genomics to identify variants in DNA sequencing data, predict the effects of DNA variants on protein structure and function to inform clinical interpretation of pathogenicity, link phenotype ontologies to genetic variants identified through exome or genome sequencing to help clinicians reach diagnostic answers faster, correlate genomic data with tumor staging and treatment approaches, utilize natural language processing to identify critical published medical literature during analysis of genomic data, and use interactive chatbots to identify individuals who qualify for genetic testing or to provide pre-test and post-test education. With careful and ethical development and validation of artificial intelligence for clinical laboratory genomics, these advances are expected to significantly enhance the abilities of geneticists to translate complex data into clearly synthesized information for clinicians to use in managing the care of their patients at scale.
Collapse
Affiliation(s)
- Swaroop Aradhya
- Invitae Corporation, San Francisco, California, USA
- Adjunct Clinical Faculty, Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | | | - Hillery Metz
- Invitae Corporation, San Francisco, California, USA
| | - Toby Manders
- Invitae Corporation, San Francisco, California, USA
| | | | | | - Keith Nykamp
- Invitae Corporation, San Francisco, California, USA
| | | | - Robert L Nussbaum
- Invitae Corporation, San Francisco, California, USA
- Volunteer Faculty, School of Medicine, University of California San Francisco, San Francisco, California, USA
| |
Collapse
|
21
|
Feng Y, Qi L, Tian W. PhenoBERT: A Combined Deep Learning Method for Automated Recognition of Human Phenotype Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1269-1277. [PMID: 35471885 DOI: 10.1109/tcbb.2022.3170301] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Automated recognition of Human Phenotype Ontology (HPO) terms from clinical texts is of significant interest to the field of clinical data mining. In this study, we develop a combined deep learning method named PhenoBERT for this purpose. PhenoBERT uses BERT, currently the state-of-the-art NLP model, as its core model for evaluating whether a clinically relevant text segment (CTS) could be represented by an HPO term. However, to avoid unnecessary comparison of a CTS with each of ∼14,000 HPO terms using BERT, we introduce a two-levels CNN module consisting of a series of CNN models organized at two levels in PhenoBERT. For a given CTS, the CNN module produces only a short list of candidate HPO terms for BERT to evaluate, significantly improving the computational efficiency. In addition, BERT is able to assign an ancestor HPO term to a CTS when recognition of the direct HPO term is not successful, mimicking the process of HPO term assignment by human. In two benchmarks, PhenoBERT outperforms four traditional dictionary-based methods and two recently developed deep learning-based methods in two benchmark tests, and its advantage is more obvious when the recognition task is more challenging. As such, PhenoBERT is of great use for assisting in the mining of clinical text data.
Collapse
|
22
|
Bhalla D, Steijaert MN, Poppelaars ES, Teunis M, van der Voet M, Corradi M, Dévière E, Noothout L, Tomassen W, Rooseboom M, Currie RA, Krul C, Pieters R, van Noort V, Wildwater M. DARTpaths, an in silico platform to investigate molecular mechanisms of compounds. Bioinformatics 2023; 39:6883905. [PMID: 36477801 PMCID: PMC9825785 DOI: 10.1093/bioinformatics/btac767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 10/11/2022] [Accepted: 12/07/2022] [Indexed: 12/13/2022] Open
Abstract
SUMMARY Xpaths is a collection of algorithms that allow for the prediction of compound-induced molecular mechanisms of action by integrating phenotypic endpoints of different species; and proposes follow-up tests for model organisms to validate these pathway predictions. The Xpaths algorithms are applied to predict developmental and reproductive toxicity (DART) and implemented into an in silico platform, called DARTpaths. AVAILABILITY AND IMPLEMENTATION All code is available on GitHub https://github.com/Xpaths/dartpaths-app under Apache license 2.0, detailed overview with demo is available at https://www.vivaltes.com/dartpaths/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | | | | | - Marie Corradi
- Innovative Testing in Life Sciences & Chemistry, Utrecht University of Applied Sciences, Utrecht 3584 CH, The Netherlands
| | | | | | | | - Martijn Rooseboom
- Toxicology Group, Shell Global Solutions International B.V., The Hague 2596 HR, the Netherlands
| | - Richard A Currie
- Predictive and Computational Toxicology, Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, UK
| | - Cyrille Krul
- Innovative Testing in Life Sciences & Chemistry, Utrecht University of Applied Sciences, Utrecht 3584 CH, The Netherlands
| | - Raymond Pieters
- Innovative Testing in Life Sciences & Chemistry, Utrecht University of Applied Sciences, Utrecht 3584 CH, The Netherlands
- Utrecht University, Institute for Risk Assessment Sciences, Utrecht 3584 CM, The Netherlands
| | | | | |
Collapse
|
23
|
Liu C, Ta CN, Havrilla JM, Nestor JG, Spotnitz ME, Geneslaw AS, Hu Y, Chung WK, Wang K, Weng C. OARD: Open annotations for rare diseases and their phenotypes based on real-world data. Am J Hum Genet 2022; 109:1591-1604. [PMID: 35998640 PMCID: PMC9502051 DOI: 10.1016/j.ajhg.2022.08.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 08/01/2022] [Indexed: 11/23/2022] Open
Abstract
Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.
Collapse
Affiliation(s)
- Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Casey N Ta
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Jim M Havrilla
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Jordan G Nestor
- Division of Nephrology, Department of Medicine, Columbia University, New York, NY 10032, USA
| | - Matthew E Spotnitz
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Andrew S Geneslaw
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Yu Hu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Wendy K Chung
- Department of Pediatrics, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
24
|
Mathew MT, Antoniou A, Ramesh N, Hu M, Gaither J, Mouhlas D, Hashimoto S, Humphrey M, Matthews T, Hunter JM, Reshmi S, Schultz M, Lee K, Pfau R, Cottrell C, McBride KL, Navin NE, Chaudhari BP, Leung ML. A Decade's Experience in Pediatric Chromosomal Microarray Reveals Distinct Characteristics Across Ordering Specialties. J Mol Diagn 2022; 24:1031-1040. [PMID: 35718094 DOI: 10.1016/j.jmoldx.2022.06.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 05/04/2022] [Accepted: 06/02/2022] [Indexed: 11/27/2022] Open
Abstract
Chromosomal microarray (CMA) is a testing modality frequently used in pediatric patients; however, previously published data on its utilization are limited to the genetic setting. Herein, we performed a database search for all CMA testing performed from 2010 to 2020, and delineated the diagnostic yield based on patient characteristics, including sex, age, clinical specialty of providers, indication of testing, and pathogenic finding. The indications for testing were further categorized into Human Phenotype Ontology categories for analysis. This study included a cohort of 14,541 patients from 29 different medical specialties, of whom 30% were from the genetics clinic. The clinical indications for testing suggested that neonatology patients demonstrated the greatest involvement of multiorgan systems, involving the most Human Phenotype Ontology categories, compared with developmental behavioral pediatrics and neurology patients being the least. The top pathogenic findings for each specialty differed, likely due to the varying clinical features and indications for testing. Deletions involving the 22q11.21 locus were the top pathogenic findings for patients presenting to genetics, neonatology, cardiology, and surgery. Our data represent the largest pediatric cohort published to date. This study is the first to demonstrate the diagnostic utility of this assay for patients seen in the setting of different specialties, and it provides normative data of CMA results among a general pediatric population referred for testing because of variable clinical presentations.
Collapse
Affiliation(s)
- Mariam T Mathew
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Austin Antoniou
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio
| | - Naveen Ramesh
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Min Hu
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Jeffrey Gaither
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio
| | - Danielle Mouhlas
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Sayaka Hashimoto
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Maggie Humphrey
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Theodora Matthews
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Jesse M Hunter
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Shalini Reshmi
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Matthew Schultz
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Kristy Lee
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Ruthann Pfau
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Catherine Cottrell
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio
| | - Kim L McBride
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio; Division of Genetics and Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Center for Cardiovascular Research, Nationwide Children's Hospital, Columbus, Ohio
| | - Nicholas E Navin
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas; Department of Bioinformatics, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Bimal P Chaudhari
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio; Division of Genetics and Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Division of Neonatology, Nationwide Children's Hospital, Columbus, Ohio
| | - Marco L Leung
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio; Department of Pathology, The Ohio State University College of Medicine, Columbus, Ohio; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio.
| |
Collapse
|
25
|
Yates T, Lain A, Campbell J, FitzPatrick DR, Simpson TI. Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders. Database (Oxford) 2022; 2022:baac038. [PMID: 35670729 PMCID: PMC9216525 DOI: 10.1093/database/baac038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 03/26/2022] [Accepted: 05/25/2022] [Indexed: 11/24/2022]
Abstract
There are >2500 different genetically determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whether the association of a filtered variant at a specific locus is a plausible explanation of the phenotype in the proband is crucial and commonly requires extensive manual literature review by both clinical scientists and clinicians. Access to a database of weighted clinical features extracted from rigorously curated literature would increase the efficiency of this process and facilitate the development of robust phenotypic similarity metrics. However, given the large and rapidly increasing volume of published information, conventional biocuration approaches are becoming impractical. Here, we present a scalable, automated method for the extraction of categorical phenotypic descriptors from the full-text literature. Papers identified through literature review were downloaded and parsed using the Cadmus custom retrieval package. Human Phenotype Ontology terms were extracted using MetaMap, with 76-84% precision and 65-73% recall. Mean terms per paper increased from 9 in title + abstract, to 68 using full text. We demonstrate that these literature-derived disease models plausibly reflect true disease expressivity more accurately than widely used manually curated models, through comparison with prospectively gathered data from the Deciphering Developmental Disorders study. The area under the curve for receiver operating characteristic (ROC) curves increased by 5-10% through the use of literature-derived models. This work shows that scalable automated literature curation increases performance and adds weight to the need for this strategy to be integrated into informatic variant analysis pipelines. Database URL: https://doi.org/10.1093/database/baac038.
Collapse
Affiliation(s)
- T.M Yates
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Transforming Genetic Medicine Initiative, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - A Lain
- Institute for Adaptive and Neural Computation, Informatics Forum, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
| | - J Campbell
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| | - D R FitzPatrick
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Transforming Genetic Medicine Initiative, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| | - T I Simpson
- Institute for Adaptive and Neural Computation, Informatics Forum, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| |
Collapse
|
26
|
Yan S, Luo L, Lai PT, Veltri D, Oler AJ, Xirasagar S, Ghosh R, Similuk M, Robinson PN, Lu Z. PhenoRerank: A re-ranking model for phenotypic concept recognition pre-trained on human phenotype ontology. J Biomed Inform 2022; 129:104059. [PMID: 35351638 PMCID: PMC11040548 DOI: 10.1016/j.jbi.2022.104059] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 02/23/2022] [Accepted: 03/22/2022] [Indexed: 11/29/2022]
Abstract
The study aims at developing a neural network model to improve the performance of Human Phenotype Ontology (HPO) concept recognition tools. We used the terms, definitions, and comments about the phenotypic concepts in the HPO database to train our model. The document to be analyzed is first split into sentences and annotated with a base method to generate candidate concepts. The sentences, along with the candidate concepts, are then fed into the pre-trained model for re-ranking. Our model comprises the pre-trained BlueBERT and a feature selection module, followed by a contrastive loss. We re-ranked the results generated by three robust HPO annotation tools and compared the performance against most of the existing approaches. The experimental results show that our model can improve the performance of the existing methods. Significantly, it boosted 3.0% and 5.6% in F1 score on the two evaluated datasets compared with the base methods. It removed more than 80% of the false positives predicted by the base methods, resulting in up to 18% improvement in precision. Our model utilizes the descriptive data in the ontology and the contextual information in the sentences for re-ranking. The results indicate that the additional information and the re-ranking model can significantly enhance the precision of HPO concept recognition compared with the base method.
Collapse
Affiliation(s)
- Shankai Yan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Daniel Veltri
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Andrew J Oler
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Sandhya Xirasagar
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Rajarshi Ghosh
- Centralized Sequencing Program, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Morgan Similuk
- Centralized Sequencing Program, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|