1
|
Thayer DS, Mumtaz S, Elmessary MA, Scanlon I, Zinnurov A, Coldea AI, Scanlon J, Chapman M, Curcin V, John A, DelPozo-Banos M, Davies H, Karwath A, Gkoutos GV, Fitzpatrick NK, Quint JK, Varma S, Milner C, Oliveira C, Parkinson H, Denaxas S, Hemingway H, Jefferson E. Creating a next-generation phenotype library: the health data research UK Phenotype Library. JAMIA Open 2024; 7:ooae049. [PMID: 38895652 PMCID: PMC11182945 DOI: 10.1093/jamiaopen/ooae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 02/12/2024] [Accepted: 05/20/2024] [Indexed: 06/21/2024] Open
Abstract
Objective To enable reproducible research at scale by creating a platform that enables health data users to find, access, curate, and re-use electronic health record phenotyping algorithms. Materials and Methods We undertook a structured approach to identifying requirements for a phenotype algorithm platform by engaging with key stakeholders. User experience analysis was used to inform the design, which we implemented as a web application featuring a novel metadata standard for defining phenotyping algorithms, access via Application Programming Interface (API), support for computable data flows, and version control. The application has creation and editing functionality, enabling researchers to submit phenotypes directly. Results We created and launched the Phenotype Library in October 2021. The platform currently hosts 1049 phenotype definitions defined against 40 health data sources and >200K terms across 16 medical ontologies. We present several case studies demonstrating its utility for supporting and enabling research: the library hosts curated phenotype collections for the BREATHE respiratory health research hub and the Adolescent Mental Health Data Platform, and it is supporting the development of an informatics tool to generate clinical evidence for clinical guideline development groups. Discussion This platform makes an impact by being open to all health data users and accepting all appropriate content, as well as implementing key features that have not been widely available, including managing structured metadata, access via an API, and support for computable phenotypes. Conclusions We have created the first openly available, programmatically accessible resource enabling the global health research community to store and manage phenotyping algorithms. Removing barriers to describing, sharing, and computing phenotypes will help unleash the potential benefit of health data for patients and the public.
Collapse
Affiliation(s)
- Daniel S Thayer
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Shahzad Mumtaz
- Health Informatics Centre, School of Medicine, University of Dundee, Dundee, DD1 9SY, United Kingdom
- School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, AB24 3UE, United Kingdom
| | - Muhammad A Elmessary
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Ieuan Scanlon
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Artur Zinnurov
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Alex-Ioan Coldea
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Jack Scanlon
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Martin Chapman
- Department of Population Health Sciences, King’s College London, London, SE1 1UL, United Kingdom
| | - Vasa Curcin
- Department of Population Health Sciences, King’s College London, London, SE1 1UL, United Kingdom
| | - Ann John
- Adolescent Mental Health Data Platform and DATAMIND, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Marcos DelPozo-Banos
- Adolescent Mental Health Data Platform and DATAMIND, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Hannah Davies
- SAIL Databank, Medical School, Swansea University, Swansea, SA2 8PP, United Kingdom
| | - Andreas Karwath
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
| | - Natalie K Fitzpatrick
- Institute of Health Informatics, University College London, London, NW1 2DA, United Kingdom
| | - Jennifer K Quint
- School of Public Health and National Heart and Lung Institute, Imperial College London, London, W12 0BZ, United Kingdom
| | - Susheel Varma
- Health Data Research United Kingdom, London, NW1 2BE, United Kingdom
| | - Chris Milner
- Health Data Research United Kingdom, London, NW1 2BE, United Kingdom
| | - Carla Oliveira
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Welcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Welcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Spiros Denaxas
- Institute of Health Informatics, University College London, London, NW1 2DA, United Kingdom
- University College London Hospitals National Institute of Health Research Biomedical Research Centre, London, NW1 2BU, United Kingdom
- British Heart Foundation Data Science Center, Health Data Research United Kingdom, London, NW1 2BE, United Kingdom
| | - Harry Hemingway
- Institute of Health Informatics, University College London, London, NW1 2DA, United Kingdom
- University College London Hospitals National Institute of Health Research Biomedical Research Centre, London, NW1 2BU, United Kingdom
| | - Emily Jefferson
- Health Informatics Centre, School of Medicine, University of Dundee, Dundee, DD1 9SY, United Kingdom
- Health Data Research United Kingdom, London, NW1 2BE, United Kingdom
| |
Collapse
|
2
|
Li Y, Yang AY, Marelli A, Li Y. MixEHR-SurG: A joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records. J Biomed Inform 2024; 153:104638. [PMID: 38631461 DOI: 10.1016/j.jbi.2024.104638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 03/07/2024] [Accepted: 04/03/2024] [Indexed: 04/19/2024]
Abstract
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as mortality or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8211 subjects with 75,187 outpatient claim records of 1767 unique ICD codes; the MIMIC-III consisting of 1458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
Collapse
Affiliation(s)
- Yixuan Li
- Department of Mathematics and Statistics, McGill University, Montreal, Canada; Mila - Quebec AI institute, Montreal, Canada
| | - Archer Y Yang
- Department of Mathematics and Statistics, McGill University, Montreal, Canada; Mila - Quebec AI institute, Montreal, Canada; School of Computer Science, McGill University, Montreal, Canada.
| | - Ariane Marelli
- McGill Adult Unit for Congenital Heart Disease (MAUDE Unit), McGill University of Health Centre, Montreal, Canada.
| | - Yue Li
- Mila - Quebec AI institute, Montreal, Canada; School of Computer Science, McGill University, Montreal, Canada.
| |
Collapse
|
4
|
Swapna LS, Huang M, Li Y. GTM-decon: guided-topic modeling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes. Genome Biol 2023; 24:190. [PMID: 37596691 PMCID: PMC10436670 DOI: 10.1186/s13059-023-03034-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Accepted: 08/09/2023] [Indexed: 08/20/2023] Open
Abstract
Cell-type composition is an important indicator of health. We present Guided Topic Model for deconvolution (GTM-decon) to automatically infer cell-type-specific gene topic distributions from single-cell RNA-seq data for deconvolving bulk transcriptomes. GTM-decon performs competitively on deconvolving simulated and real bulk data compared with the state-of-the-art methods. Moreover, as demonstrated in deconvolving disease transcriptomes, GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels from single-cell or bulk data to infer phenotype-specific gene distributions. In a nested-guided design, GTM-decon identified cell-type-specific differentially expressed genes from bulk breast cancer transcriptomes.
Collapse
Affiliation(s)
| | - Michael Huang
- School of Computer Science, McGill University, Montreal, QC, Canada
| | - Yue Li
- School of Computer Science, McGill University, Montreal, QC, Canada.
| |
Collapse
|
5
|
Zhang Y, Jiang X, Mentzer AJ, McVean G, Lunter G. Topic modeling identifies novel genetic loci associated with multimorbidities in UK Biobank. CELL GENOMICS 2023; 3:100371. [PMID: 37601973 PMCID: PMC10435382 DOI: 10.1016/j.xgen.2023.100371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Revised: 05/04/2023] [Accepted: 07/07/2023] [Indexed: 08/22/2023]
Abstract
Many diseases show patterns of co-occurrence, possibly driven by systemic dysregulation of underlying processes affecting multiple traits. We have developed a method (treeLFA) for identifying such multimorbidities from routine health-care data, which combines topic modeling with an informative prior derived from medical ontology. We apply treeLFA to UK Biobank data and identify a variety of topics representing multimorbidity clusters, including a healthy topic. We find that loci identified using topic weights as traits in a genome-wide association study (GWAS) analysis, which we validated with a range of approaches, only partially overlap with loci from GWASs on constituent single diseases. We also show that treeLFA improves upon existing methods like latent Dirichlet allocation in various ways. Overall, our findings indicate that topic models can characterize multimorbidity patterns and that genetic analysis of these patterns can provide insight into the etiology of complex traits that cannot be determined from the analysis of constituent traits alone.
Collapse
Affiliation(s)
- Yidong Zhang
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Chinese Academy of Medical Sciences Oxford Institute, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7BN, UK
- Department of Radiation Oncology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100006, China
| | - Xilin Jiang
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7BN, UK
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge CB2 0SR, UK
- Heart and Lung Research Institute, University of Cambridge, Cambridge CB2 0BB, UK
| | - Alexander J. Mentzer
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7BN, UK
| | - Gil McVean
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Gerton Lunter
- MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford OX3 9DS, UK
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen 9700 RB, the Netherlands
| |
Collapse
|
6
|
Zou Y, Pesaranghader A, Song Z, Verma A, Buckeridge DL, Li Y. Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model. Sci Rep 2022; 12:17868. [PMID: 36284225 PMCID: PMC9596500 DOI: 10.1038/s41598-022-22956-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 10/21/2022] [Indexed: 01/20/2023] Open
Abstract
The rapid growth of electronic health record (EHR) datasets opens up promising opportunities to understand human diseases in a systematic way. However, effective extraction of clinical knowledge from EHR data has been hindered by the sparse and noisy information. We present Graph ATtention-Embedded Topic Model (GAT-ETM), an end-to-end taxonomy-knowledge-graph-based multimodal embedded topic model. GAT-ETM distills latent disease topics from EHR data by learning the embedding from a constructed medical knowledge graph. We applied GAT-ETM to a large-scale EHR dataset consisting of over 1 million patients. We evaluated its performance based on topic quality, drug imputation, and disease diagnosis prediction. GAT-ETM demonstrated superior performance over the alternative methods on all tasks. Moreover, GAT-ETM learned clinically meaningful graph-informed embedding of the EHR codes and discovered interpretable and accurate patient representations for patient stratification and drug recommendations. GAT-ETM code is available at https://github.com/li-lab-mcgill/GAT-ETM .
Collapse
Affiliation(s)
- Yuesong Zou
- grid.14709.3b0000 0004 1936 8649School of Computer Science, McGill University, Montreal, Canada
| | - Ahmad Pesaranghader
- grid.14709.3b0000 0004 1936 8649School of Computer Science, McGill University, Montreal, Canada
| | - Ziyang Song
- grid.14709.3b0000 0004 1936 8649School of Computer Science, McGill University, Montreal, Canada
| | - Aman Verma
- grid.14709.3b0000 0004 1936 8649School of Population and Global Health, McGill University, Montreal, Canada
| | - David L. Buckeridge
- grid.14709.3b0000 0004 1936 8649School of Population and Global Health, McGill University, Montreal, Canada
| | - Yue Li
- grid.14709.3b0000 0004 1936 8649School of Computer Science, McGill University, Montreal, Canada
| |
Collapse
|