1
|
Jordan DM, Vy HMT, Do R. A deep learning transformer model predicts high rates of undiagnosed rare disease in large electronic health systems. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.12.21.23300393. [PMID: 38196638 PMCID: PMC10775679 DOI: 10.1101/2023.12.21.23300393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
It is estimated that as many as 1 in 16 people worldwide suffer from rare diseases. Rare disease patients face difficulty finding diagnosis and treatment for their conditions, including long diagnostic odysseys, multiple incorrect diagnoses, and unavailable or prohibitively expensive treatments. As a result, it is likely that large electronic health record (EHR) systems include high numbers of participants suffering from undiagnosed rare disease. While this has been shown in detail for specific diseases, these studies are expensive and time consuming and have only been feasible to perform for a handful of the thousands of known rare diseases. The bulk of these undiagnosed cases are effectively hidden, with no straightforward way to differentiate them from healthy controls. The ability to access them at scale would enormously expand our capacity to study and develop drugs for rare diseases, adding to tools aimed at increasing availability of study cohorts for rare disease. In this study, we train a deep learning transformer algorithm, RarePT (Rare-Phenotype Prediction Transformer), to impute undiagnosed rare disease from EHR diagnosis codes in 436,407 participants in the UK Biobank and validated on an independent cohort from 3,333,560 individuals from the Mount Sinai Health System. We applied our model to 155 rare diagnosis codes with fewer than 250 cases each in the UK Biobank and predicted participants with elevated risk for each diagnosis, with the number of participants predicted to be at risk ranging from 85 to 22,000 for different diagnoses. These risk predictions are significantly associated with increased mortality for 65% of diagnoses, with disease burden expressed as disability-adjusted life years (DALY) for 73% of diagnoses, and with 72% of available disease-specific diagnostic tests. They are also highly enriched for known rare diagnoses in patients not included in the training set, with an odds ratio (OR) of 48.0 in cross-validation cohorts of the UK Biobank and an OR of 30.6 in the independent Mount Sinai Health System cohort. Most importantly, RarePT successfully screens for undiagnosed patients in 32 rare diseases with available diagnostic tests in the UK Biobank. Using the trained model to estimate the prevalence of undiagnosed disease in the UK Biobank for these 32 rare phenotypes, we find that at least 50% of patients remain undiagnosed for 20 of 32 diseases. These estimates provide empirical evidence of a high prevalence of undiagnosed rare disease, as well as demonstrating the enormous potential benefit of using RarePT to screen for undiagnosed rare disease patients in large electronic health systems.
Collapse
Affiliation(s)
- Daniel M. Jordan
- Center for Genomic Data Analytics, Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ha My T. Vy
- Center for Genomic Data Analytics, Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ron Do
- Center for Genomic Data Analytics, Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
2
|
Stellmach C, Sass J, Auber B, Boeker M, Wienker T, Heidel AJ, Benary M, Schumacher S, Ossowski S, Klauschen F, Möller Y, Schmutzler R, Ustjanzew A, Werner P, Tomczak A, Hölter T, Thun S. Creation of a structured molecular genomics report for Germany as a local adaption of HL7's Genomic Reporting Implementation Guide. J Am Med Inform Assoc 2023; 30:1179-1189. [PMID: 37080557 DOI: 10.1093/jamia/ocad061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 03/22/2023] [Accepted: 03/28/2023] [Indexed: 04/22/2023] Open
Abstract
OBJECTIVE The objective was to develop a dataset definition, information model, and FHIR® specification for key data elements contained in a German molecular genomics (MolGen) report to facilitate genomic and phenotype integration in electronic health records. MATERIALS AND METHODS A dedicated expert group participating in the German Medical Informatics Initiative reviewed information contained in MolGen reports, determined the key elements, and formulated a dataset definition. HL7's Genomics Reporting Implementation Guide (IG) was adopted as a basis for the FHIR® specification which was subjected to a public ballot. In addition, elements in the MolGen dataset were mapped to the fields defined in ISO/TS 20428:2017 standard to evaluate compliance. RESULTS A core dataset of 76 data elements, clustered into 6 categories was created to represent all key information of German MolGen reports. Based on this, a FHIR specification with 16 profiles, 14 derived from HL7®'s Genomics Reporting IG and 2 additional profiles (of the FamilyMemberHistory and RiskAssessment resources), was developed. Five example resource bundles show how our adaptation of an international standard can be used to model MolGen report data that was requested following oncological or rare disease indications. Furthermore, the map of the MolGen report data elements to the fields defined by the ISO/TC 20428:2017 standard, confirmed the presence of the majority of required fields. CONCLUSIONS Our report serves as a template for other research initiatives attempting to create a standard format for unstructured genomic report data. Use of standard formats facilitates integration of genomic data into electronic health records for clinical decision support.
Collapse
Affiliation(s)
- Caroline Stellmach
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Julian Sass
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Bernd Auber
- Department of Human Genetics, Hannover Medical School, Hannover, Germany
| | - Martin Boeker
- Fakultät für Medizin, Technische Universität München, Munich, Germany
| | - Thomas Wienker
- Emeritus Ropers, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | | | - Manuela Benary
- Core Unit Bioinformatics, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Simon Schumacher
- Medical Data Integration Center (MeDIC), Universitätsklinikum Köln, Cologne, Germany
| | - Stephan Ossowski
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Frederick Klauschen
- Institut für Pathologie, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Pathologisches Institut, Ludwig-Maximilians-Universität München, Munich, Germany
- Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
| | - Yvonne Möller
- Center for personalized medicine (ZPM), Universitätsklinikum Tübingen, Tübingen, Germany
| | - Rita Schmutzler
- Center Familial Breast and Ovarian Cancer, National Center of Familial Tumor Diseases and Center of Integrated Oncology, Universitätsklinikum Köln, Cologne, Germany
| | - Arsenij Ustjanzew
- Institut für Medizinische, Biometrie, Epidemiologie und Informatik Mainz, Universitätsmedizin der Johannes Gutenberg-Universität Mainz, Mainz, Germany
| | | | - Aurelie Tomczak
- Liver Cancer Centre Heidelberg, Institute of Pathology, Heidelberg University Hospital, Heidelberg, Germany
| | - Thimo Hölter
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Sylvia Thun
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
3
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
Hall AG, Davlyatov GK, Orewa GN, Mehta TS, Feldman SS. Multiple Electronic Health Record-Based Measures of Social Determinants of Health to Predict Return to the Emergency Department Following Discharge. Popul Health Manag 2022; 25:771-780. [PMID: 36315199 DOI: 10.1089/pop.2022.0088] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Health care systems continue to struggle with preventing 30-day readmissions to their institutions. Social determinants of health (SDOH) are important predictors of repeat visits to the hospital. In many health systems, SDOH data are limited to those variables that are most relevant to care delivery or payment (eg, race, gender, insurance status). Despite calls for integrating a more robust set of measures (eg, measures of health behaviors and living conditions) into the electronic health record (EHR), these data often have missing values necessitating the use of imputation to build a comprehensive picture of patients who are likely to return to the health system. Using logistic regression analyses and imputation of missing data from 2017 to 2018, this study uses measures found in the EHR (eg, tobacco use, living situation, problems at home, education) to assess those SDOH that might predict a return to the emergency department within 30 days of discharge from a health system. In both imputed and raw data, the total number of recorded health conditions was the most important predictor and collectively SDOH variables made a relatively small contributions in determining the likelihood of a return to the hospital. Although SDOH variables might be important in the design of programs aimed at preventing readmissions, they may not be useful in readmission predictive models.
Collapse
Affiliation(s)
- Allyson G Hall
- Department of Health Services Administration, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Ganisher K Davlyatov
- Department of Health Administration and Policy, University of Oklahoma Health Sciences Center, Norman, Oklahoma, USA
| | - Gregory N Orewa
- Department of Health Services Administration, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Tapan S Mehta
- Department of Family and Community Medicine, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Sue S Feldman
- Department of Health Services Administration, University of Alabama at Birmingham, Birmingham, Alabama, USA
| |
Collapse
|
5
|
Ruddle RA, Adnan M, Hall M. Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data. BMJ Open 2022; 12:e064887. [PMID: 36410820 PMCID: PMC9680176 DOI: 10.1136/bmjopen-2022-064887] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVES Missing data is the most common data quality issue in electronic health records (EHRs). Missing data checks implemented in common analytical software are typically limited to counting the number of missing values in individual fields, but researchers and organisations also need to understand multifield missing data patterns to better inform advanced missing data strategies for which counts or numerical summaries are poorly suited. This study shows how set-based visualisation enables multifield missing data patterns to be discovered and investigated. DESIGN Development and evaluation of interactive set visualisation techniques to find patterns of missing data and generate actionable insights. The visualisations comprised easily interpretable bar charts for sets, heatmaps for set intersections and histograms for distributions of both sets and intersections. SETTING AND PARTICIPANTS Anonymised admitted patient care health records for National Health Service (NHS) hospitals and independent sector providers in England. The visualisation and data mining software was run over 16 million records and 86 fields in the dataset. RESULTS The dataset contained 960 million missing values. Set visualisation bar charts showed how those values were distributed across the fields, including several fields that, unexpectedly, were not complete. Set intersection heatmaps revealed unexpected gaps in diagnosis, operation and date fields because diagnosis and operation fields were not filled up sequentially and some operations did not have corresponding dates. Information gain ratio and entropy calculations allowed us to identify the origin of each unexpected pattern, in terms of the values of other fields. CONCLUSIONS Our findings show how set visualisation reveals important insights about multifield missing data patterns in large EHR datasets. The study revealed both rare and widespread data quality issues that were previously unknown, and allowed a particular part of a specific hospital to be pinpointed as the origin of rare issues that NHS Digital did not know exist.
Collapse
Affiliation(s)
- Roy A Ruddle
- School of Computing and Leeds Institute for Data Analytics, University of Leeds, Leeds, UK
| | - Muhammad Adnan
- Computer Science, Higher Colleges of Technology, Sharjah, UAE
| | - Marlous Hall
- Leeds Institute of Cardiovascular & Metabolic Medicine and Leeds Institute for Data Analytics, University of Leeds, Leeds, UK
| |
Collapse
|
6
|
Sun Y, Liu F, Zhang Y, Lu Y, Su Z, Ji H, Cheng Y, Song W, Hidru TH, Yang X, Jiang Y. The relationship of endothelial function and arterial stiffness with subclinical target organ damage in essential hypertension. J Clin Hypertens (Greenwich) 2022; 24:418-429. [PMID: 35238151 PMCID: PMC8989756 DOI: 10.1111/jch.14447] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Revised: 02/02/2022] [Accepted: 02/06/2022] [Indexed: 12/02/2022]
Abstract
This study aimed to explore whether brachial‐ankle pulse wave velocity (baPWV) and brachial artery flow‐mediated dilation (FMD) or the interaction of both parameters are associated with subclinical target organ damage (STOD) indices in patients with essential hypertension. A total of 4618 patients registered from January 2015 to October 2020 were included. baPWV and FMD were measured to evaluate arterial stiffness and endothelial dysfunction. Whereas left ventricular hypertrophy (LVH), urine albumin‐creatinine ratio (UACR), and carotid intima‐media thickness (CIMT) were obtained as STOD indicators. On multivariable logistic regression analysis with potential confounders, higher quartiles of baPWV and FMD were significantly associated with an increased risk of STOD. In patients <65 years of age, the odds ratio (OR) of LVH, UACR, and CIMT ≥.9 mm for the fourth versus the first quartile of baPWV were 1.765 (1.390–2.240), 2.832 (2.014–3.813), and 3.075 (2.315–4.084), respectively. In interaction analysis, an increase in baPWV shows a progressively higher risk of STOD across the quartiles of FMD. Also, the estimated absolute risks of LVH, UACR, and CIMT ≥.9 mm for the first to fourth quartile of baPWV increased from 1.88 to 2.75, 2.35 to 4.44, and 3.10 to 6.10, respectively, in patients grouped by FMD quartiles. The addition of baPWV to FMD slightly improved risk prediction for STOD. BaPWV and FMD were independently associated with an increased risk of STOD in patients with essential hypertension especially among patients <65 years of age. Patients with elevated baPWV and decreased FMD parameters are at increased risk of STOD.
Collapse
Affiliation(s)
- Yancui Sun
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Fei Liu
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Ying Zhang
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Yan Lu
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Zhuolin Su
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Haizhe Ji
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Yunpeng Cheng
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Wei Song
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Tesfaldet H Hidru
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Xiaolei Yang
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| | - Yinong Jiang
- Department of Cardiology, First Affiliated Hospital of Dalian Medical University, Dalian, Liaoning Province, China
| |
Collapse
|
7
|
Gianfrancesco MA, Goldstein ND. A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol 2021; 21:234. [PMID: 34706667 PMCID: PMC8549408 DOI: 10.1186/s12874-021-01416-5] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 09/28/2021] [Indexed: 11/10/2022] Open
Abstract
Electronic health records (EHRs) are widely used in epidemiological research, but the validity of the results is dependent upon the assumptions made about the healthcare system, the patient, and the provider. In this review, we identify four overarching challenges in using EHR-based data for epidemiological analysis, with a particular emphasis on threats to validity. These challenges include representativeness of the EHR to a target population, the availability and interpretability of clinical and non-clinical data, and missing data at both the variable and observation levels. Each challenge reveals layers of assumptions that the epidemiologist is required to make, from the point of patient entry into the healthcare system, to the provider documenting the results of the clinical exam and follow-up of the patient longitudinally; all with the potential to bias the results of analysis of these data. Understanding the extent of as well as remediating potential biases requires a variety of methodological approaches, from traditional sensitivity analyses and validation studies, to newer techniques such as natural language processing. Beyond methods to address these challenges, it will remain crucial for epidemiologists to engage with clinicians and informaticians at their institutions to ensure data quality and accessibility by forming multidisciplinary teams around specific research projects.
Collapse
Affiliation(s)
- Milena A Gianfrancesco
- Division of Rheumatology, University of California School of Medicine, San Francisco, CA, USA
| | - Neal D Goldstein
- Department of Epidemiology and Biostatistics, Drexel University Dornsife School of Public Health, 3215 Market St., Philadelphia, PA, 19104, USA.
| |
Collapse
|
8
|
Tan Q, Ye M, Ma AJ, Yip TCF, Wong GLH, Yuen PC. Importance-aware personalized learning for early risk prediction using static and dynamic health data. J Am Med Inform Assoc 2021; 28:713-726. [PMID: 33496786 DOI: 10.1093/jamia/ocaa306] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 11/17/2020] [Accepted: 11/21/2020] [Indexed: 01/23/2023] Open
Abstract
OBJECTIVE Accurate risk prediction is important for evaluating early medical treatment effects and improving health care quality. Existing methods are usually designed for dynamic medical data, which require long-term observations. Meanwhile, important personalized static information is ignored due to the underlying uncertainty and unquantifiable ambiguity. It is urgent to develop an early risk prediction method that can adaptively integrate both static and dynamic health data. MATERIALS AND METHODS Data were from 6367 patients with Peptic Ulcer Bleeding between 2007 and 2016. This article develops a novel End-to-end Importance-Aware Personalized Deep Learning Approach (eiPDLA) to achieve accurate early clinical risk prediction. Specifically, eiPDLA introduces a long short-term memory with temporal attention to learn sequential dependencies from time-stamped records and simultaneously incorporating a residual network with correlation attention to capture their influencing relationship with static medical data. Furthermore, a new multi-residual multi-scale network with the importance-aware mechanism is designed to adaptively fuse the learned multisource features, automatically assigning larger weights to important features while weakening the influence of less important features. RESULTS Extensive experimental results on a real-world dataset illustrate that our method significantly outperforms the state-of-the-arts for early risk prediction under various settings (eg, achieving an AUC score of 0.944 at 1 year ahead of risk prediction). Case studies indicate that the achieved prediction results are highly interpretable. CONCLUSION These results reflect the importance of combining static and dynamic health data, mining their influencing relationship, and incorporating the importance-aware mechanism to automatically identify important features. The achieved accurate early risk prediction results save precious time for doctors to timely design effective treatments and improve clinical outcomes.
Collapse
Affiliation(s)
- Qingxiong Tan
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong
| | - Mang Ye
- School of Computer Science, Wuhan University, Wuhan, China
| | - Andy Jinhua Ma
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
| | - Terry Cheuk-Fung Yip
- Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Grace Lai-Hung Wong
- Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Pong C Yuen
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong
| |
Collapse
|
9
|
Trinder M, Brunham LR. Polygenic scores for dyslipidemia: the emerging genomic model of plasma lipoprotein trait inheritance. Curr Opin Lipidol 2021; 32:103-111. [PMID: 33395106 DOI: 10.1097/mol.0000000000000737] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
PURPOSE OF REVIEW Contemporary polygenic scores, which summarize the cumulative contribution of millions of common single-nucleotide variants to a phenotypic trait, can have effects comparable to monogenic mutations. This review focuses on the emerging use of 'genome-wide' polygenic scores for plasma lipoproteins to define the etiology of clinical dyslipidemia, modify the severity of monogenic disease, and inform therapeutic options. RECENT FINDINGS Polygenic scores for low-density lipoprotein cholesterol (LDL-C), triglycerides, and high-density lipoprotein cholesterol are associated with severe hypercholesterolemia, hypertriglyceridemia, or hypoalphalipoproteinemia, respectively. These polygenic scores for LDL-C or triglycerides associate with risk of incident coronary artery disease (CAD) independent of polygenic scores designed specifically for CAD and may identify individuals that benefit most from lipid-lowering medication. Additionally, the severity of hypercholesterolemia and CAD associated with familial hypercholesterolemia-a common monogenic disorder-is modified by these polygenic factors. The current focus of polygenic scores for dyslipidemia is to design predictive polygenic scores for diverse populations and determining how these polygenic scores could be implemented and standardized for use in the clinic. SUMMARY Polygenic scores have shown early promise for the management of dyslipidemias, but several challenges need to be addressed before widespread clinical implementation to ensure that potential benefits are robust and reproducible, equitable, and cost-effective.
Collapse
Affiliation(s)
- Mark Trinder
- Centre for Heart Lung Innovation, University of British Columbia
- Experimental Medicine Program, University of British Columbia
| | - Liam R Brunham
- Centre for Heart Lung Innovation, University of British Columbia
- Experimental Medicine Program, University of British Columbia
- Department of Medicine, University of British Columbia
- Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
10
|
Haneuse S, Arterburn D, Daniels MJ. Assessing Missing Data Assumptions in EHR-Based Studies: A Complex and Underappreciated Task. JAMA Netw Open 2021; 4:e210184. [PMID: 33635321 DOI: 10.1001/jamanetworkopen.2021.0184] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
- Statistical Editor, JAMA Network Open
| | - David Arterburn
- Kaiser Permanente Washington Health Research Institute, Seattle
| | | |
Collapse
|
11
|
Increasing the Density of Laboratory Measures for Machine Learning Applications. J Clin Med 2020; 10:jcm10010103. [PMID: 33396741 PMCID: PMC7795258 DOI: 10.3390/jcm10010103] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Revised: 12/23/2020] [Accepted: 12/25/2020] [Indexed: 12/12/2022] Open
Abstract
Background. The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications. Method. We analyzed the laboratory measures derived from Geisinger’s EHR on patients in three distinct cohorts—patients tested for Clostridioides difficile (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns. Results. We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for C. difficile infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as −35.5 for the Cdiff, −8.3 for the IBD, and −11.3 for the OA dataset. Conclusions. An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis.
Collapse
|
12
|
Li R, Chen Y, Ritchie MD, Moore JH. Electronic health records and polygenic risk scores for predicting disease risk. Nat Rev Genet 2020; 21:493-502. [PMID: 32235907 DOI: 10.1038/s41576-020-0224-1] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/02/2020] [Indexed: 01/03/2023]
Abstract
Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction.
Collapse
Affiliation(s)
- Ruowang Li
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Marylyn D Ritchie
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|