1
|
McCaw ZR, Gao J, Lin X, Gronsbell J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat Genet 2024:10.1038/s41588-024-01793-9. [PMID: 38872030 DOI: 10.1038/s41588-024-01793-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 05/08/2024] [Indexed: 06/15/2024]
Abstract
Within population biobanks, incomplete measurement of certain traits limits the power for genetic discovery. Machine learning is increasingly used to impute the missing values from the available data. However, performing genome-wide association studies (GWAS) on imputed traits can introduce spurious associations, identifying genetic variants that are not associated with the original trait. Here we introduce a new method, synthetic surrogate (SynSurr) analysis, which makes GWAS on imputed phenotypes robust to imputation errors. Rather than replacing missing values, SynSurr jointly analyzes the original and imputed traits. We show that SynSurr estimates the same genetic effect as standard GWAS and improves power in proportion to the quality of the imputations. SynSurr requires a commonly made missing-at-random assumption but relaxes the requirements of existing imputation methods by not requiring correct model specification. We present extensive simulations and ablation analyses to validate SynSurr and apply it to empower the GWAS of dual-energy X-ray absorptiometry traits within the UK Biobank.
Collapse
Affiliation(s)
- Zachary R McCaw
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
2
|
Ranapurwala SI, Alam IZ, Pence BW, Carey TS, Christensen S, Clark M, Chelminski PR, Wu LT, Greenblatt LH, Korte JE, Wolfson M, Douglas HE, Bowlby LA, Capata M, Marshall SW. Development and validation of an electronic health records-based opioid use disorder algorithm by expert clinical adjudication among patients with prescribed opioids. Pharmacoepidemiol Drug Saf 2023; 32:577-585. [PMID: 36585827 PMCID: PMC10073250 DOI: 10.1002/pds.5591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 12/05/2022] [Accepted: 12/22/2022] [Indexed: 01/01/2023]
Abstract
BACKGROUND In the US, over 200 lives are lost from opioid overdoses each day. Accurate and prompt diagnosis of opioid use disorders (OUD) may help prevent overdose deaths. However, international classification of disease (ICD) codes for OUD are known to underestimate prevalence, and their specificity and sensitivity are unknown. We developed and validated algorithms to identify OUD in electronic health records (EHR) and examined the validity of OUD ICD codes. METHODS Through four iterations, we developed EHR-based OUD identification algorithms among patients who were prescribed opioids from 2014 to 2017. The algorithms and OUD ICD codes were validated against 169 independent "gold standard" EHR chart reviews conducted by an expert adjudication panel across four healthcare systems. After using 2014-2020 EHR for validating iteration 1, the experts were advised to use 2014-2017 EHR thereafter. RESULTS Of the 169 EHR charts, 81 (48%) were reviewed by more than one expert and exhibited 85% expert agreement. The experts identified 54 OUD cases. The experts endorsed all 11 OUD criteria from the Diagnostic and Statistical Manual of Mental Disorders-5, including craving (72%), tolerance (65%), withdrawal (56%), and recurrent use in physically hazardous conditions (50%). The OUD ICD codes had 10% sensitivity and 99% specificity, underscoring large underestimation. In comparison our algorithm identified OUD with 23% sensitivity and 98% specificity. CONCLUSIONS AND RELEVANCE This is the first study to estimate the validity of OUD ICD codes and develop validated EHR-based OUD identification algorithms. This work will inform future research on early intervention and prevention of OUD.
Collapse
Affiliation(s)
- Shabbar I. Ranapurwala
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, North Carolina, USA
- Injury Prevention Research Center, UNC, Chapel Hill, North Carolina, USA
| | - Ishrat Z. Alam
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, North Carolina, USA
- Injury Prevention Research Center, UNC, Chapel Hill, North Carolina, USA
| | - Brian W. Pence
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, North Carolina, USA
- Injury Prevention Research Center, UNC, Chapel Hill, North Carolina, USA
| | - Timothy S. Carey
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, North Carolina, USA
- North Carolina Translational and Clinical Sciences Institute, School of Medicine, University of North Carolina at Chapel Hill, North Carolina, USA
- Department of Medicine, School of Medicine, University of North Carolina at Chapel Hill, North Carolina, USA
| | - Sean Christensen
- Department of Psychiatry and Behavioral Sciences, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Marshall Clark
- North Carolina Translational and Clinical Sciences Institute, School of Medicine, University of North Carolina at Chapel Hill, North Carolina, USA
| | - Paul R. Chelminski
- Division of General Internal Medicine and Clinical Epidemiology, Department of Medicine, School of Medicine, University of North Carolina at Chapel Hill, North Carolina, USA
| | - Li-Tzy Wu
- Department of Psychiatry and Behavioral Sciences, School of Medicine, Duke University, Durham, North Carolina, USA
- Department of Medicine, School of Medicine, Duke University, Durham, North Carolina, USA
| | - Lawrence H. Greenblatt
- Department of Medicine, School of Medicine, Duke University, Durham, North Carolina, USA
| | - Jeffrey E. Korte
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Mark Wolfson
- Department of Social Medicine, Population, and Public Health, School of Medicine, University of California, Riverside, California, USA
| | - Heather E. Douglas
- Department of Psychiatry and Behavioral Medicine, School of Medicine, Wake Forest University, Winston-Salem, North Carolina, NC, USA
| | - Lynn A. Bowlby
- Department of Medicine, School of Medicine, Duke University, Durham, North Carolina, USA
| | - Michael Capata
- Department of Psychiatry and Behavioral Sciences, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Stephen W. Marshall
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, North Carolina, USA
- Injury Prevention Research Center, UNC, Chapel Hill, North Carolina, USA
| |
Collapse
|
3
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
Weinstein EJ, Ritchey ME, Lo Re V. Core concepts in pharmacoepidemiology: Validation of health outcomes of interest within real-world healthcare databases. Pharmacoepidemiol Drug Saf 2023; 32:1-8. [PMID: 36057777 PMCID: PMC9772105 DOI: 10.1002/pds.5537] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 08/09/2022] [Accepted: 08/19/2022] [Indexed: 02/06/2023]
Abstract
Real-world healthcare data, including administrative and electronic medical record databases, provide a rich source of data for the conduct of pharmacoepidemiologic studies but carry the potential for misclassification of health outcomes of interest (HOIs). Validation studies are important ways to quantify the degree of error associated with case-identifying algorithms for HOIs and are crucial for interpreting study findings within real-world data. This review provides a rationale, framework, and step-by-step approach to validating case-identifying algorithms for HOIs within healthcare databases. Key steps in validating a case-identifying algorithm within a healthcare database include: (1) selecting the appropriate health outcome; (2) determining the reference standard against which to validate the algorithm; (3) developing the algorithm using diagnosis codes, diagnostic tests or their results, procedures, drug therapies, patient-reported symptoms or diagnoses, or some combinations of these parameters; (4) selection of patients and sample sizes for validation; (5) collecting data to confirm the HOI; (6) confirming the HOI; and (7) assessing the algorithm's performance. Additional strategies for algorithm refinement and methods to correct for bias due to misclassification of outcomes are discussed. The review concludes by discussing factors affecting the transportability of case-identifying algorithms and the need for ongoing validation as data elements within healthcare databases, such as diagnosis codes, change over time or new variables, such as patient-generated health data, are included in these data sources.
Collapse
Affiliation(s)
- Erica J Weinstein
- Division of Infectious Diseases, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Center for Pharmacoepidemiology Research and Training, Center for Clinical Epidemiology and Biostatistics, and Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Mary Elizabeth Ritchey
- Med Tech Epi, LLC, Philadelphia, PA, USA
- Center for Pharmacoepidemiology and Treatment Science, Rutgers University, New Brunswick, New Jersey, USA
| | - Vincent Lo Re
- Division of Infectious Diseases, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Center for Pharmacoepidemiology Research and Training, Center for Clinical Epidemiology and Biostatistics, and Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
5
|
Ng DQ, Dang E, Chen L, Nguyen MT, Nguyen MKN, Samman S, Nguyen TMT, Cadiz CL, Nguyen L, Chan A. Current and recommended practices for evaluating adverse drug events using electronic health records: A systematic review. JOURNAL OF THE AMERICAN COLLEGE OF CLINICAL PHARMACY 2021. [DOI: 10.1002/jac5.1524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Ding Quan Ng
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Emily Dang
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Lijie Chen
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Mary Thuy Nguyen
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Michael Ky Nguyen Nguyen
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Sarah Samman
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Tiffany Mai Thy Nguyen
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Christine Luu Cadiz
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Lee Nguyen
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| | - Alexandre Chan
- School of Pharmacy & Pharmaceutical Sciences University of California Irvine Irvine California USA
| |
Collapse
|
6
|
Holmes JH, Beinlich J, Boland MR, Bowles KH, Chen Y, Cook TS, Demiris G, Draugelis M, Fluharty L, Gabriel PE, Grundmeier R, Hanson CW, Herman DS, Himes BE, Hubbard RA, Kahn CE, Kim D, Koppel R, Long Q, Mirkovic N, Morris JS, Mowery DL, Ritchie MD, Urbanowicz R, Moore JH. Why Is the Electronic Health Record So Challenging for Research and Clinical Care? Methods Inf Med 2021; 60:32-48. [PMID: 34282602 DOI: 10.1055/s-0041-1731784] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
BACKGROUND The electronic health record (EHR) has become increasingly ubiquitous. At the same time, health professionals have been turning to this resource for access to data that is needed for the delivery of health care and for clinical research. There is little doubt that the EHR has made both of these functions easier than earlier days when we relied on paper-based clinical records. Coupled with modern database and data warehouse systems, high-speed networks, and the ability to share clinical data with others are large number of challenges that arguably limit the optimal use of the EHR OBJECTIVES: Our goal was to provide an exhaustive reference for those who use the EHR in clinical and research contexts, but also for health information systems professionals as they design, implement, and maintain EHR systems. METHODS This study includes a panel of 24 biomedical informatics researchers, information technology professionals, and clinicians, all of whom have extensive experience in design, implementation, and maintenance of EHR systems, or in using the EHR as clinicians or researchers. All members of the panel are affiliated with Penn Medicine at the University of Pennsylvania and have experience with a variety of different EHR platforms and systems and how they have evolved over time. RESULTS Each of the authors has shared their knowledge and experience in using the EHR in a suite of 20 short essays, each representing a specific challenge and classified according to a functional hierarchy of interlocking facets such as usability and usefulness, data quality, standards, governance, data integration, clinical care, and clinical research. CONCLUSION We provide here a set of perspectives on the challenges posed by the EHR to clinical and research users.
Collapse
Affiliation(s)
- John H Holmes
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - James Beinlich
- Information Technology Entity Services and Corporate Information Services, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Mary R Boland
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Kathryn H Bowles
- Department of Biobehavioral Health Sciences, University of Pennsylvania School of Nursing, Philadelphia, Pennsylvania, United States
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Tessa S Cook
- Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - George Demiris
- Department of Biobehavioral Health Sciences, University of Pennsylvania School of Nursing, Philadelphia, Pennsylvania, United States
| | - Michael Draugelis
- Department of Predictive Health Care, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Laura Fluharty
- Clinical Research Operations, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Peter E Gabriel
- Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Robert Grundmeier
- Department of Pediatrics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States
| | - C William Hanson
- Department of Anesthesiology and Critical Care, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Daniel S Herman
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine Philadelphia, Pennsylvania, United States
| | - Blanca E Himes
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Charles E Kahn
- Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Dokyoon Kim
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Ross Koppel
- Department of Sociology, University of Pennsylvania, Philadelphia, Pennsylvania, United States
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Nebojsa Mirkovic
- Department of Research Analytics, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Jeffrey S Morris
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Danielle L Mowery
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Marylyn D Ritchie
- Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Ryan Urbanowicz
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| |
Collapse
|
7
|
Studying pediatric health outcomes with electronic health records using Bayesian clustering and trajectory analysis. J Biomed Inform 2020; 113:103654. [PMID: 33309993 DOI: 10.1016/j.jbi.2020.103654] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 11/03/2020] [Accepted: 12/06/2020] [Indexed: 11/21/2022]
Abstract
Use of routinely collected data from electronic health records (EHR) can expedite longitudinal studies that investigate childhood exposures and rare pediatric health outcomes. For instance, characteristics of the body mass index (BMI) trajectory early in life may be associated with subsequent development of type 2 diabetes. Past studies investigating these relationships have used longitudinal cohort data collected over the course of many years to investigate the connection between BMI trajectory and subsequent development of diabetes. In contrast, EHR data from routine clinical care can provide longitudinal information on early-life BMI trajectories as well as subsequent health outcomes without requiring any additional data collection. In this study, we introduce a Bayesian joint phenotyping and BMI trajectory model to address data quality challenges in an EHR-based study of early-life BMI and type 2 diabetes in adolescence. We compared this joint modeling approach to traditional approaches using a computable phenotype for type 2 diabetes or separately estimated BMI trajectories and type 2 diabetes phenotypes. In a sample of 49,062 children derived from the PEDSnet consortium of pediatric healthcare systems, a median 8 (interquartile range [IQR] 5-13) BMI measurements were available to characterize the early-life BMI trajectory. The joint modeling and computable phenotype approaches found that age at adiposity rebound between 5 and 9 years was associated with higher odds of type 2 diabetes in adolescence compared to age at adiposity rebound between 2 and 5 years (joint model odds ratio [OR] = 1.77; computable phenotype OR = 1.88) and that BMI in excess of 140% of the 95th percentile for age and sex at age 9 years was associated with higher odds of type 2 diabetes in adolescence relative to children with BMI from 100 to 120% of the 95th percentile (joint model OR = 6.22; computable phenotype OR = 13.25). Estimates from the separate phenotyping and trajectory model were substantially attenuated towards the null. These results demonstrate that EHR data coupled with modern methodologic approaches can improve efficiency and timeliness of studies of childhood exposures and rare health outcomes.
Collapse
|