1
|
Lu Y, Tong J, Chubak J, Lumley T, Hubbard RA, Xu H, Chen Y. Leveraging error-prone algorithm-derived phenotypes: Enhancing association studies for risk factors in EHR data. J Biomed Inform 2024; 157:104690. [PMID: 39004110 DOI: 10.1016/j.jbi.2024.104690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 06/01/2024] [Accepted: 07/10/2024] [Indexed: 07/16/2024]
Abstract
OBJECTIVES It has become increasingly common for multiple computable phenotypes from electronic health records (EHR) to be developed for a given phenotype. However, EHR-based association studies often focus on a single phenotype. In this paper, we develop a method aiming to simultaneously make use of multiple EHR-derived phenotypes for reduction of bias due to phenotyping error and improved efficiency of phenotype/exposure associations. MATERIALS AND METHODS The proposed method combines multiple algorithm-derived phenotypes with a small set of validated outcomes to reduce bias and improve estimation accuracy and efficiency. The performance of our method was evaluated through simulation studies and real-world application to an analysis of colon cancer recurrence using EHR data from Kaiser Permanente Washington. RESULTS In settings where there was no single surrogate performing uniformly better than all others in terms of both sensitivity and specificity, our method achieved substantial bias reduction compared to using a single algorithm-derived phenotype. Our method also led to higher estimation efficiency by up to 30% compared to an estimator that used only one algorithm-derived phenotype. DISCUSSION Simulation studies and application to real-world data demonstrated the effectiveness of our method in integrating multiple phenotypes, thereby enhancing bias reduction, statistical accuracy and efficiency. CONCLUSIONS Our method combines information across multiple surrogates using a statistically efficient seemingly unrelated regression framework. Our method provides a robust alternative to single-surrogate-based bias correction, especially in contexts lacking information on which surrogate is superior.
Collapse
Affiliation(s)
- Yiwen Lu
- Center for Health AI and Synthesis of Evidence (CHASE), Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiayi Tong
- Center for Health AI and Synthesis of Evidence (CHASE), Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Jessica Chubak
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Thomas Lumley
- Department of Statistics, Faculty of Science, University of Auckland, Auckland, New Zealand
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
| | - Yong Chen
- Center for Health AI and Synthesis of Evidence (CHASE), Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA; Leonard Davis Institute of Health Economics, Philadelphia, PA, USA; Penn Medicine Center for Evidence-based Practice (CEP), Philadelphia, PA, USA.
| |
Collapse
|
2
|
Zhou Q, Wong KY. Improving estimation efficiency of case-cohort studies with interval-censored failure time data. Stat Methods Med Res 2024:9622802241268601. [PMID: 39105419 DOI: 10.1177/09622802241268601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/07/2024]
Abstract
The case-cohort design is a commonly used cost-effective sampling strategy for large cohort studies, where some covariates are expensive to measure or obtain. In this paper, we consider regression analysis under a case-cohort study with interval-censored failure time data, where the failure time is only known to fall within an interval instead of being exactly observed. A common approach to analyzing data from a case-cohort study is the inverse probability weighting approach, where only subjects in the case-cohort sample are used in estimation, and the subjects are weighted based on the probability of inclusion into the case-cohort sample. This approach, though consistent, is generally inefficient as it does not incorporate information outside the case-cohort sample. To improve efficiency, we first develop a sieve maximum weighted likelihood estimator under the Cox model based on the case-cohort sample and then propose a procedure to update this estimator by using information in the full cohort. We show that the update estimator is consistent, asymptotically normal, and at least as efficient as the original estimator. The proposed method can flexibly incorporate auxiliary variables to improve estimation efficiency. A weighted bootstrap procedure is employed for variance estimation. Simulation results indicate that the proposed method works well in practical situations. An application to a Phase 3 HIV vaccine efficacy trial is provided for illustration.
Collapse
Affiliation(s)
- Qingning Zhou
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, USA
| | - Kin Yau Wong
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| |
Collapse
|
3
|
Kundu R, Shi X, Morrison J, Barrett J, Mukherjee B. A framework for understanding selection bias in real-world healthcare data. JOURNAL OF THE ROYAL STATISTICAL SOCIETY. SERIES A, (STATISTICS IN SOCIETY) 2024; 187:606-635. [PMID: 39281782 PMCID: PMC11393555 DOI: 10.1093/jrsssa/qnae039] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 01/27/2024] [Accepted: 03/31/2024] [Indexed: 09/18/2024]
Abstract
Using administrative patient-care data such as Electronic Health Records (EHR) and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real-world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.
Collapse
Affiliation(s)
- Ritoban Kundu
- Department of Biostatistics, University of Michigan, Ann Arbor, USA
| | - Xu Shi
- Department of Biostatistics, University of Michigan, Ann Arbor, USA
| | - Jean Morrison
- Department of Biostatistics, University of Michigan, Ann Arbor, USA
| | - Jessica Barrett
- MRC Investigator, Biostatistics Unit, Medical Research Council, University of Cambridge, Cambridge, UK
| | - Bhramar Mukherjee
- Department of Biostatistics and Epidemiology, University of Michigan, Ann Arbor, USA
| |
Collapse
|
4
|
Furner B, Cheng A, Desai AV, Benedetti DJ, Friedman DL, Wyatt KD, Watkins M, Volchenboum SL, Cohn SL. Extracting Electronic Health Record Neuroblastoma Treatment Data With High Fidelity Using the REDCap Clinical Data Interoperability Services Module. JCO Clin Cancer Inform 2024; 8:e2400009. [PMID: 38815188 PMCID: PMC11371086 DOI: 10.1200/cci.24.00009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 03/20/2024] [Accepted: 04/12/2024] [Indexed: 06/01/2024] Open
Abstract
PURPOSE Although the International Neuroblastoma Risk Group Data Commons (INRGdc) has enabled seminal large cohort studies, the research is limited by the lack of real-world, electronic health record (EHR) treatment data. To address this limitation, we evaluated the feasibility of extracting treatment data directly from EHRs using the REDCap Clinical Data Interoperability Services (CDIS) module for future submission to the INRGdc. METHODS Patients enrolled on the Children's Oncology Group neuroblastoma biology study ANBL00B1 (ClinicalTrials.gov identifier: NCT00904241) who received care at the University of Chicago (UChicago) or the Vanderbilt University Medical Center (VUMC) after the go-live dates for the Fast Healthcare Interoperability Resources (FHIR)-compliant EHRs were identified. Antineoplastic drug orders were extracted using the CDIS module. To validate the CDIS output, antineoplastic agents extracted through FHIR were compared with those queried through EHR relational databases (UChicago's Clinical Research Data Warehouse and VUMC's Epic Clarity database) and manual chart review. RESULTS The analytic cohort consisted of 41 patients at UChicago and 32 VUMC patients. Antineoplastic drug orders were identified in the extracted EHR records of 39 (95.1%) UChicago patients and 26 (81.3%) VUMC patients. Manual chart review confirmed that patients with missing (n = 8) or discontinued (n = 1) orders in the CDIS output did not receive antineoplastic agents during the timeframe of the study. More than 99% of the antineoplastic drug orders in the EHR relational databases were identified in the corresponding CDIS output. CONCLUSION Our results demonstrate the feasibility of extracting EHR treatment data with high fidelity using HL7-FHIR via REDCap CDIS for future submission to the INRGdc.
Collapse
Affiliation(s)
- Brian Furner
- Department of Pediatrics, Section of Hematology/Oncology, The University of Chicago, Chicago, IL
| | - Alex Cheng
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Ami V. Desai
- Department of Pediatrics, Section of Hematology/Oncology, The University of Chicago, Chicago, IL
| | - Daniel J. Benedetti
- Department of Pediatrics, Division of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, TN
| | - Debra L. Friedman
- Department of Pediatrics, Division of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, TN
| | - Kirk D. Wyatt
- Department of Pediatric Hematology/Oncology, Roger Maris Cancer Center, Sanford Health, Fargo, ND
| | - Michael Watkins
- Department of Pediatrics, Section of Hematology/Oncology, The University of Chicago, Chicago, IL
| | - Samuel L. Volchenboum
- Department of Pediatrics, Section of Hematology/Oncology, The University of Chicago, Chicago, IL
| | - Susan L. Cohn
- Department of Pediatrics, Section of Hematology/Oncology, The University of Chicago, Chicago, IL
| |
Collapse
|
5
|
Gao J, Bonzel CL, Hong C, Varghese P, Zakir K, Gronsbell J. Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms. J Am Med Inform Assoc 2024; 31:640-650. [PMID: 38128118 PMCID: PMC10873838 DOI: 10.1093/jamia/ocad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/22/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023] Open
Abstract
OBJECTIVE High-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (eg, sensitivity, specificity). MATERIALS AND METHODS ssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC with synthetic, semi-synthetic, and EHR data from Mass General Brigham (MGB). RESULTS ssROC produced ROC parameter estimates with minimal bias and significantly lower variance than supROC in the simulated and semi-synthetic data. For the 5 PAs from MGB, the estimates from ssROC are 30% to 60% less variable than supROC on average. DISCUSSION ssROC enables precise evaluation of PA performance without demanding large volumes of labeled data. ssROC is also easily implementable in open-source R software. CONCLUSION When used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research.
Collapse
Affiliation(s)
- Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States
| | - Paul Varghese
- Health Informatics, Verily Life Sciences, Cambridge, MA, United States
| | - Karim Zakir
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
6
|
Ostropolets A, Hripcsak G, Husain SA, Richter LR, Spotnitz M, Elhussein A, Ryan PB. Scalable and interpretable alternative to chart review for phenotype evaluation using standardized structured data from electronic health records. J Am Med Inform Assoc 2023; 31:119-129. [PMID: 37847668 PMCID: PMC10746303 DOI: 10.1093/jamia/ocad202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Revised: 09/23/2023] [Accepted: 10/02/2023] [Indexed: 10/19/2023] Open
Abstract
OBJECTIVES Chart review as the current gold standard for phenotype evaluation cannot support observational research on electronic health records and claims data sources at scale. We aimed to evaluate the ability of structured data to support efficient and interpretable phenotype evaluation as an alternative to chart review. MATERIALS AND METHODS We developed Knowledge-Enhanced Electronic Profile Review (KEEPER) as a phenotype evaluation tool that extracts patient's structured data elements relevant to a phenotype and presents them in a standardized fashion following clinical reasoning principles. We evaluated its performance (interrater agreement, intermethod agreement, accuracy, and review time) compared to manual chart review for 4 conditions using randomized 2-period, 2-sequence crossover design. RESULTS Case ascertainment with KEEPER was twice as fast compared to manual chart review. 88.1% of the patients were classified concordantly using charts and KEEPER, but agreement varied depending on the condition. Missing data and differences in interpretation accounted for most of the discrepancies. Pairs of clinicians agreed in case ascertainment in 91.2% of the cases when using KEEPER compared to 76.3% when using charts. Patient classification aligned with the gold standard in 88.1% and 86.9% of the cases respectively. CONCLUSION Structured data can be used for efficient and interpretable phenotype evaluation if they are limited to relevant subset and organized according to the clinical reasoning principles. A system that implements these principles can achieve noninferior performance compared to chart review at a fraction of time.
Collapse
Affiliation(s)
- Anna Ostropolets
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
- Medical Informatics Services, New York-Presbyterian Hospital, New York, NY 10032, United States
| | - Syed A Husain
- Division of Nephrology, Department of Medicine, Vagelos College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, NY 10032, United States
| | - Lauren R Richter
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
| | - Matthew Spotnitz
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
| | - Ahmed Elhussein
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
| | - Patrick B Ryan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
- Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ 08560, United States
| |
Collapse
|
7
|
Lee SH, Ma Y, Wei Y, Chen J. Optimal sampling for positive only electronic health record data. Biometrics 2023; 79:2974-2986. [PMID: 36632649 PMCID: PMC10333453 DOI: 10.1111/biom.13824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 01/04/2023] [Indexed: 01/13/2023]
Abstract
Identifying a patient's disease/health status from electronic medical records is a frequently encountered task in electronic health records (EHR) related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean-squared phenotyping/classification error (MSE). Our approach incorporates "positive only" information, an approximation of the true disease status without false alarm, when it is available. In addition, our sampling procedure is applicable for training a chosen classification model which can be misspecified. We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real-data example, and is found often satisfactory under criteria beyond MSE.
Collapse
Affiliation(s)
- Seong-ho Lee
- Department of Statistics, Pennsylvania State University
| | - Yanyuan Ma
- Department of Statistics, Pennsylvania State University
| | - Ying Wei
- Department of Biostatistics, Columbia University
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| |
Collapse
|
8
|
Liu X, Chubak J, Hubbard RA, Chen Y. SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies. J Am Med Inform Assoc 2021; 29:918-927. [PMID: 34962283 PMCID: PMC9714591 DOI: 10.1093/jamia/ocab267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 10/16/2021] [Accepted: 11/23/2021] [Indexed: 12/30/2022] Open
Abstract
OBJECTIVES Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation. MATERIALS AND METHODS We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT. RESULTS We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association. CONCLUSIONS The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Jessica Chubak
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA,Department of Epidemiology, University of Washington, Seattle, Washington, USA
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Corresponding Author: Yong Chen, PhD, Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, 423 Guardian Drive, Philadelphia, PA 19104, USA ()
| |
Collapse
|