1
|
Wen J, Hou J, Bonzel CL, Zhao Y, Castro VM, Gainer VS, Weisenfeld D, Cai T, Ho YL, Panickan VA, Costa L, Hong C, Gaziano JM, Liao KP, Lu J, Cho K, Cai T. LATTE: Label-efficient incident phenotyping from longitudinal electronic health records. PATTERNS (NEW YORK, N.Y.) 2024; 5:100906. [PMID: 38264714 PMCID: PMC10801250 DOI: 10.1016/j.patter.2023.100906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 09/06/2023] [Accepted: 12/01/2023] [Indexed: 01/25/2024]
Abstract
Electronic health record (EHR) data are increasingly used to support real-world evidence studies but are limited by the lack of precise timings of clinical events. Here, we propose a label-efficient incident phenotyping (LATTE) algorithm to accurately annotate the timing of clinical events from longitudinal EHR data. By leveraging the pre-trained semantic embeddings, LATTE selects predictive features and compresses their information into longitudinal visit embeddings through visit attention learning. LATTE models the sequential dependency between the target event and visit embeddings to derive the timings. To improve label efficiency, LATTE constructs longitudinal silver-standard labels from unlabeled patients to perform semi-supervised training. LATTE is evaluated on the onset of type 2 diabetes, heart failure, and relapses of multiple sclerosis. LATTE consistently achieves substantial improvements over benchmark methods while providing high prediction interpretability. The event timings are shown to help discover risk factors of heart failure among patients with rheumatoid arthritis.
Collapse
Affiliation(s)
- Jun Wen
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
| | - Jue Hou
- University of Minnesota, Minneapolis, MN, USA
| | - Clara-Lea Bonzel
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
| | | | | | | | | | - Tianrun Cai
- VA Boston Healthcare System, Boston, MA, USA
- Mass General Brigham, Boston, MA, USA
| | - Yuk-Lam Ho
- VA Boston Healthcare System, Boston, MA, USA
| | - Vidul A. Panickan
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
| | | | | | - J. Michael Gaziano
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women’s Hospital, Boston, MA, USA
| | - Katherine P. Liao
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women’s Hospital, Boston, MA, USA
| | - Junwei Lu
- VA Boston Healthcare System, Boston, MA, USA
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Kelly Cho
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women’s Hospital, Boston, MA, USA
| | - Tianxi Cai
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
2
|
Hong C, Liang L, Yuan Q, Cho K, Liao KP, Pencina MJ, Christiani DC, Cai T. Semi-supervised calibration of noisy event risk (SCANER) with electronic health records. J Biomed Inform 2023; 144:104425. [PMID: 37331495 PMCID: PMC10478159 DOI: 10.1016/j.jbi.2023.104425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2022] [Revised: 05/05/2023] [Accepted: 05/19/2023] [Indexed: 06/20/2023]
Abstract
OBJECTIVE Electronic health records (EHR), containing detailed longitudinal clinical information on a large number of patients and covering broad patient populations, open opportunities for comprehensive predictive modeling of disease progression and treatment response. However, since EHRs were originally constructed for administrative purposes not for research, in the EHR-linked studies, it is often not feasible to capture reliable information for analytical variables, especially in the survival setting, when both accurate event status and event times are needed for model building. For example, progression-free survival (PFS), a commonly used survival outcome for cancer patients, often involves complex information embedded in free-text clinical notes and cannot be extracted reliably. Proxies of PFS time such as time to the first mention of progression in the notes are at best good approximations to the true event time. This leads to difficulty in efficiently estimating event rates for an EHR patient cohort. Estimating survival rates based on error-prone outcome definitions can lead to biased results and hamper the power in the downstream analysis. On the other hand, extracting accurate event time information via manual annotation is time and resource intensive. The objective of this study is to develop a calibrated survival rate estimator using noisy outcomes from EHR data. MATERIALS AND METHODS In this paper, we propose a two-stage semi-supervised calibration of noisy event rate (SCANER) estimator that can effectively overcome censoring induced dependency and attains more robust performance (i.e., not sensitive to misspecification of the imputation model) by fully utilizing both a small-labeled set of gold-standard survival outcomes annotated via manual chart review and a set of proxy features automatically captured via EHR in the unlabeled set. We validate the SCANER estimator by estimating the PFS rates for a virtual cohort of lung cancer patients from one large tertiary care center and the ICU-free survival rates for COVID patients from two large tertiary care centers. RESULTS In terms of survival rate estimates, the SCANER had very similar point estimates compared to the complete-case Kaplan Meier estimator. On the other hand, other benchmark methods for comparison, which fail to account for the induced dependency between event time and the censoring time conditioning on surrogate outcomes, produced biased results across all three case studies. In terms of standard errors, the SCANER estimator was more efficient than the KM estimator, with up to 50% efficiency gain. CONCLUSION The SCANER estimator achieves more efficient, robust, and accurate survival rate estimates compared to existing approaches. This promising new approach can also improve the resolution (i.e., granularity of event time) by using labels conditioning on multiple surrogates, particularly among less common or poorly coded conditions.
Collapse
Affiliation(s)
- Chuan Hong
- Duke University, Durham, NC, USA; Harvard Medical School, Boston, MA, USA
| | - Liang Liang
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Qianyu Yuan
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Kelly Cho
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA; Brigham and Women's Hospital, Boston, MA, USA
| | - Katherine P Liao
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA; Brigham and Women's Hospital, Boston, MA, USA
| | | | - David C Christiani
- Harvard T.H. Chan School of Public Health, Boston, MA, USA; Massachusetts General Hospital, Boston, MA, USA
| | - Tianxi Cai
- Harvard T.H. Chan School of Public Health, Boston, MA, USA; Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|
3
|
Ahuja Y, Liang L, Zhou D, Huang S, Cai T. Semisupervised Calibration of Risk with Noisy Event Times (SCORNET) using electronic health record data. Biostatistics 2023; 24:760-775. [PMID: 35166342 PMCID: PMC10544799 DOI: 10.1093/biostatistics/kxac003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 01/18/2022] [Accepted: 01/24/2022] [Indexed: 01/19/2023] Open
Abstract
Leveraging large-scale electronic health record (EHR) data to estimate survival curves for clinical events can enable more powerful risk estimation and comparative effectiveness research. However, use of EHR data is hindered by a lack of direct event time observations. Occurrence times of relevant diagnostic codes or target disease mentions in clinical notes are at best a good approximation of the true disease onset time. On the other hand, extracting precise information on the exact event time requires laborious manual chart review and is sometimes altogether infeasible due to a lack of detailed documentation. Current status labels-binary indicators of phenotype status during follow-up-are significantly more efficient and feasible to compile, enabling more precise survival curve estimation given limited resources. Existing survival analysis methods using current status labels focus almost entirely on supervised estimation, and naive incorporation of unlabeled data into these methods may lead to biased estimates. In this article, we propose Semisupervised Calibration of Risk with Noisy Event Times (SCORNET), which yields a consistent and efficient survival function estimator by leveraging a small set of current status labels and a large set of informative features. In addition to providing theoretical justification of SCORNET, we demonstrate in both simulation and real-world EHR settings that SCORNET achieves efficiency akin to the parametric Weibull regression model, while also exhibiting semi-nonparametric flexibility and relatively low empirical bias in a variety of generative settings.
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA
| | - Liang Liang
- Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA
| | - Doudou Zhou
- Department of Statistics, University of California Davis, 1 Shields Avenue, Davis, CA 05616, USA
| | - Sicong Huang
- Department of Rheumatology, Immunology, and Allergy, Brigham and Women’s Hospital, 75 Francis Street, Boston, MA 02115, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA and Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, MA 02115, USA
| |
Collapse
|
4
|
Hou J, Chan SF, Wang X, Cai T. Risk prediction with imperfect survival outcome information from electronic health records. Biometrics 2023; 79:190-202. [PMID: 34747010 PMCID: PMC9741856 DOI: 10.1111/biom.13599] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 10/28/2021] [Accepted: 10/29/2021] [Indexed: 12/14/2022]
Abstract
Readily available proxies for the time of disease onset such as the time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow-up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on the current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error model for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initially estimator solely based on the labeled subset, we perform a one-step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semisupervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in a finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Mass General Brigham Healthcare Biobank.
Collapse
Affiliation(s)
- Jue Hou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Stephanie F. Chan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Xuan Wang
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
5
|
Beyrer J, Nelson DR, Sheffield KM, Huang YJ, Lau YK, Hincapie AL. Development and Validation of Coding Algorithms to Identify Patients with Incident Non-Small Cell Lung Cancer in United States Healthcare Claims Data. Clin Epidemiol 2023; 15:73-89. [PMID: 36659903 PMCID: PMC9842515 DOI: 10.2147/clep.s389824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 12/23/2022] [Indexed: 01/13/2023] Open
Abstract
Purpose We sought to develop and validate an incident non-small cell lung cancer (NSCLC) algorithm for United States (US) healthcare claims data. Diagnoses and procedures, but not medications, were incorporated to support longer-term relevance and reliability. Methods Patients with newly diagnosed NSCLC per Surveillance, Epidemiology, and End Results (SEER) served as cases. Controls included newly diagnosed small-cell lung cancer and other lung cancers, and two 5% random samples for other cancer and without cancer. Algorithms derived from logistic regression and machine learning methods used the entire sample (Approach A) or started with a previous algorithm for those with lung cancer (Approach B). Sensitivity, specificity, positive predictive values (PPV), negative predictive values, and F-scores (compared for 1000 bootstrap samples) were calculated. Misclassification was evaluated by calculating the odds of selection by the algorithm among true positives and true negatives. Results The best performing algorithm utilized neural networks (Approach B). A 10-variable point-score algorithm was derived from logistic regression (Approach B); sensitivity was 77.69% and PPV = 67.61% (F-score = 72.30%). This algorithm was less sensitive for patients ≥80 years old, with Medicare follow-up time <3 months, or missing SEER data on stage, laterality, or site and less specific for patients with SEER primary site of main bronchus, SEER summary stage 2000 regional by direct extension only, or pre-index chronic pulmonary disease. Conclusion Our study developed and validated a practical, 10-variable, point-based algorithm for identifying incident NSCLC cases in a US claims database based on a previously validated incident lung cancer algorithm.
Collapse
Affiliation(s)
- Julie Beyrer
- Eli Lilly and Company, Indianapolis, IN, USA,Correspondence: Julie Beyrer, Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN, 46285, USA, Tel +1 317 651 8236, Email
| | | | | | | | | | - Ana L Hincapie
- University of Cincinnati James L. Winkle College of Pharmacy, Cincinnati, OH, USA
| |
Collapse
|
6
|
Rasmussen LA, Christensen NL, Winther-Larsen A, Dalton SO, Virgilsen LF, Jensen H, Vedsted P. A Validated Register-Based Algorithm to Identify Patients Diagnosed with Recurrence of Surgically Treated Stage I Lung Cancer in Denmark. Clin Epidemiol 2023; 15:251-261. [PMID: 36890800 PMCID: PMC9986467 DOI: 10.2147/clep.s396738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 02/15/2023] [Indexed: 03/04/2023] Open
Abstract
Introduction Recurrence of cancer is not routinely registered in Danish national health registers. This study aimed to develop and validate a register-based algorithm to identify patients diagnosed with recurrent lung cancer and to estimate the accuracy of the identified diagnosis date. Material and Methods Patients with early-stage lung cancer treated with surgery were included in the study. Recurrence indicators were diagnosis and procedure codes recorded in the Danish National Patient Register and pathology results recorded in the Danish National Pathology Register. Information from CT scans and medical records served as the gold standard to assess the accuracy of the algorithm. Results The final population consisted of 217 patients; 72 (33%) had recurrence according to the gold standard. The median follow-up time since primary lung cancer diagnosis was 29 months (interquartile interval: 18-46). The algorithm for identifying a recurrence reached a sensitivity of 83.3% (95% CI: 72.7-91.1), a specificity of 93.8% (95% CI: 88.5-97.1), and a positive predictive value of 87.0% (95% CI: 76.7-93.9). The algorithm identified 70% of the recurrences within 60 days of the recurrence date registered by the gold standard method. The positive predictive value of the algorithm decreased to 70% when the algorithm was simulated in a population with a recurrence rate of 15%. Conclusion The proposed algorithm demonstrated good performance in a population with 33% recurrences over a median of 29 months. It can be used to identify patients diagnosed with recurrent lung cancer, and it may be a valuable tool for future research in this field. However, a lower positive predictive value is seen when applying the algorithm in populations with low recurrence rates.
Collapse
Affiliation(s)
| | | | - Anne Winther-Larsen
- Department of Clinical Biochemistry, Aarhus University Hospital, Aarhus, Denmark
| | - Susanne Oksbjerg Dalton
- Survivorship and Inequality in Cancer, Danish Cancer Society Research Center, Copenhagen, Denmark.,Department of Clinical Oncology & Palliative Care, Zealand University Hospital, Næstved, Denmark
| | | | - Henry Jensen
- Research Unit for General Practice, Aarhus, Denmark
| | | |
Collapse
|
7
|
Ahuja Y, Wen J, Hong C, Xia Z, Huang S, Cai T. A semi-supervised adaptive Markov Gaussian embedding process (SAMGEP) for prediction of phenotype event times using the electronic health record. Sci Rep 2022; 12:17737. [PMID: 36273240 PMCID: PMC9588081 DOI: 10.1038/s41598-022-22585-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2021] [Accepted: 10/17/2022] [Indexed: 01/18/2023] Open
Abstract
While there exist numerous methods to identify binary phenotypes (i.e. COPD) using electronic health record (EHR) data, few exist to ascertain the timings of phenotype events (i.e. COPD onset or exacerbations). Estimating event times could enable more powerful use of EHR data for longitudinal risk modeling, including survival analysis. Here we introduce Semi-supervised Adaptive Markov Gaussian Embedding Process (SAMGEP), a semi-supervised machine learning algorithm to estimate phenotype event times using EHR data with limited observed labels, which require resource-intensive chart review to obtain. SAMGEP models latent phenotype states as a binary Markov process, and it employs an adaptive weighting strategy to map timestamped EHR features to an embedding function that it models as a state-dependent Gaussian process. SAMGEP's feature weighting achieves meaningful feature selection, and its predictions significantly improve AUCs and F1 scores over existing approaches in diverse simulations and real-world settings. It is particularly adept at predicting cumulative risk and event counting process functions, and is robust to diverse generative model parameters. Moreover, it achieves high accuracy with few (50-100) labels, efficiently leveraging unlabeled EHR data to maximize information gain from costly-to-obtain event time labels. SAMGEP can be used to estimate accurate phenotype state functions for risk modeling research.
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, 02115, USA. .,Harvard Medical School, Boston, MA, USA. .,Department of Medicine, NYU Langone Health, New York, NY, USA.
| | - Jun Wen
- grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA
| | - Chuan Hong
- grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA
| | - Zongqi Xia
- grid.21925.3d0000 0004 1936 9000Department of Neurology, University of Pittsburgh, Pittsburgh, PA USA
| | - Sicong Huang
- grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA ,grid.62560.370000 0004 0378 8294Division of Rheumatology, Inflammation, and Immunity, Brigham and Women’s Hospital, Boston, MA USA ,grid.410370.10000 0004 4657 1992VA Boston Healthcare System, Boston, MA USA
| | - Tianxi Cai
- grid.38142.3c000000041936754XDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115 USA ,grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA ,grid.410370.10000 0004 4657 1992VA Boston Healthcare System, Boston, MA USA
| |
Collapse
|
8
|
Liang L, Hou J, Uno H, Cho K, Ma Y, Cai T. Semi-supervised approach to event time annotation using longitudinal electronic health records. LIFETIME DATA ANALYSIS 2022; 28:428-491. [PMID: 35753014 PMCID: PMC10044535 DOI: 10.1007/s10985-022-09557-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 05/13/2022] [Indexed: 06/15/2023]
Abstract
Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.
Collapse
Affiliation(s)
- Liang Liang
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Jue Hou
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Hajime Uno
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Kelly Cho
- Massachusetts Veterans Epidemiology Research and Information Center, US Department of Veteran Affairs, Boston, MA, USA
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, PA, Boston, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
9
|
Khair S, Dort JC, Quan ML, Cheung WY, Sauro KM, Nakoneshny SC, Popowich BL, Liu P, Wu G, Xu Y. Validated algorithms for identifying timing of second event of oropharyngeal squamous cell carcinoma using real-world data. Head Neck 2022; 44:1909-1917. [PMID: 35653151 DOI: 10.1002/hed.27109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 04/29/2022] [Accepted: 05/18/2022] [Indexed: 11/07/2022] Open
Abstract
BACKGROUND Understanding occurrence and timing of second events (recurrence and second primary cancer) is essential for cancer specific survival analysis. However, this information is not readily available in administrative data. METHODS Alberta Cancer Registry, physician claims, and other administrative data were used. Timing of second event was estimated based on our developed algorithm. For validation, the difference, in days between the algorithm estimated and the chart-reviewed timing of second event. Further, the result of Cox-regression modeling cancer-free survival was compared to chart review data. RESULTS Majority (74.3%) of the patients had a difference between the chart-reviewed and algorithm-estimated timing of second event falling within the 0-60 days window. Kaplan-Meier curves generated from the estimated data and chart review data were comparable with a 5-year second-event-free survival rate of 75.4% versus 72.5%. CONCLUSION The algorithm provided an estimated timing of second event similar to that of the chart review.
Collapse
Affiliation(s)
- Shahreen Khair
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Joseph C Dort
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada
| | - May Lynn Quan
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada.,Department of Oncology, Cumming School of Medicine, University of Calgary, Tom Baker, Cancer Centre, Calgary, Alberta, Canada
| | - Winson Y Cheung
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada
| | - Khara M Sauro
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada.,Department of Oncology, Cumming School of Medicine, University of Calgary, Tom Baker, Cancer Centre, Calgary, Alberta, Canada
| | - Steven C Nakoneshny
- The Ohlson Research Initiative, Arnie Charbonneau Cancer Institute, University of Calgary, Calgary, Alberta, Canada
| | - Brittany Lynn Popowich
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Teaching Research and Wellness (TRW), Calgary, Alberta, Canada
| | - Ping Liu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Guosong Wu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Teaching Research and Wellness (TRW), Calgary, Alberta, Canada
| | - Yuan Xu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada.,Department of Oncology, Cumming School of Medicine, University of Calgary, Tom Baker, Cancer Centre, Calgary, Alberta, Canada.,Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Teaching Research and Wellness (TRW), Calgary, Alberta, Canada
| |
Collapse
|
10
|
Ritzwoller DP, Hassett MJ, Uno H. Regarding the Utility of Unstructured Data and Natural Language Processing for Identification of Breast Cancer Recurrence. JCO Clin Cancer Inform 2021; 5:1024-1025. [PMID: 34637320 PMCID: PMC9848577 DOI: 10.1200/cci.21.00091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Accepted: 08/20/2021] [Indexed: 01/23/2023] Open
Affiliation(s)
- Debra P. Ritzwoller
- Debra P. Ritzwoller, PhD, Institute for Health Research, Kaiser
Permanente Colorado, Aurora, CO; Michael J. Hassett, MD, MPH, Department of
Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, Harvard Medical
School, Boston, MA; and Hajime Uno, PhD, Harvard Medical School, Boston,
MA
| | - Michael J. Hassett
- Debra P. Ritzwoller, PhD, Institute for Health Research, Kaiser
Permanente Colorado, Aurora, CO; Michael J. Hassett, MD, MPH, Department of
Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, Harvard Medical
School, Boston, MA; and Hajime Uno, PhD, Harvard Medical School, Boston,
MA
| | - Hajime Uno
- Debra P. Ritzwoller, PhD, Institute for Health Research, Kaiser
Permanente Colorado, Aurora, CO; Michael J. Hassett, MD, MPH, Department of
Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, Harvard Medical
School, Boston, MA; and Hajime Uno, PhD, Harvard Medical School, Boston,
MA
| |
Collapse
|
11
|
Caswell-Jin JL, Callahan A, Purington N, Han SS, Itakura H, John EM, Blayney DW, Sledge GW, Shah NH, Kurian AW. Treatment and Monitoring Variability in US Metastatic Breast Cancer Care. JCO Clin Cancer Inform 2021; 5:600-614. [PMID: 34043432 DOI: 10.1200/cci.21.00031] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Treatment and monitoring options for patients with metastatic breast cancer (MBC) are increasing, but little is known about variability in care. We sought to improve understanding of MBC care and its correlates by analyzing real-world claims data using a search engine with a novel query language to enable temporal electronic phenotyping. METHODS Using the Advanced Cohort Engine, we identified 6,180 women who met criteria for having estrogen receptor-positive, human epidermal growth factor receptor 2-negative MBC from IBM MarketScan US insurance claims (2007-2014). We characterized treatment, monitoring, and hospice usage, along with clinical and nonclinical factors affecting care. RESULTS We observed wide variability in treatment modality and monitoring across patients and geography. Most women received first-recorded therapy with endocrine (67%) versus chemotherapy, underwent more computed tomography (CT) (76%) than positron emission tomography-CT, and were monitored using tumor markers (58%). Nearly half (46%) met criteria for aggressive disease, which were associated with receiving chemotherapy first, monitoring primarily with CT, and more frequent imaging. Older age was associated with endocrine therapy first, less frequent imaging, and less use of tumor markers. After controlling for clinical factors, care strategies varied significantly by nonclinical factors (median regional income with first-recorded therapy and imaging type, geographic region with these and with imaging frequency and use of tumor markers; P < .0001). CONCLUSION Variability in US MBC care is explained by patient and disease factors and by nonclinical factors such as geographic region, suggesting that treatment decisions are influenced by local practice patterns and/or resources. A search engine designed to express complex electronic phenotypes from longitudinal patient records enables the identification of variability in patient care, helping to define disparities and areas for improvement.
Collapse
Affiliation(s)
| | - Alison Callahan
- Department of Medicine, Stanford University School of Medicine, Stanford, CA
| | - Natasha Purington
- Department of Medicine, Stanford University School of Medicine, Stanford, CA.,Department of Neurosurgery, Stanford University School of Medicine, Stanford, CA
| | - Summer S Han
- Department of Medicine, Stanford University School of Medicine, Stanford, CA.,Department of Neurosurgery, Stanford University School of Medicine, Stanford, CA
| | - Haruka Itakura
- Department of Medicine, Stanford University School of Medicine, Stanford, CA
| | - Esther M John
- Department of Medicine, Stanford University School of Medicine, Stanford, CA.,Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, CA
| | - Douglas W Blayney
- Department of Medicine, Stanford University School of Medicine, Stanford, CA
| | - George W Sledge
- Department of Medicine, Stanford University School of Medicine, Stanford, CA
| | - Nigam H Shah
- Department of Medicine, Stanford University School of Medicine, Stanford, CA
| | - Allison W Kurian
- Department of Medicine, Stanford University School of Medicine, Stanford, CA.,Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, CA
| |
Collapse
|
12
|
Izci H, Tambuyzer T, Tuand K, Depoorter V, Laenen A, Wildiers H, Vergote I, Van Eycken L, De Schutter H, Verdoodt F, Neven P. A Systematic Review of Estimating Breast Cancer Recurrence at the Population Level With Administrative Data. J Natl Cancer Inst 2021; 112:979-988. [PMID: 32259259 DOI: 10.1093/jnci/djaa050] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 03/20/2020] [Accepted: 03/31/2020] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Exact numbers of breast cancer recurrences are currently unknown at the population level, because they are challenging to actively collect. Previously, real-world data such as administrative claims have been used within expert- or data-driven (machine learning) algorithms for estimating cancer recurrence. We present the first systematic review and meta-analysis, to our knowledge, of publications estimating breast cancer recurrence at the population level using algorithms based on administrative data. METHODS The systematic literature search followed Preferred Reporting Items for Systematic Reviews and Meta-Analysis guidelines. We evaluated and compared sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy of algorithms. A random-effects meta-analysis was performed using a generalized linear mixed model to obtain a pooled estimate of accuracy. RESULTS Seventeen articles met the inclusion criteria. Most articles used information from medical files as the gold standard, defined as any recurrence. Two studies included bone metastases only in the definition of recurrence. Fewer studies used a model-based approach (decision trees or logistic regression) (41.2%) compared with studies using detection rules without specified model (58.8%). The generalized linear mixed model for all recurrence types reported an accuracy of 92.2% (95% confidence interval = 88.4% to 94.8%). CONCLUSIONS Publications reporting algorithms for detecting breast cancer recurrence are limited in number and heterogeneous. A thorough analysis of the existing algorithms demonstrated the need for more standardization and validation. The meta-analysis reported a high accuracy overall, which indicates algorithms as promising tools to identify breast cancer recurrence at the population level. The rule-based approach combined with emerging machine learning algorithms could be interesting to explore in the future.
Collapse
Affiliation(s)
- Hava Izci
- Department of Oncology, KU Leuven - University of Leuven, Leuven, Belgium
| | - Tim Tambuyzer
- Research Department, Belgian Cancer Registry, Brussels, Belgium
| | - Krizia Tuand
- KU Leuven Libraries - 2Bergen - Learning Centre Désiré Collen, Leuven, Belgium
| | - Victoria Depoorter
- Department of Oncology, KU Leuven - University of Leuven, Leuven, Belgium
| | - Annouschka Laenen
- Interuniversity Centre for Biostatistics and Statistical Bioinformatics, Leuven, Belgium
| | - Hans Wildiers
- Department of Oncology, KU Leuven - University of Leuven, Leuven, Belgium.,Department of General Medical Oncology, University Hospitals Leuven, Leuven, Belgium
| | - Ignace Vergote
- Department of Oncology, KU Leuven - University of Leuven, Leuven, Belgium.,Department of Gynaecological Oncology, University Hospitals Leuven, Leuven, Belgium
| | | | | | - Freija Verdoodt
- Research Department, Belgian Cancer Registry, Brussels, Belgium
| | - Patrick Neven
- Department of Oncology, KU Leuven - University of Leuven, Leuven, Belgium.,Department of Gynaecological Oncology, University Hospitals Leuven, Leuven, Belgium
| |
Collapse
|
13
|
Grabner M, Molife C, Wang L, Winfree KB, Cui ZL, Cuyun Carter G, Hess LM. Data Integration to Improve Real-world Health Outcomes Research for Non-Small Cell Lung Cancer in the United States: Descriptive and Qualitative Exploration. JMIR Cancer 2021; 7:e23161. [PMID: 33843600 PMCID: PMC8076987 DOI: 10.2196/23161] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 01/29/2021] [Accepted: 02/01/2021] [Indexed: 12/20/2022] Open
Abstract
Background The integration of data from disparate sources could help alleviate data insufficiency in real-world studies and compensate for the inadequacies of single data sources and short-duration, small sample size studies while improving the utility of data for research. Objective This study aims to describe and evaluate a process of integrating data from several complementary sources to conduct health outcomes research in patients with non–small cell lung cancer (NSCLC). The integrated data set is also used to describe patient demographics, clinical characteristics, treatment patterns, and mortality rates. Methods This retrospective cohort study integrated data from 4 sources: administrative claims from the HealthCore Integrated Research Database, clinical data from a Cancer Care Quality Program (CCQP), clinical data from abstracted medical records (MRs), and mortality data from the US Social Security Administration. Patients with lung cancer who initiated second-line (2L) therapy between November 01, 2015, and April 13, 2018, were identified in the claims and CCQP data. Eligible patients were 18 years or older and received atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab in the 2L setting. The main analysis cohort included patients with claims data and data from at least one additional data source (CCQP or MR). Patients without integrated data (claims only) were reported separately. Descriptive and univariate statistics were reported. Results Data integration resulted in a main analysis cohort of 2195 patients with NSCLC; 2106 patients had CCQP and 407 patients had MR data. The claims-only cohort included 931 eligible patients. For the main analysis cohort, the mean age was 62.1 (SD 9.27) years, 48.56% (1066/2195) were female, the median length of follow-up was 6.8 months, and for 37.77% (829/2195), death was observed. For the claims-only cohort, the mean age was 66.6 (SD 12.69) years, 52.1% (485/931) were female, the median length of follow-up was 8.6 months, and for 29.3% (273/931), death was observed. The most frequent 2L treatment was immunotherapy (1094/2195, 49.84%), followed by platinum-based regimens (472/2195, 21.50%) and single-agent chemotherapy (441/2195, 20.09%); mean duration of 2L therapy was 5.6 (SD 4.9, median 4) months. We describe challenges and learnings from the data integration process, and the benefits of the integrated data set, which includes a richer set of clinical and outcome data to supplement the utilization metrics available in administrative claims. Conclusions The management of patients with NSCLC requires care from a multidisciplinary team, leading to a lack of a single aggregated data source in real-world settings. The availability of integrated clinical data from MRs, health plan claims, and other sources of clinical care may improve the ability to assess emerging treatments.
Collapse
Affiliation(s)
| | - Cliff Molife
- Eli Lilly and Company, Indianapolis, IN, United States
| | - Liya Wang
- HealthCore Inc, Wilmington, DE, United States
| | | | | | | | - Lisa M Hess
- Eli Lilly and Company, Indianapolis, IN, United States
| |
Collapse
|
14
|
Rasmussen LA, Jensen H, Virgilsen LF, Jeppesen MM, Blaakaer J, Hansen DG, Jensen PT, Mogensen O, Vedsted P. Identification of endometrial cancer recurrence - a validated algorithm based on nationwide Danish registries. Acta Oncol 2021; 60:452-458. [PMID: 33306454 DOI: 10.1080/0284186x.2020.1859133] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
INTRODUCTION Recurrence of endometrial cancer is not routinely registered in the Danish national health registers. The aim of this study was to develop and validate a register-based algorithm to identify women diagnosed with endometrial cancer recurrence in Denmark to facilitate register-based research in this field. MATERIAL AND METHODS We conducted a cohort study based on data from Danish health registers. The algorithm was designed to identify women with recurrence and estimate the accompanying diagnosis date, which was based on information from the Danish National Patient Registry and the Danish National Pathology Registry. Indicators of recurrence were pathology registrations and procedure or diagnosis codes suggesting recurrence and related treatment. The gold standard for endometrial cancer recurrence originated from a Danish nationwide study of 2612 women diagnosed with endometrial cancer, FIGO stage I-II during 2005-2009. Recurrence was suspected in 308 women based on pathology reports, and recurrence suspicion was confirmed or rejected in the 308 women based on reviews of the medical records. The algorithm was validated by comparing the recurrence status identified by the algorithm and the recurrence status in the gold standard. RESULTS After relevant exclusions, the final study population consisted of 268 women, hereof 160 (60%) with recurrence according to the gold standard. The algorithm displayed a sensitivity of 91.3% (95% confidence interval (CI): 85.8-95.1), a specificity of 91.7% (95% CI: 84.8-96.1) and a positive predictive value of 94.2% (95% CI: 89.3-97.3). The algorithm estimated the recurrence date within 30 days of the gold standard in 86% and within 60 days of the gold standard in 94% of the identified patients. DISCUSSION The algorithm demonstrated good performance; it could be a valuable tool for future research in endometrial cancer recurrence and may facilitate studies with potential impact on clinical practice.
Collapse
Affiliation(s)
- Linda A. Rasmussen
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| | - Henry Jensen
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| | - Line F. Virgilsen
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| | - Mette M. Jeppesen
- Department of Gynaecology and Obstetrics, Odense University Hospital, Odense, Denmark
| | - Jan Blaakaer
- Department of Gynaecology and Obstetrics, Odense University Hospital, Odense, Denmark
| | - Dorte G. Hansen
- Research Unit of General Practice, University of Southern Denmark, Odense, Denmark
| | - Pernille T. Jensen
- Department of Gynaecology and Obstetrics, Aarhus University Hospital, Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Ole Mogensen
- Department of Gynaecology and Obstetrics, Aarhus University Hospital, Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Peter Vedsted
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| |
Collapse
|
15
|
Rasmussen LA, Jensen H, Virgilsen LF, Hölmich LR, Vedsted P. A Validated Register-Based Algorithm to Identify Patients Diagnosed with Recurrence of Malignant Melanoma in Denmark. Clin Epidemiol 2021; 13:207-214. [PMID: 33758549 PMCID: PMC7979354 DOI: 10.2147/clep.s295844] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Accepted: 02/18/2021] [Indexed: 11/23/2022] Open
Abstract
Purpose Information on cancer recurrence is rarely available outside clinical trials. Wide exclusion criteria used in clinical trials tend to limit the generalizability of findings to the entire population of people living beyond a cancer disease. Therefore, population-level evidence is needed. The aim of this study was to develop and validate a register-based algorithm to identify patients diagnosed with recurrence after curative treatment of malignant melanoma. Patients and Methods Indicators of recurrence were diagnosis and procedure codes recorded in the Danish National Patient Register and pathology results recorded in the Danish National Pathology Register. Medical records on recurrence status and recurrence date in the Danish Melanoma Database served as the gold standard to assess the accuracy of the algorithm. Results The study included 1747 patients diagnosed with malignant melanoma; 95 (5.4%) were diagnosed with recurrence of malignant melanoma according to the gold standard. The algorithm reached a sensitivity of 93.7% (95% confidence interval (CI) 86.8–97.6), a specificity of 99.2% (95% CI: 98.6–99.5), a positive predictive value of 86.4% (95% CI: 78.2–92.4), and negative predictive value of 99.6% (95% CI: 99.2–99.9). Lin’s concordance correlation coefficient was 0.992 (95% CI: 0.989–0.996) for the agreement between the recurrence dates generated by the algorithm and by the gold standard. Conclusion The algorithm can be used to identify patients diagnosed with recurrence of malignant melanoma and to establish the timing of recurrence. This can generate population-level evidence on disease-free survival and diagnostic pathways for recurrence of malignant melanoma.
Collapse
Affiliation(s)
- Linda Aagaard Rasmussen
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| | - Henry Jensen
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| | - Line Flytkjaer Virgilsen
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| | - Lisbet Rosenkrantz Hölmich
- Department of Plastic Surgery, Herlev and Gentofte Hospital, Herlev, Denmark.,Department of Clinical Medicine, Copenhagen University, Copenhagen, Denmark
| | - Peter Vedsted
- Research Centre for Cancer Diagnosis in Primary Care (CaP), Research Unit for General Practice, Aarhus, Denmark
| |
Collapse
|
16
|
Kohane IS, Aronow BJ, Avillach P, Beaulieu-Jones BK, Bellazzi R, Bradford RL, Brat GA, Cannataro M, Cimino JJ, García-Barrio N, Gehlenborg N, Ghassemi M, Gutiérrez-Sacristán A, Hanauer DA, Holmes JH, Hong C, Klann JG, Loh NHW, Luo Y, Mandl KD, Daniar M, Moore JH, Murphy SN, Neuraz A, Ngiam KY, Omenn GS, Palmer N, Patel LP, Pedrera-Jiménez M, Sliz P, South AM, Tan ALM, Taylor DM, Taylor BW, Torti C, Vallejos AK, Wagholikar KB, Weber GM, Cai T. What Every Reader Should Know About Studies Using Electronic Health Record Data but May Be Afraid to Ask. J Med Internet Res 2021; 23:e22219. [PMID: 33600347 PMCID: PMC7927948 DOI: 10.2196/22219] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 09/14/2020] [Accepted: 01/10/2021] [Indexed: 12/13/2022] Open
Abstract
Coincident with the tsunami of COVID-19–related publications, there has been a surge of studies using real-world data, including those obtained from the electronic health record (EHR). Unfortunately, several of these high-profile publications were retracted because of concerns regarding the soundness and quality of the studies and the EHR data they purported to analyze. These retractions highlight that although a small community of EHR informatics experts can readily identify strengths and flaws in EHR-derived studies, many medical editorial teams and otherwise sophisticated medical readers lack the framework to fully critically appraise these studies. In addition, conventional statistical analyses cannot overcome the need for an understanding of the opportunities and limitations of EHR-derived studies. We distill here from the broader informatics literature six key considerations that are crucial for appraising studies utilizing EHR data: data completeness, data collection and handling (eg, transformation), data type (ie, codified, textual), robustness of methods against EHR variability (within and across institutions, countries, and time), transparency of data and analytic code, and the multidisciplinary approach. These considerations will inform researchers, clinicians, and other stakeholders as to the recommended best practices in reviewing manuscripts, grants, and other outputs from EHR-data derived studies, and thereby promote and foster rigor, quality, and reliability of this rapidly growing field.
Collapse
Affiliation(s)
- Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Bruce J Aronow
- Biomedical Informatics, Cincinnati Children's Hospital Medical Center, University of Cincinnati, Cincinnati, OH, United States
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | | | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.,ICS Maugeri, Pavia, Italy
| | - Robert L Bradford
- North Carolina Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Gabriel A Brat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Mario Cannataro
- Data Analytics Research Center, University Magna Graecia of Catanzaro, Catanzaro, Italy.,Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, Italy
| | - James J Cimino
- Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | | | - Nils Gehlenborg
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Marzyeh Ghassemi
- Department of Computer Science and Medicine, University of Toronto, Toronto, ON, Canada
| | | | - David A Hanauer
- Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, United States
| | - John H Holmes
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Jeffrey G Klann
- Department of Medicine, Harvard Medical School, Boston, MA, United States.,Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA, United States
| | | | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, United States
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
| | - Mohamad Daniar
- Clinical Research Informatics, Boston Children's Hospital, Boston, MA, United States
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States
| | - Shawn N Murphy
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.,Department of Neurology, Massachusetts General Hospital, Boston, MA, United States
| | - Antoine Neuraz
- Department of Biomedical Informatics, Necker-Enfant Malades Hospital, Assistance Publique - Hôpitaux de Paris, Paris, France.,Centre de Recherche des Cordeliers, INSERM UMRS 1138 Team 22, Université de Paris, Paris, France
| | - Kee Yuan Ngiam
- National University Health Systems, Singapore, Singapore
| | - Gilbert S Omenn
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, United States
| | - Nathan Palmer
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Lav P Patel
- Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, Kansas City, KS, United States
| | | | - Piotr Sliz
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
| | - Andrew M South
- Section of Nephrology, Department of Pediatrics, Brenner Children's Hospital, Wake Forest School of Medicine, Winston Salem, NC, United States
| | - Amelia Li Min Tan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.,Department of Biomedical Informatics, National University of Singapore, Singapore, Singapore
| | - Deanne M Taylor
- Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA, United States.,Department of Pediatrics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, United States
| | - Bradley W Taylor
- Clinical and Translational Science Institute, Medical College of Wisconsin, Milwaukee, WI, United States
| | - Carlo Torti
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, Italy
| | - Andrew K Vallejos
- Clinical and Translational Science Institute, Medical College of Wisconsin, Milwaukee, WI, United States
| | - Kavishwar B Wagholikar
- Department of Medicine, Harvard Medical School, Boston, MA, United States.,Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA, United States
| | | | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| |
Collapse
|
17
|
Beyrer J, Nelson DR, Sheffield KM, Huang YJ, Ellington T, Hincapie AL. Development and validation of coding algorithms to identify patients with incident lung cancer in United States healthcare claims data. Pharmacoepidemiol Drug Saf 2020; 29:1465-1479. [PMID: 33012044 DOI: 10.1002/pds.5137] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 09/01/2020] [Accepted: 09/09/2020] [Indexed: 11/11/2022]
Abstract
PURPOSE Our aim was to develop and validate a practical US healthcare claims algorithm for identifying incident lung cancer that improves on positive predictive value (PPV) and sensitivity observed in past studies. METHODS Patients newly diagnosed with lung cancer in Surveillance, Epidemiology, and End Results (SEER) (gold standard) were linked with Medicare claims. A 5% Medicare "other cancer" sample and noncancer sample served as controls. A split-sample validation approach was used. Rules-based, regression, and machine learning models for developing algorithms were explored. Algorithms were developed in the model building subset. Rules-based algorithms and those with the highest F scores were evaluated in the validation subset. F scores were compared for 1000 bootstrap samples. Misclassification was evaluated by calculating the odds of selection by the algorithm among true positives and true negatives. RESULTS A practical single-score algorithm derived from a logistic regression model had sensitivity = 78.22% and PPV = 78.50% (F score: 78.36). The algorithm was most likely to misclassify older patients (ages ≥80 years) or with missing data in the SEER registry, shorter follow-up time in Medicare (<3 months), insurance through Veterans Affairs, >1 cancer in SEER, or certain Charlson comorbidities (dementia, chronic pulmonary disease, liver disease, or myocardial infarction). CONCLUSION In this dataset, a practical point-based algorithm for identifying incident lung cancer demonstrated significant and substantial improvement (7.9% and 23.9% absolute improvement in sensitivity and PPV, respectively) compared with a current standard.
Collapse
Affiliation(s)
- Julie Beyrer
- Eli Lilly and Company, Indianapolis, Indiana, USA
| | | | | | | | | | - Ana L Hincapie
- University of Cincinnati James L. Winkle College of Pharmacy, Cincinnati, Ohio, USA
| |
Collapse
|
18
|
Carroll NM, Ritzwoller DP, Banegas MP, O'Keeffe-Rosetti M, Cronin AM, Uno H, Hornbrook MC, Hassett MJ. Performance of Cancer Recurrence Algorithms After Coding Scheme Switch From International Classification of Diseases 9th Revision to International Classification of Diseases 10th Revision. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 30869998 DOI: 10.1200/cci.18.00113] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE We previously developed and validated informatic algorithms that used International Classification of Diseases 9th revision (ICD9)-based diagnostic and procedure codes to detect the presence and timing of cancer recurrence (the RECUR Algorithms). In 2015, ICD10 replaced ICD9 as the worldwide coding standard. To understand the impact of this transition, we evaluated the performance of the RECUR Algorithms after incorporating ICD10 codes. METHODS Using publicly available translation tables along with clinician and other expertise, we updated the algorithms to include ICD10 codes as additional input variables. We evaluated the performance of the algorithms using gold standard recurrence measures associated with a contemporary cohort of patients with stage I to III breast, colorectal, and lung (excluding IIIB) cancer and derived performance measures, including the area under the receiver operating curve, average absolute prediction error, and correct classification rate. These values were compared with the performance measures derived from the validation of the original algorithms. RESULTS A total of 659 colorectal, 280 lung, and 2,053 breast cancer cases were identified. Area under the receiver operating curve derived from the updated algorithms was 89.0% (95% CI, 82.3% to 95.7%), 88.9% (95% CI, 79.3% to 98.2%), and 80.5% (95% CI, 72.8% to 88.2%) for the colorectal, lung, and breast cancer algorithms, respectively. Average absolute prediction errors for recurrence timing were 2.7 (SE, 11.3%), 2.4 (SE, 10.4%), and 5.6 months (SE, 21.8%), respectively, and timing estimates were within 6 months of actual recurrence for more than 80% of colorectal, more than 90% of lung, and more than 50% of breast cancer cases using the updated algorithm. CONCLUSION Performance measures derived from the updated and original algorithms had overlapping confidence intervals, suggesting that the ICD9 to ICD10 transition did not affect the RECUR Algorithm performance.
Collapse
Affiliation(s)
| | | | | | | | | | - Hajime Uno
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA
| | | | - Michael J Hassett
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA
| |
Collapse
|
19
|
Kehl KL, Hassett MJ, Schrag D. Patterns of care for older patients with stage IV non-small cell lung cancer in the immunotherapy era. Cancer Med 2020; 9:2019-2029. [PMID: 31989786 PMCID: PMC7064091 DOI: 10.1002/cam4.2854] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 12/19/2019] [Accepted: 01/05/2020] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Historically, older patients with advanced lung cancer have often received no systemic treatment. Immunotherapy has improved outcomes in clinical trials, but its dissemination and implementation at the population level is not well-understood. METHODS A retrospective cohort study of patients with stage IV non-small cell lung cancer (NSCLC) diagnosed age 66 or older from 2012 to 2015 was conducted using SEER-Medicare. Treatment patterns within one year of diagnosis were ascertained. Outcomes included delivery of (a) any systemic therapy; (b) any second-line infusional therapy, following first-line infusional therapy; and (c) any second-line immunotherapy, following first-line infusional therapy. Trends in care patterns associated with second-line immunotherapy approvals in 2015 were assessed using generalized additive models. Sociodemographic and clinical predictors of treatment were explored using logistic regression. RESULTS Among 10 303 patients, 5173 (50.2%) received first-line systemic therapy, with little change between the years 2012 (47.5%) and 2015 (50.3%). Among 3943 patients completing first-line infusional therapy, the proportion starting second-line infusional treatment remained stable from 2012 (30.5%) through 2014 (32.9%), before increasing in 2015 (42.4%) concurrent with second-line immunotherapy approvals. Factors associated with decreased utilization of any therapy included age, black race, Medicaid eligibility, residence in a high-poverty area, nonadenocarcinoma histology, and comorbidity; factors associated with increased utilization of any therapy included Asian race and Hispanic ethnicity. Among patients who received first-line infusional therapy, factors associated with decreased utilization of second-line infusional therapy included age, Medicaid eligibility, nonadenocarcinoma histology, and comorbidity; Asian race was associated with increased utilization of second-line infusional therapy. CONCLUSION United States Food and Drug Administration (FDA) approvals of immunotherapy for the second-line treatment of advanced NSCLC in 2015 were associated with increased rates of any second-line treatment, but disparities based on social determinants of health persisted.
Collapse
MESH Headings
- Aged
- Aged, 80 and over
- Antineoplastic Agents, Immunological/administration & dosage
- Antineoplastic Agents, Immunological/standards
- Antineoplastic Combined Chemotherapy Protocols/administration & dosage
- Antineoplastic Combined Chemotherapy Protocols/standards
- Carcinoma, Non-Small-Cell Lung/diagnosis
- Carcinoma, Non-Small-Cell Lung/drug therapy
- Carcinoma, Non-Small-Cell Lung/immunology
- Carcinoma, Non-Small-Cell Lung/mortality
- Drug Approval
- Female
- Humans
- Infusions, Intravenous
- Lung/immunology
- Lung/pathology
- Lung Neoplasms/diagnosis
- Lung Neoplasms/drug therapy
- Lung Neoplasms/immunology
- Lung Neoplasms/mortality
- Male
- Medicare/statistics & numerical data
- Neoplasm Staging
- Practice Patterns, Physicians'/standards
- Practice Patterns, Physicians'/statistics & numerical data
- Practice Patterns, Physicians'/trends
- Retrospective Studies
- SEER Program/statistics & numerical data
- United States/epidemiology
- United States Food and Drug Administration/standards
Collapse
Affiliation(s)
- Kenneth L. Kehl
- Division of Population SciencesDana‐Farber Cancer Institute and Harvard Medical SchoolBostonMAUSA
| | - Michael J. Hassett
- Division of Population SciencesDana‐Farber Cancer Institute and Harvard Medical SchoolBostonMAUSA
| | - Deborah Schrag
- Division of Population SciencesDana‐Farber Cancer Institute and Harvard Medical SchoolBostonMAUSA
| |
Collapse
|