1
|
Gao J, Bonzel CL, Hong C, Varghese P, Zakir K, Gronsbell J. Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms. J Am Med Inform Assoc 2024; 31:640-650. [PMID: 38128118 PMCID: PMC10873838 DOI: 10.1093/jamia/ocad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/22/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023] Open
Abstract
OBJECTIVE High-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (eg, sensitivity, specificity). MATERIALS AND METHODS ssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC with synthetic, semi-synthetic, and EHR data from Mass General Brigham (MGB). RESULTS ssROC produced ROC parameter estimates with minimal bias and significantly lower variance than supROC in the simulated and semi-synthetic data. For the 5 PAs from MGB, the estimates from ssROC are 30% to 60% less variable than supROC on average. DISCUSSION ssROC enables precise evaluation of PA performance without demanding large volumes of labeled data. ssROC is also easily implementable in open-source R software. CONCLUSION When used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research.
Collapse
Affiliation(s)
- Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States
| | - Paul Varghese
- Health Informatics, Verily Life Sciences, Cambridge, MA, United States
| | - Karim Zakir
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
2
|
Lotspeich SC, Amorim GGC, Shaw PA, Tao R, Shepherd BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification. CAN J STAT 2023. [DOI: 10.1002/cjs.11772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
3
|
Lee RY, Kross EK, Torrence J, Li KS, Sibley J, Cohen T, Lober WB, Engelberg RA, Curtis JR. Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome. JAMA Netw Open 2023; 6:e231204. [PMID: 36862411 PMCID: PMC9982698 DOI: 10.1001/jamanetworkopen.2023.1204] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/03/2023] Open
Abstract
IMPORTANCE Many clinical trial outcomes are documented in free-text electronic health records (EHRs), making manual data collection costly and infeasible at scale. Natural language processing (NLP) is a promising approach for measuring such outcomes efficiently, but ignoring NLP-related misclassification may lead to underpowered studies. OBJECTIVE To evaluate the performance, feasibility, and power implications of using NLP to measure the primary outcome of EHR-documented goals-of-care discussions in a pragmatic randomized clinical trial of a communication intervention. DESIGN, SETTING, AND PARTICIPANTS This diagnostic study compared the performance, feasibility, and power implications of measuring EHR-documented goals-of-care discussions using 3 approaches: (1) deep-learning NLP, (2) NLP-screened human abstraction (manual verification of NLP-positive records), and (3) conventional manual abstraction. The study included hospitalized patients aged 55 years or older with serious illness enrolled between April 23, 2020, and March 26, 2021, in a pragmatic randomized clinical trial of a communication intervention in a multihospital US academic health system. MAIN OUTCOMES AND MEASURES Main outcomes were natural language processing performance characteristics, human abstractor-hours, and misclassification-adjusted statistical power of methods of measuring clinician-documented goals-of-care discussions. Performance of NLP was evaluated with receiver operating characteristic (ROC) curves and precision-recall (PR) analyses and examined the effects of misclassification on power using mathematical substitution and Monte Carlo simulation. RESULTS A total of 2512 trial participants (mean [SD] age, 71.7 [10.8] years; 1456 [58%] female) amassed 44 324 clinical notes during 30-day follow-up. In a validation sample of 159 participants, deep-learning NLP trained on a separate training data set from identified patients with documented goals-of-care discussions with moderate accuracy (maximal F1 score, 0.82; area under the ROC curve, 0.924; area under the PR curve, 0.879). Manual abstraction of the outcome from the trial data set would require an estimated 2000 abstractor-hours and would power the trial to detect a risk difference of 5.4% (assuming 33.5% control-arm prevalence, 80% power, and 2-sided α = .05). Measuring the outcome by NLP alone would power the trial to detect a risk difference of 7.6%. Measuring the outcome by NLP-screened human abstraction would require 34.3 abstractor-hours to achieve estimated sensitivity of 92.6% and would power the trial to detect a risk difference of 5.7%. Monte Carlo simulations corroborated misclassification-adjusted power calculations. CONCLUSIONS AND RELEVANCE In this diagnostic study, deep-learning NLP and NLP-screened human abstraction had favorable characteristics for measuring an EHR outcome at scale. Adjusted power calculations accurately quantified power loss from NLP-related misclassification, suggesting that incorporation of this approach into the design of studies using NLP would be beneficial.
Collapse
Affiliation(s)
- Robert Y. Lee
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of Washington, Seattle
| | - Erin K. Kross
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of Washington, Seattle
| | - Janaki Torrence
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of Washington, Seattle
| | - Kevin S. Li
- Division of Biomedical and Health Informatics, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
| | - James Sibley
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Department of Biobehavioral Nursing and Health Informatics, University of Washington, Seattle
| | - Trevor Cohen
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Biomedical and Health Informatics, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
| | - William B. Lober
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Biomedical and Health Informatics, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
- Department of Biobehavioral Nursing and Health Informatics, University of Washington, Seattle
- Department of Global Health, University of Washington, Seattle
| | - Ruth A. Engelberg
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of Washington, Seattle
| | - J. Randall Curtis
- Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of Washington, Seattle
- Department of Biobehavioral Nursing and Health Informatics, University of Washington, Seattle
- Department of Health Systems and Population Health, University of Washington, Seattle
| |
Collapse
|
4
|
Estevez M, Benedum CM, Jiang C, Cohen AB, Phadke S, Sarkar S, Bozkurt S. Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework. Cancers (Basel) 2022; 14:cancers14133063. [PMID: 35804834 PMCID: PMC9264846 DOI: 10.3390/cancers14133063] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 06/17/2022] [Accepted: 06/17/2022] [Indexed: 02/04/2023] Open
Abstract
A vast amount of real-world data, such as pathology reports and clinical notes, are captured as unstructured text in electronic health records (EHRs). However, this information is both difficult and costly to extract through human abstraction, especially when scaling to large datasets is needed. Fortunately, Natural Language Processing (NLP) and Machine Learning (ML) techniques provide promising solutions for a variety of information extraction tasks such as identifying a group of patients who have a specific diagnosis, share common characteristics, or show progression of a disease. However, using these ML-extracted data for research still introduces unique challenges in assessing validity and generalizability to different cohorts of interest. In order to enable effective and accurate use of ML-extracted real-world data (RWD) to support research and real-world evidence generation, we propose a research-centric evaluation framework for model developers, ML-extracted data users and other RWD stakeholders. This framework covers the fundamentals of evaluating RWD produced using ML methods to maximize the use of EHR data for research purposes.
Collapse
Affiliation(s)
- Melissa Estevez
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
| | - Corey M. Benedum
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
| | - Chengsheng Jiang
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
| | - Aaron B. Cohen
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
- Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, USA
| | - Sharang Phadke
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
| | - Somnath Sarkar
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
| | - Selen Bozkurt
- Flatiron Health, Inc., 233 Spring Street, New York, NY 10013, USA; (M.E.); (C.M.B.); (C.J.); (A.B.C.); (S.P.); (S.S.)
- Correspondence:
| |
Collapse
|
5
|
Casey A, Davidson E, Poon M, Dong H, Duma D, Grivas A, Grover C, Suárez-Paniagua V, Tobin R, Whiteley W, Wu H, Alex B. A systematic review of natural language processing applied to radiology reports. BMC Med Inform Decis Mak 2021; 21:179. [PMID: 34082729 PMCID: PMC8176715 DOI: 10.1186/s12911-021-01533-7] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 05/17/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Natural language processing (NLP) has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses and quantifies recent literature in NLP applied to radiology reports. METHODS We conduct an automated literature search yielding 4836 results using automated filtering, metadata enriching steps and citation search combined with manual review. Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics. RESULTS We present a comprehensive analysis of the 164 publications retrieved with publications in 2019 almost triple those in 2015. Each publication is categorised into one of 6 clinical application categories. Deep learning use increases in the period but conventional machine learning approaches are still prevalent. Deep learning remains challenged when data is scarce and there is little evidence of adoption into clinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, it is hard to comparatively evaluate these approaches given that most of them use different datasets. Only 14 studies made their data and 15 their code available with 10 externally validating results. CONCLUSIONS Automated understanding of clinical narratives of the radiology reports has the potential to enhance the healthcare process and we show that research in this field continues to grow. Reproducibility and explainability of models are important if the domain is to move applications into clinical use. More could be done to share code enabling validation of methods on different institutional data and to reduce heterogeneity in reporting of study properties allowing inter-study comparisons. Our results have significance for researchers in the field providing a systematic synthesis of existing work to build on, identify gaps, opportunities for collaboration and avoid duplication.
Collapse
Affiliation(s)
- Arlene Casey
- School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland
| | - Emma Davidson
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland
| | - Michael Poon
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland
| | - Hang Dong
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, Scotland
- Health Data Research UK, London, UK
| | - Daniel Duma
- School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland
| | - Andreas Grivas
- Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland
| | - Claire Grover
- Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland
| | - Víctor Suárez-Paniagua
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, Scotland
- Health Data Research UK, London, UK
| | - Richard Tobin
- Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland
| | - William Whiteley
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Honghan Wu
- Health Data Research UK, London, UK
- Institute of Health Informatics, University College London, London, UK
| | - Beatrice Alex
- School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, Scotland
| |
Collapse
|