1
|
Kehl KL, Jee J, Pichotta K, Paul MA, Trukhanov P, Fong C, Waters M, Bakouny Z, Xu W, Choueiri TK, Nichols C, Schrag D, Schultz N. Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research. Nat Commun 2024; 15:9787. [PMID: 39532885 PMCID: PMC11557593 DOI: 10.1038/s41467-024-54071-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 10/31/2024] [Indexed: 11/16/2024] Open
Abstract
Databases that link molecular data to clinical outcomes can inform precision cancer research into novel prognostic and predictive biomarkers. However, outside of clinical trials, cancer outcomes are typically recorded only in text form within electronic health records (EHRs). Artificial intelligence (AI) models have been trained to extract outcomes from individual EHRs. However, patient privacy restrictions have historically precluded dissemination of these models beyond the centers at which they were trained. In this study, the vulnerability of text classification models trained directly on protected health information to membership inference attacks is confirmed. A teacher-student distillation approach is applied to develop shareable models for annotating outcomes from imaging reports and medical oncologist notes. 'Teacher' models trained on EHR data from Dana-Farber Cancer Institute (DFCI) are used to label imaging reports and discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. 'Student' models are trained to use these MIMIC documents to predict the labels assigned by teacher models and sent to Memorial Sloan Kettering (MSK) for evaluation. The student models exhibit high discrimination across outcomes in both the DFCI and MSK test sets. Leveraging private labeling of public datasets to distill publishable clinical AI models from academic centers could facilitate deployment of machine learning to accelerate precision oncology research.
Collapse
Affiliation(s)
- Kenneth L Kehl
- Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA.
| | - Justin Jee
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Karl Pichotta
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Morgan A Paul
- Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA
| | - Pavel Trukhanov
- Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA
| | - Christopher Fong
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Michele Waters
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Ziad Bakouny
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Wenxin Xu
- Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA
| | - Toni K Choueiri
- Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA
| | - Chelsea Nichols
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Deborah Schrag
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| | - Nikolaus Schultz
- Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, USA
| |
Collapse
|
2
|
Tavabi N, Pruneski J, Golchin S, Singh M, Sanborn R, Heyworth B, Landschaft A, Kimia A, Kiapour A. Building large-scale registries from unstructured clinical notes using a low-resource natural language processing pipeline. Artif Intell Med 2024; 151:102847. [PMID: 38658131 DOI: 10.1016/j.artmed.2024.102847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 02/06/2024] [Accepted: 03/19/2024] [Indexed: 04/26/2024]
Abstract
Building clinical registries is an important step in clinical research and improvement of patient care quality. Natural Language Processing (NLP) methods have shown promising results in extracting valuable information from unstructured clinical notes. However, the structure and nature of clinical notes are very different from regular text that state-of-the-art NLP models are trained and tested on, and they have their own set of challenges. In this study, we propose Sentence Extractor with Keywords (SE-K), an efficient and interpretable classification approach for extracting information from clinical notes and show that it outperforms more computationally expensive methods in text classification. Following the Institutional Review Board (IRB) approval, we used SE-K and two embedding based NLP approaches (Sentence Extractor with Embeddings (SE-E) and Bidirectional Encoder Representations from Transformers (BERT)) to develop comprehensive registry of anterior cruciate ligament surgeries from 20 years of unstructured clinical data at a multi-site tertiary-care regional children's hospital. The low-resource approach (SE-K) had better performance (average AUROC of 0.94 ± 0.04) than the embedding-based approaches (SE-E: 0.93 ± 0.04 and BERT: 0.87 ± 0.09) for out of sample validation, in addition to minimum performance drop between test and out-of-sample validation. Moreover, the SE-K approach was at least six times faster (on CPU) than SE-E (on CPU) and BERT (on GPU) and provides interpretability. Our proposed approach, SE-K, can be effectively used to extract relevant variables from clinic notes to build large-scale registries, with consistently better performance compared to the more resource-intensive approaches (e.g., BERT). Such approaches can facilitate information extraction from unstructured notes for registry building, quality improvement and adverse event monitoring.
Collapse
Affiliation(s)
- Nazgol Tavabi
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
| | - James Pruneski
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | - Shahriar Golchin
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Mallika Singh
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Ryan Sanborn
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Benton Heyworth
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | - Assaf Landschaft
- Division of Emergency Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Amir Kimia
- Harvard Medical School, Boston, MA, USA; Division of Emergency Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Ata Kiapour
- Department of Orthopaedic Surgery and Sports Medicine, Boston Children's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
3
|
Li W, Gou F, Wu J. Artificial intelligence auxiliary diagnosis and treatment system for breast cancer in developing countries. JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY 2024; 32:395-413. [PMID: 38189731 DOI: 10.3233/xst-230194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
BACKGROUND In many developing countries, a significant number of breast cancer patients are unable to receive timely treatment due to a large population base, high patient numbers, and limited medical resources. OBJECTIVE This paper proposes a breast cancer assisted diagnosis system based on electronic medical records. The goal of this system is to address the limitations of existing systems, which primarily rely on structured electronic records and may miss crucial information stored in unstructured records. METHODS The proposed approach is a breast cancer assisted diagnosis system based on electronic medical records. The system utilizes breast cancer enhanced convolutional neural networks with semantic initialization filters (BC-INIT-CNN). It extracts highly relevant tumor markers from unstructured medical records to aid in breast cancer staging diagnosis and effectively utilizes the important information present in unstructured records. RESULTS The model's performance is assessed using various evaluation metrics. Such as accuracy, ROC curves, and Precision-Recall curves. Comparative analysis demonstrates that the BC-INIT-CNN model outperforms several existing methods in terms of accuracy and computational efficiency. CONCLUSIONS The proposed breast cancer assisted diagnosis system based on BC-INIT-CNN showcases the potential to address the challenges faced by developing countries in providing timely treatment to breast cancer patients. By leveraging unstructured medical records and extracting relevant tumor markers, the system enables accurate staging diagnosis and enhances the utilization of valuable information.
Collapse
Affiliation(s)
- Wenxiu Li
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China
| | - Fangfang Gou
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China
| | - Jia Wu
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China
- Research Center for Artificial Intelligence, Monash University, Melbourne, Clayton VIC, Australia
| |
Collapse
|
4
|
Mummaneni PV, Bydon M. Clinical Databases in Spine Surgery: Strength in Numbers. Neurosurgery 2023; 93:1-3. [PMID: 37318222 DOI: 10.1227/neu.0000000000002465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023] Open
Affiliation(s)
- Praveen V Mummaneni
- Department of Neurological Surgery, University of California, San Francisco, San Francisco, California, USA
| | - Mohamad Bydon
- Department of Neurologic Surgery, Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
5
|
Sousa S, Kern R. How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10204-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
AbstractDeep learning (DL) models for natural language processing (NLP) tasks often handle private data, demanding protection against breaches and disclosures. Data protection laws, such as the European Union’s General Data Protection Regulation (GDPR), thereby enforce the need for privacy. Although many privacy-preserving NLP methods have been proposed in recent years, no categories to organize them have been introduced yet, making it hard to follow the progress of the literature. To close this gap, this article systematically reviews over sixty DL methods for privacy-preserving NLP published between 2016 and 2020, covering theoretical foundations, privacy-enhancing technologies, and analysis of their suitability for real-world scenarios. First, we introduce a novel taxonomy for classifying the existing methods into three categories: data safeguarding methods, trusted methods, and verification methods. Second, we present an extensive summary of privacy threats, datasets for applications, and metrics for privacy evaluation. Third, throughout the review, we describe privacy issues in the NLP pipeline in a holistic view. Further, we discuss open challenges in privacy-preserving NLP regarding data traceability, computation overhead, dataset size, the prevalence of human biases in embeddings, and the privacy-utility tradeoff. Finally, this review presents future research directions to guide successive research and development of privacy-preserving NLP models.
Collapse
|