1
|
Kenney RC, Chen X, Shintani K, Gagnon C, Liu J, DaCosta Byfield S, Ochs L, Currie AM. Validation of Non-Small Cell Lung Cancer Clinical Insights Using a Generalized Oncology Natural Language Processing Model. JCO Clin Cancer Inform 2024; 8:e2300099. [PMID: 39230200 DOI: 10.1200/cci.23.00099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 04/02/2024] [Accepted: 06/18/2024] [Indexed: 09/05/2024] Open
Abstract
PURPOSE Limited studies have used natural language processing (NLP) in the context of non-small cell lung cancer (NSCLC). This study aimed to validate the application of an NLP model to an NSCLC cohort by extracting NSCLC concepts from free-text medical notes and converting them to structured, interpretable data. METHODS Patients with a lung neoplasm, NSCLC histology, and treatment information in their notes were selected from a repository of over 27 million patients. From these, 200 were randomly selected for this study with the longest and the most recent note included for each patient. An NLP model developed and validated on a large solid and blood cancer oncology cohort was applied to this NSCLC cohort. Two certified tumor registrars and a curator abstracted concepts from the notes: neoplasm, histology, stage, TNM values, and metastasis sites. This manually abstracted gold standard was compared with the NLP model output. Precision and recall scores were calculated. RESULTS The NLP model extracted the NSCLC concepts with excellent precision and recall with the following scores, respectively: Lung neoplasm 100% and 100%, NSCLC histology 99% and 88%, histology correctly linked to neoplasm 98% and 79%, stage value 98.8% and 92%, stage TNM value 93% and 98%, and metastasis site 97% and 89%. High precision is related to a low number of false positives, and therefore, extracted concepts are likely accurate. High recall indicates that the model captured most of the desired concepts. CONCLUSION This study validates that Optum's oncology NLP model has high precision and recall with clinical real-world data and is a reliable model to support research studies and clinical trials. This validation study shows that our nonspecific solid tumor and blood cancer oncology model is generalizable to successfully extract clinical information from specific cancer cohorts.
Collapse
Affiliation(s)
- Rachel C Kenney
- Optum Insight, Optum, Eden Prairie, MN
- Departments of Neurology and Population Health, New York University Grossman School of Medicine, New York, NY
| | | | | | | | - John Liu
- Optum Insight, Optum, Eden Prairie, MN
| | | | | | | |
Collapse
|
2
|
Wi S, Goldhoff PE, Fuller LA, Grewal K, Wentzensen N, Clarke MA, Lorey TS. Using Natural Language Processing to Improve Discrete Data Capture From Interpretive Cervical Biopsy Diagnoses at a Large Health Care Organization. Arch Pathol Lab Med 2023; 147:222-226. [PMID: 35390126 DOI: 10.5858/arpa.2021-0410-oa] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/15/2021] [Indexed: 02/05/2023]
Abstract
CONTEXT.— The terminology used by pathologists to describe and grade dysplasia and premalignant changes of the cervical epithelium has evolved over time. Unfortunately, coexistence of different classification systems combined with nonstandardized interpretive text has created multiple layers of interpretive ambiguity. OBJECTIVE.— To use natural language processing (NLP) to automate and expedite translation of interpretive text to a single most severe, and thus actionable, cervical intraepithelial neoplasia (CIN) diagnosis. DESIGN.— We developed and applied NLP algorithms to 35 847 unstructured cervical pathology reports and assessed NLP performance in identifying the most severe diagnosis, compared to expert manual review. NLP performance was determined by calculating precision, recall, and F score. RESULTS.— The NLP algorithms yielded a precision of 0.957, a recall of 0.925, and an F score of 0.94. Additionally, we estimated that the time to evaluate each monthly biopsy file was significantly reduced, from 30 hours to 0.5 hours. CONCLUSIONS.— A set of validated NLP algorithms applied to pathology reports can rapidly and efficiently assign a discrete, actionable diagnosis using CIN classification to assist with clinical management of cervical pathology and disease. Moreover, discrete diagnostic data encoded as CIN terminology can enhance the efficiency of clinical research.
Collapse
Affiliation(s)
- Soora Wi
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Patricia E Goldhoff
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Laurie A Fuller
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Kiranjit Grewal
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Nicolas Wentzensen
- From the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland (Wentzensen, Clarke)
| | - Megan A Clarke
- From the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland (Wentzensen, Clarke)
| | - Thomas S Lorey
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| |
Collapse
|
3
|
Wang L, Fu S, Wen A, Ruan X, He H, Liu S, Moon S, Mai M, Riaz IB, Wang N, Yang P, Xu H, Warner JL, Liu H. Assessment of Electronic Health Record for Cancer Research and Patient Care Through a Scoping Review of Cancer Natural Language Processing. JCO Clin Cancer Inform 2022; 6:e2200006. [PMID: 35917480 PMCID: PMC9470142 DOI: 10.1200/cci.22.00006] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 03/18/2022] [Accepted: 06/15/2022] [Indexed: 11/20/2022] Open
Abstract
PURPOSE The advancement of natural language processing (NLP) has promoted the use of detailed textual data in electronic health records (EHRs) to support cancer research and to facilitate patient care. In this review, we aim to assess EHR for cancer research and patient care by using the Minimal Common Oncology Data Elements (mCODE), which is a community-driven effort to define a minimal set of data elements for cancer research and practice. Specifically, we aim to assess the alignment of NLP-extracted data elements with mCODE and review existing NLP methodologies for extracting said data elements. METHODS Published literature studies were searched to retrieve cancer-related NLP articles that were written in English and published between January 2010 and September 2020 from main literature databases. After the retrieval, articles with EHRs as the data source were manually identified. A charting form was developed for relevant study analysis and used to categorize data including four main topics: metadata, EHR data and targeted cancer types, NLP methodology, and oncology data elements and standards. RESULTS A total of 123 publications were selected finally and included in our analysis. We found that cancer research and patient care require some data elements beyond mCODE as expected. Transparency and reproductivity are not sufficient in NLP methods, and inconsistency in NLP evaluation exists. CONCLUSION We conducted a comprehensive review of cancer NLP for research and patient care using EHRs data. Issues and barriers for wide adoption of cancer NLP were identified and discussed.
Collapse
Affiliation(s)
- Liwei Wang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Xiaoyang Ruan
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Huan He
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sijia Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sungrim Moon
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Michelle Mai
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Irbaz B. Riaz
- Department of Hematology/Oncology, Mayo Clinic, Scottsdale, AZ
| | - Nan Wang
- Department of Computer Science and Engineering, College of Science and Engineering, University of Minnesota, Minneapolis, MN
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, AZ
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Jeremy L. Warner
- Departments of Medicine (Hematology/Oncology), Vanderbilt University, Nashville, TN
- Department Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| |
Collapse
|
4
|
Zhou S, Wang N, Wang L, Liu H, Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc 2022; 29:1208-1216. [PMID: 35333345 PMCID: PMC9196678 DOI: 10.1093/jamia/ocac040] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 03/06/2022] [Accepted: 03/09/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. MATERIALS AND METHODS A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. RESULTS All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. CONCLUSIONS The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.
Collapse
Affiliation(s)
- Sicheng Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Nan Wang
- School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Liwei Wang
- Department of AI and Informatics Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hongfang Liu
- Department of AI and Informatics Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
5
|
Yoo S, Yoon E, Boo D, Kim B, Kim S, Paeng JC, Yoo IR, Choi IY, Kim K, Ryoo HG, Lee SJ, Song E, Joo YH, Kim J, Lee HY. Transforming Thyroid Cancer Diagnosis and Staging Information from Unstructured Reports to the Observational Medical Outcome Partnership Common Data Model. Appl Clin Inform 2022; 13:521-531. [PMID: 35705182 PMCID: PMC9200482 DOI: 10.1055/s-0042-1748144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
BACKGROUND Cancer staging information is an essential component of cancer research. However, the information is primarily stored as either a full or semistructured free-text clinical document which is limiting the data use. By transforming the cancer-specific data to the Observational Medical Outcome Partnership Common Data Model (OMOP CDM), the information can contribute to establish multicenter observational cancer studies. To the best of our knowledge, there have been no studies on OMOP CDM transformation and natural language processing (NLP) for thyroid cancer to date. OBJECTIVE We aimed to demonstrate the applicability of the OMOP CDM oncology extension module for thyroid cancer diagnosis and cancer stage information by processing free-text medical reports. METHODS Thyroid cancer diagnosis and stage-related modifiers were extracted with rule-based NLP from 63,795 thyroid cancer pathology reports and 56,239 Iodine whole-body scan reports from three medical institutions in the Observational Health Data Sciences and Informatics data network. The data were converted into the OMOP CDM v6.0 according to the OMOP CDM oncology extension module. The cancer staging group was derived and populated using the transformed CDM data. RESULTS The extracted thyroid cancer data were completely converted into the OMOP CDM. The distributions of histopathological types of thyroid cancer were approximately 95.3 to 98.8% of papillary carcinoma, 0.9 to 3.7% of follicular carcinoma, 0.04 to 0.54% of adenocarcinoma, 0.17 to 0.81% of medullary carcinoma, and 0 to 0.3% of anaplastic carcinoma. Regarding cancer staging, stage-I thyroid cancer accounted for 55 to 64% of the cases, while stage III accounted for 24 to 26% of the cases. Stage-II and -IV thyroid cancers were detected at a low rate of 2 to 6%. CONCLUSION As a first study on OMOP CDM transformation and NLP for thyroid cancer, this study will help other institutions to standardize thyroid cancer-specific data for retrospective observational research and participate in multicenter studies.
Collapse
Affiliation(s)
- Sooyoung Yoo
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Eunsil Yoon
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Dachung Boo
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Borham Kim
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Seok Kim
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Jin Chul Paeng
- Department of Nuclear Medicine, Seoul National University, College of Medicine, Seoul, South Korea
| | - Ie Ryung Yoo
- Division of Nuclear Medicine, Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, South Korea
| | - In Young Choi
- Department of Medical Informatics, The Catholic University of Korea, College of Medicine, Seoul, South Korea.,Department of Biomedicine and Health Sciences, The Catholic University of Korea, College of Medicine, Seoul, South Korea
| | - Kwangsoo Kim
- Transdisciplinary Department of Medicine and Advanced Technology, Seoul National University Hospital, Seoul, South Korea
| | - Hyun Gee Ryoo
- Department of Nuclear Medicine, Seoul National University Hospital, Seoul, South Korea.,Department of Nuclear Medicine, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Sun Jung Lee
- Department of Medical Informatics, The Catholic University of Korea, College of Medicine, Seoul, South Korea.,Department of Biomedicine and Health Sciences, The Catholic University of Korea, College of Medicine, Seoul, South Korea
| | - Eunhye Song
- Department of Data Science Research, Innovative Medical Technology Research Institute, Seoul National University Hospital, Seoul, South Korea
| | - Young-Hwan Joo
- Biomedical Research Institute, Seoul National University Hospital, Seoul, South Korea
| | - Junmo Kim
- Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, South Korea
| | - Ho-Young Lee
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea.,Department of Nuclear Medicine, Seoul National University, College of Medicine, Seoul, South Korea
| |
Collapse
|
6
|
Bernstam EV, Shireman PK, Meric‐Bernstam F, N. Zozus M, Jiang X, Brimhall BB, Windham AK, Schmidt S, Visweswaran S, Ye Y, Goodrum H, Ling Y, Barapatre S, Becich MJ. Artificial intelligence in clinical and translational science: Successes, challenges and opportunities. Clin Transl Sci 2022; 15:309-321. [PMID: 34706145 PMCID: PMC8841416 DOI: 10.1111/cts.13175] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Accepted: 10/01/2021] [Indexed: 01/12/2023] Open
Abstract
Artificial intelligence (AI) is transforming many domains, including finance, agriculture, defense, and biomedicine. In this paper, we focus on the role of AI in clinical and translational research (CTR), including preclinical research (T1), clinical research (T2), clinical implementation (T3), and public (or population) health (T4). Given the rapid evolution of AI in CTR, we present three complementary perspectives: (1) scoping literature review, (2) survey, and (3) analysis of federally funded projects. For each CTR phase, we addressed challenges, successes, failures, and opportunities for AI. We surveyed Clinical and Translational Science Award (CTSA) hubs regarding AI projects at their institutions. Nineteen of 63 CTSA hubs (30%) responded to the survey. The most common funding source (48.5%) was the federal government. The most common translational phase was T2 (clinical research, 40.2%). Clinicians were the intended users in 44.6% of projects and researchers in 32.3% of projects. The most common computational approaches were supervised machine learning (38.6%) and deep learning (34.2%). The number of projects steadily increased from 2012 to 2020. Finally, we analyzed 2604 AI projects at CTSA hubs using the National Institutes of Health Research Portfolio Online Reporting Tools (RePORTER) database for 2011-2019. We mapped available abstracts to medical subject headings and found that nervous system (16.3%) and mental disorders (16.2) were the most common topics addressed. From a computational perspective, big data (32.3%) and deep learning (30.0%) were most common. This work represents a snapshot in time of the role of AI in the CTSA program.
Collapse
Affiliation(s)
- Elmer V. Bernstam
- School of Biomedical InformaticsThe University of Texas Health Science Center at HoustonHoustonTexasUSA
- Division of General Internal MedicineDepartment of Internal MedicineMcGovern Medical SchoolThe University of Texas Health Science Center at HoustonHoustonTexasUSA
| | - Paula K. Shireman
- Departments of Surgery and MicrobiologyImmunology & Molecular GeneticsUniversity of Texas Health San AntonioSan AntonioTexasUSA
- University HealthSan AntonioTexasUSA
- South Texas Veterans Health Care SystemSan AntonioTexasUSA
| | - Funda Meric‐Bernstam
- Department of Investigational Cancer TherapeuticsThe University of Texas MD Anderson Cancer CenterHoustonTexasUSA
| | - Meredith N. Zozus
- Division of Clinical Research InformaticsDepartment of Population Health SciencesUniversity of Texas Health San AntonioSan AntonioTexasUSA
| | - Xiaoqian Jiang
- School of Biomedical InformaticsThe University of Texas Health Science Center at HoustonHoustonTexasUSA
| | - Bradley B. Brimhall
- University HealthSan AntonioTexasUSA
- Department of PathologyUniversity of Texas Health San AntonioSan AntonioTexasUSA
| | - Ashley K. Windham
- University HealthSan AntonioTexasUSA
- Department of PathologyUniversity of Texas Health San AntonioSan AntonioTexasUSA
| | - Susanne Schmidt
- Department of Population Health SciencesUniversity of Texas Health San AntonioSan AntonioTexasUSA
| | - Shyam Visweswaran
- Department of Biomedical InformaticsUniversity of Pittsburgh School of MedicinePittsburghPennsylvaniaUSA
| | - Ye Ye
- Department of Biomedical InformaticsUniversity of Pittsburgh School of MedicinePittsburghPennsylvaniaUSA
| | - Heath Goodrum
- School of Biomedical InformaticsThe University of Texas Health Science Center at HoustonHoustonTexasUSA
| | - Yaobin Ling
- School of Biomedical InformaticsThe University of Texas Health Science Center at HoustonHoustonTexasUSA
| | - Seemran Barapatre
- Department of Biomedical InformaticsUniversity of Pittsburgh School of MedicinePittsburghPennsylvaniaUSA
| | - Michael J. Becich
- Department of Biomedical InformaticsUniversity of Pittsburgh School of MedicinePittsburghPennsylvaniaUSA
| |
Collapse
|
7
|
Zhan X, Long H, Gou F, Duan X, Kong G, Wu J. A Convolutional Neural Network-Based Intelligent Medical System with Sensors for Assistive Diagnosis and Decision-Making in Non-Small Cell Lung Cancer. SENSORS 2021; 21:s21237996. [PMID: 34884000 PMCID: PMC8659811 DOI: 10.3390/s21237996] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 11/26/2021] [Accepted: 11/28/2021] [Indexed: 12/15/2022]
Abstract
In many regions of the world, early diagnosis of non-small cell lung cancer (NSCLC) is a major challenge due to the large population and lack of medical resources, which is difficult toeffectively address via limited physician manpower alone. Therefore, we developed a convolutional neural network (CNN)-based assisted diagnosis and decision-making intelligent medical system with sensors. This system analyzes NSCLC patients' medical records using sensors to assist staging a diagnosis and provides recommended treatment plans to physicians. To address the problem of unbalanced case samples across pathological stages, we used transfer learning and dynamic sampling techniques to reconstruct and iteratively train the model to improve the accuracy of the prediction system. In this paper, all data for training and testing the system were obtained from the medical records of 2,789,675 patients with NSCLC, which were recorded in three hospitals in China over a five-year period. When the number of case samples reached 8000, the system achieved an accuracy rate of 0.84, which is already close to that of the doctors (accuracy: 0.86). The experimental results proved that the system can quickly and accurately analyze patient data and provide decision information support for physicians.
Collapse
Affiliation(s)
- Xiangbing Zhan
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (X.Z.); (X.D.); (G.K.)
| | - Huiyun Long
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (X.Z.); (X.D.); (G.K.)
- Correspondence: (H.L.); (J.W.)
| | - Fangfang Gou
- School of Computer Science and Engineering, Central South University, Changsha 410083, China;
| | - Xun Duan
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (X.Z.); (X.D.); (G.K.)
| | - Guangqian Kong
- State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (X.Z.); (X.D.); (G.K.)
| | - Jia Wu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China;
- Research Center for Artificial Intelligence, Monash University, Clayton, VIC 3800, Australia
- Correspondence: (H.L.); (J.W.)
| |
Collapse
|
8
|
Gianfrancesco MA, Goldstein ND. A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol 2021; 21:234. [PMID: 34706667 PMCID: PMC8549408 DOI: 10.1186/s12874-021-01416-5] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 09/28/2021] [Indexed: 11/10/2022] Open
Abstract
Electronic health records (EHRs) are widely used in epidemiological research, but the validity of the results is dependent upon the assumptions made about the healthcare system, the patient, and the provider. In this review, we identify four overarching challenges in using EHR-based data for epidemiological analysis, with a particular emphasis on threats to validity. These challenges include representativeness of the EHR to a target population, the availability and interpretability of clinical and non-clinical data, and missing data at both the variable and observation levels. Each challenge reveals layers of assumptions that the epidemiologist is required to make, from the point of patient entry into the healthcare system, to the provider documenting the results of the clinical exam and follow-up of the patient longitudinally; all with the potential to bias the results of analysis of these data. Understanding the extent of as well as remediating potential biases requires a variety of methodological approaches, from traditional sensitivity analyses and validation studies, to newer techniques such as natural language processing. Beyond methods to address these challenges, it will remain crucial for epidemiologists to engage with clinicians and informaticians at their institutions to ensure data quality and accessibility by forming multidisciplinary teams around specific research projects.
Collapse
Affiliation(s)
- Milena A Gianfrancesco
- Division of Rheumatology, University of California School of Medicine, San Francisco, CA, USA
| | - Neal D Goldstein
- Department of Epidemiology and Biostatistics, Drexel University Dornsife School of Public Health, 3215 Market St., Philadelphia, PA, 19104, USA.
| |
Collapse
|
9
|
Bitterman DS, Miller TA, Mak RH, Savova GK. Clinical Natural Language Processing for Radiation Oncology: A Review and Practical Primer. Int J Radiat Oncol Biol Phys 2021; 110:641-655. [PMID: 33545300 DOI: 10.1016/j.ijrobp.2021.01.044] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 12/22/2020] [Accepted: 01/23/2021] [Indexed: 02/07/2023]
Abstract
Natural language processing (NLP), which aims to convert human language into expressions that can be analyzed by computers, is one of the most rapidly developing and widely used technologies in the field of artificial intelligence. Natural language processing algorithms convert unstructured free text data into structured data that can be extracted and analyzed at scale. In medicine, this unlocking of the rich, expressive data within clinical free text in electronic medical records will help untap the full potential of big data for research and clinical purposes. Recent major NLP algorithmic advances have significantly improved the performance of these algorithms, leading to a surge in academic and industry interest in developing tools to automate information extraction and phenotyping from clinical texts. Thus, these technologies are poised to transform medical research and alter clinical practices in the future. Radiation oncology stands to benefit from NLP algorithms if they are appropriately developed and deployed, as they may enable advances such as automated inclusion of radiation therapy details into cancer registries, discovery of novel insights about cancer care, and improved patient data curation and presentation at the point of care. However, challenges remain before the full value of NLP is realized, such as the plethora of jargon specific to radiation oncology, nonstandard nomenclature, a lack of publicly available labeled data for model development, and interoperability limitations between radiation oncology data silos. Successful development and implementation of high quality and high value NLP models for radiation oncology will require close collaboration between computer scientists and the radiation oncology community. Here, we present a primer on artificial intelligence algorithms in general and NLP algorithms in particular; provide guidance on how to assess the performance of such algorithms; review prior research on NLP algorithms for oncology; and describe future avenues for NLP in radiation oncology research and clinics.
Collapse
Affiliation(s)
- Danielle S Bitterman
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, Massachusetts; Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, Massachusetts.
| | - Timothy A Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| | - Raymond H Mak
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, Massachusetts
| | - Guergana K Savova
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| |
Collapse
|
10
|
Mensa E, Colla D, Dalmasso M, Giustini M, Mamo C, Pitidis A, Radicioni DP. Violence detection explanation via semantic roles embeddings. BMC Med Inform Decis Mak 2020; 20:263. [PMID: 33059690 PMCID: PMC7559980 DOI: 10.1186/s12911-020-01237-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Accepted: 09/02/2020] [Indexed: 11/22/2022] Open
Abstract
Background Emergency room reports pose specific challenges to natural language processing techniques. In this setting, violence episodes on women, elderly and children are often under-reported. Categorizing textual descriptions as containing violence-related injuries (V) vs. non-violence-related injuries (NV) is thus a relevant task to the ends of devising alerting mechanisms to track (and prevent) violence episodes. Methods We present ViDeS (so dubbed after Violence Detection System), a system to detect episodes of violence from narrative texts in emergency room reports. It employs a deep neural network for categorizing textual ER reports data, and complements such output by making explicit which elements corroborate the interpretation of the record as reporting about violence-related injuries. To these ends we designed a novel hybrid technique for filling semantic frames that employs distributed representations of terms herein, along with syntactic and semantic information. The system has been validated on real data annotated with two sorts of information: about the presence vs. absence of violence-related injuries, and about some semantic roles that can be interpreted as major cues for violent episodes, such as the agent that committed violence, the victim, the body district involved, etc.. The employed dataset contains over 150K records annotated with class (V,NV) information, and 200 records with finer-grained information on the aforementioned semantic roles. Results We used data coming from an Italian branch of the EU-Injury Database (EU-IDB) project, compiled by hospital staff. Categorization figures approach full precision and recall for negative cases and.97 precision and.94 recall on positive cases. As regards as the recognition of semantic roles, we recorded an accuracy varying from.28 to.90 according to the semantic roles involved. Moreover, the system allowed unveiling annotation errors committed by hospital staff. Conclusions Explaining systems’ results, so to make their output more comprehensible and convincing, is today necessary for AI systems. Our proposal is to combine distributed and symbolic (frame-like) representations as a possible answer to such pressing request for interpretability. Although presently focused on the medical domain, the proposed methodology is general and, in principle, it can be extended to further application areas and categorization tasks.
Collapse
Affiliation(s)
- Enrico Mensa
- Department of Computer Science, University of Turin, Corso Svizzera 185, Turin, 10149, Italy
| | - Davide Colla
- Department of Computer Science, University of Turin, Corso Svizzera 185, Turin, 10149, Italy
| | - Marco Dalmasso
- Servizio sovrazonale di Epidemiologia dell'ASL TO3 della Regione Piemonte, Via Sabaudia 164, Grugliasco (TO), 10095, Italy
| | - Marco Giustini
- Reparto Epidemiologia ambientale e sociale Dipartimento Ambiente e Salute (DAMSA) Istituto Superiore di Sanità, Viale Regina Elena, 299, Roma, 00161, Italy
| | - Carlo Mamo
- Servizio sovrazonale di Epidemiologia dell'ASL TO3 della Regione Piemonte, Via Sabaudia 164, Grugliasco (TO), 10095, Italy
| | - Alessio Pitidis
- Reparto Epidemiologia ambientale e sociale Dipartimento Ambiente e Salute (DAMSA) Istituto Superiore di Sanità, Viale Regina Elena, 299, Roma, 00161, Italy.,Data Analysis Services, B2C Innovation Inc. - Digital Services, Corso Magenta 69/A, Milan, PO Box 20123, Italy
| | - Daniele P Radicioni
- Department of Computer Science, University of Turin, Corso Svizzera 185, Turin, 10149, Italy.
| |
Collapse
|
11
|
Wang Y, Xu H, Uzuner O. Editorial: The second international workshop on health natural language processing (HealthNLP 2019). BMC Med Inform Decis Mak 2019; 19:233. [PMID: 31801516 PMCID: PMC6894102 DOI: 10.1186/s12911-019-0930-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Ozlem Uzuner
- Information Sciences and Technology, George Mason University, Fairfax, VA USA
| |
Collapse
|