1
|
Diab KM, Deng J, Wu Y, Yesha Y, Collado-Mesa F, Nguyen P. Natural Language Processing for Breast Imaging: A Systematic Review. Diagnostics (Basel) 2023; 13:diagnostics13081420. [PMID: 37189521 DOI: 10.3390/diagnostics13081420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 04/05/2023] [Accepted: 04/11/2023] [Indexed: 05/17/2023] Open
Abstract
Natural Language Processing (NLP) has gained prominence in diagnostic radiology, offering a promising tool for improving breast imaging triage, diagnosis, lesion characterization, and treatment management in breast cancer and other breast diseases. This review provides a comprehensive overview of recent advances in NLP for breast imaging, covering the main techniques and applications in this field. Specifically, we discuss various NLP methods used to extract relevant information from clinical notes, radiology reports, and pathology reports and their potential impact on the accuracy and efficiency of breast imaging. In addition, we reviewed the state-of-the-art in NLP-based decision support systems for breast imaging, highlighting the challenges and opportunities of NLP applications for breast imaging in the future. Overall, this review underscores the potential of NLP in enhancing breast imaging care and offers insights for clinicians and researchers interested in this exciting and rapidly evolving field.
Collapse
Affiliation(s)
- Kareem Mahmoud Diab
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
| | - Jamie Deng
- Department of Computer Science, University of Miami, Miami, FL 33146, USA
| | - Yusen Wu
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
| | - Yelena Yesha
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
- Department of Computer Science, University of Miami, Miami, FL 33146, USA
- Department of Radiology, Miller School of Medicine, University of Miami, Miami, FL 33146, USA
| | - Fernando Collado-Mesa
- Department of Radiology, Miller School of Medicine, University of Miami, Miami, FL 33146, USA
| | - Phuong Nguyen
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
- Department of Computer Science, University of Miami, Miami, FL 33146, USA
- OpenKnect Inc., Halethorpe, MD 21227, USA
| |
Collapse
|
2
|
Schiappa R, Contu S, Culie D, Thamphya B, Chateau Y, Gal J, Bailleux C, Haudebourg J, Ferrero JM, Barranger E, Chamorey E. RUBY: Natural Language Processing of French Electronic Medical Records for Breast Cancer Research. JCO Clin Cancer Inform 2022; 6:e2100199. [PMID: 35960900 PMCID: PMC9470144 DOI: 10.1200/cci.21.00199] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 05/06/2022] [Accepted: 07/08/2022] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Electronic medical records are a valuable source of information about patients' clinical status but are often free-text documents that require laborious manual review to be exploited. Techniques from computer science have been investigated, but the literature has marginally focused on non-English language texts. We developed RUBY, a tool designed in collaboration with IBM-France to automatically structure clinical information from French medical records of patients with breast cancer. MATERIALS AND METHODS RUBY, which exploits state-of-the-art Named Entity Recognition models combined with keyword extraction and postprocessing rules, was applied on clinical texts. We investigated the precision of RUBY in extracting the target information. RESULTS RUBY has an average precision of 92.8% for the Surgery report, 92.7% for the Pathology report, 98.1% for the Biopsy report, and 81.8% for the Consultation report. CONCLUSION These results show that the automatic approach has the potential to effectively extract clinical knowledge from an extensive set of electronic medical records, reducing the manual effort required and saving a significant amount of time. A deeper semantic analysis and further understanding of the context in the text, as well as training on a larger and more recent set of reports, including those containing highly variable entities and the use of ontologies, could further improve the results.
Collapse
Affiliation(s)
- Renaud Schiappa
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Sara Contu
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Dorian Culie
- Cervico-facial Oncology Surgical Department, University Institute of Face and Neck, University of Côte d'Azur, Nice, France
| | - Brice Thamphya
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Yann Chateau
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Jocelyn Gal
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Caroline Bailleux
- Department of Medical Oncology, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Juliette Haudebourg
- Anatomy and Pathological Cytology Laboratory, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Jean-Marc Ferrero
- Anatomy and Pathological Cytology Laboratory, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Emmanuel Barranger
- Department of Medical Oncology, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Emmanuel Chamorey
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| |
Collapse
|
3
|
Wang L, Fu S, Wen A, Ruan X, He H, Liu S, Moon S, Mai M, Riaz IB, Wang N, Yang P, Xu H, Warner JL, Liu H. Assessment of Electronic Health Record for Cancer Research and Patient Care Through a Scoping Review of Cancer Natural Language Processing. JCO Clin Cancer Inform 2022; 6:e2200006. [PMID: 35917480 PMCID: PMC9470142 DOI: 10.1200/cci.22.00006] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 03/18/2022] [Accepted: 06/15/2022] [Indexed: 11/20/2022] Open
Abstract
PURPOSE The advancement of natural language processing (NLP) has promoted the use of detailed textual data in electronic health records (EHRs) to support cancer research and to facilitate patient care. In this review, we aim to assess EHR for cancer research and patient care by using the Minimal Common Oncology Data Elements (mCODE), which is a community-driven effort to define a minimal set of data elements for cancer research and practice. Specifically, we aim to assess the alignment of NLP-extracted data elements with mCODE and review existing NLP methodologies for extracting said data elements. METHODS Published literature studies were searched to retrieve cancer-related NLP articles that were written in English and published between January 2010 and September 2020 from main literature databases. After the retrieval, articles with EHRs as the data source were manually identified. A charting form was developed for relevant study analysis and used to categorize data including four main topics: metadata, EHR data and targeted cancer types, NLP methodology, and oncology data elements and standards. RESULTS A total of 123 publications were selected finally and included in our analysis. We found that cancer research and patient care require some data elements beyond mCODE as expected. Transparency and reproductivity are not sufficient in NLP methods, and inconsistency in NLP evaluation exists. CONCLUSION We conducted a comprehensive review of cancer NLP for research and patient care using EHRs data. Issues and barriers for wide adoption of cancer NLP were identified and discussed.
Collapse
Affiliation(s)
- Liwei Wang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Xiaoyang Ruan
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Huan He
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sijia Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sungrim Moon
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Michelle Mai
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Irbaz B. Riaz
- Department of Hematology/Oncology, Mayo Clinic, Scottsdale, AZ
| | - Nan Wang
- Department of Computer Science and Engineering, College of Science and Engineering, University of Minnesota, Minneapolis, MN
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, AZ
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Jeremy L. Warner
- Departments of Medicine (Hematology/Oncology), Vanderbilt University, Nashville, TN
- Department Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| |
Collapse
|
4
|
Yoo S, Yoon E, Boo D, Kim B, Kim S, Paeng JC, Yoo IR, Choi IY, Kim K, Ryoo HG, Lee SJ, Song E, Joo YH, Kim J, Lee HY. Transforming Thyroid Cancer Diagnosis and Staging Information from Unstructured Reports to the Observational Medical Outcome Partnership Common Data Model. Appl Clin Inform 2022; 13:521-531. [PMID: 35705182 PMCID: PMC9200482 DOI: 10.1055/s-0042-1748144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
BACKGROUND Cancer staging information is an essential component of cancer research. However, the information is primarily stored as either a full or semistructured free-text clinical document which is limiting the data use. By transforming the cancer-specific data to the Observational Medical Outcome Partnership Common Data Model (OMOP CDM), the information can contribute to establish multicenter observational cancer studies. To the best of our knowledge, there have been no studies on OMOP CDM transformation and natural language processing (NLP) for thyroid cancer to date. OBJECTIVE We aimed to demonstrate the applicability of the OMOP CDM oncology extension module for thyroid cancer diagnosis and cancer stage information by processing free-text medical reports. METHODS Thyroid cancer diagnosis and stage-related modifiers were extracted with rule-based NLP from 63,795 thyroid cancer pathology reports and 56,239 Iodine whole-body scan reports from three medical institutions in the Observational Health Data Sciences and Informatics data network. The data were converted into the OMOP CDM v6.0 according to the OMOP CDM oncology extension module. The cancer staging group was derived and populated using the transformed CDM data. RESULTS The extracted thyroid cancer data were completely converted into the OMOP CDM. The distributions of histopathological types of thyroid cancer were approximately 95.3 to 98.8% of papillary carcinoma, 0.9 to 3.7% of follicular carcinoma, 0.04 to 0.54% of adenocarcinoma, 0.17 to 0.81% of medullary carcinoma, and 0 to 0.3% of anaplastic carcinoma. Regarding cancer staging, stage-I thyroid cancer accounted for 55 to 64% of the cases, while stage III accounted for 24 to 26% of the cases. Stage-II and -IV thyroid cancers were detected at a low rate of 2 to 6%. CONCLUSION As a first study on OMOP CDM transformation and NLP for thyroid cancer, this study will help other institutions to standardize thyroid cancer-specific data for retrospective observational research and participate in multicenter studies.
Collapse
Affiliation(s)
- Sooyoung Yoo
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Eunsil Yoon
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Dachung Boo
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Borham Kim
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Seok Kim
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Jin Chul Paeng
- Department of Nuclear Medicine, Seoul National University, College of Medicine, Seoul, South Korea
| | - Ie Ryung Yoo
- Division of Nuclear Medicine, Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, South Korea
| | - In Young Choi
- Department of Medical Informatics, The Catholic University of Korea, College of Medicine, Seoul, South Korea.,Department of Biomedicine and Health Sciences, The Catholic University of Korea, College of Medicine, Seoul, South Korea
| | - Kwangsoo Kim
- Transdisciplinary Department of Medicine and Advanced Technology, Seoul National University Hospital, Seoul, South Korea
| | - Hyun Gee Ryoo
- Department of Nuclear Medicine, Seoul National University Hospital, Seoul, South Korea.,Department of Nuclear Medicine, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Sun Jung Lee
- Department of Medical Informatics, The Catholic University of Korea, College of Medicine, Seoul, South Korea.,Department of Biomedicine and Health Sciences, The Catholic University of Korea, College of Medicine, Seoul, South Korea
| | - Eunhye Song
- Department of Data Science Research, Innovative Medical Technology Research Institute, Seoul National University Hospital, Seoul, South Korea
| | - Young-Hwan Joo
- Biomedical Research Institute, Seoul National University Hospital, Seoul, South Korea
| | - Junmo Kim
- Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, South Korea
| | - Ho-Young Lee
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea.,Department of Nuclear Medicine, Seoul National University, College of Medicine, Seoul, South Korea
| |
Collapse
|
5
|
Santos T, Tariq A, Gichoya JW, Trivedi H, Banerjee I. Automatic Classification of Cancer Pathology Reports: A Systematic Review. J Pathol Inform 2022; 13:100003. [PMID: 35242443 PMCID: PMC8860734 DOI: 10.1016/j.jpi.2022.100003] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 11/12/2021] [Indexed: 11/30/2022] Open
Abstract
Pathology reports primarily consist of unstructured free text and thus the clinical information contained in the reports is not trivial to access or query. Multiple natural language processing (NLP) techniques have been proposed to automate the coding of pathology reports via text classification. In this systematic review, we follow the guidelines proposed by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; Page et al., 2020: BMJ.) to identify the NLP systems for classifying pathology reports published between the years of 2010 and 2021. Based on our search criteria, a total of 3445 records were retrieved, and 25 articles met the final review criteria. We benchmarked the systems based on methodology, complexity of the prediction task and core types of NLP models: i) Rule-based and Intelligent systems, ii) statistical machine learning, and iii) deep learning. While certain tasks are well addressed by these models, many others have limitations and remain as open challenges, such as, extraction of many cancer characteristics (size, shape, type of cancer, others) from pathology reports. We investigated the final set of papers (25) and addressed their potential as well as their limitations. We hope that this systematic review helps researchers prioritize the development of innovated approaches to tackle the current limitations and help the advancement of cancer research.
Collapse
Affiliation(s)
- Thiago Santos
- Department of Computer Science, Emory University, Atlanta, GA, USA
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
- Corresponding author.
| | - Amara Tariq
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA
| | - Judy Wawira Gichoya
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
- Department of Radiology, Emory School of Medicine, Atlanta, GA, USA
| | - Hari Trivedi
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
- Department of Radiology, Emory School of Medicine, Atlanta, GA, USA
| | - Imon Banerjee
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA
- Department of Computer Engineering, Arizona State University, AZ, USA
| |
Collapse
|
6
|
Wahab N, Miligy IM, Dodd K, Sahota H, Toss M, Lu W, Jahanifar M, Bilal M, Graham S, Park Y, Hadjigeorghiou G, Bhalerao A, Lashen AG, Ibrahim AY, Katayama A, Ebili HO, Parkin M, Sorell T, Raza SEA, Hero E, Eldaly H, Tsang YW, Gopalakrishnan K, Snead D, Rakha E, Rajpoot N, Minhas F. Semantic annotation for computational pathology: multidisciplinary experience and best practice recommendations. J Pathol Clin Res 2022; 8:116-128. [PMID: 35014198 PMCID: PMC8822374 DOI: 10.1002/cjp2.256] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Revised: 11/25/2021] [Accepted: 12/10/2021] [Indexed: 02/06/2023]
Abstract
Recent advances in whole-slide imaging (WSI) technology have led to the development of a myriad of computer vision and artificial intelligence-based diagnostic, prognostic, and predictive algorithms. Computational Pathology (CPath) offers an integrated solution to utilise information embedded in pathology WSIs beyond what can be obtained through visual assessment. For automated analysis of WSIs and validation of machine learning (ML) models, annotations at the slide, tissue, and cellular levels are required. The annotation of important visual constructs in pathology images is an important component of CPath projects. Improper annotations can result in algorithms that are hard to interpret and can potentially produce inaccurate and inconsistent results. Despite the crucial role of annotations in CPath projects, there are no well-defined guidelines or best practices on how annotations should be carried out. In this paper, we address this shortcoming by presenting the experience and best practices acquired during the execution of a large-scale annotation exercise involving a multidisciplinary team of pathologists, ML experts, and researchers as part of the Pathology image data Lake for Analytics, Knowledge and Education (PathLAKE) consortium. We present a real-world case study along with examples of different types of annotations, diagnostic algorithm, annotation data dictionary, and annotation constructs. The analyses reported in this work highlight best practice recommendations that can be used as annotation guidelines over the lifecycle of a CPath project.
Collapse
Affiliation(s)
- Noorul Wahab
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | - Islam M Miligy
- PathologyUniversity of NottinghamNottinghamUK
- Department of Pathology, Faculty of MedicineMenoufia UniversityShebin El‐KomEgypt
| | - Katherine Dodd
- HistopathologyUniversity Hospital Coventry and WarwickshireCoventryUK
| | - Harvir Sahota
- HistopathologyUniversity Hospital Coventry and WarwickshireCoventryUK
| | | | - Wenqi Lu
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | | | - Mohsin Bilal
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | - Simon Graham
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | - Young Park
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | | | - Abhir Bhalerao
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | | | | | - Ayaka Katayama
- Graduate School of MedicineGunma UniversityMaebashiJapan
| | | | | | - Tom Sorell
- Department of Politics and International StudiesUniversity of WarwickCoventryUK
| | | | - Emily Hero
- HistopathologyUniversity Hospital Coventry and WarwickshireCoventryUK
- Leicester Royal Infirmary, HistopathologyUniversity Hospitals LeicesterLeicesterUK
| | - Hesham Eldaly
- HistopathologyUniversity Hospital Coventry and WarwickshireCoventryUK
| | - Yee Wah Tsang
- HistopathologyUniversity Hospital Coventry and WarwickshireCoventryUK
| | | | - David Snead
- HistopathologyUniversity Hospital Coventry and WarwickshireCoventryUK
| | - Emad Rakha
- PathologyUniversity of NottinghamNottinghamUK
| | - Nasir Rajpoot
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| | - Fayyaz Minhas
- Tissue Image Analytics CentreUniversity of WarwickCoventryUK
| |
Collapse
|
7
|
Torous VF, Simpson RW, Balani JP, Baras AS, Berman MA, Birdsong GG, Giannico GA, Paner GP, Pettus JR, Sessions Z, Sirintrapun SJ, Srigley JR, Spencer S. College of American Pathologists Cancer Protocols: From Optimizing Cancer Patient Care to Facilitating Interoperable Reporting and Downstream Data Use. JCO Clin Cancer Inform 2021; 5:47-55. [PMID: 33439728 PMCID: PMC8140812 DOI: 10.1200/cci.20.00104] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The College of American Pathologists Cancer Protocols have offered guidance to pathologists for standard cancer pathology reporting for more than 35 years. The adoption of computer readable versions of these protocols by electronic health record and laboratory information system (LIS) vendors has provided a mechanism for pathologists to report within their LIS workflow, in addition to enabling standardized structured data capture and reporting to downstream consumers of these data such as the cancer surveillance community. This paper reviews the history of the Cancer Protocols and electronic Cancer Checklists, outlines the current use of these critically important cancer case reporting tools, and examines future directions, including plans to help improve the integration of the Cancer Protocols into clinical, public health, research, and other workflows.
Collapse
Affiliation(s)
| | | | - Jyoti P Balani
- University of Texas Southwestern Medical Center, Dallas, TX
| | | | - Michael A Berman
- Jefferson Hospital, Allegheny Health Network, Jefferson Hills, PA
| | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Holtzapple E, Telmer CA, Miskov-Zivanov N. FLUTE: Fast and reliable knowledge retrieval from biomedical literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5880277. [PMID: 32761077 PMCID: PMC7408180 DOI: 10.1093/database/baaa056] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 05/21/2020] [Accepted: 07/02/2020] [Indexed: 12/14/2022]
Abstract
State-of-the-art machine reading methods extract, in hours, hundreds of thousands of events from the biomedical literature. However, many of the extracted biomolecular interactions are incorrect or not relevant for computational modeling of a system of interest. Therefore, rapid, automated methods are required to filter and select accurate and useful information. The FiLter for Understanding True Events (FLUTE) tool uses public protein interaction databases to filter interactions that have been extracted by machines from databases such as PubMed and score them for accuracy. Confidence in the interactions allows for rapid and accurate model assembly. As our results show, FLUTE can reliably determine the confidence in the biomolecular interactions extracted by fast machine readers and at the same time provide a speedup in interaction filtering by three orders of magnitude. Database URL: https://bitbucket.org/biodesignlab/flute.
Collapse
Affiliation(s)
- Emilee Holtzapple
- Department of Computational and Systems Biology, University of Pittsburgh, 3501 Fifth Ave, Pittsburgh, Pennsylvania 15213, USA
| | - Cheryl A Telmer
- Molecular Biosensor and Imagining Center, Carnegie Mellon University, 4400 Fifth Ave, Pittsburgh, Pennsylvania 15213, USA
| | - Natasa Miskov-Zivanov
- Department of Computational and Systems Biology, University of Pittsburgh, 3501 Fifth Ave, Pittsburgh, Pennsylvania 15213, USA.,Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara St, Pittsburgh, Pennsylvania 15261, USA.,Department of Bioengineering, University of Pittsburgh, 300 Technology Dr, Pittsburgh 15213, USA
| |
Collapse
|
9
|
Senders JT, Cho LD, Calvachi P, McNulty JJ, Ashby JL, Schulte IS, Almekkawi AK, Mehrtash A, Gormley WB, Smith TR, Broekman MLD, Arnaout O. Automating Clinical Chart Review: An Open-Source Natural Language Processing Pipeline Developed on Free-Text Radiology Reports From Patients With Glioblastoma. JCO Clin Cancer Inform 2021; 4:25-34. [PMID: 31977252 DOI: 10.1200/cci.19.00060] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
PURPOSE The aim of this study was to develop an open-source natural language processing (NLP) pipeline for text mining of medical information from clinical reports. We also aimed to provide insight into why certain variables or reports are more suitable for clinical text mining than others. MATERIALS AND METHODS Various NLP models were developed to extract 15 radiologic characteristics from free-text radiology reports for patients with glioblastoma. Ten-fold cross-validation was used to optimize the hyperparameter settings and estimate model performance. We examined how model performance was associated with quantitative attributes of the radiologic characteristics and reports. RESULTS In total, 562 unique brain magnetic resonance imaging reports were retrieved. NLP extracted 15 radiologic characteristics with high to excellent discrimination (area under the curve, 0.82 to 0.98) and accuracy (78.6% to 96.6%). Model performance was correlated with the inter-rater agreement of the manually provided labels (ρ = 0.904; P < .001) but not with the frequency distribution of the variables of interest (ρ = 0.179; P = .52). All variables labeled with a near perfect inter-rater agreement were classified with excellent performance (area under the curve > 0.95). Excellent performance could be achieved for variables with only 50 to 100 observations in the minority group and class imbalances up to a 9:1 ratio. Report-level classification accuracy was not associated with the number of words or the vocabulary size in the distinct text documents. CONCLUSION This study provides an open-source NLP pipeline that allows for text mining of narratively written clinical reports. Small sample sizes and class imbalance should not be considered as absolute contraindications for text mining in clinical research. However, future studies should report measures of inter-rater agreement whenever ground truth is based on a consensus label and use this measure to identify clinical variables eligible for text mining.
Collapse
Affiliation(s)
- Joeky T Senders
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA.,Department of Neurosurgery, Leiden University Medical Center, Leiden, the Netherlands
| | - Logan D Cho
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA.,Department of Neuroscience, Brown University, Providence, RI
| | - Paola Calvachi
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - John J McNulty
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA.,Vagelos College of Physicians and Surgeons, Columbia University, New York, NY
| | - Joanna L Ashby
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - Isabelle S Schulte
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - Ahmad Kareem Almekkawi
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - Alireza Mehrtash
- Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - William B Gormley
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - Timothy R Smith
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| | - Marike L D Broekman
- Department of Neurosurgery, Leiden University Medical Center, Leiden, the Netherlands.,Department of Neurosurgery, Haaglanden Medical Center, The Hague, the Netherlands
| | - Omar Arnaout
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
| |
Collapse
|
10
|
Santus E, Li C, Yala A, Peck D, Soomro R, Faridi N, Mamshad I, Tang R, Lanahan CR, Barzilay R, Hughes K. Do Neural Information Extraction Algorithms Generalize Across Institutions? JCO Clin Cancer Inform 2020; 3:1-8. [PMID: 31310566 DOI: 10.1200/cci.18.00160] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Natural language processing (NLP) techniques have been adopted to reduce the curation costs of electronic health records. However, studies have questioned whether such techniques can be applied to data from previously unseen institutions. We investigated the performance of a common neural NLP algorithm on data from both known and heldout (ie, institutions whose data were withheld from the training set and only used for testing) hospitals. We also explored how diversity in the training data affects the system's generalization ability. METHODS We collected 24,881 breast pathology reports from seven hospitals and manually annotated them with nine key attributes that describe types of atypia and cancer. We trained a convolutional neural network (CNN) on annotations from either only one (CNN1), only two (CNN2), or only four (CNN4) hospitals. The trained systems were tested on data from five organizations, including both known and heldout ones. For every setting, we provide the accuracy scores as well as the learning curves that show how much data are necessary to achieve good performance and generalizability. RESULTS The system achieved a cross-institutional accuracy of 93.87% when trained on reports from only one hospital (CNN1). Performance improved to 95.7% and 96%, respectively, when the system was trained on reports from two (CNN2) and four (CNN4) hospitals. The introduction of diversity during training did not lead to improvements on the known institutions, but it boosted performance on the heldout institutions. When tested on reports from heldout hospitals, CNN4 outperformed CNN1 and CNN2 by 2.13% and 0.3%, respectively. CONCLUSION Real-world scenarios require that neural NLP approaches scale to data from previously unseen institutions. We show that a common neural NLP algorithm for information extraction can achieve this goal, especially when diverse data are used during training.
Collapse
Affiliation(s)
- Enrico Santus
- Massachusetts Institute of Technology, Cambridge, MA
| | - Clara Li
- Massachusetts Institute of Technology, Cambridge, MA
| | - Adam Yala
- Massachusetts Institute of Technology, Cambridge, MA
| | - Donald Peck
- Henry Ford Health System, Detroit, MI.,Michigan Technological University, Houghton, MI
| | - Rufina Soomro
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Naveen Faridi
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Isra Mamshad
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Rong Tang
- Rochester General Hospital, Rochester, NY
| | | | | | | |
Collapse
|
11
|
Zhao B. Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31577448 PMCID: PMC6874014 DOI: 10.1200/cci.19.00057] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F1 scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters.
Collapse
|
12
|
A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. J Biomed Inform 2019; 100:103301. [PMID: 31589927 DOI: 10.1016/j.jbi.2019.103301] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 09/04/2019] [Accepted: 10/03/2019] [Indexed: 02/07/2023]
Abstract
OBJECTIVE There is a lot of information about cancer in Electronic Health Record (EHR) notes that can be useful for biomedical research provided natural language processing (NLP) methods are available to extract and structure this information. In this paper, we present a scoping review of existing clinical NLP literature for cancer. METHODS We identified studies describing an NLP method to extract specific cancer-related information from EHR sources from PubMed, Google Scholar, ACL Anthology, and existing reviews. Two exclusion criteria were used in this study. We excluded articles where the extraction techniques used were too broad to be represented as frames (e.g., document classification) and also where very low-level extraction methods were used (e.g. simply identifying clinical concepts). 78 articles were included in the final review. We organized this information according to frame semantic principles to help identify common areas of overlap and potential gaps. RESULTS Frames were created from the reviewed articles pertaining to cancer information such as cancer diagnosis, tumor description, cancer procedure, breast cancer diagnosis, prostate cancer diagnosis and pain in prostate cancer patients. These frames included both a definition as well as specific frame elements (i.e. extractable attributes). We found that cancer diagnosis was the most common frame among the reviewed papers (36 out of 78), with recent work focusing on extracting information related to treatment and breast cancer diagnosis. CONCLUSION The list of common frames described in this paper identifies important cancer-related information extracted by existing NLP techniques and serves as a useful resource for future researchers requiring cancer information extracted from EHR notes. We also argue, due to the heavy duplication of cancer NLP systems, that a general purpose resource of annotated cancer frames and corresponding NLP tools would be valuable.
Collapse
|
13
|
Jain NM, Culley A, Knoop T, Micheel C, Osterman T, Levy M. Conceptual Framework to Support Clinical Trial Optimization and End-to-End Enrollment Workflow. JCO Clin Cancer Inform 2019; 3:1-10. [PMID: 31225983 PMCID: PMC6873934 DOI: 10.1200/cci.19.00033] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/02/2019] [Indexed: 12/19/2022] Open
Abstract
In this work, we present a conceptual framework to support clinical trial optimization and enrollment workflows and review the current state, limitations, and future trends in this space. This framework includes knowledge representation of clinical trials, clinical trial optimization, clinical trial design, enrollment workflows for prospective clinical trial matching, waitlist management, and, finally, evaluation strategies for assessing improvement.
Collapse
Affiliation(s)
- Neha M. Jain
- Vanderbilt University Medical Center, Nashville, TN
| | | | - Teresa Knoop
- Vanderbilt University Medical Center, Nashville, TN
| | | | | | - Mia Levy
- Vanderbilt University Medical Center, Nashville, TN
- Rush University Medical Center, Chicago, IL
| |
Collapse
|
14
|
Shortreed SM, Cook AJ, Coley RY, Bobb JF, Nelson JC. Challenges and Opportunities for Using Big Health Care Data to Advance Medical Science and Public Health. Am J Epidemiol 2019; 188:851-861. [PMID: 30877288 DOI: 10.1093/aje/kwy292] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Accepted: 12/20/2018] [Indexed: 12/14/2022] Open
Abstract
Methodological advancements in epidemiology, biostatistics, and data science have strengthened the research world's ability to use data captured from electronic health records (EHRs) to address pressing medical questions, but gaps remain. We describe methods investments that are needed to curate EHR data toward research quality and to integrate complementary data sources when EHR data alone are insufficient for research goals. We highlight new methods and directions for improving the integrity of medical evidence generated from pragmatic trials, observational studies, and predictive modeling. We also discuss needed methods contributions to further ease data sharing across multisite EHR data networks. Throughout, we identify opportunities for training and for bolstering collaboration among subject matter experts, methodologists, practicing clinicians, and health system leaders to help ensure that methods problems are identified and resulting advances are translated into mainstream research practice more quickly.
Collapse
Affiliation(s)
- Susan M Shortreed
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute, Seattle, Washington
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| | - Andrea J Cook
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute, Seattle, Washington
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| | - R Yates Coley
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute, Seattle, Washington
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| | - Jennifer F Bobb
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute, Seattle, Washington
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| | - Jennifer C Nelson
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute, Seattle, Washington
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| |
Collapse
|
15
|
Brentnall AR, Cuzick J, Buist DSM, Bowles EJA. Long-term Accuracy of Breast Cancer Risk Assessment Combining Classic Risk Factors and Breast Density. JAMA Oncol 2018; 4:e180174. [PMID: 29621362 PMCID: PMC6143016 DOI: 10.1001/jamaoncol.2018.0174] [Citation(s) in RCA: 129] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 01/16/2018] [Indexed: 12/23/2022]
Abstract
Importance Accurate long-term breast cancer risk assessment for women attending routine screening could help reduce the disease burden and intervention-associated harms by personalizing screening recommendations and preventive interventions. Objective To report the accuracy of risk assessment for breast cancer during a period of 19 years. Design, Setting, and Participants This cohort study of the Kaiser Permanente Washington breast imaging registry included women without previous breast cancer, aged 40 to 73 years, who attended screening from January 1, 1996, through December 31, 2013. Follow-up was completed on December 31, 2014, and data were analyzed from March 2, 2016, through November 13, 2017. Exposures Risk factors from a questionnaire and breast density from the Breast Imaging and Reporting Data System at entry; primary risk was assessed using the Tyrer-Cuzick model. Main Outcomes and Measures Incidence of invasive breast cancer was estimated with and without breast density. Follow-up began 6 months after the entry mammogram and extended to the earliest diagnosis of invasive breast cancer, censoring at 75 years of age, 2014, diagnosis of ductal carcinoma in situ, death, or health plan disenrollment. Observed divided by expected (O/E) numbers of cancer cases were compared using exact Poisson 95% CIs. Hazard ratios for the top decile of 10-year risk relative to the middle 80% of the study population were estimated. Constancy of relative risk calibration during follow-up was tested using a time-dependent proportional hazards effect. Results In this cohort study of 132 139 women (median age at entry, 50 years; interquartile range, 44-58 years), 2699 invasive breast cancers were subsequently diagnosed after a median 5.2 years of follow-up (interquartile range, 2.4-11.1 years; maximum follow-up, 19 years; annual incidence rate [IR] per 1000 women, 2.9). Observed number of cancer diagnoses was close to the expected number (O/E for the Tyrer-Cuzick model, 1.02 [95% CI, 0.98-1.06]; O/E for the Tyrer-Cuzick model with density, 0.98 [95% CI, 0.94-1.02]). The Tyrer-Cuzick model estimated 2554 women (1.9%) to be at high risk (10-year risk of ≥8%), of whom 147 subsequently developed invasive breast cancer (O/E, 0.79; 95% CI, 0.67-0.93; IR per 1000 women, 8.7). The Tyrer-Cuzick model with density estimated more women to be at high risk (4645 [3.5%]; 273 cancers [10.1%]; O/E, 0.78; 95% CI, 0.69-0.88; IR per 1000 women, 9.2). The hazard ratio for the highest risk decile compared with the middle 80% was 2.22 (95% CI, 2.02-2.45) for the Tyrer-Cuzick model and 2.55 (95% CI, 2.33-2.80) for the Tyrer-Cuzick model with density. Little evidence was found for a decrease in relative risk calibration throughout follow-up for the Tyrer-Cuzick model (age-adjusted slope, -0.003; 95% CI, -0.018 to 0.012) or the Tyrer-Cuzick model with density (age-adjusted slope, -0.008; 95% CI, -0.020 to 0.004). Conclusions and Relevance Breast cancer risk assessment combining classic risk factors with mammographic density may provide useful data for 10 years or more and could be used to guide long-term, systematic, risk-adapted screening and prevention strategies.
Collapse
Affiliation(s)
- Adam R. Brentnall
- Centre for Cancer Prevention, Wolfson Institute of Preventive Medicine, Queen Mary University of London, London, England
| | - Jack Cuzick
- Centre for Cancer Prevention, Wolfson Institute of Preventive Medicine, Queen Mary University of London, London, England
| | - Diana S. M. Buist
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington
| | | |
Collapse
|
16
|
Tang R, Ouyang L, Li C, He Y, Griffin M, Taghian A, Smith B, Yala A, Barzilay R, Hughes K. Machine learning to parse breast pathology reports in Chinese. Breast Cancer Res Treat 2018; 169:243-250. [PMID: 29380208 DOI: 10.1007/s10549-018-4668-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2018] [Accepted: 01/11/2018] [Indexed: 02/06/2023]
Abstract
INTRODUCTION Large structured databases of pathology findings are valuable in deriving new clinical insights. However, they are labor intensive to create and generally require manual annotation. There has been some work in the bioinformatics community to support automating this work via machine learning in English. Our contribution is to provide an automated approach to construct such structured databases in Chinese, and to set the stage for extraction from other languages. METHODS We collected 2104 de-identified Chinese benign and malignant breast pathology reports from Hunan Cancer Hospital. Physicians with native Chinese proficiency reviewed the reports and annotated a variety of binary and numerical pathologic entities. After excluding 78 cases with a bilateral lesion in the same report, 1216 cases were used as a training set for the algorithm, which was then refined by 405 development cases. The Natural language processing algorithm was tested by using the remaining 405 cases to evaluate the machine learning outcome. The model was used to extract 13 binary entities and 8 numerical entities. RESULTS When compared to physicians with native Chinese proficiency, the model showed a per-entity accuracy from 91 to 100% for all common diagnoses on the test set. The overall accuracy of binary entities was 98% and of numerical entities was 95%. In a per-report evaluation for binary entities with more than 100 training cases, 85% of all the testing reports were completely correct and 11% had an error in 1 out of 22 entities. CONCLUSION We have demonstrated that Chinese breast pathology reports can be automatically parsed into structured data using standard machine learning approaches. The results of our study demonstrate that techniques effective in parsing English reports can be scaled to other languages.
Collapse
Affiliation(s)
- Rong Tang
- Division of Surgical Oncology, MGH, Boston, USA
| | - Lizhi Ouyang
- Department of Breast Surgery, Hunan Cancer Hospital, Changsha, Hunan, China
| | - Clara Li
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA
| | - Yue He
- Department of Breast Surgery, Hunan Cancer Hospital, Changsha, Hunan, China
| | | | | | | | - Adam Yala
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA
| | | |
Collapse
|
17
|
Boguslav M, Cohen KB, Baumgartner WA, Hunter LE. Improving precision in concept normalization. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018; 23:566-577. [PMID: 29218915 PMCID: PMC5730334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Most natural language processing applications exhibit a trade-off between precision and recall. In some use cases for natural language processing, there are reasons to prefer to tilt that trade-off toward high precision. Relying on the Zipfian distribution of false positive results, we describe a strategy for increasing precision, using a variety of both pre-processing and post-processing methods. They draw on both knowledge-based and frequentist approaches to modeling language. Based on an existing high-performance biomedical concept recognition pipeline and a previously published manually annotated corpus, we apply this hybrid rationalist/empiricist strategy to concept normalization for eight different ontologies. Which approaches did and did not improve precision varied widely between the ontologies.
Collapse
Affiliation(s)
- Mayla Boguslav
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO 80045, USA compbio.ucdenver.edu,
| | | | | | | |
Collapse
|
18
|
Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, Forshee R, Walderhaug M, Botsis T. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. J Biomed Inform 2017; 73:14-29. [PMID: 28729030 DOI: 10.1016/j.jbi.2017.07.012] [Citation(s) in RCA: 290] [Impact Index Per Article: 41.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Revised: 06/07/2017] [Accepted: 07/14/2017] [Indexed: 12/24/2022]
Abstract
We followed a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses to identify existing clinical natural language processing (NLP) systems that generate structured information from unstructured free text. Seven literature databases were searched with a query combining the concepts of natural language processing and structured data capture. Two reviewers screened all records for relevance during two screening phases, and information about clinical NLP systems was collected from the final set of papers. A total of 7149 records (after removing duplicates) were retrieved and screened, and 86 were determined to fit the review criteria. These papers contained information about 71 different clinical NLP systems, which were then analyzed. The NLP systems address a wide variety of important clinical and research tasks. Certain tasks are well addressed by the existing systems, while others remain as open challenges that only a small number of systems attempt, such as extraction of temporal information or normalization of concepts to standard terminologies. This review has identified many NLP systems capable of processing clinical free text and generating structured output, and the information collected and evaluated here will be important for prioritizing development of new approaches for clinical NLP.
Collapse
Affiliation(s)
- Kory Kreimeyer
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States.
| | - Matthew Foster
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States
| | - Abhishek Pandey
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States
| | - Nina Arya
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States
| | - Gwendolyn Halford
- FDA Library, US Food and Drug Administration, Silver Spring, MD, United States
| | - Sandra F Jones
- Cancer Surveillance Branch, Division of Cancer Prevention and Control, National Center for Chronic Disease Prevention and Health Promotion, Centers for Disease Control and Prevention, Atlanta, GA, United States
| | - Richard Forshee
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States
| | - Mark Walderhaug
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States
| | - Taxiarchis Botsis
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, United States
| |
Collapse
|
19
|
Effect of an Automated Tracking Registry on the Rate of Tracking Failure in Incidental Pulmonary Nodules. J Am Coll Radiol 2017; 14:773-777. [PMID: 28434846 DOI: 10.1016/j.jacr.2017.02.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2016] [Revised: 02/02/2017] [Accepted: 02/07/2017] [Indexed: 12/21/2022]
Abstract
OBJECTIVE Following incidental lung nodules with interval CT scanning is an accepted method to detect early lung cancer, but delayed tracking or failure to track is reported in up to 40% of patients. METHODS Our institution developed and implemented an automated lung nodule registry tracking system. This system uses a code at the time that a suspicious nodule is discovered to populate the registry. Suspicious nodules were defined as any nodule, solid or ground glass, <3 cm that the radiologist recorded as a potential malignancy or recommended for follow-up imaging. We exported the system to eight other Veterans Administration Medical Centers (VAMCs) with over 10,000 patients enrolled. We retrospectively reviewed 200 sequential CT scan reports containing incidental nodule(s) from two tertiary care university-affiliated VAMCs, both before and after the implementation of the registry tracking system. The primary outcome was the rate of tracking failure, defined as suspicious nodules that had no follow-up imaging or whose follow-up was delayed when compared with published guidelines. Secondary outcomes were predictors of tracking failure and reasons for tracking failure. RESULTS After implementation of the registry tracking system in the two VAMCs, we found a significant decrease in tracking failure, from a preimplementation rate of 74% to a postimplementation rate of 10% (P < .001). We found that age, nodule size, number, and nodule characteristics were significant predictors. CONCLUSIONS The automated lung nodule registry tracking system can be exported to other health care facilities and significantly reduces the rate of tracking failure.
Collapse
|
20
|
Chen W, Huang Y, Boyle B, Lin S. The utility of including pathology reports in improving the computational identification of patients. J Pathol Inform 2016; 7:46. [PMID: 27994938 PMCID: PMC5139449 DOI: 10.4103/2153-3539.194838] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Accepted: 10/08/2016] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Celiac disease (CD) is a common autoimmune disorder. Efficient identification of patients may improve chronic management of the disease. Prior studies have shown searching International Classification of Diseases-9 (ICD-9) codes alone is inaccurate for identifying patients with CD. In this study, we developed automated classification algorithms leveraging pathology reports and other clinical data in Electronic Health Records (EHRs) to refine the subset population preselected using ICD-9 code (579.0). MATERIALS AND METHODS EHRs were searched for established ICD-9 code (579.0) suggesting CD, based on which an initial identification of cases was obtained. In addition, laboratory results for tissue transglutaminse were extracted. Using natural language processing we analyzed pathology reports from upper endoscopy. Twelve machine learning classifiers using different combinations of variables related to ICD-9 CD status, laboratory result status, and pathology reports were experimented to find the best possible CD classifier. Ten-fold cross-validation was used to assess the results. RESULTS A total of 1498 patient records were used including 363 confirmed cases and 1135 false positive cases that served as controls. Logistic model based on both clinical and pathology report features produced the best results: Kappa of 0.78, F1 of 0.92, and area under the curve (AUC) of 0.94, whereas in contrast using ICD-9 only generated poor results: Kappa of 0.28, F1 of 0.75, and AUC of 0.63. CONCLUSION Our automated classification system presented an efficient and reliable way to improve the performance of CD patient identification.
Collapse
Affiliation(s)
- Wei Chen
- Department of Research and Development, Research Information Solutions and Innovation, Nationwide Children's Hospital, 575 Children's Crossroad, Columbus, Ohio 43215, USA
| | - Yungui Huang
- Department of Research and Development, Research Information Solutions and Innovation, Nationwide Children's Hospital, 575 Children's Crossroad, Columbus, Ohio 43215, USA
| | - Brendan Boyle
- Department of Gastroenterology, Nationwide Children's Hospital, 700 Children's Dr, Columbus, Ohio 43205, USA
| | - Simon Lin
- Department of Research and Development, Research Information Solutions and Innovation, Nationwide Children's Hospital, 575 Children's Crossroad, Columbus, Ohio 43215, USA
| |
Collapse
|
21
|
Yala A, Barzilay R, Salama L, Griffin M, Sollender G, Bardia A, Lehman C, Buckley JM, Coopey SB, Polubriaginof F, Garber JE, Smith BL, Gadd MA, Specht MC, Gudewicz TM, Guidi AJ, Taghian A, Hughes KS. Using machine learning to parse breast pathology reports. Breast Cancer Res Treat 2016; 161:203-211. [DOI: 10.1007/s10549-016-4035-1] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 10/21/2016] [Indexed: 10/20/2022]
|
22
|
Ye JJ. Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example. J Pathol Inform 2016; 7:44. [PMID: 28066684 PMCID: PMC5100200 DOI: 10.4103/2153-3539.192822] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Accepted: 09/27/2016] [Indexed: 12/05/2022] Open
Abstract
Background: Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described. Methods: Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past 4 and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients. Results: 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis. Conclusions: R extended with RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.
Collapse
Affiliation(s)
- Jay J Ye
- Dahl-Chase Pathology Associates, Bangor, Maine, USA
| |
Collapse
|
23
|
Bozkurt S, Gimenez F, Burnside ES, Gulkesen KH, Rubin DL. Using automatically extracted information from mammography reports for decision-support. J Biomed Inform 2016; 62:224-31. [PMID: 27388877 DOI: 10.1016/j.jbi.2016.07.001] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2016] [Revised: 06/22/2016] [Accepted: 07/02/2016] [Indexed: 02/07/2023]
Abstract
OBJECTIVE To evaluate a system we developed that connects natural language processing (NLP) for information extraction from narrative text mammography reports with a Bayesian network for decision-support about breast cancer diagnosis. The ultimate goal of this system is to provide decision support as part of the workflow of producing the radiology report. MATERIALS AND METHODS We built a system that uses an NLP information extraction system (which extract BI-RADS descriptors and clinical information from mammography reports) to provide the necessary inputs to a Bayesian network (BN) decision support system (DSS) that estimates lesion malignancy from BI-RADS descriptors. We used this integrated system to predict diagnosis of breast cancer from radiology text reports and evaluated it with a reference standard of 300 mammography reports. We collected two different outputs from the DSS: (1) the probability of malignancy and (2) the BI-RADS final assessment category. Since NLP may produce imperfect inputs to the DSS, we compared the difference between using perfect ("reference standard") structured inputs to the DSS ("RS-DSS") vs NLP-derived inputs ("NLP-DSS") on the output of the DSS using the concordance correlation coefficient. We measured the classification accuracy of the BI-RADS final assessment category when using NLP-DSS, compared with the ground truth category established by the radiologist. RESULTS The NLP-DSS and RS-DSS had closely matched probabilities, with a mean paired difference of 0.004±0.025. The concordance correlation of these paired measures was 0.95. The accuracy of the NLP-DSS to predict the correct BI-RADS final assessment category was 97.58%. CONCLUSION The accuracy of the information extracted from mammography reports using the NLP system was sufficient to provide accurate DSS results. We believe our system could ultimately reduce the variation in practice in mammography related to assessment of malignant lesions and improve management decisions.
Collapse
Affiliation(s)
- Selen Bozkurt
- Akdeniz University Faculty of Medicine, Department of Biostatistics and Medical Informatics, Antalya, Turkey
| | - Francisco Gimenez
- Department of Radiology and Medicine (Biomedical Informatics Research), Stanford University, Richard M. Lucas Center, 1201 Welch Road, Office P285, Stanford, CA 94305-5488, United States
| | | | - Kemal H Gulkesen
- Akdeniz University Faculty of Medicine, Department of Biostatistics and Medical Informatics, Antalya, Turkey
| | - Daniel L Rubin
- Department of Radiology and Medicine (Biomedical Informatics Research), Stanford University, Richard M. Lucas Center, 1201 Welch Road, Office P285, Stanford, CA 94305-5488, United States.
| |
Collapse
|
24
|
The SYNODOS Project: System for the Normalization and Organization of Textual Medical Data for Observation in Healthcare. Ing Rech Biomed 2016. [DOI: 10.1016/j.irbm.2016.03.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|