1
|
Luo Y, Mao C, Sanchez‐Pinto LN, Ahmad FS, Naidech A, Rasmussen L, Pacheco JA, Schneider D, Mithal LB, Dresden S, Holmes K, Carson M, Shah SJ, Khan S, Clare S, Wunderink RG, Liu H, Walunas T, Cooper L, Yue F, Wehbe F, Fang D, Liebovitz DM, Markl M, Michelson KN, McColley SA, Green M, Starren J, Ackermann RT, D'Aquila RT, Adams J, Lloyd‐Jones D, Chisholm RL, Kho A. Northwestern University resource and education development initiatives to advance collaborative artificial intelligence across the learning health system. Learn Health Syst 2024; 8:e10417. [PMID: 39036530 PMCID: PMC11257059 DOI: 10.1002/lrh2.10417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 02/22/2024] [Accepted: 02/26/2024] [Indexed: 07/23/2024] Open
Abstract
Introduction The rapid development of artificial intelligence (AI) in healthcare has exposed the unmet need for growing a multidisciplinary workforce that can collaborate effectively in the learning health systems. Maximizing the synergy among multiple teams is critical for Collaborative AI in Healthcare. Methods We have developed a series of data, tools, and educational resources for cultivating the next generation of multidisciplinary workforce for Collaborative AI in Healthcare. We built bulk-natural language processing pipelines to extract structured information from clinical notes and stored them in common data models. We developed multimodal AI/machine learning (ML) tools and tutorials to enrich the toolbox of the multidisciplinary workforce to analyze multimodal healthcare data. We have created a fertile ground to cross-pollinate clinicians and AI scientists and train the next generation of AI health workforce to collaborate effectively. Results Our work has democratized access to unstructured health information, AI/ML tools and resources for healthcare, and collaborative education resources. From 2017 to 2022, this has enabled studies in multiple clinical specialties resulting in 68 peer-reviewed publications. In 2022, our cross-discipline efforts converged and institutionalized into the Center for Collaborative AI in Healthcare. Conclusions Our Collaborative AI in Healthcare initiatives has created valuable educational and practical resources. They have enabled more clinicians, scientists, and hospital administrators to successfully apply AI methods in their daily research and practice, develop closer collaborations, and advanced the institution-level learning health system.
Collapse
Affiliation(s)
- Yuan Luo
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Chengsheng Mao
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Lazaro N. Sanchez‐Pinto
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Division of Critical Care, Department of PediatricsNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Stanley Manne Children's Research InstituteAnn & Robert H. Lurie Children's Hospital of ChicagoChicagoIllinoisUSA
| | - Faraz S. Ahmad
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Cardiology, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Andrew Naidech
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Neurocritical Care, Department of NeurologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Luke Rasmussen
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Jennifer A. Pacheco
- Center for Genetic MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Daniel Schneider
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
| | - Leena B. Mithal
- Stanley Manne Children's Research InstituteAnn & Robert H. Lurie Children's Hospital of ChicagoChicagoIllinoisUSA
- Division of Infectious Diseases, Department of PediatricsNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Scott Dresden
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of Emergency MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Kristi Holmes
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Galter Health Sciences LibraryNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Matthew Carson
- Galter Health Sciences LibraryNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Sanjiv J. Shah
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Cardiology, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Seema Khan
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of SurgeryNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Susan Clare
- Department of SurgeryNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Richard G. Wunderink
- Division of Critical Care, Department of PediatricsNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Pulmonary and Critical Care Division, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Huiping Liu
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of PharmacologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Division of Hematology and Oncology, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Theresa Walunas
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Division of General Internal Medicine, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Center for Health Information PartnershipsInstitute for Public Health and Medicine, Northwestern UniversityChicagoIllinoisUSA
- Department of Microbiology‐ImmunologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Lee Cooper
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Department of PathologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Feng Yue
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of PathologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Department of Biochemistry and Molecular GeneticsNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Firas Wehbe
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of SurgeryNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Deyu Fang
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of PathologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - David M. Liebovitz
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Division of General Internal Medicine, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Center for Health Information PartnershipsInstitute for Public Health and Medicine, Northwestern UniversityChicagoIllinoisUSA
| | - Michael Markl
- Department of RadiologyNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Kelly N. Michelson
- Division of Critical Care, Department of PediatricsNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Stanley Manne Children's Research InstituteAnn & Robert H. Lurie Children's Hospital of ChicagoChicagoIllinoisUSA
- Center for Bioethics and Medical Humanities, Institute for Public Health and MedicineNorthwestern UniversityChicagoIllinoisUSA
| | - Susanna A. McColley
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Stanley Manne Children's Research InstituteAnn & Robert H. Lurie Children's Hospital of ChicagoChicagoIllinoisUSA
- Division of Pulmonary and Sleep Medicine, Department of PediatricsNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Marianne Green
- Division of General Internal Medicine, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Justin Starren
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Ronald T. Ackermann
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Division of General Internal Medicine, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Institute for Public Health and MedicineNorthwestern UniversityChicagoIllinoisUSA
| | - Richard T. D'Aquila
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Division of Infectious Diseases, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - James Adams
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of Emergency MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Donald Lloyd‐Jones
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Epidemiology, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
| | - Rex L. Chisholm
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Department of SurgeryNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Center for Health Information PartnershipsInstitute for Public Health and Medicine, Northwestern UniversityChicagoIllinoisUSA
| | - Abel Kho
- Northwestern University Clinical and Translational Sciences InstituteChicagoIllinoisUSA
- Institute for Augmented Intelligence in MedicineNorthwestern UniversityChicagoIllinoisUSA
- Division of Health and Biomedical Informatics, Department of Preventive MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Division of General Internal Medicine, Department of MedicineNorthwestern University Feinberg School of MedicineChicagoIllinoisUSA
- Center for Health Information PartnershipsInstitute for Public Health and Medicine, Northwestern UniversityChicagoIllinoisUSA
| |
Collapse
|
2
|
Lam H, Nguyen F, Wang X, Stock A, Lenskaya V, Kooshesh M, Li P, Qazi M, Wang S, Dehghan M, Qian X, Si Q, Polydorides AD. An accessible, efficient, and accurate natural language processing method for extracting diagnostic data from pathology reports. J Pathol Inform 2022; 13:100154. [PMID: 36605108 PMCID: PMC9808011 DOI: 10.1016/j.jpi.2022.100154] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 10/09/2022] [Accepted: 11/02/2022] [Indexed: 11/09/2022] Open
Abstract
Context Analysis of diagnostic information in pathology reports for the purposes of clinical or translational research and quality assessment/control often requires manual data extraction, which can be laborious, time-consuming, and subject to mistakes. Objective We sought to develop, employ, and evaluate a simple, dictionary- and rule-based natural language processing (NLP) algorithm for generating searchable information on various types of parameters from diverse surgical pathology reports. Design Data were exported from the pathology laboratory information system (LIS) into extensible markup language (XML) documents, which were parsed by NLP-based Python code into desired data points and delivered to Excel spreadsheets. Accuracy and efficiency were compared to a manual data extraction method with concordance measured by Cohen's κ coefficient and corresponding P values. Results The automated method was highly concordant (90%-100%, P<.001) with excellent inter-observer reliability (Cohen's κ: 0.86-1.0) compared to the manual method in 3 clinicopathological research scenarios, including squamous dysplasia presence and grade in anal biopsies, epithelial dysplasia grade and location in colonoscopic surveillance biopsies, and adenocarcinoma grade and amount in prostate core biopsies. Significantly, the automated method was 24-39 times faster and inherently contained links for each diagnosis to additional variables such as patient age, location, etc., which would require additional manual processing time. Conclusions A simple, flexible, and scaleable NLP-based platform can be used to correctly, safely, and quickly extract and deliver linked data from pathology reports into searchable spreadsheets for clinical and research purposes.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | - Alexandros D. Polydorides
- Corresponding author at: Department of Pathology, Molecular and Cell Based Medicine, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1194, New York, NY 10029, USA.
| |
Collapse
|
3
|
Li Y, Wu X, Yang P, Jiang G, Luo Y. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:850-866. [PMID: 36462630 PMCID: PMC10025752 DOI: 10.1016/j.gpb.2022.11.003] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 10/03/2022] [Accepted: 11/17/2022] [Indexed: 12/03/2022]
Abstract
The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
Collapse
Affiliation(s)
- Yawei Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Xin Wu
- Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905 / Scottsdale, AZ 85259, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
4
|
Entity understanding with hierarchical graph learning for enhanced text classification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108576] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
5
|
A novel differential diagnosis algorithm for chronic lymphocytic leukemia using immunophenotyping with flow cytometry. Hematol Transfus Cell Ther 2021:S2531-1379(21)01317-1. [PMID: 35216960 DOI: 10.1016/j.htct.2021.08.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 07/13/2021] [Accepted: 08/10/2021] [Indexed: 12/11/2022] Open
Abstract
INTRODUCTION The availability of a clinical decision algorithm for diagnosis of chronic lymphocytic leukemia (CLL) may greatly contribute to the diagnosis of CLL, particularly in cases with ambiguous immunophenotypes. Herein we propose a novel differential diagnosis algorithm for the CLL diagnosis using immunophenotyping with flow cytometry. METHODS The hierarchical logistic regression model (Backward LR) was used to build a predictive algorithm for the diagnosis of CLL, differentiated from other lymphoproliferative disorders (LPDs). RESULTS A total of 302 patients, of whom 220 (72.8%) had CLL and 82 (27.2%), B-cell lymphoproliferative disorders other than CLL, were included in the study. The Backward LR model comprised the variables CD5, CD43, CD81, ROR1, CD23, CD79b, FMC7, sIg and CD200 in the model development process. The weak expression of CD81 and increased intensity of expression in markers CD5, CD23 and CD200 increased the probability of CLL diagnosis, (p < 0.05). The odd ratio for CD5, C23, CD200 and CD81 was 1.088 (1.050 - 1.126), 1.044 (1.012 - 1.077), 1.039 (1.007 - 1.072) and 0.946 (0.921 - 0.970) [95% C.I.], respectively. Our model provided a novel diagnostic algorithm with 95.27% of sensitivity and 91.46% of specificity. The model prediction for 97.3% (214) of 220 patients diagnosed with CLL, was CLL and for 91.5% (75) of 82 patients diagnosed with an LPD other than CLL, was others. The cases were correctly classified as CLL and others with a 95.7% correctness rate. CONCLUSIONS Our model highlighting 4 markers (CD81, CD5, CD23 and CD200) provided high sensitivity and specificity in the CLL diagnosis and in distinguishing of CLL among other LPDs.
Collapse
|
6
|
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9:e20675. [PMID: 34236337 PMCID: PMC8433943 DOI: 10.2196/20675] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 11/25/2020] [Accepted: 07/02/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The Unified Medical Language System (UMLS) has been a critical tool in biomedical and health informatics, and the year 2021 marks its 30th anniversary. The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different computer systems and applications. OBJECTIVE Despite its longevity, there is no comprehensive publication analysis of the use of the UMLS. Thus, this review and analysis is conducted to provide an overview of the UMLS and its use in English-language peer-reviewed publications, with the objective of providing a comprehensive understanding of how the UMLS has been used in English-language peer-reviewed publications over the last 30 years. METHODS PubMed, ACM Digital Library, and the Nursing & Allied Health Database were used to search for studies. The primary search strategy was as follows: UMLS was used as a Medical Subject Headings term or a keyword or appeared in the title or abstract. Only English-language publications were considered. The publications were screened first, then coded and categorized iteratively, following the grounded theory. The review process followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. RESULTS A total of 943 publications were included in the final analysis. Moreover, 32 publications were categorized into 2 categories; hence the total number of publications before duplicates are removed is 975. After analysis and categorization of the publications, UMLS was found to be used in the following emerging themes or areas (the number of publications and their respective percentages are given in parentheses): natural language processing (230/975, 23.6%), information retrieval (125/975, 12.8%), terminology study (90/975, 9.2%), ontology and modeling (80/975, 8.2%), medical subdomains (76/975, 7.8%), other language studies (53/975, 5.4%), artificial intelligence tools and applications (46/975, 4.7%), patient care (35/975, 3.6%), data mining and knowledge discovery (25/975, 2.6%), medical education (20/975, 2.1%), degree-related theses (13/975, 1.3%), digital library (5/975, 0.5%), and the UMLS itself (150/975, 15.4%), as well as the UMLS for other purposes (27/975, 2.8%). CONCLUSIONS The UMLS has been used successfully in patient care, medical education, digital libraries, and software development, as originally planned, as well as in degree-related theses, the building of artificial intelligence tools, data mining and knowledge discovery, foundational work in methodology, and middle layers that may lead to advanced products. Natural language processing, the UMLS itself, and information retrieval are the 3 most common themes that emerged among the included publications. The results, although largely related to academia, demonstrate that UMLS achieves its intended uses successfully, in addition to achieving uses broadly beyond its original intentions.
Collapse
Affiliation(s)
- Xia Jing
- Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC, United States
| |
Collapse
|
7
|
Liu S, Luo Y, Stone D, Zong N, Wen A, Yu Y, Rasmussen LV, Wang F, Pathak J, Liu H, Jiang G. Integration of NLP2FHIR Representation with Deep Learning Models for EHR Phenotyping: A Pilot Study on Obesity Datasets. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:410-419. [PMID: 34457156 PMCID: PMC8378603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
HL7 Fast Healthcare Interoperability Resources (FHIR) is one of the current data standards for enabling electronic healthcare information exchange. Previous studies have shown that FHIR is capable of modeling both structured and unstructured data from electronic health records (EHRs). However, the capability of FHIR in enabling clinical data analytics has not been well investigated. The objective of the study is to demonstrate how FHIR-based representation of unstructured EHR data can be ported to deep learning models for text classification in clinical phenotyping. We leverage and extend the NLP2FHIR clinical data normalization pipeline and conduct a case study with two obesity datasets. We tested several deep learning-based text classifiers such as convolutional neural networks, gated recurrent unit, and text graph convolutional networks on both raw text and NLP2FHIR inputs. We found that the combination of NLP2FHIR input and text graph convolutional networks has the highest F1 score. Therefore, FHIR-based deep learning methods has the potential to be leveraged in supporting EHR phenotyping, making the phenotyping algorithms more portable across EHR systems and institutions.
Collapse
Affiliation(s)
| | - Yuan Luo
- Northwestern University, Chicago, IL
| | | | | | | | - Yue Yu
- Mayo Clinic, Rochester, MN
| | | | - Fei Wang
- Weill Cornell Medicine, New York, NY
| | | | | | | |
Collapse
|
8
|
Validation of deep learning natural language processing algorithm for keyword extraction from pathology reports in electronic health records. Sci Rep 2020; 10:20265. [PMID: 33219276 PMCID: PMC7679382 DOI: 10.1038/s41598-020-77258-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 11/05/2020] [Indexed: 11/20/2022] Open
Abstract
Pathology reports contain the essential data for both clinical and research purposes. However, the extraction of meaningful, qualitative data from the original document is difficult due to the narrative and complex nature of such reports. Keyword extraction for pathology reports is necessary to summarize the informative text and reduce intensive time consumption. In this study, we employed a deep learning model for the natural language process to extract keywords from pathology reports and presented the supervised keyword extraction algorithm. We considered three types of pathological keywords, namely specimen, procedure, and pathology types. We compared the performance of the present algorithm with the conventional keyword extraction methods on the 3115 pathology reports that were manually labeled by professional pathologists. Additionally, we applied the present algorithm to 36,014 unlabeled pathology reports and analysed the extracted keywords with biomedical vocabulary sets. The results demonstrated the suitability of our model for practical application in extracting important data from pathology reports.
Collapse
|
9
|
Kersloot MG, van Putten FJP, Abu-Hanna A, Cornet R, Arts DL. Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J Biomed Semantics 2020; 11:14. [PMID: 33198814 PMCID: PMC7670625 DOI: 10.1186/s13326-020-00231-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 11/03/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations. METHODS Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies' objectives were categorized by way of induction. These results were used to define recommendations. RESULTS Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed. CONCLUSION We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.
Collapse
Affiliation(s)
- Martijn G. Kersloot
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| | - Florentien J. P. van Putten
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ameen Abu-Hanna
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ronald Cornet
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Derk L. Arts
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| |
Collapse
|
10
|
Abdulkadhar S, Bhasuran B, Natarajan J. Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature. Knowl Inf Syst 2020. [DOI: 10.1007/s10115-020-01514-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Development of a Novel Tool for the Retrieval and Analysis of Hormone Receptor Expression Characteristics in Metastatic Breast Cancer via Data Mining on Pathology Reports. BIOMED RESEARCH INTERNATIONAL 2020; 2020:2654815. [PMID: 32566676 PMCID: PMC7273481 DOI: 10.1155/2020/2654815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 04/30/2020] [Indexed: 12/24/2022]
Abstract
Information about the expression status of hormone receptors such as estrogen receptor (ER), progesterone receptor (PR), and Her-2 is crucial in the management and prognosis of breast cancer. Therefore, the retrieval and analysis of hormone receptor expression characteristics in metastatic breast cancer may be valuable in breast cancer study. Herein, we report a text mining tool based on word/phrase matching that retrieves hormone receptor expression data of regional or distant metastatic breast cancer from pathology reports. It was tested on pathology reports at the China Medical University Hospital from 2013 to 2018. The tool showed specificities of 91.6% and 63.3% for the detection of regional lymph node metastasis and distant metastasis, respectively. Sensitivity in immunohistochemical study result extraction in these cases was 98.6% for distant metastasis and 78.3% for regional lymph node metastasis. Statistical analysis on these retrieved data showed significant difference s in PR and Her-2 expressions between regional and metastatic breast cancer, which is compatible with previous studies. In conclusion, our study shows that metastatic breast cancer hormone receptor expression characteristics can be retrieved by text mining. The algorithm designed in this study may be useful in future studies about text mining in pathology reports.
Collapse
|
12
|
Abstract
Electronic Health Records (EHR) are a rich repository of valuable clinical information that exist in primary and secondary care databases. In order to utilize EHRs for medical observational research a range of algorithms for automatically identifying individuals with a specific phenotype have been developed. This review summarizes and offers a critical evaluation of the literature relating to studies conducted into the development of EHR phenotyping systems. This review describes phenotyping systems and techniques based on structured and unstructured EHR data. Articles published on PubMed and Google scholar between 2013 and 2017 have been reviewed, using search terms derived from Medical Subject Headings (MeSH). The popularity of using Natural Language Processing (NLP) techniques in extracting features from narrative text has increased. This increased attention is due to the availability of open source NLP algorithms, combined with accuracy improvement. In this review, Concept extraction is the most popular NLP technique since it has been used by more than 50% of the reviewed papers to extract features from EHR. High-throughput phenotyping systems using unsupervised machine learning techniques have gained more popularity due to their ability to efficiently and automatically extract a phenotype with minimal human effort.
Collapse
|
13
|
Moraes LO, Pedreira CE, Barrena S, Lopez A, Orfao A. A decision-tree approach for the differential diagnosis of chronic lymphoid leukemias and peripheral B-cell lymphomas. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 178:85-90. [PMID: 31416565 DOI: 10.1016/j.cmpb.2019.06.014] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 06/07/2019] [Accepted: 06/12/2019] [Indexed: 06/10/2023]
Abstract
BACKGROUND AND OBJECTIVE Here we propose a decision-tree approach for the differential diagnosis of distinct WHO categories B-cell chronic lymphoproliferative disorders using flow cytometry data. Flow cytometry is the preferred method for the immunophenotypic characterization of leukemia and lymphoma, being able to process and register multiparametric data about tens of thousands of cells per second. METHODS The proposed decision-tree is composed by logistic function nodes that branch throughout the tree into sets of (possible) distinct leukemia/lymphoma diagnoses. To avoid overfitting, regularization via the Lasso algorithm was used. The code can be run online at https://codeocean.com/2018/03/08/a-decision-tree-approach-for-the-differential-diagnosis-of-chronic-lymphoid-leukemias-and-peripheral-b-cell-lymphomas/ or downloaded from https://github.com/lauramoraes/bioinformatics-sourcecode to be executed in Matlab. RESULTS The proposed approach was validated in diagnostic peripheral blood and bone marrow samples from 283 mature lymphoid leukemias/lymphomas patients. The proposed approach achieved 95% correctness in the cross-validation test phase (100% in-sample), 61% giving a single diagnosis and 34% (possible) multiple disease diagnoses. Similar results were obtained in an out-of-sample validation dataset. The generated tree reached the final diagnoses after up to seven decision nodes. CONCLUSIONS Here we propose a decision-tree approach for the differential diagnosis of mature lymphoid leukemias/lymphomas which proved to be accurate during out-of-sample validation. The full process is accomplished through seven binary transparent decision nodes.
Collapse
Affiliation(s)
- L O Moraes
- Rua Horacio Macedo 2030, Rio de Janeiro/RJ, CEP: 21941-914, Brazil.
| | - C E Pedreira
- Rua Horacio Macedo 2030, Rio de Janeiro/RJ, CEP: 21941-914, Brazil.
| | - S Barrena
- Lab 11, Centro de Investigacion del Cancer, Paseo de la Universidad de Coimbra, Campus Miguel Unamuno, 37002 Salamanca, España.
| | - A Lopez
- Lab 11, Centro de Investigacion del Cancer, Paseo de la Universidad de Coimbra, Campus Miguel Unamuno, 37002 Salamanca, España.
| | - A Orfao
- Lab 11, Centro de Investigacion del Cancer, Paseo de la Universidad de Coimbra, Campus Miguel Unamuno, 37002 Salamanca, España.
| |
Collapse
|
14
|
Zeng Z, Zhao Y, Sun M, Vo AH, Starren J, Luo Y. Rich Text Formatted EHR Narratives: A Hidden and Ignored Trove. Stud Health Technol Inform 2019; 264:472-476. [PMID: 31437968 PMCID: PMC8060951 DOI: 10.3233/shti190266] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
This study presents an approach for mining structured information from clinical narratives in Electronic Health Records (EHRs) by using Rich Text Formatted (RTF) records. RTF is adopted by many medical information management systems. There is rich structural information in these files which can be extracted and interpreted, yet such information is largely ignored. We investigate multiple types of EHR narratives in the Enterprise Data Warehouse from a multisite large healthcare chain consisting of both, an academic medical center and community hospitals. We focus on the RTF constructs related to tables and sections that are not available in plain text EHR narratives. We show how to parse these RTF constructs, analyze their prevalence and characteristics in the context of multiple types of EHR narratives. Our case study demonstrates the additional utility of the features derived from RTF constructs over plain text oriented NLP.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Yuan Zhao
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Mengxin Sun
- Hospital Medicine, Northwestern Memorial Hospital, Chicago, IL, USA
| | - Andy H Vo
- Committee on Developmental Biology and Regenerative Medicine, The University of Chicago, Chicago, IL, USA
| | - Justin Starren
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| |
Collapse
|
15
|
Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Med Inform 2019; 7:e12239. [PMID: 31066697 PMCID: PMC6528438 DOI: 10.2196/12239] [Citation(s) in RCA: 230] [Impact Index Per Article: 46.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 03/04/2019] [Accepted: 03/24/2019] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset. OBJECTIVE The goal of the research was to provide a comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases, including the investigation of challenges faced by NLP methodologies in understanding clinical narratives. METHODS Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed and searches were conducted in 5 databases using "clinical notes," "natural language processing," and "chronic disease" and their variations as keywords to maximize coverage of the articles. RESULTS Of the 2652 articles considered, 106 met the inclusion criteria. Review of the included papers resulted in identification of 43 chronic diseases, which were then further classified into 10 disease categories using the International Classification of Diseases, 10th Revision. The majority of studies focused on diseases of the circulatory system (n=38) while endocrine and metabolic diseases were fewest (n=14). This was due to the structure of clinical records related to metabolic diseases, which typically contain much more structured data, compared with medical records for diseases of the circulatory system, which focus more on unstructured data and consequently have seen a stronger focus of NLP. The review has shown that there is a significant increase in the use of machine learning methods compared to rule-based approaches; however, deep learning methods remain emergent (n=3). Consequently, the majority of works focus on classification of disease phenotype with only a handful of papers addressing extraction of comorbidities from the free text or integration of clinical notes with structured data. There is a notable use of relatively simple methods, such as shallow classifiers (or combination with rule-based methods), due to the interpretability of predictions, which still represents a significant issue for more complex methods. Finally, scarcity of publicly available data may also have contributed to insufficient development of more advanced methods, such as extraction of word embeddings from clinical notes. CONCLUSIONS Efforts are still required to improve (1) progression of clinical NLP methods from extraction toward understanding; (2) recognition of relations among entities rather than entities in isolation; (3) temporal extraction to understand past, current, and future clinical events; (4) exploitation of alternative sources of clinical knowledge; and (5) availability of large-scale, de-identified clinical corpora.
Collapse
Affiliation(s)
- Seyedmostafa Sheikhalishahi
- eHealth Research Group, Fondazione Bruno Kessler Research Institute, Trento, Italy
- Department of Information Engineering and Computer Science, University of Trento, Trento, Italy
| | - Riccardo Miotto
- Institute for Next Generation Healthcare, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Joel T Dudley
- Institute for Next Generation Healthcare, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Alberto Lavelli
- NLP Research Group, Fondazione Bruno Kessler Research Institute, Trento, Italy
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | - Venet Osmani
- eHealth Research Group, Fondazione Bruno Kessler Research Institute, Trento, Italy
| |
Collapse
|
16
|
Zeng Z, Yao L, Roy A, Li X, Espino S, Clare SE, Khan SA, Luo Y. Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2019; 3:283-299. [PMID: 33225204 DOI: 10.1007/s41666-019-00046-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Accurately identifying distant recurrences in breast cancer from the Electronic Health Records (EHR) is important for both clinical care and secondary analysis. Although multiple applications have been developed for computational phenotyping in breast cancer, distant recurrence identification still relies heavily on manual chart review. In this study, we aim to develop a model that identifies distant recurrences in breast cancer using clinical narratives and structured data from EHR. We applied MetaMap to extract features from clinical narratives and also retrieved structured clinical data from EHR. Using these features, we trained a support vector machine model to identify distant recurrences in breast cancer patients. We trained the model using 1,396 double-annotated subjects and validated the model using 599 double-annotated subjects. In addition, we validated the model on a set of 4,904 single-annotated subjects as a generalization test. In the held-out test and generalization test, we obtained F-measure scores of 0.78 and 0.74, area under curve (AUC) scores of 0.95 and 0.93, respectively. To explore the representation learning utility of deep neural networks, we designed multiple convolutional neural networks and multilayer neural networks to identify distant recurrences. Using the same test set and generalizability test set, we obtained F-measure scores of 0.79 ± 0.02 and 0.74 ± 0.004, AUC scores of 0.95 ± 0.002 and 0.95 ± 0.01, respectively. Our model can accurately and efficiently identify distant recurrences in breast cancer by combining features extracted from unstructured clinical narratives and structured clinical data.
Collapse
Affiliation(s)
- Zexian Zeng
- Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Liang Yao
- Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Ankita Roy
- Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Xiaoyu Li
- Social and Behavioral Sciences Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Sasa Espino
- Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Susan E Clare
- Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Seema A Khan
- Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Yuan Luo
- Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| |
Collapse
|
17
|
Chang KP, Chu YW, Wang J. Analysis of Hormone Receptor Status in Primary and Recurrent Breast Cancer Via Data Mining Pathology Reports. Open Med (Wars) 2019; 14:91-98. [PMID: 30847396 PMCID: PMC6401490 DOI: 10.1515/med-2019-0013] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 12/05/2018] [Indexed: 11/15/2022] Open
Abstract
Background Hormone receptors of breast cancer, such as estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (Her-2), are important prognostic factors for breast cancer. Objective The current study aimed to develop a method to retrieve the statistics of hormone receptor expression status, documented in pathology reports, given their importance in research for primary and recurrent breast cancer, and quality management of pathology laboratories. Method A two-stage text mining approach via regular expression-based word/phrase matching, was developed to retrieve the data. Results The method achieved a sensitivity of 98.8%, 98.7% and 98.4% for extraction of ER, PR, and Her-2 results. The hormone expression status from 3679 primary and 44 recurrent breast cancer cases was successfully retrieved with the method. Statistical analysis of these data showed that the recurrent disease had a significantly lower positivity rate for ER (54.5% vs 76.5%, p=0.001278) than primary breast cancer and a higher positivity rate for Her-2 (48.8% vs 16.2%, p=9.79e-8). These results corroborated the previous literature. Conclusion Text mining on pathology reports using the developed method may benefit research of primary and recurrent breast cancer.
Collapse
Affiliation(s)
- Kai-Po Chang
- Department of Pathology, China Medical University Hospital, Taichung 404, Taiwan.,Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung 402, Taiwan
| | - Yen-Wei Chu
- Biotechnology Center, Agricultural Biotechnology Center, Institute of Molecular Biology, National Chung Hsing University, Taichung 402, Taiwan.,Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung 402, Taiwan
| | - John Wang
- Department of Pathology, China Medical University Hospital, Taichung 404, Taiwan
| |
Collapse
|
18
|
Zeng Z, Deng Y, Li X, Naumann T, Luo Y. Natural Language Processing for EHR-Based Computational Phenotyping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:139-153. [PMID: 29994486 PMCID: PMC6388621 DOI: 10.1109/tcbb.2018.2849968] [Citation(s) in RCA: 90] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This article reviews recent advances in applying natural language processing (NLP) to Electronic Health Records (EHRs) for computational phenotyping. NLP-based computational phenotyping has numerous applications including diagnosis categorization, novel phenotype discovery, clinical trial screening, pharmacogenomics, drug-drug interaction (DDI), and adverse drug event (ADE) detection, as well as genome-wide and phenome-wide association studies. Significant progress has been made in algorithm development and resource construction for computational phenotyping. Among the surveyed methods, well-designed keyword search and rule-based systems often achieve good performance. However, the construction of keyword and rule lists requires significant manual effort, which is difficult to scale. Supervised machine learning models have been favored because they are capable of acquiring both classification patterns and structures from data. Recently, deep learning and unsupervised learning have received growing attention, with the former favored for its performance and the latter for its ability to find novel phenotypes. Integrating heterogeneous data sources have become increasingly important and have shown promise in improving model performance. Often, better performance is achieved by combining multiple modalities of information. Despite these many advances, challenges and opportunities remain for NLP-based computational phenotyping, including better model interpretability and generalizability, and proper characterization of feature relations in clinical narratives.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611.
| | - Yu Deng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611.
| | - Xiaoyu Li
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA 02115.
| | - Tristan Naumann
- Science and Artificial Intelligence Lab, Massachusetts Institue of Technology, Cambridge, MA 02139.
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611.
| |
Collapse
|
19
|
Zeng Z, Espino S, Roy A, Li X, Khan SA, Clare SE, Jiang X, Neapolitan R, Luo Y. Using natural language processing and machine learning to identify breast cancer local recurrence. BMC Bioinformatics 2018; 19:498. [PMID: 30591037 PMCID: PMC6309052 DOI: 10.1186/s12859-018-2466-x] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. Developing a model using natural language processing and machine learning to identify local recurrences in breast cancer patients can reduce the time-consuming work of a manual chart review. METHODS We design a novel concept-based filter and a prediction model to detect local recurrences using EHRs. In the training dataset, we manually review a development corpus of 50 progress notes and extract partial sentences that indicate breast cancer local recurrence. We process these partial sentences to obtain a set of Unified Medical Language System (UMLS) concepts using MetaMap, and we call it positive concept set. We apply MetaMap on patients' progress notes and retain only the concepts that fall within the positive concept set. These features combined with the number of pathology reports recorded for each patient are used to train a support vector machine to identify local recurrences. RESULTS We compared our model with three baseline classifiers using either full MetaMap concepts, filtered MetaMap concepts, or bag of words. Our model achieved the best AUC (0.93 in cross-validation, 0.87 in held-out testing). CONCLUSIONS Compared to a labor-intensive chart review, our model provides an automated way to identify breast cancer local recurrences. We expect that by minimally adapting the positive concept set, this study has the potential to be replicated at other institutions with a moderately sized training dataset.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Sasa Espino
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Ankita Roy
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Xiaoyu Li
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Seema A Khan
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Susan E Clare
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Xia Jiang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Richard Neapolitan
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
| |
Collapse
|
20
|
Luo Y, Szolovits P. Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2018; 2018:461-466. [PMID: 33376623 PMCID: PMC7769694 DOI: 10.1109/bibm.2018.8621521] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This paper presents a Lisp architecture for a portable NLP system, termed LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard, customized and in-house developed NLP tools. Our system facilitates portability across different institutions and data systems by incorporating an enriched Common Data Model (CDM) to standardize necessary data elements. It utilizes UMLS to perform domain adaptation when integrating generic domain NLP tools. It also features stand-off annotations that are specified by positional reference to the original document. We built an interval tree based search engine to efficiently query and retrieve the stand-off annotations by specifying positional requirements. We also developed a utility to convert an inline annotation format to stand-off annotations to enable the reuse of clinical text datasets with in-line annotations. We experimented with our system on several NLP facilitated tasks including computational phenotyping for lymphoma patients and semantic relation extraction for clinical notes. These experiments showcased the broader applicability and utility of LAPNLP.
Collapse
Affiliation(s)
- Yuan Luo
- Dept. of Preventive Medicine, Northwestern University, Chicago, USA
| | | |
Collapse
|
21
|
Zeng Z, Li X, Espino S, Roy A, Kitsch K, Clare S, Khan S, Luo Y. Contralateral Breast Cancer Event Detection Using Nature Language Processing. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:1885-1892. [PMID: 29854260 PMCID: PMC5977664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
To facilitate the identification of contralateral breast cancer events for large cohort study, we proposed and implemented a new method based on features extracted from narrative text in progress notes and features from numbers of pathology reports for each side of breast cancer. Our method collects medical concepts and their combinations to detect contralateral events in progress notes. In addition, the numbers of pathology reports generated for either left or right side of breast cancer were derived as additional features. We experimented with support vector machine using the derived features to detect contralateral events. In the cross-validation and held-out tests, the area under curve score is 0.93 and 0.89 respectively. This method can be replicated due to the simplicity of feature generation.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Xiaoyu Li
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Sasa Espino
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Ankita Roy
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Kristen Kitsch
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Susan Clare
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Seema Khan
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| |
Collapse
|
22
|
Lossio-Ventura JA, Hogan W, Modave F, Guo Y, He Z, Hicks A, Bian J. OC-2-KB: A software pipeline to build an evidence-based obesity and cancer knowledge base. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2018; 2017:1284-1287. [PMID: 29629236 DOI: 10.1109/bibm.2017.8217845] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Obesity has been linked to several types of cancer. Access to adequate health information activates people's participation in managing their own health, which ultimately improves their health outcomes. Nevertheless, the existing online information about the relationship between obesity and cancer is heterogeneous and poorly organized. A formal knowledge representation can help better organize and deliver quality health information. Currently, there are several efforts in the biomedical domain to convert unstructured data to structured data and store them in Semantic Web knowledge bases (KB). In this demo paper, we present, OC-2-KB (Obesity and Cancer to Knowledge Base), a system that is tailored to guide the automatic KB construction for managing obesity and cancer knowledge from free-text scientific literature (i.e., PubMed abstracts) in a systematic way. OC-2-KB has two important modules which perform the acquisition of entities and the extraction then classification of relationships among these entities. We tested the OC-2-KB system on a data set with 23 manually annotated obesity and cancer PubMed abstracts and created a preliminary KB with 765 triples. We conducted a preliminary evaluation on this sample of triples and reported our evaluation results.
Collapse
Affiliation(s)
| | - William Hogan
- Health Outcomes & Policy, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - François Modave
- Health Outcomes & Policy, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Yi Guo
- Health Outcomes & Policy, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| | - Amanda Hicks
- Health Outcomes & Policy, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Jiang Bian
- Health Outcomes & Policy, College of Medicine, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
23
|
Discriminant document embeddings with an extreme learning machine for classifying clinical narratives. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.01.117] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
24
|
Luo Y, Cheng Y, Uzuner Ö, Szolovits P, Starren J. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. J Am Med Inform Assoc 2018; 25:93-98. [PMID: 29025149 PMCID: PMC6381760 DOI: 10.1093/jamia/ocx090] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Revised: 07/28/2017] [Accepted: 08/05/2017] [Indexed: 11/13/2022] Open
Abstract
We propose Segment Convolutional Neural Networks (Seg-CNNs) for classifying relations from clinical notes. Seg-CNNs use only word-embedding features without manual feature engineering. Unlike typical CNN models, relations between 2 concepts are identified by simultaneously learning separate representations for text segments in a sentence: preceding, concept1, middle, concept2, and succeeding. We evaluate Seg-CNN on the i2b2/VA relation classification challenge dataset. We show that Seg-CNN achieves a state-of-the-art micro-average F-measure of 0.742 for overall evaluation, 0.686 for classifying medical problem-treatment relations, 0.820 for medical problem-test relations, and 0.702 for medical problem-medical problem relations. We demonstrate the benefits of learning segment-level representations. We show that medical domain word embeddings help improve relation classification. Seg-CNNs can be trained quickly for the i2b2/VA dataset on a graphics processing unit (GPU) platform. These results support the use of CNNs computed over segments of text for classifying medical relations, as they show state-of-the-art performance while requiring no manual feature engineering.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Yu Cheng
- AI Foundations, IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA
| | - Özlem Uzuner
- Department of Computer Science, State University of New York at Albany, Albany, NY, USA
| | - Peter Szolovits
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Justin Starren
- Department of Preventive Medicine and Medical Social Science, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| |
Collapse
|
25
|
Sulieman L, Gilmore D, French C, Cronin RM, Jackson GP, Russell M, Fabbri D. Classifying patient portal messages using Convolutional Neural Networks. J Biomed Inform 2017; 74:59-70. [DOI: 10.1016/j.jbi.2017.08.014] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Revised: 08/02/2017] [Accepted: 08/28/2017] [Indexed: 12/31/2022]
|
26
|
Luo Y. Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform 2017; 72:85-95. [PMID: 28694119 PMCID: PMC6657689 DOI: 10.1016/j.jbi.2017.07.006] [Citation(s) in RCA: 85] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Revised: 06/13/2017] [Accepted: 07/06/2017] [Indexed: 01/16/2023]
Abstract
We proposed the first models based on recurrent neural networks (more specifically Long Short-Term Memory - LSTM) for classifying relations from clinical notes. We tested our models on the i2b2/VA relation classification challenge dataset. We showed that our segment LSTM model, with only word embedding feature and no manual feature engineering, achieved a micro-averaged f-measure of 0.661 for classifying medical problem-treatment relations, 0.800 for medical problem-test relations, and 0.683 for medical problem-medical problem relations. These results are comparable to those of the state-of-the-art systems on the i2b2/VA relation classification challenge. We compared the segment LSTM model with the sentence LSTM model, and demonstrated the benefits of exploring the difference between concept text and context text, and between different contextual parts in the sentence. We also evaluated the impact of word embedding on the performance of LSTM models and showed that medical domain word embedding help improve the relation classification. These results support the use of LSTM models for classifying relations between medical concepts, as they show comparable performance to previously published systems while requiring no manual feature engineering.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Division of Health and Biomedical Informatics, Northwestern University, Chicago, IL, United States.
| |
Collapse
|
27
|
|
28
|
Luo Y, Ahmad FS, Shah SJ. Tensor Factorization for Precision Medicine in Heart Failure with Preserved Ejection Fraction. J Cardiovasc Transl Res 2017; 10:305-312. [PMID: 28116551 PMCID: PMC5515683 DOI: 10.1007/s12265-016-9727-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Accepted: 12/23/2016] [Indexed: 02/07/2023]
Abstract
Heart failure with preserved ejection fraction (HFpEF) is a heterogeneous clinical syndrome that may benefit from improved subtyping in order to better characterize its pathophysiology and to develop novel targeted therapies. The United States Precision Medicine Initiative comes amid the rapid growth in quantity and modality of clinical data for HFpEF patients ranging from deep phenotypic to trans-omic data. Tensor factorization, a form of machine learning, allows for the integration of multiple data modalities to derive clinically relevant HFpEF subtypes that may have significant differences in underlying pathophysiology and differential response to therapies. Tensor factorization also allows for better interpretability by supporting dimensionality reduction and identifying latent groups of data for meaningful summarization of both features and disease outcomes. In this narrative review, we analyze the modest literature on the application of tensor factorization to related biomedical fields including genotyping and phenotyping. Based on the cited work including work of our own, we suggest multiple tensor factorization formulations capable of integrating the deep phenotypic and trans-omic modalities of data for HFpEF, or accounting for interactions between genetic variants at different omic hierarchies. We encourage extensive experimental studies to tackle challenges in applying tensor factorization for precision medicine in HFpEF, including effectively incorporating existing medical knowledge, properly accounting for uncertainty, and efficiently enforcing sparsity for better interpretability.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, 11th Floor, Arthur Rubloff Building, 750 N. Lake Shore Drive, Chicago, IL, 60611, USA.
| | - Faraz S Ahmad
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, 11th Floor, Arthur Rubloff Building, 750 N. Lake Shore Drive, Chicago, IL, 60611, USA
- Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Sanjiv J Shah
- Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| |
Collapse
|
29
|
Luo Y, Wang F, Szolovits P. Tensor factorization toward precision medicine. Brief Bioinform 2017; 18:511-514. [PMID: 26994614 PMCID: PMC6078180 DOI: 10.1093/bib/bbw026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 01/08/2016] [Indexed: 11/13/2022] Open
Abstract
Precision medicine initiatives come amid the rapid growth in quantity and variety of biomedical data, which exceeds the capacity of matrix-oriented data representations and many current analysis algorithms. Tensor factorizations extend the matrix view to multiple modalities and support dimensionality reduction methods that identify latent groups of data for meaningful summarization of both features and instances. In this opinion article, we analyze the modest literature on applying tensor factorization to various biomedical fields including genotyping and phenotyping. Based on the cited work including work of our own, we suggest that tensor applications could serve as an effective tool to enable frequent updating of medical knowledge based on the continually growing scientific and clinical evidence. We encourage extensive experimental studies to tackle challenges including design choice of factorizations, integrating temporality and algorithm scalability.
Collapse
|
30
|
Roberts K, Rodriguez L, Shooshan SE, Demner-Fushman D. Resource Classification for Medical Questions. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1040-1049. [PMID: 28269901 PMCID: PMC5333297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We present an approach for manually and automatically classifying the resource type of medical questions. Three types of resources are considered: patient-specific, general knowledge, and research. Using this approach, an automatic question answering system could select the best type of resource from which to consider answers. We first describe our methodology for manually annotating resource type on four different question corpora totaling over 5,000 questions. We then describe our approach for automatically identifying the appropriate type of resource. A supervised machine learning approach is used with lexical, syntactic, semantic, and topic-based feature types. This approach is able to achieve accuracies in the range of 80.9% to 92.8% across four datasets. Finally, we discuss the difficulties encountered in both manual and automatic classification of this challenging task.
Collapse
Affiliation(s)
- Kirk Roberts
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX
| | - Laritza Rodriguez
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
| | - Sonya E Shooshan
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
| |
Collapse
|
31
|
Luo Y, Uzuner Ö, Szolovits P. Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations. Brief Bioinform 2017; 18:160-178. [PMID: 26851224 PMCID: PMC5221425 DOI: 10.1093/bib/bbw001] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2015] [Revised: 11/29/2015] [Indexed: 01/18/2023] Open
Abstract
Research on extracting biomedical relations has received growing attention recently, with numerous biological and clinical applications including those in pharmacogenomics, clinical trial screening and adverse drug reaction detection. The ability to accurately capture both semantic and syntactic structures in text expressing these relations becomes increasingly critical to enable deep understanding of scientific papers and clinical narratives. Shared task challenges have been organized by both bioinformatics and clinical informatics communities to assess and advance the state-of-the-art research. Significant progress has been made in algorithm development and resource construction. In particular, graph-based approaches bridge semantics and syntax, often achieving the best performance in shared tasks. However, a number of problems at the frontiers of biomedical relation extraction continue to pose interesting challenges and present opportunities for great improvement and fruitful research. In this article, we place biomedical relation extraction against the backdrop of its versatile applications, present a gentle introduction to its general pipeline and shared resources, review the current state-of-the-art in methodology advancement, discuss limitations and point out several promising future directions.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Northwestern University, 11th Floor, Arthur Rubloff Building, 750 N. Lake Shore Drive, Chicago, IL, USA
| | - Özlem Uzuner
- Department of Information Studies, State University of New York at Albany, New York, USA
| | - Peter Szolovits
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Massachusetts, USA
| |
Collapse
|
32
|
Demner-Fushman D, Elhadad N. Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing. Yearb Med Inform 2016; 25:224-233. [PMID: 27830255 PMCID: PMC5171557 DOI: 10.15265/iy-2016-017] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
OBJECTIVES This paper reviews work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts. METHODS We included any application or methodological publication that leverages text to facilitate healthcare and address the health-related needs of consumers and populations. RESULTS Many important developments in clinical text processing, both foundational and task-oriented, were addressed in community- wide evaluations and discussed in corresponding special issues that are referenced in this review. These focused issues and in-depth reviews of several other active research areas, such as pharmacovigilance and summarization, allowed us to discuss in greater depth disease modeling and predictive analytics using clinical texts, and text analysis in social media for healthcare quality assessment, trends towards online interventions based on rapid analysis of health-related posts, and consumer health question answering, among other issues. CONCLUSIONS Our analysis shows that although clinical NLP continues to advance towards practical applications and more NLP methods are used in large-scale live health information applications, more needs to be done to make NLP use in clinical applications a routine widespread reality. Progress in clinical NLP is mirrored by developments in social media text analysis: the research is moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.
Collapse
Affiliation(s)
- D Demner-Fushman
- Dina Demner-Fushman, National Library of Medicine, National Institutes of Health, Bldg. 38A, Room 10S-1022, 8600 Rockville Pike MSC-3824, Bethesda, MD 20894, USA, Tel: +1 301 435 5320, Fax: +1 301 402 0341, E-mail:
| | | |
Collapse
|
33
|
Hoogendoorn M, Berger T, Schulz A, Stolz T, Szolovits P. Predicting Social Anxiety Treatment Outcome Based on Therapeutic Email Conversations. IEEE J Biomed Health Inform 2016; 21:1449-1459. [PMID: 27542187 DOI: 10.1109/jbhi.2016.2601123] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Predicting therapeutic outcome in the mental health domain is of utmost importance to enable therapists to provide the most effective treatment to a patient. Using information from the writings of a patient can potentially be a valuable source of information, especially now that more and more treatments involve computer-based exercises or electronic conversations between patient and therapist. In this paper, we study predictive modeling using writings of patients under treatment for a social anxiety disorder. We extract a wealth of information from the text written by patients including their usage of words, the topics they talk about, the sentiment of the messages, and the style of writing. In addition, we study trends over time with respect to those measures. We then apply machine learning algorithms to generate the predictive models. Based on a dataset of 69 patients, we are able to show that we can predict therapy outcome with an area under the curve of 0.83 halfway through the therapy and with a precision of 0.78 when using the full data (i.e., the entire treatment period). Due to the limited number of participants, it is hard to generalize the results, but they do show great potential in this type of information.
Collapse
|
34
|
Burger G, Abu-Hanna A, de Keizer N, Cornet R. Natural language processing in pathology: a scoping review. J Clin Pathol 2016; 69:jclinpath-2016-203872. [PMID: 27451435 DOI: 10.1136/jclinpath-2016-203872] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Accepted: 06/27/2016] [Indexed: 01/24/2023]
Abstract
BACKGROUND Encoded pathology data are key for medical registries and analyses, but pathology information is often expressed as free text. OBJECTIVE We reviewed and assessed the use of NLP (natural language processing) for encoding pathology documents. MATERIALS AND METHODS Papers addressing NLP in pathology were retrieved from PubMed, Association for Computing Machinery (ACM) Digital Library and Association for Computational Linguistics (ACL) Anthology. We reviewed and summarised the study objectives; NLP methods used and their validation; software implementations; the performance on the dataset used and any reported use in practice. RESULTS The main objectives of the 38 included papers were encoding and extraction of clinically relevant information from pathology reports. Common approaches were word/phrase matching, probabilistic machine learning and rule-based systems. Five papers (13%) compared different methods on the same dataset. Four papers did not specify the method(s) used. 18 of the 26 studies that reported F-measure, recall or precision reported values of over 0.9. Proprietary software was the most frequently mentioned category (14 studies); General Architecture for Text Engineering (GATE) was the most applied architecture overall. Practical system use was reported in four papers. Most papers used expert annotation validation. CONCLUSIONS Different methods are used in NLP research in pathology, and good performances, that is, high precision and recall, high retrieval/removal rates, are reported for all of these. Lack of validation and of shared datasets precludes performance comparison. More comparative analysis and validation are needed to provide better insight into the performance and merits of these methods.
Collapse
Affiliation(s)
- Gerard Burger
- Symbiant Pathology Expert Centre, Hoorn, The Netherlands Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | - Ameen Abu-Hanna
- Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | - Nicolette de Keizer
- Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | - Ronald Cornet
- Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands Department of Biomedical Engineering, Linköping University, Linköping, Sweden
| |
Collapse
|
35
|
Luo Y, Szolovits P. Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records. BIOMEDICAL INFORMATICS INSIGHTS 2016; 8:29-38. [PMID: 27478379 PMCID: PMC4954589 DOI: 10.4137/bii.s38916] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Revised: 06/13/2016] [Accepted: 06/22/2016] [Indexed: 11/07/2022]
Abstract
In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen’s interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen’s relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions.
Collapse
Affiliation(s)
- Yuan Luo
- Assistant Professor, Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Peter Szolovits
- Professor, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
36
|
Tamang S, Patel MI, Blayney DW, Kuznetsov J, Finlayson SG, Vetteth Y, Shah N. Detecting unplanned care from clinician notes in electronic health records. J Oncol Pract 2016; 11:e313-9. [PMID: 25980019 DOI: 10.1200/jop.2014.002741] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
PURPOSE Reduction in unplanned episodes of care, such as emergency department visits and unplanned hospitalizations, are important quality outcome measures. However, many events are only documented in free-text clinician notes and are labor intensive to detect by manual medical record review. METHODS We studied 308,096 free-text machine-readable documents linked to individual entries in our electronic health records, representing care for patients with breast, GI, or thoracic cancer, whose treatment was initiated at one academic medical center, Stanford Health Care (SHC). Using a clinical text-mining tool, we detected unplanned episodes documented in clinician notes (for non-SHC visits) or in coded encounter data for SHC-delivered care and the most frequent symptoms documented in emergency department (ED) notes. RESULTS Combined reporting increased the identification of patients with one or more unplanned care visits by 32% (15% using coded data; 20% using all the data) among patients with 3 months of follow-up and by 21% (23% using coded data; 28% using all the data) among those with 1 year of follow-up. Based on the textual analysis of SHC ED notes, pain (75%), followed by nausea (54%), vomiting (47%), infection (36%), fever (28%), and anemia (27%), were the most frequent symptoms mentioned. Pain, nausea, and vomiting co-occur in 35% of all ED encounter notes. CONCLUSION The text-mining methods we describe can be applied to automatically review free-text clinician notes to detect unplanned episodes of care mentioned in these notes. These methods have broad application for quality improvement efforts in which events of interest occur outside of a network that allows for patient data sharing.
Collapse
Affiliation(s)
- Suzanne Tamang
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| | - Manali I Patel
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| | - Douglas W Blayney
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| | - Julie Kuznetsov
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| | - Samuel G Finlayson
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| | - Yohan Vetteth
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| | - Nigam Shah
- Stanford University School of Medicine; Stanford Health Care, Stanford, CA; and Harvard Medical School, Boston, MA
| |
Collapse
|
37
|
Hoogendoorn M, Szolovits P, Moons LMG, Numans ME. Utilizing uncoded consultation notes from electronic medical records for predictive modeling of colorectal cancer. Artif Intell Med 2016; 69:53-61. [PMID: 27085847 DOI: 10.1016/j.artmed.2016.03.003] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 03/23/2016] [Indexed: 12/15/2022]
Abstract
OBJECTIVE Machine learning techniques can be used to extract predictive models for diseases from electronic medical records (EMRs). However, the nature of EMRs makes it difficult to apply off-the-shelf machine learning techniques while still exploiting the rich content of the EMRs. In this paper, we explore the usage of a range of natural language processing (NLP) techniques to extract valuable predictors from uncoded consultation notes and study whether they can help to improve predictive performance. METHODS We study a number of existing techniques for the extraction of predictors from the consultation notes, namely a bag of words based approach and topic modeling. In addition, we develop a dedicated technique to match the uncoded consultation notes with a medical ontology. We apply these techniques as an extension to an existing pipeline to extract predictors from EMRs. We evaluate them in the context of predictive modeling for colorectal cancer (CRC), a disease known to be difficult to diagnose before performing an endoscopy. RESULTS Our results show that we are able to extract useful information from the consultation notes. The predictive performance of the ontology-based extraction method moves significantly beyond the benchmark of age and gender alone (area under the receiver operating characteristic curve (AUC) of 0.870 versus 0.831). We also observe more accurate predictive models by adding features derived from processing the consultation notes compared to solely using coded data (AUC of 0.896 versus 0.882) although the difference is not significant. The extracted features from the notes are shown be equally predictive (i.e. there is no significant difference in performance) compared to the coded data of the consultations. CONCLUSION It is possible to extract useful predictors from uncoded consultation notes that improve predictive performance. Techniques linking text to concepts in medical ontologies to derive these predictors are shown to perform best for predicting CRC in our EMR dataset.
Collapse
Affiliation(s)
- Mark Hoogendoorn
- Department of Computer Science, VU University Amsterdam, De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands; Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.
| | - Leon M G Moons
- Department of Gastroenterology and Hepatology, Utrecht University Medical Center, Heidelberglaan 100, 3584 CX Utrecht, The Netherlands.
| | - Mattijs E Numans
- Department of Public Health and Primary Care, Leiden University Medical Center, Hippocratespad 21, 2333 ZD Leiden, The Netherlands.
| |
Collapse
|
38
|
Han D, Wang S, Jiang C, Jiang X, Kim HE, Sun J, Ohno-Machado L. Trends in biomedical informatics: automated topic analysis of JAMIA articles. J Am Med Inform Assoc 2015; 22:1153-63. [PMID: 26555018 PMCID: PMC5009912 DOI: 10.1093/jamia/ocv157] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Revised: 09/08/2015] [Accepted: 09/14/2015] [Indexed: 01/26/2023] Open
Abstract
Biomedical Informatics is a growing interdisciplinary field in which research topics and citation trends have been evolving rapidly in recent years. To analyze these data in a fast, reproducible manner, automation of certain processes is needed. JAMIA is a "generalist" journal for biomedical informatics. Its articles reflect the wide range of topics in informatics. In this study, we retrieved Medical Subject Headings (MeSH) terms and citations of JAMIA articles published between 2009 and 2014. We use tensors (i.e., multidimensional arrays) to represent the interaction among topics, time and citations, and applied tensor decomposition to automate the analysis. The trends represented by tensors were then carefully interpreted and the results were compared with previous findings based on manual topic analysis. A list of most cited JAMIA articles, their topics, and publication trends over recent years is presented. The analyses confirmed previous studies and showed that, from 2012 to 2014, the number of articles related to MeSH terms Methods, Organization & Administration, and Algorithms increased significantly both in number of publications and citations. Citation trends varied widely by topic, with Natural Language Processing having a large number of citations in particular years, and Medical Record Systems, Computerized remaining a very popular topic in all years.
Collapse
Affiliation(s)
- Dong Han
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135, USA
| | - Shuang Wang
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Chao Jiang
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135, USA
| | - Xiaoqian Jiang
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Hyeon-Eui Kim
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Jimeng Sun
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, S30313, USA
| | - Lucila Ohno-Machado
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| |
Collapse
|
39
|
Luo Y, Xin Y, Hochberg E, Joshi R, Uzuner O, Szolovits P. Subgraph augmented non-negative tensor factorization (SANTF) for modeling clinical narrative text. J Am Med Inform Assoc 2015; 22:1009-19. [PMID: 25862765 PMCID: PMC4986663 DOI: 10.1093/jamia/ocv016] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Revised: 01/18/2015] [Accepted: 02/16/2015] [Indexed: 02/04/2023] Open
Abstract
OBJECTIVE Extracting medical knowledge from electronic medical records requires automated approaches to combat scalability limitations and selection biases. However, existing machine learning approaches are often regarded by clinicians as black boxes. Moreover, training data for these automated approaches at often sparsely annotated at best. The authors target unsupervised learning for modeling clinical narrative text, aiming at improving both accuracy and interpretability. METHODS The authors introduce a novel framework named subgraph augmented non-negative tensor factorization (SANTF). In addition to relying on atomic features (e.g., words in clinical narrative text), SANTF automatically mines higher-order features (e.g., relations of lymphoid cells expressing antigens) from clinical narrative text by converting sentences into a graph representation and identifying important subgraphs. The authors compose a tensor using patients, higher-order features, and atomic features as its respective modes. We then apply non-negative tensor factorization to cluster patients, and simultaneously identify latent groups of higher-order features that link to patient clusters, as in clinical guidelines where a panel of immunophenotypic features and laboratory results are used to specify diagnostic criteria. RESULTS AND CONCLUSION SANTF demonstrated over 10% improvement in averaged F-measure on patient clustering compared to widely used non-negative matrix factorization (NMF) and k-means clustering methods. Multiple baselines were established by modeling patient data using patient-by-features matrices with different feature configurations and then performing NMF or k-means to cluster patients. Feature analysis identified latent groups of higher-order features that lead to medical insights. We also found that the latent groups of atomic features help to better correlate the latent groups of higher-order features.
Collapse
Affiliation(s)
- Yuan Luo
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
| | - Yu Xin
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
| | - Ephraim Hochberg
- Center for Lymphoma, Massachusetts General Hospital and Department of Medicine, Harvard Medical School
| | - Rohit Joshi
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
| | - Ozlem Uzuner
- Department of Information Studies, State University of New York at Albany
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
| |
Collapse
|
40
|
Luo Y, Riedlinger G, Szolovits P. Text mining in cancer gene and pathway prioritization. Cancer Inform 2014; 13:69-79. [PMID: 25392685 PMCID: PMC4216063 DOI: 10.4137/cin.s13874] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Revised: 05/18/2014] [Accepted: 05/18/2014] [Indexed: 12/18/2022] Open
Abstract
Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.
Collapse
Affiliation(s)
- Yuan Luo
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Gregory Riedlinger
- Department of Pathology, Massachusetts General Hospital, Boston, MA, USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|