1
|
Liu J, Wong ZSY. Utilizing active learning strategies in machine-assisted annotation for clinical named entity recognition: a comprehensive analysis considering annotation costs and target effectiveness. J Am Med Inform Assoc 2024; 31:2632-2640. [PMID: 39081233 PMCID: PMC11491619 DOI: 10.1093/jamia/ocae197] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 07/09/2024] [Accepted: 07/15/2024] [Indexed: 10/22/2024] Open
Abstract
OBJECTIVES Active learning (AL) has rarely integrated diversity-based and uncertainty-based strategies into a dynamic sampling framework for clinical named entity recognition (NER). Machine-assisted annotation is becoming popular for creating gold-standard labels. This study investigated the effectiveness of dynamic AL strategies under simulated machine-assisted annotation scenarios for clinical NER. MATERIALS AND METHODS We proposed 3 new AL strategies: a diversity-based strategy (CLUSTER) based on Sentence-BERT and 2 dynamic strategies (CLC and CNBSE) capable of switching from diversity-based to uncertainty-based strategies. Using BioClinicalBERT as the foundational NER model, we conducted simulation experiments on 3 medication-related clinical NER datasets independently: i2b2 2009, n2c2 2018 (Track 2), and MADE 1.0. We compared the proposed strategies with uncertainty-based (LC and NBSE) and passive-learning (RANDOM) strategies. Performance was primarily measured by the number of edits made by the annotators to achieve a desired target effectiveness evaluated on independent test sets. RESULTS When aiming for 98% overall target effectiveness, on average, CLUSTER required the fewest edits. When aiming for 99% overall target effectiveness, CNBSE required 20.4% fewer edits than NBSE did. CLUSTER and RANDOM could not achieve such a high target under the pool-based simulation experiment. For high-difficulty entities, CNBSE required 22.5% fewer edits than NBSE to achieve 99% target effectiveness, whereas neither CLUSTER nor RANDOM achieved 93% target effectiveness. DISCUSSION AND CONCLUSION When the target effectiveness was set high, the proposed dynamic strategy CNBSE exhibited both strong learning capabilities and low annotation costs in machine-assisted annotation. CLUSTER required the fewest edits when the target effectiveness was set low.
Collapse
Affiliation(s)
- Jiaxing Liu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, Hubei 430073, China
| | - Zoie S Y Wong
- Graduate School of Public Health, St Luke’s International University, OMURA Susumu & Mieko Memorial St Luke’s Center for Clinical Academia, Chuo-ku, Tokyo 104-0045, Japan
- The Kirby Institute, University of New South Wales, Sydney, NSW 2052, Australia
- School of Medical Sciences, The Unviersity of Sydney, Camperdown, NSW 2050, Australia
| |
Collapse
|
2
|
Van den Eynde J, Lachmann M, Laugwitz KL, Manlhiot C, Kutty S. Successfully Implemented Artificial Intelligence and Machine Learning Applications In Cardiology: State-of-the-Art Review. Trends Cardiovasc Med 2022:S1050-1738(22)00012-3. [DOI: 10.1016/j.tcm.2022.01.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 01/11/2022] [Accepted: 01/23/2022] [Indexed: 01/14/2023]
|
3
|
Manlhiot C, van den Eynde J, Kutty S, Ross HJ. A Primer on the Present State and Future Prospects for Machine Learning and Artificial Intelligence Applications in Cardiology. Can J Cardiol 2021; 38:169-184. [PMID: 34838700 DOI: 10.1016/j.cjca.2021.11.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 11/03/2021] [Accepted: 11/13/2021] [Indexed: 12/14/2022] Open
Abstract
The artificial intelligence (AI) revolution is well underway, including in the medical field, and has dramatically transformed our lives. An understanding of the basics of AI applications, their development, and challenges to their clinical implementation is important for clinicians to fully appreciate the possibilities of AI. Such a foundation would ensure that clinicians have a good grasp and realistic expectations for AI in medicine and prevent discrepancies between the promised and real-world impact. When quantifying the track record for AI applications in cardiology, we found that a substantial number of AI systems are never deployed in clinical practice, although there certainly are many success stories. Successful implementations shared the following: they came from clinical areas where large amount of training data was available; were deployable into a single diagnostic modality; prediction models generally had high performance on external validation; and most were developed as part of collaborations with medical device manufacturers who had substantial experience with implementation of new technology. When looking into the current processes used for developing AI-based systems, we suggest that expanding the analytic framework to address potential deployment and implementation issues at project outset will improve the rate of successful implementation, and will be a necessary next step for AI to achieve its full potential in cardiovascular medicine.
Collapse
Affiliation(s)
- Cedric Manlhiot
- Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA.
| | - Jef van den Eynde
- Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA; Department of Cardiovascular Sciences, KU Leuven, Leuven, Belgium
| | - Shelby Kutty
- Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Heather J Ross
- Ted Rogers Centre for Heart Research, Peter Munk Cardiac Centre, University Health Network, Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
Jonnagaddala J, Chen A, Batongbacal S, Nekkantti C. The OpenDeID corpus for patient de-identification. Sci Rep 2021; 11:19973. [PMID: 34620985 PMCID: PMC8497517 DOI: 10.1038/s41598-021-99554-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 09/28/2021] [Indexed: 11/18/2022] Open
Abstract
For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
Collapse
Affiliation(s)
| | - Aipeng Chen
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Sean Batongbacal
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | | |
Collapse
|
5
|
Zhang H, Hu D, Duan H, Li S, Wu N, Lu X. A novel deep learning approach to extract Chinese clinical entities for lung cancer screening and staging. BMC Med Inform Decis Mak 2021; 21:214. [PMID: 34330277 PMCID: PMC8323233 DOI: 10.1186/s12911-021-01575-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 07/07/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computed tomography (CT) reports record a large volume of valuable information about patients' conditions and the interpretations of radiology images from radiologists, which can be used for clinical decision-making and further academic study. However, the free-text nature of clinical reports is a critical barrier to use this data more effectively. In this study, we investigate a novel deep learning method to extract entities from Chinese CT reports for lung cancer screening and TNM staging. METHODS The proposed approach presents a new named entity recognition algorithm, namely the BERT-based-BiLSTM-Transformer network (BERT-BTN) with pre-training, to extract clinical entities for lung cancer screening and staging. Specifically, instead of traditional word embedding methods, BERT is applied to learn the deep semantic representations of characters. Following the long short-term memory layer, a Transformer layer is added to capture the global dependencies between characters. Besides, pre-training technique is employed to alleviate the problem of insufficient labeled data. RESULTS We verify the effectiveness of the proposed approach on a clinical dataset containing 359 CT reports collected from the Department of Thoracic Surgery II of Peking University Cancer Hospital. The experimental results show that the proposed approach achieves an 85.96% macro-F1 score under exact match scheme, which improves the performance by 1.38%, 1.84%, 3.81%,4.29%,5.12%,5.29% and 8.84% compared to BERT-BTN, BERT-LSTM, BERT-fine-tune, BERT-Transformer, FastText-BTN, FastText-BiLSTM and FastText-Transformer, respectively. CONCLUSIONS In this study, we developed a novel deep learning method, i.e., BERT-BTN with pre-training, to extract the clinical entities from Chinese CT reports. The experimental results indicate that the proposed approach can efficiently recognize various clinical entities about lung cancer screening and staging, which shows the potential for further clinical decision-making and academic research.
Collapse
Affiliation(s)
- Huanyao Zhang
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Zheda Road, Hangzhou, China
| | - Danqing Hu
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Zheda Road, Hangzhou, China
| | - Huilong Duan
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Zheda Road, Hangzhou, China
| | - Shaolei Li
- Department of Thoracic Surgery II, Peking University Cancer Hospital & Institute, Beijing, China
| | - Nan Wu
- Department of Thoracic Surgery II, Peking University Cancer Hospital & Institute, Beijing, China
| | - Xudong Lu
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Zheda Road, Hangzhou, China
| |
Collapse
|
6
|
Dobbie S, Strafford H, Pickrell WO, Fonferko-Shadrach B, Jones C, Akbari A, Thompson S, Lacey A. Markup: A Web-Based Annotation Tool Powered by Active Learning. Front Digit Health 2021; 3:598916. [PMID: 34713086 PMCID: PMC8521860 DOI: 10.3389/fdgth.2021.598916] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 06/16/2021] [Indexed: 11/13/2022] Open
Abstract
Across various domains, such as health and social care, law, news, and social media, there are increasing quantities of unstructured texts being produced. These potential data sources often contain rich information that could be used for domain-specific and research purposes. However, the unstructured nature of free-text data poses a significant challenge for its utilisation due to the necessity of substantial manual intervention from domain-experts to label embedded information. Annotation tools can assist with this process by providing functionality that enables the accurate capture and transformation of unstructured texts into structured annotations, which can be used individually, or as part of larger Natural Language Processing (NLP) pipelines. We present Markup (https://www.getmarkup.com/) an open-source, web-based annotation tool that is undergoing continued development for use across all domains. Markup incorporates NLP and Active Learning (AL) technologies to enable rapid and accurate annotation using custom user configurations, predictive annotation suggestions, and automated mapping suggestions to both domain-specific ontologies, such as the Unified Medical Language System (UMLS), and custom, user-defined ontologies. We demonstrate a real-world use case of how Markup has been used in a healthcare setting to annotate structured information from unstructured clinic letters, where captured annotations were used to build and test NLP applications.
Collapse
Affiliation(s)
- Samuel Dobbie
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Huw Strafford
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - W. Owen Pickrell
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Neurology Department, Morriston Hospital, Swansea Bay University Health Board, Swansea, United Kingdom
| | | | - Carys Jones
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Ashley Akbari
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Simon Thompson
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Arron Lacey
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| |
Collapse
|
7
|
Li J, Zhou Y, Jiang X, Natarajan K, Pakhomov SV, Liu H, Xu H. Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. J Am Med Inform Assoc 2021; 28:2193-2201. [PMID: 34272955 DOI: 10.1093/jamia/ocab112] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 05/09/2021] [Accepted: 06/07/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE : Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. MATERIALS AND METHODS : We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. RESULTS : Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. CONCLUSIONS : Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.
Collapse
Affiliation(s)
- Jianfu Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yujia Zhou
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, USA
| | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
8
|
Newman-Griffis D, Lehman JF, Rosé C, Hochheiser H. Translational NLP: A New Paradigm and General Principles for Natural Language Processing Research. PROCEEDINGS OF THE CONFERENCE. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. NORTH AMERICAN CHAPTER. MEETING 2021; 2021:4125-4138. [PMID: 34179899 PMCID: PMC8223521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Natural language processing (NLP) research combines the study of universal principles, through basic science, with applied science targeting specific use cases and settings. However, the process of exchange between basic NLP and applications is often assumed to emerge naturally, resulting in many innovations going unapplied and many important questions left unstudied. We describe a new paradigm of Translational NLP, which aims to structure and facilitate the processes by which basic and applied NLP research inform one another. Translational NLP thus presents a third research paradigm, focused on understanding the challenges posed by application needs and how these challenges can drive innovation in basic science and technology design. We show that many significant advances in NLP research have emerged from the intersection of basic principles with application needs, and present a conceptual framework outlining the stakeholders and key questions in translational research. Our framework provides a roadmap for developing Translational NLP as a dedicated research area, and identifies general translational principles to facilitate exchange between basic and applied research.
Collapse
Affiliation(s)
| | - Jill Fain Lehman
- Human-Computer Interaction Institute, Carnegie Mellon University, USA
| | - Carolyn Rosé
- Language Technologies Institute, Carnegie Mellon University, USA
| | - Harry Hochheiser
- Department of Biomedical Informatics, University of Pittsburgh, USA
| |
Collapse
|
9
|
Geva A, Stedman JP, Manzi SF, Lin C, Savova GK, Avillach P, Mandl KD. Adverse drug event presentation and tracking (ADEPT): semiautomated, high throughput pharmacovigilance using real-world data. JAMIA Open 2020; 3:413-421. [PMID: 33215076 PMCID: PMC7660953 DOI: 10.1093/jamiaopen/ooaa031] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 06/23/2020] [Accepted: 06/27/2020] [Indexed: 11/24/2022] Open
Abstract
Objective To advance use of real-world data (RWD) for pharmacovigilance, we sought to integrate a high-sensitivity natural language processing (NLP) pipeline for detecting potential adverse drug events (ADEs) with easily interpretable output for high-efficiency human review and adjudication of true ADEs. Materials and methods The adverse drug event presentation and tracking (ADEPT) system employs an open source NLP pipeline to identify in clinical notes mentions of medications and signs and symptoms potentially indicative of ADEs. ADEPT presents the output to human reviewers by highlighting these drug-event pairs within the context of the clinical note. To measure incidence of seizures associated with sildenafil, we applied ADEPT to 149 029 notes for 982 patients with pediatric pulmonary hypertension. Results Of 416 patients identified as taking sildenafil, NLP found 72 [17%, 95% confidence interval (CI) 14–21] with seizures as a potential ADE. Upon human review and adjudication, only 4 (0.96%, 95% CI 0.37–2.4) patients with seizures were determined to have true ADEs. Reviewers using ADEPT required a median of 89 s (interquartile range 57–142 s) per patient to review potential ADEs. Discussion ADEPT combines high throughput NLP to increase sensitivity of ADE detection and human review, to increase specificity by differentiating true ADEs from signs and symptoms related to comorbidities, effects of other medications, or other confounders. Conclusion ADEPT is a promising tool for creating gold standard, patient-level labels for advancing NLP-based pharmacovigilance. ADEPT is a potentially time savings platform for computer-assisted pharmacovigilance based on RWD.
Collapse
Affiliation(s)
- Alon Geva
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Division of Critical Care Medicine, Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Anaesthesia, Harvard Medical School, Boston, Massachusetts, USA
| | - Jason P Stedman
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Shannon F Manzi
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Clinical Pharmacogenomics Service, Division of Genetics & Genomics and Department of Pharmacy, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| | - Chen Lin
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA
| | - Guergana K Savova
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| | - Paul Avillach
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.,Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
10
|
Trivedi G, Dadashzadeh ER, Handzel RM, Chapman WW, Visweswaran S, Hochheiser H. Interactive NLP in Clinical Care: Identifying Incidental Findings in Radiology Reports. Appl Clin Inform 2019; 10:655-669. [PMID: 31486057 DOI: 10.1055/s-0039-1695791] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Despite advances in natural language processing (NLP), extracting information from clinical text is expensive. Interactive tools that are capable of easing the construction, review, and revision of NLP models can reduce this cost and improve the utility of clinical reports for clinical and secondary use. OBJECTIVES We present the design and implementation of an interactive NLP tool for identifying incidental findings in radiology reports, along with a user study evaluating the performance and usability of the tool. METHODS Expert reviewers provided gold standard annotations for 130 patient encounters (694 reports) at sentence, section, and report levels. We performed a user study with 15 physicians to evaluate the accuracy and usability of our tool. Participants reviewed encounters split into intervention (with predictions) and control conditions (no predictions). We measured changes in model performance, the time spent, and the number of user actions needed. The System Usability Scale (SUS) and an open-ended questionnaire were used to assess usability. RESULTS Starting from bootstrapped models trained on 6 patient encounters, we observed an average increase in F1 score from 0.31 to 0.75 for reports, from 0.32 to 0.68 for sections, and from 0.22 to 0.60 for sentences on a held-out test data set, over an hour-long study session. We found that tool helped significantly reduce the time spent in reviewing encounters (134.30 vs. 148.44 seconds in intervention and control, respectively), while maintaining overall quality of labels as measured against the gold standard. The tool was well received by the study participants with a very good overall SUS score of 78.67. CONCLUSION The user study demonstrated successful use of the tool by physicians for identifying incidental findings. These results support the viability of adopting interactive NLP tools in clinical care settings for a wider range of clinical applications.
Collapse
Affiliation(s)
- Gaurav Trivedi
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| | - Esmaeel R Dadashzadeh
- Department of Surgery and Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| | - Robert M Handzel
- Department of Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, United States
| | - Shyam Visweswaran
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, United States.,Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| | - Harry Hochheiser
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, United States.,Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| |
Collapse
|
11
|
Wagholikar KB, Fischer CM, Goodson A, Herrick CD, Rees M, Toscano E, MacRae CA, Scirica BM, Desai AS, Murphy SN. Extraction of Ejection Fraction from Echocardiography Notes for Constructing a Cohort of Patients having Heart Failure with reduced Ejection Fraction (HFrEF). J Med Syst 2018; 42:209. [PMID: 30255347 PMCID: PMC6153777 DOI: 10.1007/s10916-018-1066-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2018] [Accepted: 09/09/2018] [Indexed: 12/19/2022]
Abstract
Left ventricular ejection fraction (LVEF) is an important prognostic indicator of cardiovascular outcomes. It is used clinically to determine the indication for several therapeutic interventions. LVEF is most commonly derived using in-line tools and some manual assessment by cardiologists from standardized echocardiographic views. LVEF is typically documented in free-text reports, and variation in LVEF documentation pose a challenge for the extraction and utilization of LVEF in computer-based clinical workflows. To address this problem, we developed a computerized algorithm to extract LVEF from echocardiography reports for the identification of patients having heart failure with reduced ejection fraction (HFrEF) for therapeutic intervention at a large healthcare system. We processed echocardiogram reports for 57,158 patients with coded diagnosis of Heart Failure that visited the healthcare system over a two-year period. Our algorithm identified a total of 3910 patients with reduced ejection fraction. Of the 46,634 echocardiography reports processed, 97% included a mention of LVEF. Of these reports, 85% contained numerical ejection fraction values, 9% contained ranges, and the remaining 6% contained qualitative descriptions. Overall, 18% of extracted numerical LVEFs were ≤ 40%. Furthermore, manual validation for a sample of 339 reports yielded an accuracy of 1.0. Our study demonstrates that a regular expression-based approach can accurately extract LVEF from echocardiograms, and is useful for delineating heart-failure patients with reduced ejection fraction.
Collapse
Affiliation(s)
- Kavishwar B Wagholikar
- Harvard Medical School, Boston, MA, USA. .,Massachusetts General Hospital, Boston, MA, USA.
| | | | | | | | | | | | - Calum A MacRae
- Harvard Medical School, Boston, MA, USA.,Brigham Women's Hospital, Boston, MA, USA
| | - Benjamin M Scirica
- Harvard Medical School, Boston, MA, USA.,Brigham Women's Hospital, Boston, MA, USA
| | - Akshay S Desai
- Harvard Medical School, Boston, MA, USA.,Brigham Women's Hospital, Boston, MA, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, MA, USA.,Massachusetts General Hospital, Boston, MA, USA.,Partners Healthcare, Boston, MA, USA
| |
Collapse
|
12
|
Tapi Nzali MD, Aze J, Bringay S, Lavergne C, Mollevi C, Optiz T. Reconciliation of patient/doctor vocabulary in a structured resource. Health Informatics J 2018; 25:1219-1231. [PMID: 29332530 DOI: 10.1177/1460458217751014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Today, social media is increasingly used by patients to openly discuss their health. Mining automatically such data is a challenging task because of the non-structured nature of the text and the use of many abbreviations and the slang terms. Our goal is to use Patient Authored Text to build a French Consumer Health Vocabulary on breast cancer field, by collecting various kinds of non-experts' expressions that are related to their diseases and then compare them to biomedical terms used by health care professionals. We combine several methods of the literature based on linguistic and statistical approaches to extract candidate terms used by non-experts and to link them to expert terms. We use messages extracted from the forum on ' cancerdusein.org ' and a vocabulary dedicated to breast cancer elaborated by the Institut National Du Cancer. We have built an efficient vocabulary composed of 192 validated relationships and formalized in Simple Knowledge Organization System ontology.
Collapse
|
13
|
Trivedi G, Pham P, Chapman WW, Hwa R, Wiebe J, Hochheiser H. NLPReViz: an interactive tool for natural language processing on clinical text. J Am Med Inform Assoc 2018; 25:81-87. [PMID: 29016825 PMCID: PMC6381768 DOI: 10.1093/jamia/ocx070] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 05/09/2017] [Accepted: 06/21/2017] [Indexed: 11/14/2022] Open
Abstract
The gap between domain experts and natural language processing expertise is a barrier to extracting understanding from clinical text. We describe a prototype tool for interactive review and revision of natural language processing models of binary concepts extracted from clinical notes. We evaluated our prototype in a user study involving 9 physicians, who used our tool to build and revise models for 2 colonoscopy quality variables. We report changes in performance relative to the quantity of feedback. Using initial training sets as small as 10 documents, expert review led to final F1scores for the "appendiceal-orifice" variable between 0.78 and 0.91 (with improvements ranging from 13.26% to 29.90%). F1for "biopsy" ranged between 0.88 and 0.94 (-1.52% to 11.74% improvements). The average System Usability Scale score was 70.56. Subjective feedback also suggests possible design improvements.
Collapse
Affiliation(s)
- Gaurav Trivedi
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| | - Phuong Pham
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Rebecca Hwa
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA
| | - Janyce Wiebe
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA
| | - Harry Hochheiser
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
14
|
Kholghi M, De Vine L, Sitbon L, Zuccon G, Nguyen A. Clinical information extraction using small data: An active learning approach based on sequence representations and word embeddings. J Assoc Inf Sci Technol 2017. [DOI: 10.1002/asi.23936] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Mahnoosh Kholghi
- Queensland University of Technology; Brisbane Queensland 4000 Australia
| | - Lance De Vine
- Queensland University of Technology; Brisbane Queensland 4000 Australia
| | - Laurianne Sitbon
- Queensland University of Technology; Brisbane Queensland 4000 Australia
| | - Guido Zuccon
- Queensland University of Technology; Brisbane Queensland 4000 Australia
| | - Anthony Nguyen
- The Australian e-Health Research Centre, CSIRO; Brisbane Queensland 4029 Australia
| |
Collapse
|
15
|
Kholghi M, Sitbon L, Zuccon G, Nguyen A. Active learning reduces annotation time for clinical concept extraction. Int J Med Inform 2017; 106:25-31. [DOI: 10.1016/j.ijmedinf.2017.08.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Revised: 07/31/2017] [Accepted: 08/02/2017] [Indexed: 10/19/2022]
|
16
|
Névéol A, Zweigenbaum P. Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare. Yearb Med Inform 2017; 10:194-8. [PMID: 26293868 DOI: 10.15265/iy-2015-035] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
OBJECTIVE To summarize recent research and present a selection of the best papers published in 2014 in the field of clinical Natural Language Processing (NLP). METHOD A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. RESULTS The clinical NLP best paper selection shows that the field is tackling text analysis methods of increasing depth. The full review process highlighted five papers addressing foundational methods in clinical NLP using clinically relevant texts from online forums or encyclopedias, clinical texts from Electronic Health Records, and included studies specifically aiming at a practical clinical outcome. The increased access to clinical data that was made possible with the recent progress of de-identification paved the way for the scientific community to address complex NLP problems such as word sense disambiguation, negation, temporal analysis and specific information nugget extraction. These advances in turn allowed for efficient application of NLP to clinical problems such as cancer patient triage. Another line of research investigates online clinically relevant texts and brings interesting insight on communication strategies to convey health-related information. CONCLUSIONS The field of clinical NLP is thriving through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques for concrete healthcare purposes. Clinical NLP is becoming mature for practical applications with a significant clinical impact.
Collapse
Affiliation(s)
- A Névéol
- Aurélie Névéol, LIMSI CNRS UPR 3251, Rue John von Neumann, Campus Universitaire d'Orsay, 91405 Orsay cedex, France, E-mail: {neveol,pz}@limsi.fr
| | | |
Collapse
|
17
|
Zheng S, Lu JJ, Ghasemzadeh N, Hayek SS, Quyyumi AA, Wang F. Effective Information Extraction Framework for Heterogeneous Clinical Reports Using Online Machine Learning and Controlled Vocabularies. JMIR Med Inform 2017; 5:e12. [PMID: 28487265 PMCID: PMC5442348 DOI: 10.2196/medinform.7235] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2016] [Revised: 03/16/2017] [Accepted: 03/20/2017] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Extracting structured data from narrated medical reports is challenged by the complexity of heterogeneous structures and vocabularies and often requires significant manual effort. Traditional machine-based approaches lack the capability to take user feedbacks for improving the extraction algorithm in real time. OBJECTIVE Our goal was to provide a generic information extraction framework that can support diverse clinical reports and enables a dynamic interaction between a human and a machine that produces highly accurate results. METHODS A clinical information extraction system IDEAL-X has been built on top of online machine learning. It processes one document at a time, and user interactions are recorded as feedbacks to update the learning model in real time. The updated model is used to predict values for extraction in subsequent documents. Once prediction accuracy reaches a user-acceptable threshold, the remaining documents may be batch processed. A customizable controlled vocabulary may be used to support extraction. RESULTS Three datasets were used for experiments based on report styles: 100 cardiac catheterization procedure reports, 100 coronary angiographic reports, and 100 integrated reports-each combines history and physical report, discharge summary, outpatient clinic notes, outpatient clinic letter, and inpatient discharge medication report. Data extraction was performed by 3 methods: online machine learning, controlled vocabularies, and a combination of these. The system delivers results with F1 scores greater than 95%. CONCLUSIONS IDEAL-X adopts a unique online machine learning-based approach combined with controlled vocabularies to support data extraction for clinical reports. The system can quickly learn and improve, thus it is highly adaptable.
Collapse
Affiliation(s)
- Shuai Zheng
- Department of Biomedical Informatics, Emory University, Atlanta, GA, United States
| | - James J Lu
- Department of Mathematics and Computer Science, Emory University, Atlanta, GA, United States
| | - Nima Ghasemzadeh
- Division of Cardiology, Emory School of Medicine, Emory University, Atlanta, GA, United States
| | - Salim S Hayek
- Division of Cardiology, Emory School of Medicine, Emory University, Atlanta, GA, United States
| | - Arshed A Quyyumi
- Division of Cardiology, Emory School of Medicine, Emory University, Atlanta, GA, United States
| | - Fusheng Wang
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, United States
| |
Collapse
|
18
|
Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24:596-606. [PMID: 28040687 PMCID: PMC7787254 DOI: 10.1093/jamia/ocw156] [Citation(s) in RCA: 107] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Revised: 09/06/2016] [Accepted: 10/06/2016] [Indexed: 01/16/2023] Open
Abstract
OBJECTIVE Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. MATERIALS AND METHODS We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. RESULTS Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. CONCLUSION Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering.
Collapse
Affiliation(s)
- Franck Dernoncourt
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ji Young Lee
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ozlem Uzuner
- Computer Science Department, University at Albany, SUNY, Albany, NY, USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
19
|
Kim Y, Garvin JH, Goldstein MK, Hwang TS, Redd A, Bolton D, Heidenreich PA, Meystre SM. Extraction of left ventricular ejection fraction information from various types of clinical reports. J Biomed Inform 2017; 67:42-48. [PMID: 28163196 DOI: 10.1016/j.jbi.2017.01.017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Revised: 12/16/2016] [Accepted: 01/31/2017] [Indexed: 10/20/2022]
Abstract
Efforts to improve the treatment of congestive heart failure, a common and serious medical condition, include the use of quality measures to assess guideline-concordant care. The goal of this study is to identify left ventricular ejection fraction (LVEF) information from various types of clinical notes, and to then use this information for heart failure quality measurement. We analyzed the annotation differences between a new corpus of clinical notes from the Echocardiography, Radiology, and Text Integrated Utility package and other corpora annotated for natural language processing (NLP) research in the Department of Veterans Affairs. These reports contain varying degrees of structure. To examine whether existing LVEF extraction modules we developed in prior research improve the accuracy of LVEF information extraction from the new corpus, we created two sequence-tagging NLP modules trained with a new data set, with or without predictions from the existing LVEF extraction modules. We also conducted a set of experiments to examine the impact of training data size on information extraction accuracy. We found that less training data is needed when reports are highly structured, and that combining predictions from existing LVEF extraction modules improves information extraction when reports have less structured formats and a rich set of vocabulary.
Collapse
Affiliation(s)
- Youngjun Kim
- School of Computing, University of Utah, Salt Lake City, UT, USA; VA Health Care System, Salt Lake City, UT, USA.
| | - Jennifer H Garvin
- VA Health Care System, Salt Lake City, UT, USA; Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Mary K Goldstein
- VA Palo Alto Health Care System, Palo Alto, CA, USA; Stanford University, Stanford, CA, USA
| | | | - Andrew Redd
- VA Health Care System, Salt Lake City, UT, USA; Division of Epidemiology, University of Utah, Salt Lake City, UT, USA
| | - Dan Bolton
- VA Health Care System, Salt Lake City, UT, USA; Division of Epidemiology, University of Utah, Salt Lake City, UT, USA
| | - Paul A Heidenreich
- VA Palo Alto Health Care System, Palo Alto, CA, USA; Stanford University, Stanford, CA, USA
| | - Stéphane M Meystre
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA; Medical University of South Carolina, Charleston, SC, USA
| |
Collapse
|
20
|
Hochheiser H, Ning Y, Hernandez A, Horn JR, Jacobson R, Boyce RD. Using Nonexperts for Annotating Pharmacokinetic Drug-Drug Interaction Mentions in Product Labeling: A Feasibility Study. JMIR Res Protoc 2016; 5:e40. [PMID: 27066806 PMCID: PMC4844909 DOI: 10.2196/resprot.5028] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2015] [Revised: 11/25/2015] [Accepted: 12/19/2015] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Because vital details of potential pharmacokinetic drug-drug interactions are often described in free-text structured product labels, manual curation is a necessary but expensive step in the development of electronic drug-drug interaction information resources. The use of nonexperts to annotate potential drug-drug interaction (PDDI) mentions in drug product label annotation may be a means of lessening the burden of manual curation. OBJECTIVE Our goal was to explore the practicality of using nonexpert participants to annotate drug-drug interaction descriptions from structured product labels. By presenting annotation tasks to both pharmacy experts and relatively naïve participants, we hoped to demonstrate the feasibility of using nonexpert annotators for drug-drug information annotation. We were also interested in exploring whether and to what extent natural language processing (NLP) preannotation helped improve task completion time, accuracy, and subjective satisfaction. METHODS Two experts and 4 nonexperts were asked to annotate 208 structured product label sections under 4 conditions completed sequentially: (1) no NLP assistance, (2) preannotation of drug mentions, (3) preannotation of drug mentions and PDDIs, and (4) a repeat of the no-annotation condition. Results were evaluated within the 2 groups and relative to an existing gold standard. Participants were asked to provide reports on the time required to complete tasks and their perceptions of task difficulty. RESULTS One of the experts and 3 of the nonexperts completed all tasks. Annotation results from the nonexpert group were relatively strong in every scenario and better than the performance of the NLP pipeline. The expert and 2 of the nonexperts were able to complete most tasks in less than 3 hours. Usability perceptions were generally positive (3.67 for expert, mean of 3.33 for nonexperts). CONCLUSIONS The results suggest that nonexpert annotation might be a feasible option for comprehensive labeling of annotated PDDIs across a broader range of drug product labels. Preannotation of drug mentions may ease the annotation task. However, preannotation of PDDIs, as operationalized in this study, presented the participants with difficulties. Future work should test if these issues can be addressed by the use of better performing NLP and a different approach to presenting the PDDI preannotations to users during the annotation workflow.
Collapse
Affiliation(s)
- Harry Hochheiser
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States.
| | | | | | | | | | | |
Collapse
|
21
|
Han D, Wang S, Jiang C, Jiang X, Kim HE, Sun J, Ohno-Machado L. Trends in biomedical informatics: automated topic analysis of JAMIA articles. J Am Med Inform Assoc 2015; 22:1153-63. [PMID: 26555018 PMCID: PMC5009912 DOI: 10.1093/jamia/ocv157] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Revised: 09/08/2015] [Accepted: 09/14/2015] [Indexed: 01/26/2023] Open
Abstract
Biomedical Informatics is a growing interdisciplinary field in which research topics and citation trends have been evolving rapidly in recent years. To analyze these data in a fast, reproducible manner, automation of certain processes is needed. JAMIA is a "generalist" journal for biomedical informatics. Its articles reflect the wide range of topics in informatics. In this study, we retrieved Medical Subject Headings (MeSH) terms and citations of JAMIA articles published between 2009 and 2014. We use tensors (i.e., multidimensional arrays) to represent the interaction among topics, time and citations, and applied tensor decomposition to automate the analysis. The trends represented by tensors were then carefully interpreted and the results were compared with previous findings based on manual topic analysis. A list of most cited JAMIA articles, their topics, and publication trends over recent years is presented. The analyses confirmed previous studies and showed that, from 2012 to 2014, the number of articles related to MeSH terms Methods, Organization & Administration, and Algorithms increased significantly both in number of publications and citations. Citation trends varied widely by topic, with Natural Language Processing having a large number of citations in particular years, and Medical Record Systems, Computerized remaining a very popular topic in all years.
Collapse
Affiliation(s)
- Dong Han
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135, USA
| | - Shuang Wang
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Chao Jiang
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135, USA
| | - Xiaoqian Jiang
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Hyeon-Eui Kim
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Jimeng Sun
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, S30313, USA
| | - Lucila Ohno-Machado
- Health System Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| |
Collapse
|
22
|
Velupillai S, Mowery D, South BR, Kvist M, Dalianis H. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis. Yearb Med Inform 2015; 10:183-93. [PMID: 26293867 PMCID: PMC4587060 DOI: 10.15265/iy-2015-009] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
OBJECTIVES We present a review of recent advances in clinical Natural Language Processing (NLP), with a focus on semantic analysis and key subtasks that support such analysis. METHODS We conducted a literature review of clinical NLP research from 2008 to 2014, emphasizing recent publications (2012-2014), based on PubMed and ACL proceedings as well as relevant referenced publications from the included papers. RESULTS Significant articles published within this time-span were included and are discussed from the perspective of semantic analysis. Three key clinical NLP subtasks that enable such analysis were identified: 1) developing more efficient methods for corpus creation (annotation and de-identification), 2) generating building blocks for extracting meaning (morphological, syntactic, and semantic subtasks), and 3) leveraging NLP for clinical utility (NLP applications and infrastructure for clinical use cases). Finally, we provide a reflection upon most recent developments and potential areas of future NLP development and applications. CONCLUSIONS There has been an increase of advances within key NLP subtasks that support semantic analysis. Performance of NLP semantic analysis is, in many cases, close to that of agreement between humans. The creation and release of corpora annotated with complex semantic information models has greatly supported the development of new tools and approaches. Research on non-English languages is continuously growing. NLP methods have sometimes been successfully employed in real-world clinical tasks. However, there is still a gap between the development of advanced resources and their utilization in clinical settings. A plethora of new clinical use cases are emerging due to established health care initiatives and additional patient-generated sources through the extensive use of social media and other devices.
Collapse
Affiliation(s)
- S Velupillai
- Sumithra Velupillai, Department of Computer and Systems Sciences, Stockholm University, Postbox 7003, 164 07 Kista, Sweden, Tel: +46 8 161 174, Fax: +46 8 703 9025, E-mail:
| | | | | | | | | |
Collapse
|