1
|
Argüello-González G, Aquino-Esperanza J, Salvador D, Bretón-Romero R, Del Río-Bermudez C, Tello J, Menke S. Negation recognition in clinical natural language processing using a combination of the NegEx algorithm and a convolutional neural network. BMC Med Inform Decis Mak 2023; 23:216. [PMID: 37833661 PMCID: PMC10576331 DOI: 10.1186/s12911-023-02301-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
BACKGROUND Important clinical information of patients is present in unstructured free-text fields of Electronic Health Records (EHRs). While this information can be extracted using clinical Natural Language Processing (cNLP), the recognition of negation modifiers represents an important challenge. A wide range of cNLP applications have been developed to detect the negation of medical entities in clinical free-text, however, effective solutions for languages other than English are scarce. This study aimed at developing a solution for negation recognition in Spanish EHRs based on a combination of a customized rule-based NegEx layer and a convolutional neural network (CNN). METHODS Based on our previous experience in real world evidence (RWE) studies using information embedded in EHRs, negation recognition was simplified into a binary problem ('affirmative' vs. 'non-affirmative' class). For the NegEx layer, negation rules were obtained from a publicly available Spanish corpus and enriched with custom ones, whereby the CNN binary classifier was trained on EHRs annotated for clinical named entities (cNEs) and negation markers by medical doctors. RESULTS The proposed negation recognition pipeline obtained precision, recall, and F1-score of 0.93, 0.94, and 0.94 for the 'affirmative' class, and 0.86, 0.84, and 0.85 for the 'non-affirmative' class, respectively. To validate the generalization capabilities of our methodology, we applied the negation recognition pipeline on EHRs (6,710 cNEs) from a different data source distribution than the training corpus and obtained consistent performance metrics for the 'affirmative' and 'non-affirmative' class (0.95, 0.97, and 0.96; and 0.90, 0.83, and 0.86 for precision, recall, and F1-score, respectively). Lastly, we evaluated the pipeline against two publicly available Spanish negation corpora, the IULA and NUBes, obtaining state-of-the-art metrics (1.00, 0.99, and 0.99; and 1.00, 0.93, and 0.96 for precision, recall, and F1-score, respectively). CONCLUSION Negation recognition is a source of low precision in the retrieval of cNEs from EHRs' free-text. Combining a customized rule-based NegEx layer with a CNN binary classifier outperformed many other current approaches. RWE studies highly benefit from the correct recognition of negation as it reduces false positive detections of cNE which otherwise would undoubtedly reduce the credibility of cNLP systems.
Collapse
Affiliation(s)
- Guillermo Argüello-González
- MedSavana SL, Madrid, 28004, Spain
- Statistics and Operations Research, University of Oviedo, Oviedo, 33003, Spain
| | - José Aquino-Esperanza
- MedSavana SL, Madrid, 28004, Spain
- Faculty of Medicine and Health Sciences, University of Barcelona, Barcelona, 08007, Spain
| | | | | | | | | | | |
Collapse
|
2
|
Zhao G, Gu W, Cai W, Zhao Z, Zhang X, Liu J. MLEE: A method for extracting object-level medical knowledge graph entities from Chinese clinical records. Front Genet 2022; 13:900242. [PMID: 35938002 PMCID: PMC9354090 DOI: 10.3389/fgene.2022.900242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 06/16/2022] [Indexed: 11/13/2022] Open
Abstract
As a typical knowledge-intensive industry, the medical field uses knowledge graph technology to construct causal inference calculations, such as “symptom-disease”, “laboratory examination/imaging examination-disease”, and “disease-treatment method”. The continuous expansion of large electronic clinical records provides an opportunity to learn medical knowledge by machine learning. In this process, how to extract entities with a medical logic structure and how to make entity extraction more consistent with the logic of the text content in electronic clinical records are two issues that have become key in building a high-quality, medical knowledge graph. In this work, we describe a method for extracting medical entities using real Chinese clinical electronic clinical records. We define a computational architecture named MLEE to extract object-level entities with “object-attribute” dependencies. We conducted experiments based on randomly selected electronic clinical records of 1,000 patients from Shengjing Hospital of China Medical University to verify the effectiveness of the method.
Collapse
Affiliation(s)
- Genghong Zhao
- School of Computer Science and Engineering Northeastern University, Shenyang, China
- Neusoft Research of Intelligent Healthcare Technology, Shenyang, China
- *Correspondence: Genghong Zhao, ; Xia Zhang, ; Jiren Liu,
| | - Wenjian Gu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Wei Cai
- Neusoft Research of Intelligent Healthcare Technology, Shenyang, China
| | - Zhiying Zhao
- Department of Clinical Epidemiology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Xia Zhang
- School of Computer Science and Engineering Northeastern University, Shenyang, China
- Neusoft Research of Intelligent Healthcare Technology, Shenyang, China
- *Correspondence: Genghong Zhao, ; Xia Zhang, ; Jiren Liu,
| | - Jiren Liu
- School of Computer Science and Engineering Northeastern University, Shenyang, China
- Neusoft Corporation, Shenyang, China
- *Correspondence: Genghong Zhao, ; Xia Zhang, ; Jiren Liu,
| |
Collapse
|
3
|
Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12105209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. However, this growth is insufficient to address these important phenomena in languages with limited resources. The use of cross-lingual models and translation of the well-known languages are acceptable alternatives. We also highlight the lack of consistent annotation guidelines and the shortcomings of the existing techniques, and suggest alternatives that may speed up progress in this research direction. Adding more syntactic features may alleviate the limitations of the existing techniques, such as cue ambiguity and detecting the discontinuous scopes. In some NLP applications, inclusion of a system that is negation- and speculation-aware improves performance, yet this aspect is still not addressed or considered an essential step.
Collapse
|
4
|
Enhanced sentimental analysis using visual geometry group network-based deep learning approach. Soft comput 2021. [DOI: 10.1007/s00500-021-05890-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
5
|
Yang S, Zheng X, Xiao Y, Yin X, Pang J, Mao H, Wei W, Zhang W, Yang Y, Xu H, Li M, Zhao D. Improving Chinese electronic medical record retrieval by field weight assignment, negation detection, and re-ranking. J Biomed Inform 2021; 119:103836. [PMID: 34116253 DOI: 10.1016/j.jbi.2021.103836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 04/24/2021] [Accepted: 06/06/2021] [Indexed: 11/30/2022]
Abstract
The technique of information retrieval has been widely used in electronic medical record (EMR) systems. It's a pity that most existing methods have not considered the structures and language features of Chinese EMRs, which affects the performance of retrieval. To improve accuracy and comprehensiveness, we propose an improved algorithm of Chinese EMR retrieval. First, the weights of fields in Chinese EMRs are assigned based on the corresponding importance in clinical applications. Second, negative relations in EMRs are detected, and the retrieval scores of negative terms are adjusted accordingly. Third, the retrieval results are re-ranked by expansion terms and time information to enhance the recall without decreasing precision. Experiment results show that the improved algorithm increases the precision and recall significantly, which shows that the algorithm takes a full account of the characteristics of Chinese EMRs and fits the needs for clinical applications.
Collapse
Affiliation(s)
- Songchun Yang
- Academy of Military Medical Sciences, Beijing 100850, China.
| | - Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing 100850, China.
| | - Yu Xiao
- Academy of Military Medical Sciences, Beijing 100850, China.
| | - Xiangfei Yin
- Academy of Military Medical Sciences, Beijing 100850, China; Sansha People's Hospital, Sansha 573199, China.
| | - Jianfei Pang
- Academy of Military Medical Sciences, Beijing 100850, China.
| | - Huajian Mao
- Academy of Military Medical Sciences, Beijing 100850, China.
| | - Wei Wei
- PLA 960th Hospital, Jinan 250031, China.
| | | | - Yu Yang
- Academy of Military Medical Sciences, Beijing 100850, China.
| | - Haifeng Xu
- Academy of Military Medical Sciences, Beijing 100850, China; General Hospital of Xinjiang Military Region, Urumchi 830000, China.
| | - Mei Li
- China Stroke Data Center, Beijing 100101, China.
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing 100850, China.
| |
Collapse
|
6
|
Yang Y, Huo H, Jiang J, Sun X, Guan Y, Guo X, Wan X, Liu S. Clinical decision-making framework against over-testing based on modeling implicit evaluation criteria. J Biomed Inform 2021; 119:103823. [PMID: 34044155 DOI: 10.1016/j.jbi.2021.103823] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 05/20/2021] [Accepted: 05/21/2021] [Indexed: 12/25/2022]
Abstract
Different statistical methods include various subjective criteria that can prevent over-testing. However, no unified framework that defines generalized objective criteria for various diseases is available to determine the appropriateness of diagnostic tests recommended by doctors. We present the clinical decision-making framework against over-testing based on modeling the implicit evaluation criteria (CDFO-MIEC). The CDFO-MIEC quantifies the subjective evaluation process using statistics-based methods to identify over-testing. Furthermore, it determines the test's appropriateness with extracted entities obtained via named entity recognition and entity alignment. More specifically, implicit evaluation criteria are defined-namely, the correlation among the diagnostic tests, symptoms, and diseases, confirmation function, and exclusion function. Additionally, four evaluation strategies are implemented by applying statistical methods, including the multi-label k-nearest neighbor and the conditional probability algorithms, to model the implicit evaluation criteria. Finally, they are combined into a classification and regression tree to make the final decision. The CDFO-MIEC also provides interpretability by decision conditions for supporting each clinical decision of over-testing. We tested the CDFO-MIEC on 2,860 clinical texts obtained from a single respiratory medicine department in China with the appropriate confirmation by physicians. The dataset was supplemented with random inappropriate tests. The proposed framework excelled against the best competing text classification methods with a Mean_F1 of 0.9167. This determined whether the appropriate and inappropriate tests were properly classified. The four evaluation strategies captured the features effectively, and they were imperative. Therefore, the proposed CDFO-MIEC is feasible because it exhibits high performance and can prevent over-testing.
Collapse
Affiliation(s)
- Yang Yang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Hongxing Huo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Jingchi Jiang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Xuemei Sun
- Hospital of Harbin Institute of Technology, Harbin 150003, China
| | - Yi Guan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.
| | - Xitong Guo
- School of Management, Harbin Institute of Technology, Harbin 150001, China
| | - Xiang Wan
- Shenzhen Research Institute of Big Data, Shenzhen 518000, China
| | - Shengping Liu
- Unisound AI Technology Co., Ltd, Beijing 100083, China
| |
Collapse
|
7
|
Rivera Zavala R, Martinez P. The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study. JMIR Med Inform 2020; 8:e18953. [PMID: 33270027 PMCID: PMC7746498 DOI: 10.2196/18953] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 08/25/2020] [Accepted: 10/28/2020] [Indexed: 11/13/2022] Open
Abstract
Background Negation and speculation are critical elements in natural language processing (NLP)-related tasks, such as information extraction, as these phenomena change the truth value of a proposition. In the clinical narrative that is informal, these linguistic facts are used extensively with the objective of indicating hypotheses, impressions, or negative findings. Previous state-of-the-art approaches addressed negation and speculation detection tasks using rule-based methods, but in the last few years, models based on machine learning and deep learning exploiting morphological, syntactic, and semantic features represented as spare and dense vectors have emerged. However, although such methods of named entity recognition (NER) employ a broad set of features, they are limited to existing pretrained models for a specific domain or language. Objective As a fundamental subsystem of any information extraction pipeline, a system for cross-lingual and domain-independent negation and speculation detection was introduced with special focus on the biomedical scientific literature and clinical narrative. In this work, detection of negation and speculation was considered as a sequence-labeling task where cues and the scopes of both phenomena are recognized as a sequence of nested labels recognized in a single step. Methods We proposed the following two approaches for negation and speculation detection: (1) bidirectional long short-term memory (Bi-LSTM) and conditional random field using character, word, and sense embeddings to deal with the extraction of semantic, syntactic, and contextual patterns and (2) bidirectional encoder representations for transformers (BERT) with fine tuning for NER. Results The approach was evaluated for English and Spanish languages on biomedical and review text, particularly with the BioScope corpus, IULA corpus, and SFU Spanish Review corpus, with F-measures of 86.6%, 85.0%, and 88.1%, respectively, for NeuroNER and 86.4%, 80.8%, and 91.7%, respectively, for BERT. Conclusions These results show that these architectures perform considerably better than the previous rule-based and conventional machine learning–based systems. Moreover, our analysis results show that pretrained word embedding and particularly contextualized embedding for biomedical corpora help to understand complexities inherent to biomedical text.
Collapse
Affiliation(s)
- Renzo Rivera Zavala
- Department of Computer Science and Engineering, Carlos III University of Madrid, Madrid, Spain.,Department of Computer Science and Engineering, Universidad Católica de Santa Maria, Arequipa, Peru
| | - Paloma Martinez
- Department of Computer Science and Engineering, Carlos III University of Madrid, Madrid, Spain
| |
Collapse
|
8
|
Santiso S, Pérez A, Casillas A, Oronoz M. Neural negated entity recognition in Spanish electronic health records. J Biomed Inform 2020; 105:103419. [PMID: 32298847 DOI: 10.1016/j.jbi.2020.103419] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 03/18/2020] [Accepted: 03/26/2020] [Indexed: 11/17/2022]
Abstract
This work deals with negation detection in the context of clinical texts. Negation detection is a key for decision support systems since negated events (detection of absence of some events) help ascertain current medical conditions. For artificial intelligence, negation detection is a valuable point as it can revert the meaning of a part of a text and, accordingly, influence other tasks such as medical dosage adjustment, the detection of adverse drug reactions or hospital acquired diseases. We focus on negated medical events such as disorders, findings and allergies. From Natural Language Processing (NLP) background, we refer to them as negated medical entities. A novelty of this work is that we approached this task as Named Entity Recognition (NER) with the restriction that just negated medical entities must be recognized (in an attempt to help distinguish them from non-negated ones). Our study is driven with Electronic Health Records (EHRs) written in Spanish. A challenge to cope with is the lexical variability (alternative medical forms, abbreviations, etc.). To this end, we employed an approach based on deep learning. Specifically, the system combines character embeddings to cope with out-of-vocabulary (OOV) words, Long Short-Term Memory (LSTM) networks to model contextual representations and it makes use of Conditional Random Fields (CRF) to classify each medical entity as either negated or not given the contextual dense representation. Moreover, we explored both embeddings created from words and embeddings created from lemmas. The best results were obtained with the lemmatized embeddings. Apparently, this approach reinforced the capability of the LSTMs to cope with the high lexical variability. The f-measure for exact-match was 65.1 and 82.4 for the partial-match.
Collapse
Affiliation(s)
- Sara Santiso
- IXA Group, University of the Basque Country (UPV-EHU), ManuelLardizabal 1, 20080 Donostia, Spain.
| | - Alicia Pérez
- IXA Group, University of the Basque Country (UPV-EHU), ManuelLardizabal 1, 20080 Donostia, Spain.
| | - Arantza Casillas
- IXA Group, University of the Basque Country (UPV-EHU), ManuelLardizabal 1, 20080 Donostia, Spain.
| | - Maite Oronoz
- IXA Group, University of the Basque Country (UPV-EHU), ManuelLardizabal 1, 20080 Donostia, Spain.
| |
Collapse
|
9
|
Su J, Hu J, Jiang J, Xie J, Yang Y, He B, Yang J, Guan Y. Extraction of risk factors for cardiovascular diseases from Chinese electronic medical records. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 172:1-10. [PMID: 30902121 DOI: 10.1016/j.cmpb.2019.01.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Revised: 12/17/2018] [Accepted: 01/15/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE Early prevention of cardiovascular diseases (CVDs) can effectively prevent later loss of health, and the detection of CVDs risk factors is a simple method to achieve early prevention. Personal health records play a prominent role in the field of health information extraction because of their factuality and reliability. This present study describes how to extract risk factors for CVDs from Chinese electronic medical records (CEMRs). METHODS The extraction process involves two tasks: (a) CVDs risk factor recognition and (b) risk factor time and assertion classification. We considered risk factor recognition as a named entity recognition (NER) task and time and assertion classification as a textual classification task. An information extraction pipeline system consisting of NER and textual classification modules with machine learning models was developed. In the risk factor recognition module, bidirectional long short term memory (BLSTM) with extra risk factor textual feature input was built, as well, convolutional neural networks (CNNs) with risk factor type and section label input and support vector machine (SVM) were built for time and assertion classification. RESULTS We have achieved the best performance of risk factor recognition with F1 value of 0.9609, time and assertion classification with F1 of 0.9812 and 0.9612, respectively. The experimental results showed that our system achieved a high performance and can extract risk factors from CEMRs efficiently. CONCLUSIONS The proposed system is the first system for CVDs risk factors extraction from CEMRs and shows competition to risk factor extraction systems that developed on English EMRs. Further, its good performance should have a strong influence on CVDs prevention.
Collapse
Affiliation(s)
- Jia Su
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jinpeng Hu
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jingchi Jiang
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jing Xie
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Yang Yang
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Bin He
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jinfeng Yang
- School of Software, Harbin University of Science and Technology, Harbin, Heilongjiang, China
| | - Yi Guan
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China.
| |
Collapse
|
10
|
Santiso S, Casillas A, Pérez A, Oronoz M. Word embeddings for negation detection in health records written in Spanish. Soft comput 2018. [DOI: 10.1007/s00500-018-3650-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
11
|
Névéol A, Zweigenbaum P. Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook. Yearb Med Inform 2018; 27:193-198. [PMID: 30157523 PMCID: PMC6115241 DOI: 10.1055/s-0038-1667080] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Objectives:
To summarize recent research and present a selection of the best papers published in 2017 in the field of clinical Natural Language Processing (NLP).
Methods:
A survey of the literature was performed by the two editors of the NLP section of the International Medical Informatics Association (IMIA) Yearbook. Bibliographic databases PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed based on title and abstract. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.
Results:
Clinical NLP best papers provide a contribution that ranges from methodological studies to the application of research results to practical clinical settings. They draw from text genres as diverse as clinical narratives across hospitals and languages or social media.
Conclusions:
Clinical NLP continued to thrive in 2017, with an increasing number of contributions towards applications compared to fundamental methods. Methodological work explores deep learning and system adaptation across language variants. Research results continue to translate into freely available tools and corpora, mainly for the English language.
Collapse
|
12
|
Manimaran J, Velmurugan T. Evaluation of lexicon- and syntax-based negation detection algorithms using clinical text data. BIO-ALGORITHMS AND MED-SYSTEMS 2017. [DOI: 10.1515/bams-2017-0016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractBackground:Clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing (NLP) system. In recent development modules of cTAKES, a negation detection (ND) algorithm is used to improve annotation capabilities and simplify automatic identification of negative context in large clinical documents. In this research, the two types of ND algorithms used are lexicon and syntax, which are analyzed using a database made openly available by the National Center for Biomedical Computing. The aim of this analysis is to find the pros and cons of these algorithms.Methods:Patient medical reports were collected from three institutions included the 2010 i2b2/VA Clinical NLP Challenge, which is the input data for this analysis. This database includes patient discharge summaries and progress notes. The patient data is fed into five ND algorithms: NegEx, ConText, pyConTextNLP, DEEPEN and Negation Resolution (NR). NegEx, ConText and pyConTextNLP are lexicon-based, whereas DEEPEN and NR are syntax-based. The results from these five ND algorithms are post-processed and compared with the annotated data. Finally, the performance of these ND algorithms is evaluated by computing standard measures including F-measure, kappa statistics and ROC, among others, as well as the execution time of each algorithm.Results:This research is tested through practical implementation based on the accuracy of each algorithm’s results and computational time to evaluate its performance in order to find a robust and reliable ND algorithm.Conclusions:The performance of the chosen ND algorithms is analyzed based on the results produced by this research approach. The time and accuracy of each algorithm are calculated and compared to suggest the best method.
Collapse
|
13
|
Jian Z, Guo X, Liu S, Ma H, Zhang S, Zhang R, Lei J. A cascaded approach for Chinese clinical text de-identification with less annotation effort. J Biomed Inform 2017; 73:76-83. [PMID: 28756160 PMCID: PMC5583002 DOI: 10.1016/j.jbi.2017.07.017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 07/09/2017] [Accepted: 07/25/2017] [Indexed: 11/28/2022]
Abstract
With rapid adoption of Electronic Health Records (EHR) in China, an increasing amount of clinical data has been available to support clinical research. Clinical data secondary use usually requires de-identification of personal information to protect patient privacy. Since manually de-identification of free clinical text requires significant amount of human work, developing an automated de-identification system is necessary. While there are many de-identification systems available for English clinical text, designing a de-identification system for Chinese clinical text faces many challenges such as unavailability of necessary lexical resources and sparsity of patient health information (PHI) in Chinese clinical text. In this paper, we designed a de-identification pipeline taking advantage of both rule-based and machine learning techniques. Our method, in particular, can effectively construct a data set with dense PHI information, which saves annotation time significantly for subsequent supervised learning. We experiment on a dataset of 3000 heterogeneous clinical documents to evaluate the annotation cost and the de-identification performance. Our approach can increase the efficiency of the annotation effort by over 60% while reaching performance as high as over 90% measured by F score. We demonstrate that combing rule-based and machine learning is an effective way to reduce the annotation cost and achieve high performance in Chinese clinical text de-identification task.
Collapse
Affiliation(s)
- Zhe Jian
- Department of Medical Informatics, Harbin Medical University, Harbin, China
| | - Xusheng Guo
- Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Shijian Liu
- Shanghai Children's Medical Center, Shanghai, China
| | - Handong Ma
- Synyi Co. Ltd., Shanghai, China; Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
| | - Shaodian Zhang
- Synyi Co. Ltd., Shanghai, China; Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
| | - Rui Zhang
- Institute for Health Informatics, College of Pharmacy, University of Minnesota, Minneapolis, MN, USA
| | - Jianbo Lei
- Center for Medical Informatics, Peking University, Beijing, China; School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, Sichuan, PR China.
| |
Collapse
|
14
|
A Novel Approach towards Medical Entity Recognition in Chinese Clinical Text. JOURNAL OF HEALTHCARE ENGINEERING 2017; 2017:4898963. [PMID: 29065612 PMCID: PMC5516712 DOI: 10.1155/2017/4898963] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Revised: 05/05/2017] [Accepted: 05/16/2017] [Indexed: 12/03/2022]
Abstract
Medical entity recognition, a basic task in the language processing of clinical data, has been extensively studied in analyzing admission notes in alphabetic languages such as English. However, much less work has been done on nonstructural texts that are written in Chinese, or in the setting of differentiation of Chinese drug names between traditional Chinese medicine and Western medicine. Here, we propose a novel cascade-type Chinese medication entity recognition approach that aims at integrating the sentence category classifier from a support vector machine and the conditional random field-based medication entity recognition. We hypothesized that this approach could avoid the side effects of abundant negative samples and improve the performance of the named entity recognition from admission notes written in Chinese. Therefore, we applied this approach to a test set of 324 Chinese-written admission notes with manual annotation by medical experts. Our data demonstrated that this approach had a score of 94.2% in precision, 92.8% in recall, and 93.5% in F-measure for the recognition of traditional Chinese medicine drug names and 91.2% in precision, 92.6% in recall, and 91.7% F-measure for the recognition of Western medicine drug names. The differences in F-measure were significant compared with those in the baseline systems.
Collapse
|