1
|
Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023; 143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]
Abstract
BACKGROUND Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA.
| | - Nourah M Salem
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Elizabeth K White
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Katherine J Sullivan
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Michael Bada
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Teri L Hernandez
- College of Nursing, Department of Medicine/Division of Endocrinology, Metabolism, & Diabetes, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Sonia M Leach
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| |
Collapse
|
2
|
Zhao Y, Ren B, Yu W, Zhang H, Zhao D, Lv J, Xie Z, Jiang K, Shang L, Yao H, Xu Y, Zhao G. Construction of an Assisted Model Based on Natural Language Processing for Automatic Early Diagnosis of Autoimmune Encephalitis. Neurol Ther 2022; 11:1117-1134. [PMID: 35543808 PMCID: PMC9338198 DOI: 10.1007/s40120-022-00355-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 04/07/2022] [Indexed: 11/25/2022] Open
Abstract
Introduction Early diagnosis and etiological treatment can effectively improve the prognosis of patients with autoimmune encephalitis (AE). However, anti-neuronal antibody tests which provide the definitive diagnosis require time and are not always abnormal. By using natural language processing (NLP) technology, our study proposes an assisted diagnostic method for early clinical diagnosis of AE and compares its sensitivity with that of previously established criteria. Methods Our model is based on the text classification model trained by the history of present illness (HPI) in electronic medical records (EMRs) that present a definite pathological diagnosis of AE or infectious encephalitis (IE). The definitive diagnosis of IE was based on the results of traditional etiological examinations. The definitive diagnosis of AE was based on the results of neuronal antibodies, and the diagnostic criteria of definite autoimmune limbic encephalitis proposed by Graus et al. used as the reference standard for antibody-negative AE. First, we automatically recognized and extracted symptoms for all HPI texts in EMRs by training a dataset of 552 cases. Second, four text classification models trained by a dataset of 199 cases were established for differential diagnosis of AE and IE based on a post-structuring text dataset of every HPI, which was completed using symptoms in English language after the process of normalization of synonyms. The optimal model was identified by evaluating and comparing the performance of the four models. Finally, combined with three typical symptoms and the results of standard paraclinical tests such as cerebrospinal fluid (CSF), magnetic resonance imaging (MRI), or electroencephalogram (EEG) proposed from Graus criteria, an assisted early diagnostic model for AE was established on the basis of the text classification model with the best performance. Results The comparison results for the four models applied to the independent testing dataset showed the naïve Bayesian classifier with bag of words achieved the best performance, with an area under the receiver operating characteristic curve of 0.85, accuracy of 84.5% (95% confidence interval [CI] 74.0–92.0%), sensitivity of 86.7% (95% CI 69.3–96.2%), and specificity of 82.9% (95% CI 67.9–92.8%), respectively. Compared with the diagnostic criteria proposed previously, the early diagnostic sensitivity for possible AE using the assisted diagnostic model based on the independent testing dataset was improved from 73.3% (95% CI 54.1–87.7%) to 86.7% (95% CI 69.3–96.2%). Conclusions The assisted diagnostic model could effectively increase the early diagnostic sensitivity for AE compared to previous diagnostic criteria, assist physicians in establishing the diagnosis of AE automatically after inputting the HPI and the results of standard paraclinical tests according to their narrative habits for describing symptoms, avoiding misdiagnosis and allowing for prompt initiation of specific treatment. Supplementary Information The online version contains supplementary material available at 10.1007/s40120-022-00355-7.
Collapse
Affiliation(s)
- Yunsong Zhao
- Department of Neurology, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Bin Ren
- Department of Information, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Wenjin Yu
- Department of Neurology, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Haijun Zhang
- Department of Neurology, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Di Zhao
- Department of Neurology, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Junchao Lv
- Department of Neurology, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Zhen Xie
- College of Life Sciences and Medicine, Northwest University, Xi'an, China
| | - Kun Jiang
- Department of Information, Xijing Hospital, Fourth Military Medical University, Xi'an, China
| | - Lei Shang
- Department of Health Statistics, Fourth Military Medical University, Xi'an, China
| | - Han Yao
- Department of Neurobiology, School of Basic Medicine, Fourth Military Medical University, Xi'an, China
| | - Yongyong Xu
- College of Life Sciences and Medicine, Northwest University, Xi'an, China.
| | - Gang Zhao
- Department of Neurology, Xijing Hospital, Fourth Military Medical University, Xi'an, China.
- College of Life Sciences and Medicine, Northwest University, Xi'an, China.
| |
Collapse
|
3
|
Solarte Pabón O, Montenegro O, Torrente M, Rodríguez González A, Provencio M, Menasalvas E. Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach. PeerJ Comput Sci 2022; 8:e913. [PMID: 35494817 PMCID: PMC9044225 DOI: 10.7717/peerj-cs.913] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 02/10/2022] [Indexed: 06/14/2023]
Abstract
Detecting negation and uncertainty is crucial for medical text mining applications; otherwise, extracted information can be incorrectly identified as real or factual events. Although several approaches have been proposed to detect negation and uncertainty in clinical texts, most efforts have focused on the English language. Most proposals developed for Spanish have focused mainly on negation detection and do not deal with uncertainty. In this paper, we propose a deep learning-based approach for both negation and uncertainty detection in clinical texts written in Spanish. The proposed approach explores two deep learning methods to achieve this goal: (i) Bidirectional Long-Short Term Memory with a Conditional Random Field layer (BiLSTM-CRF) and (ii) Bidirectional Encoder Representation for Transformers (BERT). The approach was evaluated using NUBES and IULA, two public corpora for the Spanish language. The results obtained showed an F-score of 92% and 80% in the scope recognition task for negation and uncertainty, respectively. We also present the results of a validation process conducted using a real-life annotated dataset from clinical notes belonging to cancer patients. The proposed approach shows the feasibility of deep learning-based methods to detect negation and uncertainty in Spanish clinical texts. Experiments also highlighted that this approach improves performance in the scope recognition task compared to other proposals in the biomedical domain.
Collapse
Affiliation(s)
- Oswaldo Solarte Pabón
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Cali, Colombia
| | - Orlando Montenegro
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Cali, Colombia
| | | | | | | | - Ernestina Menasalvas
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
4
|
Narayanan S, Achan P, Rangan PV, Rajan SP. Unified concept and assertion detection using contextual multi-task learning in a clinical decision support system. J Biomed Inform 2021; 122:103898. [PMID: 34455090 DOI: 10.1016/j.jbi.2021.103898] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 06/17/2021] [Accepted: 08/23/2021] [Indexed: 11/29/2022]
Abstract
Assertions, such as negation and speculation, alter the meaning of clinical findings ('concepts') in Electronic Health Records. Accurate assertion detection is vital to the identification of target findings in clinical decision support systems. Diverse clinical concepts and assertion modifiers embedded within longer sentences add to the challenge of error-free detection. Recent approaches leveraging biomedical contextual embeddings lead to standalone concept and assertion models that do not effectively utilize inter-task knowledge transfer. We propose a novel neural model integrating task-specific fine-tuning and multi-task learning in a coherent framework based on the hierarchical relationship between the tasks. We show that such a unified framework enhances both the tasks using several real-world clinical notes' datasets (n2c2 2010, n2c2 2012, NegEx). Concept task performance enhanced by +1.69 F1 on n2c2 2010 and +2.96 F1 on n2c2 2012 compared to standalone baselines. Assertion recognition improved by +2.89 F1 and +3.77 F1, respectively. Negation detection under low-resource settings increased significantly (+2.4 F1, p-value = 3.11E-05, McNemar's test), demonstrating the impact of inter-task knowledge transfer. The integrated architecture enhanced the generalization performance of speculation detection (+2.09 F1). To the best of our knowledge, this model is the first demonstration of a contextual multi-task system for unified detection of concepts and assertions in clinical decision support applications.
Collapse
Affiliation(s)
- Sankaran Narayanan
- Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India.
| | - Pradeep Achan
- Amrita Medical Solutions LLC, 10200 Crow Canyon Road, Castro Valley, CA, USA
| | - P Venkat Rangan
- Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India
| | - Sreeranga P Rajan
- Department of Computer Science, Stanford University, 353 Jane Stanford Way, Stanford, CA 94305, USA
| |
Collapse
|
5
|
Chen Q, Zhou X, Wu J, Zhou Y. Structuring electronic dental records through deep learning for a clinical decision support system. Health Informatics J 2021; 27:1460458220980036. [PMID: 33446032 DOI: 10.1177/1460458220980036] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Extracting information from unstructured clinical text is a fundamental and challenging task in medical informatics. Our study aims to construct a natural language processing (NLP) workflow to extract information from Chinese electronic dental records (EDRs) for clinical decision support systems (CDSSs). We extracted attributes, attribute values, and tooth positions based on an existing ontology from EDRs. A workflow integrating deep learning with keywords was constructed, in which vectors representing texts were unsupervised learned. Specifically, we implemented Sentence2vec to learn sentence vectors and Word2vec to learn word vectors. For attribute recognition, we calculated similarity values among sentence vectors and extracted attributes based on our selection strategy. For attribute value recognition, we expanded the keyword database by calculating similarity values among word vectors to select keywords. Performance of our workflow with the hybrid method was evaluated and compared with keyword-based method and deep learning method. In both attribute and value recognition, the hybrid method outperforms the other two methods in achieving high precision (0.94, 0.94), recall (0.74, 0.82), and F score (0.83, 0.88). Our NLP workflow can efficiently structure narrative text from EDRs, providing accurate input information and a solid foundation for further data-based CDSSs.
Collapse
Affiliation(s)
- Qingxiao Chen
- Peking University School and Hospital of Stomatology; National Engineering Laboratory for Digital and Material.,Technology of Stomatology; Peking University; National Clinical Research Center for Oral Diseases, PR China
| | | | - Ji Wu
- Tsinghua University, PR China
| | - Yongsheng Zhou
- Peking University School and Hospital of Stomatology, PR China
| |
Collapse
|
6
|
Enhanced sentimental analysis using visual geometry group network-based deep learning approach. Soft comput 2021. [DOI: 10.1007/s00500-021-05890-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
7
|
Wang Y, Sun Y, Ma Z, Gao L, Xu Y. A Hybrid Model for Named Entity Recognition on Chinese Electronic Medical Records. ACM T ASIAN LOW-RESO 2021. [DOI: 10.1145/3436819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Electronic medical records (EMRs) contain valuable information about the patients, such as clinical symptoms, diagnostic results, and medications. Named entity recognition (NER) aims to recognize entities from unstructured text, which is the initial step toward the semantic understanding of the EMRs. Extracting medical information from Chinese EMRs could be a more complicated task because of the difference between English and Chinese. Some researchers have noticed the importance of Chinese NER and used the recurrent neural network or convolutional neural network (CNN) to deal with this task. However, it is interesting to know whether the performance could be improved if the advantages of the RNN and CNN can be both utilized. Moreover, RoBERTa-WWM, as a pre-training model, can generate the embeddings with word-level features, which is more suitable for Chinese NER compared with Word2Vec. In this article, we propose a hybrid model. This model first obtains the entities identified by bidirectional long short-term memory and CNN, respectively, and then uses two hybrid strategies to output the final results relying on these entities. We also conduct experiments on raw medical records from real hospitals. This dataset is provided by the China Conference on Knowledge Graph and Semantic Computing in 2019 (CCKS 2019). Results demonstrate that the hybrid model can improve performance significantly.
Collapse
Affiliation(s)
- Yu Wang
- Institute of Intelligent Machines and University of Science and Technology of China, Hefei City, Anhui Province, China
| | - Yining Sun
- Institute of Intelligent Machines and University of Science and Technology of China, Hefei City, Anhui Province, China
| | | | | | - Yang Xu
- Institute of Intelligent Machines, Anhui Province, China
| |
Collapse
|
8
|
Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine. BMC Med Inform Decis Mak 2020; 20:64. [PMID: 32252745 PMCID: PMC7132896 DOI: 10.1186/s12911-020-1079-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 03/25/2020] [Indexed: 01/04/2023] Open
Abstract
Background In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future. Methods We developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen’s kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9. Results We annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality. Conclusions These results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain.
Collapse
|
9
|
Su J, Hu J, Jiang J, Xie J, Yang Y, He B, Yang J, Guan Y. Extraction of risk factors for cardiovascular diseases from Chinese electronic medical records. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 172:1-10. [PMID: 30902121 DOI: 10.1016/j.cmpb.2019.01.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Revised: 12/17/2018] [Accepted: 01/15/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE Early prevention of cardiovascular diseases (CVDs) can effectively prevent later loss of health, and the detection of CVDs risk factors is a simple method to achieve early prevention. Personal health records play a prominent role in the field of health information extraction because of their factuality and reliability. This present study describes how to extract risk factors for CVDs from Chinese electronic medical records (CEMRs). METHODS The extraction process involves two tasks: (a) CVDs risk factor recognition and (b) risk factor time and assertion classification. We considered risk factor recognition as a named entity recognition (NER) task and time and assertion classification as a textual classification task. An information extraction pipeline system consisting of NER and textual classification modules with machine learning models was developed. In the risk factor recognition module, bidirectional long short term memory (BLSTM) with extra risk factor textual feature input was built, as well, convolutional neural networks (CNNs) with risk factor type and section label input and support vector machine (SVM) were built for time and assertion classification. RESULTS We have achieved the best performance of risk factor recognition with F1 value of 0.9609, time and assertion classification with F1 of 0.9812 and 0.9612, respectively. The experimental results showed that our system achieved a high performance and can extract risk factors from CEMRs efficiently. CONCLUSIONS The proposed system is the first system for CVDs risk factors extraction from CEMRs and shows competition to risk factor extraction systems that developed on English EMRs. Further, its good performance should have a strong influence on CVDs prevention.
Collapse
Affiliation(s)
- Jia Su
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jinpeng Hu
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jingchi Jiang
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jing Xie
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Yang Yang
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Bin He
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China
| | - Jinfeng Yang
- School of Software, Harbin University of Science and Technology, Harbin, Heilongjiang, China
| | - Yi Guan
- Language Technology Research Center, Harbin Institute of Technology, Integrated Building Room 803, 92 West Dazhi Street, Harbin 150001, Heilongjiang, China.
| |
Collapse
|
10
|
Ning W, Chan S, Beam A, Yu M, Geva A, Liao K, Mullen M, Mandl KD, Kohane I, Cai T, Yu S. Feature extraction for phenotyping from semantic and knowledge resources. J Biomed Inform 2019; 91:103122. [PMID: 30738949 PMCID: PMC6424621 DOI: 10.1016/j.jbi.2019.103122] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
OBJECTIVE Phenotyping algorithms can efficiently and accurately identify patients with a specific disease phenotype and construct electronic health records (EHR)-based cohorts for subsequent clinical or genomic studies. Previous studies have introduced unsupervised EHR-based feature selection methods that yielded algorithms with high accuracy. However, those selection methods still require expert intervention to tweak the parameter settings according to the EHR data distribution for each phenotype. To further accelerate the development of phenotyping algorithms, we propose a fully automated and robust unsupervised feature selection method that leverages only publicly available medical knowledge sources, instead of EHR data. METHODS SEmantics-Driven Feature Extraction (SEDFE) collects medical concepts from online knowledge sources as candidate features and gives them vector-form distributional semantic representations derived with neural word embedding and the Unified Medical Language System Metathesaurus. A number of features that are semantically closest and that sufficiently characterize the target phenotype are determined by a linear decomposition criterion and are selected for the final classification algorithm. RESULTS SEDFE was compared with the EHR-based SAFE algorithm and domain experts on feature selection for the classification of five phenotypes including coronary artery disease, rheumatoid arthritis, Crohn's disease, ulcerative colitis, and pediatric pulmonary arterial hypertension using both supervised and unsupervised approaches. Algorithms yielded by SEDFE achieved comparable accuracy to those yielded by SAFE and expert-curated features. SEDFE is also robust to the input semantic vectors. CONCLUSION SEDFE attains satisfying performance in unsupervised feature selection for EHR phenotyping. Both fully automated and EHR-independent, this method promises efficiency and accuracy in developing algorithms for high-throughput phenotyping.
Collapse
Affiliation(s)
- Wenxin Ning
- Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Stephanie Chan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Andrew Beam
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Ming Yu
- Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Alon Geva
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children's Hospital, Boston, MA, USA; Department of Anesthesia, Harvard Medical School, Boston, MA, USA
| | - Katherine Liao
- Department of Medicine, Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Mary Mullen
- Department of Cardiology, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China.
| |
Collapse
|
11
|
Chen L, Song L, Shao Y, Li D, Ding K. Using natural language processing to extract clinically useful information from Chinese electronic medical records. Int J Med Inform 2019; 124:6-12. [PMID: 30784428 DOI: 10.1016/j.ijmedinf.2019.01.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 12/29/2018] [Accepted: 01/05/2019] [Indexed: 01/14/2023]
Abstract
AIMS To develop a natural language processing (NLP)-based algorithm for extracting clinically useful information for patients with hepatocellular carcinoma (HCC) from Chinese electronic medical records (EMRs) and use these data for the assessment of HCC staging. MATERIALS AND METHODS Clinical documents, including operation notes, radiology and pathology reports, of 92 HCC patients were collected from Chinese EMRs. We randomly grouped these patients into training (n = 60) and testing (n = 32) datasets. Rule-based and hybrid methods for extracting information were developed using the training set of manually-annotated operation notes. The method with better performance was used to process other documents. The performance of the algorithm was assessed via calculating the precision, recall and F-score for exact-boundary and partial-boundary matching strategies. The utility of clinically useful information for the HCC staging was assessed in comparison with that manually reviewed. RESULTS For operation notes, the rule-based and hybrid methods had a precision, recall and F-score ≥80% when the exact-boundary and partial-boundary matching strategies were applied to the testing dataset. By using the rule-based method (which has better performance than the hybrid method), three other types of documents also obtained good performance. When the extracted clinically useful information was applied for the HCC staging, the concordance rate with the manual review was 75%. CONCLUSION A NLP system was developed for clinical information extraction and HCC staging based on EMRs, and the results indicate that Chinese NLP has potential utility in clinical research.
Collapse
Affiliation(s)
- Liang Chen
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, PR China
| | - Liting Song
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, PR China
| | - Yue Shao
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, PR China
| | - Dewei Li
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, PR China.
| | - Keyue Ding
- Medical Genetic Institute of Henan Province, Henan Provincial People's Hospital, Henan Key Laboratory of Genetic Diseases and Functional Genomics, Henan Provincial People's Hospital of Henan University, Zhengzhou, Henan Province, PR China.
| |
Collapse
|
12
|
Abstract
Background Health professionals and consumers use different terms to express medical events or concerns, which makes the communication barriers between the professionals and consumers. This may lead to bias in the diagnosis or treatment due to the misunderstanding or incomplete understanding. To solve the issue, a consumer health vocabulary was developed to map the consumer-used health terms to professional-used medical terms. Methods In this study, we extracted Chinese consumer health terms from both online health forum and patient education monographs, and manually mapped them to medical terms used by professionals (terms in medical thesauri or in medical books). To ensure the above annotation quality, we developed annotation guidelines. Results We applied our method to extract consumer-used disease terms in endocrinology, cardiology, gastroenterology and dermatology. In this study, we identified 1349 medical mentions from 8436 questions posted in an online health forum and 1428 articles for patient education monographs. After manual annotation and review, we released 1036 Chinese consumer health terms with mapping to 480 medical terms. Four annotators worked on the manual annotation work following the Chinese consumer health term annotation guidelines. Their average inter-annotator agreement (IAA) score was 93.91% ensuring high consistency of the released terms. Conclusions We extracted Chinese consumer health terms from online forum and patient education monographs, and mapped them to medical terms used by professionals. Manual annotation efforts have been made for term annotating and mapping. Our study may contribute to the Chinese consumer health vocabulary construction. In addition, our annotated corpus, both the contexts of consumer health terms and consumer-professional term mapping, would be a useful resource for automatic methodology development. The dataset of the Chinese consumer health terms (CHT) is publicly available at http://www.phoc.org.cn/cht/.
Collapse
Affiliation(s)
- Li Hou
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020, China
| | - Hongyu Kang
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020, China
| | - Yan Liu
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020, China
| | - Luqi Li
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020, China
| | - Jiao Li
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020, China.
| |
Collapse
|
13
|
Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9:12. [PMID: 29602312 PMCID: PMC5877394 DOI: 10.1186/s13326-018-0179-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 02/14/2018] [Indexed: 01/22/2023] Open
Abstract
Background Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main Body We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.
Collapse
Affiliation(s)
- Aurélie Névéol
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| | | | - Sumithra Velupillai
- School of Computer Science and Communication, KTH, Stockholm, Sweden.,Institute of Psychiatry, Psychology and Neuroscience, King's College, London, UK
| | - Guergana Savova
- Children's Hospital Boston and Harvard Medical School, Boston, Massachusetts, USA
| | - Pierre Zweigenbaum
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| |
Collapse
|
14
|
Kang T, Zhang S, Tang Y, Hruby GW, Rusanov A, Elhadad N, Weng C. EliIE: An open-source information extraction system for clinical trial eligibility criteria. J Am Med Inform Assoc 2017; 24:1062-1071. [PMID: 28379377 PMCID: PMC6259668 DOI: 10.1093/jamia/ocx019] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 01/31/2017] [Accepted: 03/02/2017] [Indexed: 12/22/2022] Open
Abstract
OBJECTIVE To develop an open-source information extraction system called Eligibility Criteria Information Extraction (EliIE) for parsing and formalizing free-text clinical research eligibility criteria (EC) following Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) version 5.0. MATERIALS AND METHODS EliIE parses EC in 4 steps: (1) clinical entity and attribute recognition, (2) negation detection, (3) relation extraction, and (4) concept normalization and output structuring. Informaticians and domain experts were recruited to design an annotation guideline and generate a training corpus of annotated EC for 230 Alzheimer's clinical trials, which were represented as queries against the OMOP CDM and included 8008 entities, 3550 attributes, and 3529 relations. A sequence labeling-based method was developed for automatic entity and attribute recognition. Negation detection was supported by NegEx and a set of predefined rules. Relation extraction was achieved by a support vector machine classifier. We further performed terminology-based concept normalization and output structuring. RESULTS In task-specific evaluations, the best F1 score for entity recognition was 0.79, and for relation extraction was 0.89. The accuracy of negation detection was 0.94. The overall accuracy for query formalization was 0.71 in an end-to-end evaluation. CONCLUSIONS This study presents EliIE, an OMOP CDM-based information extraction system for automatic structuring and formalization of free-text EC. According to our evaluation, machine learning-based EliIE outperforms existing systems and shows promise to improve.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Shaodian Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Youlan Tang
- Institute of Human Nutrition, Columbia University, New York, NY, USA
| | - Gregory W Hruby
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Alexander Rusanov
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
15
|
Jian Z, Guo X, Liu S, Ma H, Zhang S, Zhang R, Lei J. A cascaded approach for Chinese clinical text de-identification with less annotation effort. J Biomed Inform 2017; 73:76-83. [PMID: 28756160 PMCID: PMC5583002 DOI: 10.1016/j.jbi.2017.07.017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 07/09/2017] [Accepted: 07/25/2017] [Indexed: 11/28/2022]
Abstract
With rapid adoption of Electronic Health Records (EHR) in China, an increasing amount of clinical data has been available to support clinical research. Clinical data secondary use usually requires de-identification of personal information to protect patient privacy. Since manually de-identification of free clinical text requires significant amount of human work, developing an automated de-identification system is necessary. While there are many de-identification systems available for English clinical text, designing a de-identification system for Chinese clinical text faces many challenges such as unavailability of necessary lexical resources and sparsity of patient health information (PHI) in Chinese clinical text. In this paper, we designed a de-identification pipeline taking advantage of both rule-based and machine learning techniques. Our method, in particular, can effectively construct a data set with dense PHI information, which saves annotation time significantly for subsequent supervised learning. We experiment on a dataset of 3000 heterogeneous clinical documents to evaluate the annotation cost and the de-identification performance. Our approach can increase the efficiency of the annotation effort by over 60% while reaching performance as high as over 90% measured by F score. We demonstrate that combing rule-based and machine learning is an effective way to reduce the annotation cost and achieve high performance in Chinese clinical text de-identification task.
Collapse
Affiliation(s)
- Zhe Jian
- Department of Medical Informatics, Harbin Medical University, Harbin, China
| | - Xusheng Guo
- Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Shijian Liu
- Shanghai Children's Medical Center, Shanghai, China
| | - Handong Ma
- Synyi Co. Ltd., Shanghai, China; Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
| | - Shaodian Zhang
- Synyi Co. Ltd., Shanghai, China; Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
| | - Rui Zhang
- Institute for Health Informatics, College of Pharmacy, University of Minnesota, Minneapolis, MN, USA
| | - Jianbo Lei
- Center for Medical Informatics, Peking University, Beijing, China; School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, Sichuan, PR China.
| |
Collapse
|
16
|
Névéol A, Zweigenbaum P. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing. Yearb Med Inform 2017; 26:228-234. [PMID: 29063569 PMCID: PMC6239234 DOI: 10.15265/iy-2017-027] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Indexed: 02/01/2023] Open
Abstract
Objectives: To summarize recent research and present a selection of the best papers published in 2016 in the field of clinical Natural Language Processing (NLP). Method: A survey of the literature was performed by the two section editors of the IMIA Yearbook NLP section. Bibliographic databases were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Papers were automatically ranked and then manually reviewed based on titles and abstracts. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. Results: The five clinical NLP best papers provide a contribution that ranges from emerging original foundational methods to transitioning solid established research results to a practical clinical setting. They offer a framework for abbreviation disambiguation and coreference resolution, a classification method to identify clinically useful sentences, an analysis of counseling conversations to improve support to patients with mental disorder and grounding of gradable adjectives. Conclusions: Clinical NLP continued to thrive in 2016, with an increasing number of contributions towards applications compared to fundamental methods. Fundamental work addresses increasingly complex problems such as lexical semantics, coreference resolution, and discourse analysis. Research results translate into freely available tools, mainly for English.
Collapse
Affiliation(s)
- A. Névéol
- LIMSI, CNRS, Université Paris Saclay, Orsay, France
| | | | | |
Collapse
|
17
|
Guo H, Na X, Hou L, Li J. Classifying Chinese Questions Related to Health Care Posted by Consumers Via the Internet. J Med Internet Res 2017. [PMID: 28634156 PMCID: PMC5497072 DOI: 10.2196/jmir.7156] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background In question answering (QA) system development, question classification is crucial for identifying information needs and improving the accuracy of returned answers. Although the questions are domain-specific, they are asked by non-professionals, making the question classification task more challenging. Objective This study aimed to classify health care–related questions posted by the general public (Chinese speakers) on the Internet. Methods A topic-based classification schema for health-related questions was built by manually annotating randomly selected questions. The Kappa statistic was used to measure the interrater reliability of multiple annotation results. Using the above corpus, we developed a machine-learning method to automatically classify these questions into one of the following six classes: Condition Management, Healthy Lifestyle, Diagnosis, Health Provider Choice, Treatment, and Epidemiology. Results The consumer health question schema was developed with a four-hierarchical-level of specificity, comprising 48 quaternary categories and 35 annotation rules. The 2000 sample questions were coded with 2000 major codes and 607 minor codes. Using natural language processing techniques, we expressed the Chinese questions as a set of lexical, grammatical, and semantic features. Furthermore, the effective features were selected to improve the question classification performance. From the 6-category classification results, we achieved an average precision of 91.41%, recall of 89.62%, and F1 score of 90.24%. Conclusions In this study, we developed an automatic method to classify questions related to Chinese health care posted by the general public. It enables Artificial Intelligence (AI) agents to understand Internet users’ information needs on health care.
Collapse
Affiliation(s)
- Haihong Guo
- Institute of Medical Information & Library, Chinese Academy of Medical Sciences, Beijing, China
| | - Xu Na
- Institute of Medical Information & Library, Chinese Academy of Medical Sciences, Beijing, China
| | - Li Hou
- Institute of Medical Information & Library, Chinese Academy of Medical Sciences, Beijing, China
| | - Jiao Li
- Institute of Medical Information & Library, Chinese Academy of Medical Sciences, Beijing, China
| |
Collapse
|
18
|
He B, Dong B, Guan Y, Yang J, Jiang Z, Yu Q, Cheng J, Qu C. Building a comprehensive syntactic and semantic corpus of Chinese clinical texts. J Biomed Inform 2017; 69:203-217. [DOI: 10.1016/j.jbi.2017.04.006] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Revised: 04/06/2017] [Accepted: 04/07/2017] [Indexed: 11/25/2022]
|