1
|
Zhu Z, Zhao Q, Li J, Ge Y, Ding X, Gu T, Zou J, Lv S, Wang S, Yang JJ. Comparative Analysis of Large Language Models in Chinese Medical Named Entity Recognition. Bioengineering (Basel) 2024; 11:982. [PMID: 39451358 PMCID: PMC11504658 DOI: 10.3390/bioengineering11100982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Revised: 09/13/2024] [Accepted: 09/23/2024] [Indexed: 10/26/2024] Open
Abstract
The emergence of large language models (LLMs) has provided robust support for application tasks across various domains, such as name entity recognition (NER) in the general domain. However, due to the particularity of the medical domain, the research on understanding and improving the effectiveness of LLMs on biomedical named entity recognition (BNER) tasks remains relatively limited, especially in the context of Chinese text. In this study, we extensively evaluate several typical LLMs, including ChatGLM2-6B, GLM-130B, GPT-3.5, and GPT-4, on the Chinese BNER task by leveraging a real-world Chinese electronic medical record (EMR) dataset and a public dataset. The experimental results demonstrate the promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for Chinese BNER tasks. More importantly, instruction fine-tuning significantly enhances the performance of LLMs. The fine-tuned offline ChatGLM2-6B surpassed the performance of the task-specific model BiLSTM+CRF (BC) on the real-world dataset. The best fine-tuned model, GPT-3.5, outperforms all other LLMs on the publicly available CCKS2017 dataset, even surpassing half of the baselines; however, it still remains challenging for it to surpass the state-of-the-art task-specific models, i.e., Dictionary-guided Attention Network (DGAN). To our knowledge, this study is the first attempt to evaluate the performance of LLMs on Chinese BNER tasks, which emphasizes the prospective and transformative implications of utilizing LLMs on Chinese BNER tasks. Furthermore, we summarize our findings into a set of actionable guidelines for future researchers on how to effectively leverage LLMs to become experts in specific tasks.
Collapse
Affiliation(s)
- Zhichao Zhu
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Qing Zhao
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Jianjiang Li
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Yanhu Ge
- Department of Anesthesiology, Beijing Anzhen Hospital, Capital Medical University, Beijing 100013, China;
| | - Xingjian Ding
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Tao Gu
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Jingchen Zou
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Sirui Lv
- College of Computer Science, Beijing University of Technology, Beijing 100124, China; (Z.Z.); (Q.Z.); (J.L.); (X.D.); (T.G.); (J.Z.); (S.L.)
| | - Sheng Wang
- Department of Anesthesiology, Beijing Anzhen Hospital, Capital Medical University, Beijing 100013, China;
| | - Ji-Jiang Yang
- Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
2
|
Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024; 16:333-344. [PMID: 38340264 PMCID: PMC11289304 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]
Abstract
We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.
Collapse
Affiliation(s)
- Dao-Ling Huang
- BGI Research, Shenzhen, 518083, China.
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Quanlei Zeng
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yun Xiong
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Shuixia Liu
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Chaoqun Pang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Menglei Xia
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Ting Fang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yanli Ma
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Cuicui Qiang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yi Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yu Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Hong Li
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yuying Yuan
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
3
|
Wang L, Zheng Y, Chen Y, Xu H, Li F. Clinical named entity recognition for percutaneous coronary intervention surgical information with hybrid neural network. THE REVIEW OF SCIENTIFIC INSTRUMENTS 2024; 95:065114. [PMID: 38921058 DOI: 10.1063/5.0174442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 05/04/2024] [Indexed: 06/27/2024]
Abstract
Percutaneous coronary intervention (PCI) has become a vital treatment approach for coronary artery disease, but the clinical data of PCI cannot be directly utilized due to its unstructured characteristics. The existing clinical named entity recognition (CNER) has been used to identify specific entities such as body parts, drugs, and diseases, but its specific potential in PCI clinical texts remains largely unexplored. How to effectively use CNER to deeply mine the information in the existing PCI clinical records is worth studying. In this paper, a total of 24 267 corpora are collected from the Cardiovascular Disease Treatment Center of the People's Hospital of Liaoning Province in China. We select three types of clinical record texts of fine-grained PCI surgical information, from which 5.8% of representative surgical records of PCI patients are selected as datasets for labeling. To fully utilize global information and multi-level semantic features, we design a novel character-level vector embedding method and further propose a new hybrid model based on it. Based on the classic Bidirectional Long Short-Term Memory Network (BiLSTM), the model further integrates Convolutional Neural Networks (CNNs) and Bidirectional Encoder Representations from Transformers (BERTs) for feature extraction and representation, and finally uses Conditional Random Field (CRF) for decoding and predicting label sequences. This hybrid model is referred to as BCC-BiLSTM in this paper. In order to verify the performance of the proposed hybrid model for extracting PCI surgical information, we simultaneously compare both representative traditional and intelligent methods. Under the same circumstances, compared with other intelligent methods, the BCC-BiLSTM proposed in this paper reduces the word vector dimension by 15%, and the F1 score reaches 86.2% in named entity recognition of PCI clinical texts, which is 26.4% higher than that of HMM. The improvement is 1.2% higher than BiLSTM + CRF and 0.7% higher than the most popular BERT + BiLSTM + CRF. Compared with the representative models, the hybrid model has better performance and can achieve optimal results faster in the model training process, so it has good clinical application prospects.
Collapse
Affiliation(s)
- Li Wang
- Dalian Maritime University, Dalian 116026, China
| | - Yuhang Zheng
- Dalian Maritime University, Dalian 116026, China
| | - Yi Chen
- Sussex Artificial Intelligence Institute, Zhejiang Gongshang University, Hangzhou 310018, China
| | - Hongzeng Xu
- Department of Cardiology, The People's Hospital of China Medical University, The People's Hospital of Liaoning Province, Shenyang 110011, China
| | - Feng Li
- Sussex Artificial Intelligence Institute, Zhejiang Gongshang University, Hangzhou 310018, China
| |
Collapse
|
4
|
Liu Q, Zhang L, Ren G, Zou B. Research on named entity recognition of Traditional Chinese Medicine chest discomfort cases incorporating domain vocabulary features. Comput Biol Med 2023; 166:107466. [PMID: 37742417 DOI: 10.1016/j.compbiomed.2023.107466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/20/2023] [Accepted: 09/04/2023] [Indexed: 09/26/2023]
Abstract
OBJECTIVE To promote research on knowledge extraction and knowledge graph construction of chest discomfort medical cases in Traditional Chinese Medicine (TCM), this paper focuses on their named entity recognition (NER). The recognition task includes six entity types: "syndrome", "symptom", "etiology and pathogenesis", "treatment method", "medication", and "prescription". METHODS We annotated data in a BIO (B-begin, I-inside, O-outside) manner. For the characteristics of medical case texts, we proposed a custom dictionary method that can be dynamically updated for word segmentation. To compare the effect of the method on the experimental results, we applied the method in the BiLSTM-CRF model and IDCNN-CRF model, respectively. RESULTS The models using custom dictionaries (BiLSTM-CRF-Loaded and IDCNN-CRF-Loaded) outperformed the models without custom dictionaries (BiLSTM-CRF and IDCNN-CRF) in accuracy, precision, recall, and F1 score. The BiLSTM-CRF-Loaded model yielded F1 scores of 92.59% and 93.23% on the test set and validation set, respectively, surpassing the BERT-BiLSTM-CRF model by 3.59% and 4.87%. Furthermore, when analyzing the results for the six entity categories separately, we found that the use of custom dictionaries has a marked impact, with the categories of "etiology and pathogenesis" and "syndrome" demonstrating the most noticeable improvements. By comparing the F1 scores, we observed that the entity category "medication" yielded the highest performance, reaching F1 scores of 96.04% and 96.48% on the test set and validation set, respectively. CONCLUSION We propose a word segmentation method based on a dynamically updated custom dictionary. The method is combined with the BILSTM-CRF and the IDCNN-CRF models, which enhances the model to recognize domain-specific terms and new entities. It can be widely applied in dealing with complex text structures and texts containing domain-specific terms.
Collapse
Affiliation(s)
- Qingping Liu
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, China.
| | - Lunlun Zhang
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, China.
| | - Gao Ren
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, China.
| | - Beiji Zou
- School of Computer Science and Engineering, Central South University, Changsha, 410083, Hunan, China.
| |
Collapse
|
5
|
Li F, Bi Z, Xu H, Shi Y, Duan N, Li Z. Design and implementation of a smart Internet of Things chest pain center based on deep learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:18987-19011. [PMID: 38052586 DOI: 10.3934/mbe.2023840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
The data input process for most chest pain centers is not intelligent, requiring a lot of staff to manually input patient information. This leads to problems such as long processing times, high potential for errors, an inability to access patient data in a timely manner and an increasing workload. To address the challenge, an Internet of Things (IoT)-driven chest pain center is designed, which crosses the sensing layer, network layer and application layer. The system enables the construction of intelligent chest pain management through a pre-hospital app, Ultra-Wideband (UWB) positioning, and in-hospital treatment. The pre-hospital app is provided to emergency medical services (EMS) centers, which allows them to record patient information in advance and keep it synchronized with the hospital's database, reducing the time needed for treatment. UWB positioning obtains the patient's hospital information through the zero-dimensional base station and the corresponding calculation engine, and in-hospital treatment involves automatic acquisition of patient information through web and mobile applications. The system also introduces the Bidirectional Long Short-Term Memory (BiLSTM)-Conditional Random Field (CRF)-based algorithm to train electronic medical record information for chest pain patients, extracting the patient's chest pain clinical symptoms. The resulting data are saved in the chest pain patient database and uploaded to the national chest pain center. The system has been used in Liaoning Provincial People's Hospital, and its subsequent assistance to doctors and nurses in collaborative treatment, data feedback and analysis is of great significance.
Collapse
Affiliation(s)
- Feng Li
- School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
- School of Computer Science and Engineering, Nanyang Technological University, 639798, Singapore
| | - Zhongao Bi
- School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
| | - Hongzeng Xu
- Department of Cardiology, The People's Hospital of Liaoning Province, Liaoning, Shenyang 110011, China
| | - Yunqi Shi
- Department of Cardiology, The People's Hospital of Liaoning Province, Liaoning, Shenyang 110011, China
| | - Na Duan
- Department of Cardiology, The People's Hospital of Liaoning Province, Liaoning, Shenyang 110011, China
| | - Zhaoyu Li
- Department of Cardiology, The Second Affiliated Hospital Zhejiang University School of Medicine, Hangzhou 310000, China
| |
Collapse
|
6
|
Gajendran S, Manjula D, Sugumaran V, Hema R. Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora. Comput Biol Chem 2023; 102:107808. [PMID: 36621289 PMCID: PMC9807269 DOI: 10.1016/j.compbiolchem.2022.107808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 12/21/2022] [Accepted: 12/29/2022] [Indexed: 01/04/2023]
Abstract
The number of biomedical articles published is increasing rapidly over the years. Currently there are about 30 million articles in PubMed and over 25 million mentions in Medline. Among these fundamentals, Biomedical Named Entity Recognition (BioNER) and Biomedical Relation Extraction (BioRE) are the most essential in analysing the literature. In the biomedical domain, Knowledge Graph is used to visualize the relationships between various entities such as proteins, chemicals and diseases. Scientific publications have increased dramatically as a result of the search for treatments and potential cures for the new Coronavirus, but efficiently analysing, integrating, and utilising related sources of information remains a difficulty. In order to effectively combat the disease during pandemics like COVID-19, literature must be used quickly and effectively. In this paper, we introduced a fully automated framework consists of BERT-BiLSTM, Knowledge graph, and Representation Learning model to extract the top diseases, chemicals, and proteins related to COVID-19 from the literature. The proposed framework uses Named Entity Recognition models for disease recognition, chemical recognition, and protein recognition. Then the system uses the Chemical - Disease Relation Extraction and Chemical - Protein Relation Extraction models. And the system extracts the entities and relations from the CORD-19 dataset using the models. The system then creates a Knowledge Graph for the extracted relations and entities. The system performs Representation Learning on this KG to get the embeddings of all entities and get the top related diseases, chemicals, and proteins with respect to COVID-19.
Collapse
Affiliation(s)
- Sudhakaran Gajendran
- School of Electronics Engineering, Vellore Institute of Technology, Chennai, India,Corresponding author
| | - D. Manjula
- School of Computer Science Engineering, Vellore Institute of Technology, Chennai, India
| | - Vijayan Sugumaran
- Center for Data Science and Big Data Analytics, Oakland University, Rochester, MI, USA,Department of Decision and Information Sciences, School of Business Administration, Oakland University, Rochester, MI, USA
| | - R. Hema
- Department of Electronics and Communication Engineering, St. Joseph College of Engineering, Chennai, India
| |
Collapse
|
7
|
Ward PJ, Young AM, Slavova S, Liford M, Daniels L, Lucas R, Kavuluru R. Deep Neural Networks for Fine-Grained Surveillance of Overdose Mortality. Am J Epidemiol 2023; 192:257-266. [PMID: 36222700 DOI: 10.1093/aje/kwac180] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 08/16/2022] [Accepted: 10/10/2022] [Indexed: 02/07/2023] Open
Abstract
Surveillance of drug overdose deaths relies on death certificates for identification of the substances that caused death. Drugs and drug classes can be identified through the International Classification of Diseases, Tenth Revision (ICD-10), codes present on death certificates. However, ICD-10 codes do not always provide high levels of specificity in drug identification. To achieve more fine-grained identification of substances on death certificate, the free-text cause-of-death section, completed by the medical certifier, must be analyzed. Current methods for analyzing free-text death certificates rely solely on lookup tables for identifying specific substances, which must be frequently updated and maintained. To improve identification of drugs on death certificates, a deep-learning named-entity recognition model was developed, utilizing data from the Kentucky Drug Overdose Fatality Surveillance System (2014-2019), which achieved an F1-score of 99.13%. This model can identify new drug misspellings and novel substances that are not present on current surveillance lookup tables, enhancing the surveillance of drug overdose deaths.
Collapse
|
8
|
Raza S, Schwartz B. Entity and relation extraction from clinical case reports of COVID-19: a natural language processing approach. BMC Med Inform Decis Mak 2023; 23:20. [PMID: 36703154 PMCID: PMC9879259 DOI: 10.1186/s12911-023-02117-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 01/20/2023] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Extracting relevant information about infectious diseases is an essential task. However, a significant obstacle in supporting public health research is the lack of methods for effectively mining large amounts of health data. OBJECTIVE This study aims to use natural language processing (NLP) to extract the key information (clinical factors, social determinants of health) from published cases in the literature. METHODS The proposed framework integrates a data layer for preparing a data cohort from clinical case reports; an NLP layer to find the clinical and demographic-named entities and relations in the texts; and an evaluation layer for benchmarking performance and analysis. The focus of this study is to extract valuable information from COVID-19 case reports. RESULTS The named entity recognition implementation in the NLP layer achieves a performance gain of about 1-3% compared to benchmark methods. Furthermore, even without extensive data labeling, the relation extraction method outperforms benchmark methods in terms of accuracy (by 1-8% better). A thorough examination reveals the disease's presence and symptoms prevalence in patients. CONCLUSIONS A similar approach can be generalized to other infectious diseases. It is worthwhile to use prior knowledge acquired through transfer learning when researching other infectious diseases.
Collapse
Affiliation(s)
- Shaina Raza
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| | - Brian Schwartz
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| |
Collapse
|
9
|
Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022; 14:2761. [PMID: 36560764 PMCID: PMC9781729 DOI: 10.3390/v14122761] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.
Collapse
Affiliation(s)
- Syed Raza Bashir
- Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| | - Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | | | - Urooj Qamar
- Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
| |
Collapse
|
10
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
11
|
Zheng Y, Han Z, Cai Y, Duan X, Sun J, Yang W, Huang H. An imConvNet-based deep learning model for Chinese medical named entity recognition. BMC Med Inform Decis Mak 2022; 22:303. [PMID: 36411432 PMCID: PMC9677659 DOI: 10.1186/s12911-022-02049-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 11/15/2022] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND With the development of current medical technology, information management becomes perfect in the medical field. Medical big data analysis is based on a large amount of medical and health data stored in the electronic medical system, such as electronic medical records and medical reports. How to fully exploit the resources of information included in these medical data has always been the subject of research by many scholars. The basis for text mining is named entity recognition (NER), which has its particularities in the medical field, where issues such as inadequate text resources and a large number of professional domain terms continue to face significant challenges in medical NER. METHODS We improved the convolutional neural network model (imConvNet) to obtain additional text features. Concurrently, we continue to use the classical Bert pre-training model and BiLSTM model for named entity recognition. We use imConvNet model to extract additional word vector features and improve named entity recognition accuracy. The proposed model, named BERT-imConvNet-BiLSTM-CRF, is composed of four layers: BERT embedding layer-getting word embedding vector; imConvNet layer-capturing the context feature of each character; BiLSTM (Bidirectional Long Short-Term Memory) layer-capturing the long-distance dependencies; CRF (Conditional Random Field) layer-labeling characters based on their features and transfer rules. RESULTS The average F1 score on the public medical data set yidu-s4k reached 91.38% when combined with the classical model; when real electronic medical record text in impacted wisdom teeth is used as the experimental object, the model's F1 score is 93.89%. They all show better results than classical models. CONCLUSIONS The suggested novel model (imConvNet) significantly improves the recognition accuracy of Chinese medical named entities and applies to various medical corpora.
Collapse
Affiliation(s)
- Yuchen Zheng
- Medical College, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Zhenggong Han
- Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Yimin Cai
- Medical College, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Xubo Duan
- Medical College, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Jiangling Sun
- Guiyang Hospital of Stomatology, Guiyang, 550002, Guizhou, China
| | - Wei Yang
- Medical College, Guizhou University, Guiyang, 550025, Guizhou, China.
| | - Haisong Huang
- Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang, 550025, Guizhou, China.
| |
Collapse
|
12
|
Nassar M, Rogers AB, Talo' F, Sanchez S, Shafique Z, Finn RD, McEntyre J. A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications. Gigascience 2022; 11:giac077. [PMID: 35950838 PMCID: PMC9366992 DOI: 10.1093/gigascience/giac077] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 06/13/2022] [Accepted: 07/12/2022] [Indexed: 11/17/2022] Open
Abstract
Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API.
Collapse
Affiliation(s)
- Maaly Nassar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Current affiliation: SciBite - an Elsevier Company, Wellcome Genome Campus, Hinxton, Cambridge CB10 1DR, UK
| | - Alexander B Rogers
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Francesco Talo'
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Santiago Sanchez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Zunaira Shafique
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
13
|
Tong Y, Zhuang F, Zhang H, Fang C, Zhao Y, Wang D, Zhu H, Ni B. Improving biomedical named entity recognition by dynamic caching inter-sentence information. Bioinformatics 2022; 38:3976-3983. [PMID: 35758612 DOI: 10.1093/bioinformatics/btac422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 06/03/2022] [Accepted: 06/24/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Biomedical Named Entity Recognition (BioNER) aims to identify biomedical domain-specific entities (e.g. gene, chemical and disease) from unstructured texts. Despite deep learning-based methods for BioNER achieving satisfactory results, there is still much room for improvement. Firstly, most existing methods use independent sentences as training units and ignore inter-sentence context, which usually leads to the labeling inconsistency problem. Secondly, previous document-level BioNER works have approved that the inter-sentence information is essential, but what information should be regarded as context remains ambiguous. Moreover, there are still few pre-training-based BioNER models that have introduced inter-sentence information. Hence, we propose a cache-based inter-sentence model called BioNER-Cache to alleviate the aforementioned problems. RESULTS We propose a simple but effective dynamic caching module to capture inter-sentence information for BioNER. Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context. Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train our model. We build a comprehensive benchmark on four biomedical datasets to evaluate the model performance fairly. Finally, extensive experiments clearly validate the superiority of our proposed BioNER-Cache compared with various state-of-the-art intra-sentence and inter-sentence baselines. AVAILABILITYAND IMPLEMENTATION Code will be available at https://github.com/zgzjdx/BioNER-Cache. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yiqi Tong
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Fuzhen Zhuang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China.,SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China
| | - Huajie Zhang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Chuyu Fang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Yu Zhao
- School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China
| | - Deqing Wang
- SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China
| | | | - Bin Ni
- Xiamen Data Intelligence Academy of ICT, CAS, Xiamen 361021, China
| |
Collapse
|
14
|
Zhang Z, Xiong H, Xu T, Qin C, Zhang L, Chen E. Complex Attributed Network Embedding for medical complication prediction. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-022-01712-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
15
|
Yan J, Chen L, Yu Y, Xu H, Xu Z, Sheng Y, Chen J. Neuroimaging-ITM: A Text Mining Pipeline Combining Deep Adversarial Learning with Interaction Based Topic Modeling for Enabling the FAIR Neuroimaging Study. Neuroinformatics 2022; 20:701-726. [PMID: 35235184 DOI: 10.1007/s12021-022-09571-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/04/2022] [Indexed: 12/31/2022]
Abstract
Sharing various neuroimaging digital resources have received widespread attention in FAIR (Findable, Accessible, Interoperable and Reusable) neuroscience. In order to support a comprehensive understanding of brain cognition, neuroimaging provenance should be constructed to characterize both research processes and results, and integrates various digital resources for quick replication and open cooperation. This brings new challenges to neuroimaging text mining, including fragmented information, lack of labelled corpora, and vague topics. This paper proposes a text mining pipeline for enabling the FAIR neuroimaging study. In order to avoid fragmented information, the Brain Informatics provenance model is redesigned based on NIDM (Neuroimaging Data Model) and FAIR facets. It can systematically capture the provenance requests from the FAIR neuroimaging study and then transform them into a group of text mining tasks. A neuroimaging text mining pipeline combining deep adversarial learning with interaction based topic modeling, called neuroimaging interaction topic model (Neuroimaging-ITM), is proposed to automatically extract neuroimaging provenance and identify research topics in the few-shot scenario. Finally, a group of experiments is completed by using real data from the journal PloS One. The experimental results show that Neuroimaging-ITM can systematically and accurately extract provenance information and obtain high-quality research topics from the full text of neuroimaging articles. Most of the mean F1 values of provenance extraction exceed 0.9. The topic coherence and KL (Kullback-Leibler) divergence reach 9.95 and 0.96 respectively. The results are obviously better than baseline methods.
Collapse
Affiliation(s)
- Jianzhuo Yan
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China.,Engineering Research Center of Digital Community, Beijing University of Technology, Beijing, 100124, China
| | - Lihong Chen
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China.,Engineering Research Center of Digital Community, Beijing University of Technology, Beijing, 100124, China
| | - Yongchuan Yu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China.,Engineering Research Center of Digital Community, Beijing University of Technology, Beijing, 100124, China
| | - Hongxia Xu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China.,Engineering Research Center of Digital Community, Beijing University of Technology, Beijing, 100124, China
| | - Zhe Xu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Ying Sheng
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Jianhui Chen
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China. .,Beijing International Collaboration Base On Brain Informatics and Wisdom Services, Beijing University of Technology, Beijing, 100124, China. .,Beijing Key Laboratory of MRI and Brain Informatics, Beijing University of Technology, Beijing, 100124, China.
| |
Collapse
|
16
|
Named entity recognition (NER) for Chinese agricultural diseases and pests based on discourse topic and attention mechanism. EVOLUTIONARY INTELLIGENCE 2022. [DOI: 10.1007/s12065-022-00727-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
17
|
Sun J, Liu Y, Cui J, He H. Deep learning-based methods for natural hazard named entity recognition. Sci Rep 2022; 12:4598. [PMID: 35301387 PMCID: PMC8931008 DOI: 10.1038/s41598-022-08667-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 03/09/2022] [Indexed: 12/20/2022] Open
Abstract
Natural hazard named entity recognition is a technique used to recognize natural hazard entities from a large number of texts. The method of natural hazard named entity recognition can facilitate acquisition of natural hazards information and provide reference for natural hazard mitigation. The method of named entity recognition has many challenges, such as fast change, multiple types and various forms of named entities. This can introduce difficulties in research of natural hazard named entity recognition. To address the above problem, this paper constructed a natural disaster annotated corpus for training and evaluation model, and selected and compared several deep learning methods based on word vector features. A deep learning method for natural hazard named entity recognition can automatically mine text features and reduce the dependence on manual rules. This paper compares and analyzes the deep learning models from three aspects: pretraining, feature extraction and decoding. A natural hazard named entity recognition method based on deep learning is proposed, namely XLNet-BiLSTM-CRF model. Finally, the research hotspots of natural hazards papers in the past 10 years were obtained through this model. After training, the precision of the XLNet-BilSTM-CRF model is 92.80%, the recall rate is 91.74%, and the F1-score is 92.27%. The results show that this method, which is superior to other methods, can effectively recognize natural hazard named entities.
Collapse
Affiliation(s)
- Junlin Sun
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China
| | - Yanrong Liu
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China
| | - Jing Cui
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China
| | - Handong He
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
18
|
Lin S, Xu Z, Sheng Y, Chen L, Chen J. AT-NeuroEAE: A Joint Extraction Model of Events With Attributes for Research Sharing-Oriented Neuroimaging Provenance Construction. Front Neurosci 2022; 15:739535. [PMID: 35321479 PMCID: PMC8936590 DOI: 10.3389/fnins.2021.739535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2021] [Accepted: 12/20/2021] [Indexed: 11/13/2022] Open
Abstract
Provenances are a research focus of neuroimaging resources sharing. An amount of work has been done to construct high-quality neuroimaging provenances in a standardized and convenient way. However, besides existing processed-based provenance extraction methods, open research sharing in computational neuroscience still needs one way to extract provenance information from rapidly growing published resources. This paper proposes a literature mining-based approach for research sharing-oriented neuroimaging provenance construction. A group of neuroimaging event-containing attributes are defined to model the whole process of neuroimaging researches, and a joint extraction model based on deep adversarial learning, called AT-NeuroEAE, is proposed to realize the event extraction in a few-shot learning scenario. Finally, a group of experiments were performed on the real data set from the journal PLOS ONE. Experimental results show that the proposed method provides a practical approach to quickly collect research information for neuroimaging provenance construction oriented to open research sharing.
Collapse
Affiliation(s)
- Shaofu Lin
- Faculty of Information Technology, Beijing University of Technology, Beijing, China
- Beijing Institute of Smart City, Beijing University of Technology, Beijing, China
| | - Zhe Xu
- Faculty of Information Technology, Beijing University of Technology, Beijing, China
| | - Ying Sheng
- Faculty of Information Technology, Beijing University of Technology, Beijing, China
| | - Lihong Chen
- Faculty of Information Technology, Beijing University of Technology, Beijing, China
- Engineering Research Center of Digital Community, Beijing University of Technology, Beijing, China
| | - Jianhui Chen
- Faculty of Information Technology, Beijing University of Technology, Beijing, China
- Beijing Key Laboratory of Magnetic Resonance Imaging (MRI) and Brain Informatics, Beijing University of Technology, Beijing, China
- Beijing International Collaboration Base on Brain Informatics and Wisdom Services, Beijing University of Technology, Beijing, China
- *Correspondence: Jianhui Chen,
| |
Collapse
|
19
|
Prostate cancer management with lifestyle intervention: From knowledge graph to Chatbot. CLINICAL AND TRANSLATIONAL DISCOVERY 2022. [DOI: 10.1002/ctd2.29] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
20
|
Xiong Y, Peng H, Xiang Y, Wong KC, Chen Q, Yan J, Tang B. Leveraging Multi-source Knowledge for Chinese Clinical Named Entity Recognition via Relational Graph Convolutional Network. J Biomed Inform 2022; 128:104035. [PMID: 35217186 DOI: 10.1016/j.jbi.2022.104035] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/04/2022] [Accepted: 02/18/2022] [Indexed: 11/29/2022]
Abstract
OBJECTIVE External knowledge, such as lexicon of words in Chinese and domain knowledge graph (KG) of concepts, has been recently adopted to improve the performance of machine learning methods for named entity recognition (NER) as it can provide additional information beyond context. However, most existing studies only consider knowledge from one source (i.e., either lexicon or knowledge graph) in different ways and consider lexicon words or KG concepts independently with their boundaries. In this paper, we focus on leveraging multi-source knowledge in a unified manner where lexicon words or KG concepts are well combined with their boundaries for Chinese Clinical NER (CNER). MATERIAL AND METHODS We propose a novel method based on relational graph convolutional network (RGCN), called MKRGCN, to utilize multi-source knowledge in a unified manner for CNER. For any sentence, a relational graph based on words or concepts in each knowledge source is constructed, where lexicon words or KG concepts appearing in the sentence are linked to the containing tokens with the boundary information of the lexicon words or KG concepts. RGCN is used to model all relational graphs constructed from multi-source knowledge, and the representations of tokens from multi-source knowledge are integrated into the context representations of tokens via an attention mechanism. Based on the knowledge-enhanced representations of tokens, we deploy a conditional random field (CRF) layer for named entity label prediction. In this study, a lexicon of words and a medical knowledge graph are used as knowledge sources for Chinese CNER. RESULTS Our proposed method achieves the best performance on CCKS2017 and CCKS2018 in Chinese with F1-scores of 91.88% and 89.91%, respectively, significantly outperforming existing methods. The extended experiments on NCBI-Disease and BC2GM in English also prove the effectiveness of our method when only considering one knowledge source via RGCN. CONCLUSION The MKRGCN model can integrate knowledge from the external lexicon and knowledge graph effectively for Chinese CNER and has the potential to be applied to English NER.
Collapse
Affiliation(s)
- Ying Xiong
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China
| | - Hao Peng
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China
| | | | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
21
|
Wang M, Wei Z, Jia M, Chen L, Ji H. Deep learning model for multi-classification of infectious diseases from unstructured electronic medical records. BMC Med Inform Decis Mak 2022; 22:41. [PMID: 35168624 PMCID: PMC8848865 DOI: 10.1186/s12911-022-01776-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 01/28/2022] [Indexed: 01/21/2023] Open
Abstract
Purpose Predictively diagnosing infectious diseases helps in providing better treatment and enhances the prevention and control of such diseases. This study uses actual data from a hospital. A multiple infectious disease diagnostic model (MIDDM) is designed for conducting multi-classification of infectious diseases so as to assist in clinical infectious-disease decision-making. Methods Based on actual hospital medical records of infectious diseases from December 2012 to December 2020, a deep learning model for multi-classification research on infectious diseases is constructed. The data includes 20,620 cases covering seven types of infectious diseases, including outpatients and inpatients, of which training data accounted for 80%, i.e., 16,496 cases, and test data accounted for 20%, i.e., 4124 cases. Through the auto-encoder, data normalization and sparse data densification processing are carried out to improve the model training effect. A residual network and attention mechanism are introduced into the MIDDM model to improve the performance of the model. Result MIDDM achieved improved prediction results in diagnosing seven kinds of infectious diseases. In the case of similar disease diagnosis characteristics and similar interference factors, the prediction accuracy of disease classification with more sample data is significantly higher than the prediction accuracy of disease classification with fewer sample data. For instance, the training data for viral hepatitis, influenza, and hand foot and mouth disease were 2954, 3924, and 3015 respectively and the corresponding test accuracy rates were 99.86%, 98.47%, and 97.31%. There is less training data for syphilis, infectious diarrhea, and measles, i.e., 1208, 575, and 190 respectively and the corresponding test accuracy rates were noticeably lower, i.e., 83.03%, 87.30%, and42.11%. We also compared the MIDDM model with the models used in other studies. Using the same input data, taking viral hepatitis as an example, the accuracy of MIDDM is 99.44%, which is significantly higher than that of XGBoost (96.19%), Decision tree (90.13%), Bayesian method (85.19%), and logistic regression (91.26%). Other diseases were also significantly better predicted by MIDDM than by these three models. Conclusion The application of the MIDDM model to multi-class diagnosis and prediction of infectious diseases can improve the accuracy of infectious-disease diagnosis. However, these results need to be further confirmed via clinical randomized controlled trials.
Collapse
Affiliation(s)
- Mengying Wang
- Information Management and Big Data Center, Peking University Third Hospital, Beijing, China
| | - Zhenhao Wei
- Goodwill Hessian Health Technology Co. Ltd, Beijing, China
| | - Mo Jia
- Information Management and Big Data Center, Peking University Third Hospital, Beijing, China
| | - Lianzhong Chen
- Goodwill Hessian Health Technology Co. Ltd, Beijing, China
| | - Hong Ji
- Information Management and Big Data Center, Peking University Third Hospital, Beijing, China.
| |
Collapse
|
22
|
Li M, Liu F, Zhu J, Zhang R, Qin Y, Gao D. Model-based clinical note entity recognition for rheumatoid arthritis using bidirectional encoder representation from transformers. Quant Imaging Med Surg 2022; 12:184-195. [PMID: 34993070 DOI: 10.21037/qims-21-90] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Accepted: 06/24/2021] [Indexed: 11/06/2022]
Abstract
Background Rheumatoid arthritis (RA) is a disease of the immune system with a high rate of disability and there are a large amount of valuable disease diagnosis and treatment information in the clinical note of the electronic medical record. Artificial intelligence methods can be used to mine useful information in clinical notes effectively. This study aimed to develop an effective method to identify and classify medical entities in the clinical notes relating to RA and use the entity identification results in subsequent studies. Methods In this paper, we introduced the bidirectional encoder representation from transformers (BERT) pre-training model to enhance the semantic representation of word vectors. The generated word vectors were then inputted into the model, which is composed of traditional bidirectional long short-term memory neural networks and conditional random field machine learning algorithms for the named entity recognition of clinical notes to improve the model's effectiveness. The BERT method takes the combination of token embeddings, segment embeddings, and position embeddings as the model input and fine-tunes the model during training. Results Compared with the traditional Word2vec word vector model, the performance of the BERT pre-training model to obtain a word vector as model input was significantly improved. The best F1-score of the named entity recognition task after training using many rheumatoid arthritis clinical notes was 0.936. Conclusions This paper confirms the effectiveness of using an advanced artificial intelligence method to carry out named entity recognition tasks on a corpus of a large number of clinical notes; this application is promising in the medical setting. Moreover, the extraction of results in this study provides a lot of basic data for subsequent tasks, including relation extraction, medical knowledge graph construction, and disease reasoning.
Collapse
Affiliation(s)
- Meiting Li
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Feifei Liu
- Department of Ultrasound, Peking University People's Hospital, Beijing, China
| | - Jia'an Zhu
- Department of Ultrasound, Peking University People's Hospital, Beijing, China
| | - Ran Zhang
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Yi Qin
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Dongping Gao
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| |
Collapse
|
23
|
Karcioglu AA, Bulut H. The WM-q multiple exact string matching algorithm for DNA sequences. Comput Biol Med 2021; 136:104656. [PMID: 34333228 DOI: 10.1016/j.compbiomed.2021.104656] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 07/12/2021] [Accepted: 07/13/2021] [Indexed: 10/20/2022]
Abstract
The string matching algorithms are among the essential fields in computer science, such as text search, intrusion detection systems, fraud detection, sequence search in bioinformatics. The exact string matching algorithms are divided into two parts: single and multiple. Multiple string matching algorithms involve finding elements of the pattern set P in a given input text T. String matching processes should be done in a time-efficient manner for DNA sequences. As the volume of the text T increases and the number of search patterns increases, the total runtime increases. Efficient algorithms should be selected to perform these search operations as soon as possible. In this study, the Wu-Manber algorithm, one of the multiple exact string matching algorithms, is improved. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In this study, the WM-q algorithm, a version of the Wu-Manber algorithm based on the perfect hash function for DNA sequences, is proposed. String matching is performed using different block lengths provided by the perfect hash function instead of using the fixed block length as in the traditional Wu-Manber algorithm. The proposed approach has been compared with E. Coli and Human Chromosome1 datasets, frequently used in the literature, using multiple exact string matching algorithms. The proposed algorithm gives better results for performance metrics such as the average runtime, the average number of characters and hash comparisons.
Collapse
Affiliation(s)
| | - Hasan Bulut
- Department of Computer Engineering, Ege University, Izmir, Turkey.
| |
Collapse
|
24
|
Karcioglu AA, Bulut H. Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences. Comput Biol Med 2021; 131:104292. [PMID: 33662682 DOI: 10.1016/j.compbiomed.2021.104292] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 02/16/2021] [Accepted: 02/16/2021] [Indexed: 12/22/2022]
Abstract
Exact string matching algorithms involve finding all occurrences of a pattern P in a text T. These algorithms have been extensively studied in computer science, primarily because of their applications in various fields such as text search and computational biology. The main goal of exact string matching algorithms is to find all pattern matches correctly within the shortest possible time frame. Although hash-based string matching algorithms run fast, there are shortcomings, such as hash collisions. In this study, a novel hash function has been proposed that eliminates hash collisions for DNA sequences. It provides us perfect hashing and produces hash values in a time-efficient manner. We have proposed two exact string matching algorithms based on the proposed hash function. In the first approach, we replace the traditional Hash-q algorithm's hash function with the proposed one. In the second approach, we improved the first approach by utilizing the shift size indicated at the (m-1)th entry in the good suffix shift table when an exact matching is found. In these approaches, we eliminate the need to compare the last q characters of the pattern and text. We have included six algorithms from the literature in our evaluations. E. Coli and Human Chromosome1 datasets from the literature and a synthetic dataset produced randomly are utilized for comparisons. The results show that the proposed approaches achieve better performance metrics in terms of the average runtime, the average number of character comparisons, and the average number of hash comparisons.
Collapse
Affiliation(s)
| | - Hasan Bulut
- Department of Computer Engineering, Ege University, Izmir, Turkey.
| |
Collapse
|
25
|
Pandey B, Kumar Pandey D, Pratap Mishra B, Rhmann W. A comprehensive survey of deep learning in the field of medical imaging and medical natural language processing: Challenges and research directions. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2021.01.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
26
|
Fine-Grained Mechanical Chinese Named Entity Recognition Based on ALBERT-AttBiLSTM-CRF and Transfer Learning. Symmetry (Basel) 2020. [DOI: 10.3390/sym12121986] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Manufacturing text often exists as unlabeled data; the entity is fine-grained and the extraction is difficult. The above problems mean that the manufacturing industry knowledge utilization rate is low. This paper proposes a novel Chinese fine-grained NER (named entity recognition) method based on symmetry lightweight deep multinetwork collaboration (ALBERT-AttBiLSTM-CRF) and model transfer considering active learning (MTAL) to research fine-grained named entity recognition of a few labeled Chinese textual data types. The method is divided into two stages. In the first stage, the ALBERT-AttBiLSTM-CRF was applied for verification in the CLUENER2020 dataset (Public dataset) to get a pretrained model; the experiments show that the model obtains an F1 score of 0.8962, which is better than the best baseline algorithm, an improvement of 9.2%. In the second stage, the pretrained model was transferred into the Manufacturing-NER dataset (our dataset), and we used the active learning strategy to optimize the model effect. The final F1 result of Manufacturing-NER was 0.8931 after the model transfer (it was higher than 0.8576 before the model transfer); so, this method represents an improvement of 3.55%. Our method effectively transfers the existing knowledge from public source data to scientific target data, solving the problem of named entity recognition with scarce labeled domain data, and proves its effectiveness.
Collapse
|
27
|
Gajendran S, D M, Sugumaran V. Character level and word level embedding with bidirectional LSTM - Dynamic recurrent neural network for biomedical named entity recognition from literature. J Biomed Inform 2020; 112:103609. [PMID: 33122119 DOI: 10.1016/j.jbi.2020.103609] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 10/14/2020] [Accepted: 10/22/2020] [Indexed: 12/22/2022]
Abstract
Named Entity Recognition is the process of identifying different entities in a given context. Biomedical Named Entity Recognition (BNER) is the task of extracting chemical names from biomedical texts to support biomedical and translational research. The aim of the system is to extract useful chemical names from biomedical literature text without a lot of handcrafted engineering features. This approach introduces a novel neural network architecture with the composition of bidirectional long short-term memory (BLSTM), dynamic recurrent neural network (RNN) and conditional random field (CRF) that uses character level and word level embedding as the only features to identify the chemical entities. Using this approach we have achieved the F1 score of 89.98 on BioCreAtIvE II GM corpus and 90.84 on NCBI corpus by outperforming the existing systems. Our system is based on the deep neural architecture that uses both character and word level embedding which captures the morphological and orthographic information eliminating the need for handcrafted engineering features. The proposed system outperforms the existing systems without a lot of handcrafted engineering features. The embedding concept along with the bidirectional LSTM network proved to be an effective method to identify most of the chemical entities.
Collapse
Affiliation(s)
- Sudhakaran Gajendran
- Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai, India.
| | - Manjula D
- Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai, India.
| | - Vijayan Sugumaran
- Center for Data Science and Big Data Analytics, Oakland University, Rochester, MI, USA; Department of Decision and Information Sciences, School of Business Administration, Oakland University, Rochester, MI, USA.
| |
Collapse
|
28
|
Wen G, Chen H, Li H, Hu Y, Li Y, Wang C. Cross domains adversarial learning for Chinese named entity recognition for online medical consultation. J Biomed Inform 2020; 112:103608. [PMID: 33132138 DOI: 10.1016/j.jbi.2020.103608] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 10/19/2020] [Accepted: 10/22/2020] [Indexed: 11/19/2022]
Abstract
Deep learning methods have been applied to Chinese named entity recognition for the online medical consultation. They require a large number of marked samples. However, no such database is available at present. This paper begins with constructing a larger labelled Chinese texts database for the online medical consultation. Second, a basic framework unit is proposed, which is pre-trained by the transfer learning from both Bidirectional language model and Mask language model trained on the larger unlabelled data. Finally, cross domains adversarial learning (CDAL) for Chinese named entity recognition is proposed to further improve the performance, which not only uses the pre-trained basic framework unit, but also uses the adversarial multi-task learning on both electronic medical record texts and online medical consultation texts. Experimental results validate the effectiveness of CDAL.
Collapse
Affiliation(s)
- Guihua Wen
- School of Computer Science & Engineering, South China University of Technology, Guangzhou, China
| | - Hehong Chen
- School of Computer Science & Engineering, South China University of Technology, Guangzhou, China
| | - Huihui Li
- School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, China.
| | - Yang Hu
- School of Computer Science & Engineering, South China University of Technology, Guangzhou, China
| | - Yanghui Li
- School of Computer Science & Engineering, South China University of Technology, Guangzhou, China
| | - Changjun Wang
- Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| |
Collapse
|
29
|
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36:1234-1240. [PMID: 31501885 PMCID: PMC7703786 DOI: 10.1093/bioinformatics/btz682] [Citation(s) in RCA: 1132] [Impact Index Per Article: 283.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. RESULTS We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. AVAILABILITY AND IMPLEMENTATION We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Collapse
Affiliation(s)
- Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sungdong Kim
- Clova AI Research, Naver Corp, Seong-Nam 13561, Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Chan Ho So
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea.,Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| |
Collapse
|
30
|
Teng F, Yang W, Chen L, Huang L, Xu Q. Explainable Prediction of Medical Codes With Knowledge Graphs. Front Bioeng Biotechnol 2020; 8:867. [PMID: 32923430 PMCID: PMC7456905 DOI: 10.3389/fbioe.2020.00867] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Accepted: 07/06/2020] [Indexed: 11/23/2022] Open
Abstract
International Classification of Diseases (ICD) is an authoritative health care classification system of different diseases. It is widely used for disease and health records, assisted medical reimbursement decisions, and collecting morbidity and mortality statistics. The most existing ICD coding models only translate the simple diagnosis descriptions into ICD codes. And it obscures the reasons and details behind specific diagnoses. Besides, the label (code) distribution is uneven. And there is a dependency between labels. Based on the above considerations, the knowledge graph and attention mechanism were expanded into medical code prediction to improve interpretability. In this study, a new method called G_Coder was presented, which mainly consists of Multi-CNN, graph presentation, attentional matching, and adversarial learning. The medical knowledge graph was constructed by extracting entities related to ICD-9 from freebase. Ontology contains 5 entity classes, which are disease, symptom, medicine, surgery, and examination. The result of G_Coder on the MIMIC-III dataset showed that the micro-F1 score is 69.2% surpassing the state of art. The following conclusions can be obtained through the experiment: G_Coder integrates information across medical records using Multi-CNN and embeds knowledge into ICD codes. Adversarial learning is used to generate the adversarial samples to reconcile the writing styles of doctor. With the knowledge graph and attention mechanism, most relevant segments of medical codes can be explained. This suggests that the knowledge graph significantly improves the precision of code prediction and reduces the working pressure of the human coders.
Collapse
Affiliation(s)
- Fei Teng
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
| | - Wei Yang
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
| | - Li Chen
- The Third People's Hospital of Chengdu, Chengdu, China
| | - LuFei Huang
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
- The Third People's Hospital of Chengdu, Chengdu, China
| | - Qiang Xu
- School of Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| |
Collapse
|
31
|
Abstract
OBJECTIVES We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basic IE tasks, named entity recognition and relation extraction, for two selected semantic classes-diseases and drugs (or medications)-and relations between them. METHODS For the time period from 2017 to early 2020, we searched for relevant publications from three major scientific communities: medicine and medical informatics, natural language processing, as well as neural networks and artificial intelligence. RESULTS In the past decade, the field of Natural Language Processing (NLP) has undergone a profound methodological shift from symbolic to distributed representations based on the paradigm of Deep Learning (DL). Meanwhile, this trend is, although with some delay, also reflected in the medical NLP community. In the reporting period, overwhelming experimental evidence has been gathered, as illustrated in this survey for medical IE, that DL-based approaches outperform non-DL ones by often large margins. Still, small-sized and access-limited corpora create intrinsic problems for data-greedy DL as do special linguistic phenomena of medical sublanguages that have to be overcome by adaptive learning strategies. CONCLUSIONS The paradigm shift from (feature-engineered) ML to DNNs changes the fundamental methodological rules of the game for medical NLP. This change is by no means restricted to medical IE but should also deeply influence other areas of medical informatics, either NLP- or non-NLP-based.
Collapse
Affiliation(s)
- Udo Hahn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
| | - Michel Oleynik
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| |
Collapse
|
32
|
Yu G, Yang Y, Wang X, Zhen H, He G, Li Z, Zhao Y, Shu Q, Shu L. Adversarial active learning for the identification of medical concepts and annotation inconsistency. J Biomed Inform 2020; 108:103481. [PMID: 32687985 DOI: 10.1016/j.jbi.2020.103481] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 05/08/2020] [Accepted: 06/08/2020] [Indexed: 02/01/2023]
Abstract
OBJECTIVE Named entity recognition (NER) is a principal task in the biomedical field and deep learning-based algorithms have been widely applied to biomedical NER. However, all of these methods that are applied to biomedical corpora use only annotated samples to maximize their performances. Thus, (1) large numbers of unannotated samples are relinquished and their values are overlooked. (2) Compared with other types of active learning (AL) algorithms, generative adversarial learning (GAN)-based AL methods have developed slowly. Furthermore, current diversity-based AL methods only compute similarities between a pair of sentences and cannot evaluate distribution similarities between groups of sentences. Annotation inconsistency is one of the significant challenges in the biomedical annotation field. Most existing methods for addressing this challenge are statistics-based or rule-based methods. (3) They require sufficient expert knowledge and complex designs. To address challenges (1), (2), and (3) simultaneously, we propose innovative algorithms. METHODS GAN is introduced in this paper, and we propose the GAN-bidirectional long short-term memory-conditional random field (GAN-BiLSTM-CRF) and the GAN-bidirectional encoder representations from transformers-conditional random field (GAN-BERT-CRF) models, which can be considered an NER model, an AL model, and a model identifying error labels. BiLSTM-CRF or BERT-CRF is defined as the generator and a convolutional neural network (CNN)-based network is considered the discriminator. (1) The generator employs unannotated samples in addition to annotated samples to maximize NER performance. (2) The outputs of the CRF layer and the discriminator are used to select unlabeled samples for the AL task. (3) The discriminator discriminates the distribution of error labels from that of correct labels, identify error labels, and address the annotation inconsistency challenge. RESULTS The corpus from the 2010 i2b2/VA NLP challenge and the Chinese CCKS-2017 Task 2 dataset are adopted for experiments. Compared to the baseline BiLSTM-CRF and BERT-CRF, the GAN-BiLSTM-CRF and GAN-BERT-CRF models achieved significant improvements on the precision, recall, and F1 scores in terms of NER performance. Learning curves in AL experiments show the comparative results of the proposed models. Furthermore, the trained discriminator can identify samples with incorrect medical labels in both simulation and real-word experimental environments. CONCLUSION The idea of introducing GAN contributes significant results in terms of NER, active learning, and the ability to identify incorrect annotated samples. The benefits of GAN will be further studied.
Collapse
Affiliation(s)
- Gang Yu
- Department of IT Center, the Children's Hospital, Zhejiang University School of Medicine, China; National Clinical Research Center for Child Health, China.
| | - Yiwen Yang
- Department of Artificial Intelligence, Enterprise Institute, Ewell Technology, China.
| | - Xuying Wang
- Department of Artificial Intelligence, Enterprise Institute, Ewell Technology, China.
| | - Huachun Zhen
- Department of Artificial Intelligence, Enterprise Institute, Ewell Technology, China.
| | - Guoping He
- Department of Artificial Intelligence, Enterprise Institute, Ewell Technology, China.
| | - Zheming Li
- Department of IT Center, the Children's Hospital, Zhejiang University School of Medicine, China; National Clinical Research Center for Child Health, China.
| | - Yonggen Zhao
- Department of IT Center, the Children's Hospital, Zhejiang University School of Medicine, China; National Clinical Research Center for Child Health, China.
| | - Qiang Shu
- National Clinical Research Center for Child Health, China.
| | - Liqi Shu
- Department of Neurology, Warren Alpert Medical School of Brown University, United States
| |
Collapse
|
33
|
Brasil S, Pascoal C, Francisco R, dos Reis Ferreira V, A. Videira P, Valadão G. Artificial Intelligence (AI) in Rare Diseases: Is the Future Brighter? Genes (Basel) 2019; 10:genes10120978. [PMID: 31783696 PMCID: PMC6947640 DOI: 10.3390/genes10120978] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 11/19/2019] [Accepted: 11/20/2019] [Indexed: 02/06/2023] Open
Abstract
The amount of data collected and managed in (bio)medicine is ever-increasing. Thus, there is a need to rapidly and efficiently collect, analyze, and characterize all this information. Artificial intelligence (AI), with an emphasis on deep learning, holds great promise in this area and is already being successfully applied to basic research, diagnosis, drug discovery, and clinical trials. Rare diseases (RDs), which are severely underrepresented in basic and clinical research, can particularly benefit from AI technologies. Of the more than 7000 RDs described worldwide, only 5% have a treatment. The ability of AI technologies to integrate and analyze data from different sources (e.g., multi-omics, patient registries, and so on) can be used to overcome RDs’ challenges (e.g., low diagnostic rates, reduced number of patients, geographical dispersion, and so on). Ultimately, RDs’ AI-mediated knowledge could significantly boost therapy development. Presently, there are AI approaches being used in RDs and this review aims to collect and summarize these advances. A section dedicated to congenital disorders of glycosylation (CDG), a particular group of orphan RDs that can serve as a potential study model for other common diseases and RDs, has also been included.
Collapse
Affiliation(s)
- Sandra Brasil
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Carlota Pascoal
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Rita Francisco
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Vanessa dos Reis Ferreira
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- Correspondence:
| | - Paula A. Videira
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Gonçalo Valadão
- Instituto de Telecomunicações, 1049-001 Lisboa, Portugal;
- Departamento de Ciências e Tecnologias, Autónoma Techlab–Universidade Autónoma de Lisboa, 1169-023 Lisboa, Portugal
- Electronics, Telecommunications and Computers Engineering Department, Instituto Superior de Engenharia de Lisboa, 1959-007 Lisboa, Portugal
| |
Collapse
|