1
|
Han P, Li X, Zhang Z, Zhong Y, Gu L, Hua Y, Li X. CMCN: Chinese medical concept normalization using continual learning and knowledge-enhanced. Artif Intell Med 2024; 157:102965. [PMID: 39241561 DOI: 10.1016/j.artmed.2024.102965] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 05/10/2024] [Accepted: 08/19/2024] [Indexed: 09/09/2024]
Abstract
Medical Concept Normalization (MCN) is a crucial process for deep information extraction and natural language processing tasks, which plays a vital role in biomedical research. Although MCN in English has achieved significant research achievements, Chinese medical concept normalization (CMCN) remains insufficiently explored due to its complex syntactic structure and the paucity of Chinese medical semantic and ontology resources. In recent years, deep learning has been extensively applied across numerous natural language processing tasks, owing to its robust learning capabilities, adaptability, and transferability. It has proven to be well suited for intricate and specialized knowledge discovery research in the biomedical field. In this study, we conduct research on CMCN through the lens of deep learning. Specifically, our research introduces a model that leverages polymorphic semantic information and knowledge enhanced through multi-task learning and retain more important medical features through continual learning. As the cornerstone of CMCN, disease names are the main focus of this research. We evaluated various methodologies on Chinese disease dataset built by ourselves, finally achieving 76.12 % on Accuracy@1, 87.20 % on Accuracy@5 and 90.02 % on Accuracy@10 with our best-performing model GCBM-BSCL. This research not only advances the fields of knowledge mining and medical concept normalization but also enhances the integration and application of artificial intelligence in the medical and health field. We have published the source code and results on https://github.com/BearLiX/CMCN.
Collapse
Affiliation(s)
- Pu Han
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China; Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China.
| | - Xiong Li
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Zhanpeng Zhang
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Yule Zhong
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Liang Gu
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Yingying Hua
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Xiaoyan Li
- School of Basic Medical Sciences, Nanjing Medical University, Nanjing 210029, China.
| |
Collapse
|
2
|
Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024; 2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
Collapse
Affiliation(s)
- Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
3
|
French E, McInnes BT. An overview of biomedical entity linking throughout the years. J Biomed Inform 2023; 137:104252. [PMID: 36464228 PMCID: PMC9845184 DOI: 10.1016/j.jbi.2022.104252] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 09/19/2022] [Accepted: 11/15/2022] [Indexed: 12/04/2022]
Abstract
Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field.
Collapse
Affiliation(s)
- Evan French
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
4
|
Almeida T, Antunes R, F. Silva J, Almeida JR, Matos S. Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics. Database (Oxford) 2022; 2022:6625810. [PMID: 35776534 PMCID: PMC9248917 DOI: 10.1093/database/baac047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 05/13/2022] [Accepted: 06/06/2022] [Indexed: 11/14/2022]
Abstract
Abstract
The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work. The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sieve-based dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques. The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively. The code to reproduce our experiments and run the pipeline is publicly available.
Database URL
https://github.com/bioinformatics-ua/biocreativeVII_track2
Collapse
Affiliation(s)
- Tiago Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| | - João F. Silva
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| | - João R Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
- Department of Information and Communications Technologies, University of A Coruña , A Coruña, Spain
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| |
Collapse
|
5
|
Xu D, Miller T. A simple neural vector space model for medical concept normalization using concept embeddings. J Biomed Inform 2022; 130:104080. [PMID: 35472514 PMCID: PMC9351985 DOI: 10.1016/j.jbi.2022.104080] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 04/15/2022] [Accepted: 04/19/2022] [Indexed: 11/24/2022]
Abstract
OBJECTIVE Medical concept normalization (MCN), the task of linking textual mentions to concepts in an ontology, provides a solution to unify different ways of referring to the same concept. In this paper, we present a simple neural MCN model that takes mentions as input and directly predicts concepts. MATERIALS AND METHODS We evaluate our proposed model on clinical datasets from ShARe/CLEF eHealth 2013 shared task and 2019 n2c2/OHNLP shared task track 3. Our neural MCN model consists of an encoder, and a normalized temperature-scaled softmax (NT-softmax) layer that maximizes the cosine similarity score of matching the mention to the correct concept. We adopt SAPBERT as the encoder and initialize the weights in the NT-softmax layer with pre-computed concept embeddings from SAPBERT. RESULTS Our proposed neural model achieves competitive performance on ShARe/CLEF 2013 and establishes a new state-of-the-art on 2019-n2c2-MCN. Yet this model is simpler than most prior work: it requires no complex pipelines, no hand-crafted rules, and no preprocessing, making it simpler to apply in new settings. DISCUSSION Analyses of our proposed model show that the NT-softmax is better than the conventional softmax on the MCN task, and both the CUI-less threshold parameter and the initialization of the weight vectors in the NT-softmax layer contribute to the improvements. CONCLUSION We propose a simple neural model for clinical MCN, an one-step approach with simpler inference and more effective performance than prior work. Our analyses demonstrate future work on MCN may require more effort on unseen concepts.
Collapse
Affiliation(s)
- Dongfang Xu
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School Boston, MA, USA.
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School Boston, MA, USA
| |
Collapse
|
6
|
Mitchell JR, Szepietowski P, Howard R, Reisman P, Jones JD, Lewis P, Fridley BL, Rollison DE. A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study. J Med Internet Res 2022; 24:e27210. [PMID: 35319481 PMCID: PMC8987958 DOI: 10.2196/27210] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 10/22/2021] [Accepted: 11/10/2021] [Indexed: 11/30/2022] Open
Abstract
Background Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more. Objective The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports. Methods We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars. Results caBERTnet’s accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively. Conclusions We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.
Collapse
Affiliation(s)
- Joseph Ross Mitchell
- Department of Machine Learning, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.,Department of Medicine, Faculty of Medicine & Dentistry, and the Alberta Machine Intelligence Institute, University of Alberta, Edmonton, AB, Canada.,Alberta Health Services, Edmonton, AB, Canada
| | - Phillip Szepietowski
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Rachel Howard
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Phillip Reisman
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Jennie D Jones
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Patricia Lewis
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| | - Dana E Rollison
- Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States
| |
Collapse
|
7
|
|
8
|
Lybarger K, Damani A, Gunn M, Uzuner OZ, Yetisgen M. Extracting Radiological Findings With Normalized Anatomical Information Using a Span-Based BERT Relation Extraction Model. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2022:339-348. [PMID: 35854739 PMCID: PMC9285141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
Medical imaging is critical to the diagnosis and treatment of numerous medical problems, including many forms of cancer. Medical imaging reports distill the findings and observations of radiologists, creating an unstructured textual representation of unstructured medical images. Large-scale use of this text-encoded information requires converting the unstructured text to a structured, semantic representation. We explore the extraction and normalization of anatomical information in radiology reports that is associated with radiological findings. We investigate this extraction and normalization task using a span-based relation extraction model that jointly extracts entities and relations using BERT. This work examines the factors that influence extraction and normalization performance, including the body part/organ system, frequency of occurrence, span length, and span diversity. It discusses approaches for improving performance and creating high-quality semantic representations of radiological phenomena.
Collapse
|
9
|
Mao J, Akhtar J, Zhang X, Sun L, Guan S, Li X, Chen G, Liu J, Jeon HN, Kim MS, No KT, Wang G. Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 2021; 24:103052. [PMID: 34553136 PMCID: PMC8441174 DOI: 10.1016/j.isci.2021.103052] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Early quantitative structure-activity relationship (QSAR) technologies have unsatisfactory versatility and accuracy in fields such as drug discovery because they are based on traditional machine learning and interpretive expert features. The development of Big Data and deep learning technologies significantly improve the processing of unstructured data and unleash the great potential of QSAR. Here we discuss the integration of wet experiments (which provide experimental data and reliable verification), molecular dynamics simulation (which provides mechanistic interpretation at the atomic/molecular levels), and machine learning (including deep learning) techniques to improve QSAR models. We first review the history of traditional QSAR and point out its problems. We then propose a better QSAR model characterized by a new iterative framework to integrate machine learning with disparate data input. Finally, we discuss the application of QSAR and machine learning to many practical research fields, including drug development and clinical trials.
Collapse
Affiliation(s)
- Jiashun Mao
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, Republic of Korea
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
| | - Javed Akhtar
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| | - Xiao Zhang
- Shanghai Rural Commercial Bank Co., Ltd, Shanghai 200002, China
| | - Liang Sun
- Department of Physics, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Shenghui Guan
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
| | - Xinyu Li
- School of Life and Health Sciences and Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Guangming Chen
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| | - Jiaxin Liu
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Hyeon-Nae Jeon
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Min Sung Kim
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Kyoung Tai No
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, Republic of Korea
| | - Guanyu Wang
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| |
Collapse
|
10
|
Abstract
Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry.
Collapse
Affiliation(s)
- Bethany Percha
- Department of Medicine and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10025, USA;
| |
Collapse
|
11
|
Humphreys BL, Del Fiol G, Xu H. The UMLS knowledge sources at 30: indispensable to current research and applications in biomedical informatics. J Am Med Inform Assoc 2020; 27:1499-1501. [PMID: 33059366 DOI: 10.1093/jamia/ocaa208] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Indexed: 01/22/2023] Open
Affiliation(s)
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|