1
|
Zhang Y, Yang Z, Yang Y, Lin H, Wang J. Location-enhanced syntactic knowledge for biomedical relation extraction. J Biomed Inform 2024; 156:104676. [PMID: 38876451 DOI: 10.1016/j.jbi.2024.104676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 06/08/2024] [Accepted: 06/10/2024] [Indexed: 06/16/2024]
Abstract
Biomedical relation extraction has long been considered a challenging task due to the specialization and complexity of biomedical texts. Syntactic knowledge has been widely employed in existing research to enhance relation extraction, providing guidance for the semantic understanding and text representation of models. However, the utilization of syntactic knowledge in most studies is not exhaustive, and there is often a lack of fine-grained noise reduction, leading to confusion in relation classification. In this paper, we propose an attention generator that comprehensively considers both syntactic dependency type information and syntactic position information to distinguish the importance of different dependency connections. Additionally, we integrate positional information, dependency type information, and word representations together to introduce location-enhanced syntactic knowledge for guiding our biomedical relation extraction. Experimental results on three widely used English benchmark datasets in the biomedical domain consistently outperform a range of baseline models, demonstrating that our approach not only makes full use of syntactic knowledge but also effectively reduces the impact of noisy words.
Collapse
Affiliation(s)
- Yan Zhang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Yumeng Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
2
|
Luo X, Deng Z, Yang B, Luo MY. Pre-trained language models in medicine: A survey. Artif Intell Med 2024; 154:102904. [PMID: 38917600 DOI: 10.1016/j.artmed.2024.102904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 04/15/2024] [Accepted: 06/03/2024] [Indexed: 06/27/2024]
Abstract
With the rapid progress in Natural Language Processing (NLP), Pre-trained Language Models (PLM) such as BERT, BioBERT, and ChatGPT have shown great potential in various medical NLP tasks. This paper surveys the cutting-edge achievements in applying PLMs to various medical NLP tasks. Specifically, we first brief PLMS and outline the research of PLMs in medicine. Next, we categorise and discuss the types of tasks in medical NLP, covering text summarisation, question-answering, machine translation, sentiment analysis, named entity recognition, information extraction, medical education, relation extraction, and text mining. For each type of task, we first provide an overview of the basic concepts, the main methodologies, the advantages of applying PLMs, the basic steps of applying PLMs application, the datasets for training and testing, and the metrics for task evaluation. Subsequently, a summary of recent important research findings is presented, analysing their motivations, strengths vs weaknesses, similarities vs differences, and discussing potential limitations. Also, we assess the quality and influence of the research reviewed in this paper by comparing the citation count of the papers reviewed and the reputation and impact of the conferences and journals where they are published. Through these indicators, we further identify the most concerned research topics currently. Finally, we look forward to future research directions, including enhancing models' reliability, explainability, and fairness, to promote the application of PLMs in clinical practice. In addition, this survey also collect some download links of some model codes and the relevant datasets, which are valuable references for researchers applying NLP techniques in medicine and medical professionals seeking to enhance their expertise and healthcare service through AI technology.
Collapse
Affiliation(s)
- Xudong Luo
- School of Computer Science and Engineering, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining, Guangxi Normal University, Guilin 541004, China; Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China.
| | - Zhiqi Deng
- School of Computer Science and Engineering, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining, Guangxi Normal University, Guilin 541004, China; Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China.
| | - Binxia Yang
- School of Computer Science and Engineering, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining, Guangxi Normal University, Guilin 541004, China; Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China.
| | - Michael Y Luo
- Emmanuel College, Cambridge University, Cambridge, CB2 3AP, UK.
| |
Collapse
|
3
|
Li Y, Tao W, Li Z, Sun Z, Li F, Fenton S, Xu H, Tao C. Artificial intelligence-powered pharmacovigilance: A review of machine and deep learning in clinical text-based adverse drug event detection for benchmark datasets. J Biomed Inform 2024; 152:104621. [PMID: 38447600 DOI: 10.1016/j.jbi.2024.104621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 02/19/2024] [Accepted: 03/03/2024] [Indexed: 03/08/2024]
Abstract
OBJECTIVE The primary objective of this review is to investigate the effectiveness of machine learning and deep learning methodologies in the context of extracting adverse drug events (ADEs) from clinical benchmark datasets. We conduct an in-depth analysis, aiming to compare the merits and drawbacks of both machine learning and deep learning techniques, particularly within the framework of named-entity recognition (NER) and relation classification (RC) tasks related to ADE extraction. Additionally, our focus extends to the examination of specific features and their impact on the overall performance of these methodologies. In a broader perspective, our research extends to ADE extraction from various sources, including biomedical literature, social media data, and drug labels, removing the limitation to exclusively machine learning or deep learning methods. METHODS We conducted an extensive literature review on PubMed using the query "(((machine learning [Medical Subject Headings (MeSH) Terms]) OR (deep learning [MeSH Terms])) AND (adverse drug event [MeSH Terms])) AND (extraction)", and supplemented this with a snowballing approach to review 275 references sourced from retrieved articles. RESULTS In our analysis, we included twelve articles for review. For the NER task, deep learning models outperformed machine learning models. In the RC task, gradient Boosting, multilayer perceptron and random forest models excelled. The Bidirectional Encoder Representations from Transformers (BERT) model consistently achieved the best performance in the end-to-end task. Future efforts in the end-to-end task should prioritize improving NER accuracy, especially for 'ADE' and 'Reason'. CONCLUSION These findings hold significant implications for advancing the field of ADE extraction and pharmacovigilance, ultimately contributing to improved drug safety monitoring and healthcare outcomes.
Collapse
Affiliation(s)
- Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Wei Tao
- Department of Biostatistics & Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zehan Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zenan Sun
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Fang Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Susan Fenton
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, USA
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL 32224, USA.
| |
Collapse
|
4
|
Liu C, Sun K, Zhou Q, Duan Y, Shu J, Kan H, Gu Z, Hu J. CPMI-ChatGLM: parameter-efficient fine-tuning ChatGLM with Chinese patent medicine instructions. Sci Rep 2024; 14:6403. [PMID: 38493251 PMCID: PMC10944515 DOI: 10.1038/s41598-024-56874-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 03/12/2024] [Indexed: 03/18/2024] Open
Abstract
Chinese patent medicine (CPM) is a typical type of traditional Chinese medicine (TCM) preparation that uses Chinese herbs as raw materials and is an important means of treating diseases in TCM. Chinese patent medicine instructions (CPMI) serve as a guide for patients to use drugs safely and effectively. In this study, we apply a pre-trained language model to the domain of CPM. We have meticulously assembled, processed, and released the first CPMI dataset and fine-tuned the ChatGLM-6B base model, resulting in the development of CPMI-ChatGLM. We employed consumer-grade graphics cards for parameter-efficient fine-tuning and investigated the impact of LoRA and P-Tuning v2, as well as different data scales and instruction data settings on model performance. We evaluated CPMI-ChatGLM using BLEU, ROUGE, and BARTScore metrics. Our model achieved scores of 0.7641, 0.8188, 0.7738, 0.8107, and - 2.4786 on the BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L and BARTScore metrics, respectively. In comparison experiments and human evaluation with four large language models of similar parameter scales, CPMI-ChatGLM demonstrated state-of-the-art performance. CPMI-ChatGLM demonstrates commendable proficiency in CPM recommendations, making it a promising tool for auxiliary diagnosis and treatment. Furthermore, the various attributes in the CPMI dataset can be used for data mining and analysis, providing practical application value and research significance.
Collapse
Affiliation(s)
- Can Liu
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
- Anhui Computer Application Research Institute of Chinese Medicine, China Academy of Chinese Medical Sciences, Hefei, 230012, China
| | - Kaijie Sun
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
| | - Qingqing Zhou
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
| | - Yuchen Duan
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
| | - Jianhua Shu
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
| | - Hongxing Kan
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
- Anhui Computer Application Research Institute of Chinese Medicine, China Academy of Chinese Medical Sciences, Hefei, 230012, China
| | - Zongyun Gu
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China
| | - Jili Hu
- School of Medical Informatics Engineering, Anhui University of Traditional Chinese Medicine, Hefei, 230012, China.
- Anhui Computer Application Research Institute of Chinese Medicine, China Academy of Chinese Medical Sciences, Hefei, 230012, China.
| |
Collapse
|
5
|
Lu X, Tong J, Xia S. Entity relationship extraction from Chinese electronic medical records based on feature augmentation and cascade binary tagging framework. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:1342-1355. [PMID: 38303468 DOI: 10.3934/mbe.2024058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Extracting entity relations from unstructured Chinese electronic medical records is an important task in medical information extraction. However, Chinese electronic medical records mostly have document-level volumes, and existing models are either unable to handle long text sequences or exhibit poor performance. This paper proposes a neural network based on feature augmentation and cascade binary tagging framework. First, we utilize a pre-trained model to tokenize the original text and obtain word embedding vectors. Second, the word vectors are fed into the feature augmentation network and fused with the original features and position features. Finally, the cascade binary tagging decoder generates the results. In the current work, we built a Chinese document-level electronic medical record dataset named VSCMeD, which contains 595 real electronic medical records from vascular surgery patients. The experimental results show that the model achieves a precision of 87.82% and recall of 88.47%. It is also verified on another Chinese medical dataset CMeIE-V2 that the model achieves a precision of 54.51% and recall of 48.63%.
Collapse
Affiliation(s)
- Xiaoqing Lu
- School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou, China
| | - Jijun Tong
- School of Information Science and Engineering, Zhejiang Sci-Tech University, Hangzhou, China
| | - Shudong Xia
- The Fourth Affiliated Hospital Zhejiang University School of Medicine, Jinhua, China
| |
Collapse
|
6
|
Lenert LA, Lane S, Wehbe R. Could an artificial intelligence approach to prior authorization be more human? J Am Med Inform Assoc 2023; 30:989-994. [PMID: 36809561 PMCID: PMC10114030 DOI: 10.1093/jamia/ocad016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 01/23/2023] [Accepted: 02/02/2023] [Indexed: 02/23/2023] Open
Abstract
Prior authorization (PA) may be a necessary evil within the healthcare system, contributing to physician burnout and delaying necessary care, but also allowing payers to prevent wasting resources on redundant, expensive, and/or ineffective care. PA has become an "informatics issue" with the rise of automated methods for PA review, championed in the Health Level 7 International's (HL7's) DaVinci Project. DaVinci proposes using rule-based methods to automate PA, a time-tested strategy with known limitations. This article proposes an alternative that may be more human-centric, using artificial intelligence (AI) methods for the computation of authorization decisions. We believe that by combining modern approaches for accessing and exchanging existing electronic health data with AI methods tailored to reflect the judgments of expert panels that include patient representatives, and refined with "few shot" learning approaches to prevent bias, we could create a just and efficient process that serves the interests of society as a whole. Efficient simulation of human appropriateness assessments from existing data using AI methods could eliminate burdens and bottlenecks while preserving PA's benefits as a tool to limit inappropriate care.
Collapse
Affiliation(s)
- Leslie A Lenert
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Steven Lane
- Health Gorilla, Mountain View, California, USA
| | - Ramsey Wehbe
- Department of Cardiology, Medical University of South Carolina, Charleston, South Carolina, USA
| |
Collapse
|
7
|
Serna García G, Al Khalaf R, Invernici F, Ceri S, Bernasconi A. CoVEffect: interactive system for mining the effects of SARS-CoV-2 mutations and variants based on deep learning. Gigascience 2022; 12:giad036. [PMID: 37222749 PMCID: PMC10205000 DOI: 10.1093/gigascience/giad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/11/2023] [Accepted: 04/27/2023] [Indexed: 05/25/2023] Open
Abstract
BACKGROUND Literature about SARS-CoV-2 widely discusses the effects of variations that have spread in the past 3 years. Such information is dispersed in the texts of several research articles, hindering the possibility of practically integrating it with related datasets (e.g., millions of SARS-CoV-2 sequences available to the community). We aim to fill this gap, by mining literature abstracts to extract-for each variant/mutation-its related effects (in epidemiological, immunological, clinical, or viral kinetics terms) with labeled higher/lower levels in relation to the nonmutated virus. RESULTS The proposed framework comprises (i) the provisioning of abstracts from a COVID-19-related big data corpus (CORD-19) and (ii) the identification of mutation/variant effects in abstracts using a GPT2-based prediction model. The above techniques enable the prediction of mutations/variants with their effects and levels in 2 distinct scenarios: (i) the batch annotation of the most relevant CORD-19 abstracts and (ii) the on-demand annotation of any user-selected CORD-19 abstract through the CoVEffect web application (http://gmql.eu/coveffect), which assists expert users with semiautomated data labeling. On the interface, users can inspect the predictions and correct them; user inputs can then extend the training dataset used by the prediction model. Our prototype model was trained through a carefully designed process, using a minimal and highly diversified pool of samples. CONCLUSIONS The CoVEffect interface serves for the assisted annotation of abstracts, allowing the download of curated datasets for further use in data integration or analysis pipelines. The overall framework can be adapted to resolve similar unstructured-to-structured text translation tasks, which are typical of biomedical domains.
Collapse
Affiliation(s)
- Giuseppe Serna García
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Ruba Al Khalaf
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Francesco Invernici
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Stefano Ceri
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Anna Bernasconi
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| |
Collapse
|
8
|
Sun F, Xu H, Meng Y, Lu Z. A BERT-based model for coupled biological strategies in biomimetic design. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07734-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
9
|
Li Z, Wang M, Peng D, Liu J, Xie Y, Dai Z, Zou X. Identification of Chemical-Disease Associations Through Integration of Molecular Fingerprint, Gene Ontology and Pathway Information. Interdiscip Sci 2022; 14:683-696. [PMID: 35391615 DOI: 10.1007/s12539-022-00511-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 03/16/2022] [Accepted: 03/17/2022] [Indexed: 06/14/2023]
Abstract
The identification of chemical-disease association types is helpful not only to discovery lead compounds and study drug repositioning, but also to treat disease and decipher pathomechanism. It is very urgent to develop computational method for identifying potential chemical-disease association types, since wet methods are usually expensive, laborious and time-consuming. In this study, molecular fingerprint, gene ontology and pathway are utilized to characterize chemicals and diseases. A novel predictor is proposed to recognize potential chemical-disease associations at the first layer, and further distinguish whether their relationships belong to biomarker or therapeutic relations at the second layer. The prediction performance of current method is assessed using the benchmark dataset based on ten-fold cross-validation. The practical prediction accuracies of the first layer and the second layer are 78.47% and 72.07%, respectively. The recognition ability for lead compounds, new drug indications, potential and true chemical-disease association pairs has also been investigated and confirmed by constructing a variety of datasets and performing a series of experiments. It is anticipated that the current method can be considered as a powerful high-throughput virtual screening tool for drug researches and developments.
Collapse
Affiliation(s)
- Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China.
- NMPA Key Laboratory for Technology Research and Evaluation of Pharmacovigilance, Guangzhou, 510006, People's Republic of China.
- Key Laboratory of Digital Quality Evaluation of Chinese Materia Medica of State Administration of Traditional Chinese Medicine, Guangzhou, 510006, People's Republic of China.
| | - Mengru Wang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Dongdong Peng
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Jie Liu
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Yun Xie
- HuiZhou University, Huizhou, 516007, People's Republic of China
| | - Zong Dai
- School of Biomedical Engineering, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China.
| |
Collapse
|
10
|
Information Extraction from the Text Data on Traditional Chinese Medicine: A Review on Tasks, Challenges, and Methods from 2010 to 2021. EVIDENCE-BASED COMPLEMENTARY AND ALTERNATIVE MEDICINE 2022; 2022:1679589. [PMID: 35600940 PMCID: PMC9122692 DOI: 10.1155/2022/1679589] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/31/2022] [Accepted: 04/06/2022] [Indexed: 12/12/2022]
Abstract
Background The practice of traditional Chinese medicine (TCM) began several thousand years ago, and the knowledge of practitioners is recorded in paper and electronic versions of case notes, manuscripts, and books in multiple languages. Developing a method of information extraction (IE) from these sources to generate a cohesive data set would be a great contribution to the medical field. The goal of this study was to perform a systematic review of the status of IE from TCM sources over the last 10 years. Methods We conducted a search of four literature databases for articles published from 2010 to 2021 that focused on the use of natural language processing (NLP) methods to extract information from unstructured TCM text data. Two reviewers and one adjudicator contributed to article search, article selection, data extraction, and synthesis processes. Results We retrieved 1234 records, 49 of which met our inclusion criteria. We used the articles to (i) assess the key tasks of IE in the TCM domain, (ii) summarize the challenges to extracting information from TCM text data, and (iii) identify effective frameworks, models, and key findings of TCM IE through classification. Conclusions Our analysis showed that IE from TCM text data has improved over the past decade. However, the extraction of TCM text still faces some challenges involving the lack of gold standard corpora, nonstandardized expressions, and multiple types of relations. In the future, IE work should be promoted by extracting more existing entities and relations, constructing gold standard data sets, and exploring IE methods based on a small amount of labeled data. Furthermore, fine-grained and interpretable IE technologies are necessary for further exploration.
Collapse
|
11
|
Zanoli R, Lavelli A, Löffler T, Perez Gonzalez NA, Rinaldi F. An annotated dataset for extracting gene-melanoma relations from scientific literature. J Biomed Semantics 2022; 13:2. [PMID: 35045882 PMCID: PMC8772125 DOI: 10.1186/s13326-021-00251-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 08/27/2021] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Melanoma is one of the least common but the deadliest of skin cancers. This cancer begins when the genes of a cell suffer damage or fail, and identifying the genes involved in melanoma is crucial for understanding the melanoma tumorigenesis. Thousands of publications about human melanoma appear every year. However, while biological curation of data is costly and time-consuming, to date the application of machine learning for gene-melanoma relation extraction from text has been severely limited by the lack of annotated resources.
Results
To overcome this lack of resources for melanoma, we have exploited the information of the Melanoma Gene Database (MGDB, a manually curated database of genes involved in human melanoma) to automatically build an annotated dataset of binary relations between gene and melanoma entities occurring in PubMed abstracts. The entities were automatically annotated by state-of-the-art text-mining tools. Their annotation includes both the mention text spans and normalized concept identifiers. The relations among the entities were annotated at concept- and mention-level. The concept-level annotation was produced using the information of the genes in MGDB to decide if a relation holds between a gene and melanoma concept in the whole abstract. The exploitability of this dataset was tested with both traditional machine learning, and neural network-based models like BERT. The models were then used to automatically extract gene-melanoma relations from the biomedical literature. Most of the current models use context-aware representations of the target entities to establish relations between them. To facilitate researchers in their experiments we generated a mention-level annotation in support to the concept-level annotation. The mention-level annotation was generated by automatically linking gene and melanoma mentions co-occurring within the sentences that in MGDB establish the association of the gene with melanoma.
Conclusions
This paper presents a corpus containing gene-melanoma annotated relations. Additionally, it discusses experiments which show the usefulness of such a corpus for training a system capable of mining gene-melanoma relationships from the literature. Researchers can use the corpus to develop and compare their own models, and produce results which might be integrated with existing structured knowledge databases, which in turn might facilitate medical research.
Collapse
|
12
|
Agnikula Kshatriya BS, Sagheb E, Wi CI, Yoon J, Seol HY, Juhn Y, Sohn S. Identification of asthma control factor in clinical notes using a hybrid deep learning model. BMC Med Inform Decis Mak 2021; 21:272. [PMID: 34753481 PMCID: PMC8579684 DOI: 10.1186/s12911-021-01633-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 09/14/2021] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND There are significant variabilities in guideline-concordant documentation in asthma care. However, assessing clinician's documentation is not feasible using only structured data but requires labor-intensive chart review of electronic health records (EHRs). A certain guideline element in asthma control factors, such as review inhaler techniques, requires context understanding to correctly capture from EHR free text. METHODS The study data consist of two sets: (1) manual chart reviewed data-1039 clinical notes of 300 patients with asthma diagnosis, and (2) weakly labeled data (distant supervision)-27,363 clinical notes from 800 patients with asthma diagnosis. A context-aware language model, Bidirectional Encoder Representations from Transformers (BERT) was developed to identify inhaler techniques in EHR free text. Both original BERT and clinical BioBERT (cBERT) were applied with a cost-sensitivity to deal with imbalanced data. The distant supervision using weak labels by rules was also incorporated to augment the training set and alleviate a costly manual labeling process in the development of a deep learning algorithm. A hybrid approach using post-hoc rules was also explored to fix BERT model errors. The performance of BERT with/without distant supervision, hybrid, and rule-based models were compared in precision, recall, F-score, and accuracy. RESULTS The BERT models on the original data performed similar to a rule-based model in F1-score (0.837, 0.845, and 0.838 for rules, BERT, and cBERT, respectively). The BERT models with distant supervision produced higher performance (0.853 and 0.880 for BERT and cBERT, respectively) than without distant supervision and a rule-based model. The hybrid models performed best in F1-score of 0.877 and 0.904 over the distant supervision on BERT and cBERT. CONCLUSIONS The proposed BERT models with distant supervision demonstrated its capability to identify inhaler techniques in EHR free text, and outperformed both the rule-based model and BERT models trained on the original data. With a distant supervision approach, we may alleviate costly manual chart review to generate the large training data required in most deep learning-based models. A hybrid model was able to fix BERT model errors and further improve the performance.
Collapse
Affiliation(s)
| | - Elham Sagheb
- Department of Artificial Intelligence and Informatics, Mayo Clinic, 200 First St SW, Rochester, MN 55905 USA
| | - Chung-Il Wi
- Precision Population Science Lab, Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, MN USA
| | - Jungwon Yoon
- Department of Pediatrics, Myongji Hospital, Goyang, South Korea
| | - Hee Yun Seol
- Pusan National University, Yangsan Hospital, Yangsan, South Korea
| | - Young Juhn
- Precision Population Science Lab, Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, MN USA
| | - Sunghwan Sohn
- Department of Artificial Intelligence and Informatics, Mayo Clinic, 200 First St SW, Rochester, MN 55905 USA
| |
Collapse
|
13
|
Alfattni G, Peek N, Nenadic G. Attention-based bidirectional long short-term memory networks for extracting temporal relationships from clinical discharge summaries. J Biomed Inform 2021; 123:103915. [PMID: 34600144 DOI: 10.1016/j.jbi.2021.103915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 08/05/2021] [Accepted: 09/09/2021] [Indexed: 10/20/2022]
Abstract
Temporal relation extraction between health-related events is a widely studied task in clinical Natural Language Processing (NLP). The current state-of-the-art methods mostly rely on engineered features (i.e., rule-based modelling) and sequence modelling, which often encodes a source sentence into a single fixed-length context. An obvious disadvantage of this fixed-length context design is its incapability to model longer sentences, as important temporal information in the clinical text may appear at different positions. To address this issue, we propose an Attention-based Bidirectional Long Short-Term Memory (Att-BiLSTM) model to enable learning the important semantic information in long source text segments and to better determine which parts of the text are most important. We experimented with two embeddings and compared the performances to traditional state-of-the-art methods that require elaborate linguistic pre-processing and hand-engineered features. The experimental results on the i2b2 2012 temporal relation test corpus show that the proposed method achieves a significant improvement with an F-score of 0.811, which is at least 10% better than state-of-the-art in the field. We show that the model can be remarkably effective at classifying temporal relations when provided with word embeddings trained on corpora in a general domain. Finally, we perform an error analysis to gain insight into the common errors made by the model.
Collapse
Affiliation(s)
- Ghada Alfattni
- Department of Computer Science, University of Manchester, Manchester, UK; Department of Computer Science, Jamoum University College, Umm Al-Qura University, Makkah, Saudi Arabia.
| | - Niels Peek
- Centre for Health Informatics, Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, UK; National Institute of Health Research Manchester Biomedical Research Centre, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK; The Alan Turing Institute, UK
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, UK; The Alan Turing Institute, UK
| |
Collapse
|
14
|
Li Z, Chen H, Qi R, Lin H, Chen H. DocR-BERT: Document-level R-BERT for Chemical-induced Disease Relation Extraction via Gaussian Probability Distribution. IEEE J Biomed Health Inform 2021; 26:1341-1352. [PMID: 34591774 DOI: 10.1109/jbhi.2021.3116769] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Chemical-induced disease (CID) relation extraction from biomedical articles plays an important role in disease treatment and drug development. Existing methods are insufficient for capturing complete document level semantic information due to ignoring semantic information of entities in different sentences. In this work, we proposed an effective document-level relation extraction model to automatically extract intra-/inter-sentential CID relations from articles. Firstly, our model employed BERT to generate contextual semantic representations of the title, abstract and shortest dependency paths (SDPs). Secondly, to enhance the semantic representation of the whole document, cross attention with self-attention (named cross2self-attention) between abstract, title and SDPs was proposed to learn the mutual semantic information. Thirdly, to distinguish the importance of the target entity in different sentences, the Gaussian probability distribution was utilized to compute the weights of the co-occurrence sentence and its adjacent entity sentences. More complete semantic information of the target entity is collected from all entities occurring in the document via our presented document-level R-BERT (DocR-BERT). Finally, the related representations were concatenated and fed into the softmax function to extract CIDs. We evaluated the model on the CDR corpus provided by BioCreative V. The proposed model without external resources is superior in performance as compared with other state-of-the-art models (our model achieves 53.5%, 70%, and 63.7% of the F1-score on inter-/intra-sentential and overall CDR dataset). The experimental results indicate that cross2self-attention, the Gaussian probability distribution and DocR-BERT can effectively improve the CID extraction performance. Furthermore, the mutual semantic information learned by the cross self-attention from abstract towards title can significantly influence the extraction performance of document-level biomedical relation extraction tasks.
Collapse
|
15
|
Kanjirangat V, Rinaldi F. Enhancing Biomedical Relation Extraction with Transformer Models using Shortest Dependency Path Features and Triplet Information. J Biomed Inform 2021; 122:103893. [PMID: 34481058 DOI: 10.1016/j.jbi.2021.103893] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Revised: 08/17/2021] [Accepted: 08/22/2021] [Indexed: 10/20/2022]
Abstract
Entity relation extraction plays an important role in the biomedical, healthcare, and clinical research areas. Recently, pre-trained models based on transformer architectures and their variants have shown remarkable performances in various natural language processing tasks. Most of these variants were based on slight modifications in the architectural components, representation schemes and augmenting data using distant supervision methods. In distantly supervised methods, one of the main challenges is pruning out noisy samples. A similar situation can arise when the training samples are not directly available but need to be constructed from the given dataset. The BioCreative V Chemical Disease Relation (CDR) task provides a dataset that does not explicitly offer mention-level gold annotations and hence replicates the above scenario. Selecting the representative sentences from the given abstract or document text that could convey a potential entity relationship becomes essential. Most of the existing methods in literature propose to either consider the entire text or all the sentences which contain the entity mentions. This could be a computationally expensive and time consuming approach. This paper presents a novel approach to handle such scenarios, specifically in biomedical relation extraction. We propose utilizing the Shortest Dependency Path (SDP) features for constructing data samples by pruning out noisy information and selecting the most representative samples for model learning. We also utilize triplet information in model learning using the biomedical variant of BERT, viz., BioBERT. The problem is represented as a sentence pair classification task using the sentence and the entity-relation pair as input. We analyze the approach on both intra-sentential and inter-sentential relations in the CDR dataset. The proposed approach that utilizes the SDP and triplet features presents promising results, specifically on the inter-sentential relation extraction task. We make the code used for this work publicly available on Github.1.
Collapse
Affiliation(s)
- Vani Kanjirangat
- Istituto Dalle Molle di Studi sull'Intelligenza Artificiale USI/SUPSI, Lugano, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| | - Fabio Rinaldi
- Istituto Dalle Molle di Studi sull'Intelligenza Artificiale USI/SUPSI, Lugano, Switzerland; Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
16
|
Yuan C, Huang H, Feng C, Shi G, Wei X. Document-level relation extraction with Entity-Selection Attention. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.04.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
17
|
DECAB-LSTM: Deep Contextualized Attentional Bidirectional LSTM for cancer hallmark classification. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106486] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|