Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022;23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open

For:	Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022;23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Jia Y, Wang H, Yuan Z, Zhu L, Xiang ZL. Biomedical relation extraction method based on ensemble learning and attention mechanism. BMC Bioinformatics 2024;25:333. [PMID: 39425010 PMCID: PMC11488084 DOI: 10.1186/s12859-024-05951-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Accepted: 10/07/2024] [Indexed: 10/21/2024] Open

Phan CP, Phan B, Chiang JH. Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini. Database (Oxford) 2024;2024:baae104. [PMID: 39383312 PMCID: PMC11463225 DOI: 10.1093/database/baae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/21/2024] [Accepted: 09/04/2024] [Indexed: 10/11/2024]

Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024;159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]

Abstract

OBJECTIVE

Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets.

METHODS

We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset.

RESULTS

We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.

CONCLUSION

This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.

Collapse

Košprdić M, Prodanović N, Ljajić A, Bašaragin B, Milošević N. From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts. Artif Intell Med 2024;156:102970. [PMID: 39197375 DOI: 10.1016/j.artmed.2024.102970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 08/23/2024] [Accepted: 08/23/2024] [Indexed: 09/01/2024]

Sun J, Zhang C, Xing L, Zhang L, Cai H, Guo M. BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention. J Biomed Inform 2024;158:104733. [PMID: 39368528 DOI: 10.1016/j.jbi.2024.104733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 09/03/2024] [Accepted: 09/28/2024] [Indexed: 10/07/2024]

Abstract

Electronic Health Records (EHRs) contain various valuable medical entities and their relationships. Although the extraction of biomedical relationships has achieved good results in the mining of electronic health records and the construction of biomedical knowledge bases, there are still some problems. There may be implied complex associations between entities and relationships in overlapping triplets, and ignoring these interactions may lead to a decrease in the accuracy of entity extraction. To address this issue, a joint extraction model for medical entity relations based on a relation attention mechanism is proposed. The relation extraction module identifies candidate relationships within a sentence. The attention mechanism based on these relationships assigns weights to contextual words in the sentence that are associated with different relationships. Additionally, it extracts the subject and object entities. Under a specific relationship, entity vector representations are utilized to construct a global entity matching matrix based on Biaffine transformations. This matrix is designed to enhance the semantic dependencies and relational representations between entities, enabling triplet extraction. This allows the two subtasks of named entity recognition and relation extraction to be interrelated, fully utilizing contextual information within the sentence, and effectively addresses the issue of overlapping triplets. Experimental observations from the CMeIE Chinese medical relation extraction dataset and the Baidu2019 Chinese dataset confirm that our approach yields the superior F1 score across all cutting-edge baselines. Moreover, it offers substantial performance improvements in intricate situations involving diverse overlapping patterns, multitudes of triplets, and cross-sentence triplets.

Collapse

Sänger M, Garda S, Wang XD, Weber-Genzel L, Droop P, Fuchs B, Akbik A, Leser U. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools. Bioinformatics 2024;40:btae564. [PMID: 39302686 PMCID: PMC11453098 DOI: 10.1093/bioinformatics/btae564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 08/23/2024] [Accepted: 09/17/2024] [Indexed: 09/22/2024] Open

Abstract

MOTIVATION

With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications.

RESULTS

Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools.

AVAILABILITY AND IMPLEMENTATION

All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.

Collapse

Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen LJ. RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature. Database (Oxford) 2024;2024:baae095. [PMID: 39265993 PMCID: PMC11394941 DOI: 10.1093/database/baae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/31/2024] [Accepted: 08/16/2024] [Indexed: 09/14/2024]

Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024;11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open

Affiliation(s)

Po-Ting Lai National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Elisabeth Coudert Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lucila Aimo Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Kristian Axelsen Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lionel Breuza Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Edouard de Castro Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Marc Feuermann Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Anne Morgat Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lucille Pourcel Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Ivo Pedruzzi Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Sylvain Poux Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Nicole Redaschi Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Catherine Rivoire Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Anastasia Sveshnikova Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Chih-Hsuan Wei National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Robert Leaman National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Ling Luo School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
Alan Bridge Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.

Collapse

Nie P, Ning J, Lin M, Yang Z, Wang L. SSGU-CD: A combined semantic and structural information graph U-shaped network for document-level Chemical-Disease interaction extraction. J Biomed Inform 2024;157:104719. [PMID: 39214159 DOI: 10.1016/j.jbi.2024.104719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 08/20/2024] [Accepted: 08/27/2024] [Indexed: 09/04/2024]

Lu Z, Peng Y, Cohen T, Ghassemi M, Weng C, Tian S. Large language models in biomedicine and health: current research landscape and future directions. J Am Med Inform Assoc 2024;31:1801-1811. [PMID: 39169867 PMCID: PMC11339542 DOI: 10.1093/jamia/ocae202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Indexed: 08/23/2024] Open

Li Z, Wei Q, Huang LC, Li J, Hu Y, Chuang YS, He J, Das A, Keloth VK, Yang Y, Diala CS, Roberts KE, Tao C, Jiang X, Zheng WJ, Xu H. Ensemble pretrained language models to extract biomedical knowledge from literature. J Am Med Inform Assoc 2024;31:1904-1911. [PMID: 38520725 PMCID: PMC11339500 DOI: 10.1093/jamia/ocae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 02/14/2024] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open

Affiliation(s)

Zhao Li McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Qiang Wei McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Liang-Chin Huang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Jianfu Li McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Yao-Shun Chuang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Jianping He McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Avisha Das McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Vipina Kuttichi Keloth Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
Yuntao Yang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Chiamaka S Diala McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Kirk E Roberts McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Cui Tao McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Xiaoqian Jiang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
W Jim Zheng McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Hua Xu Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States

Collapse

Zhou H, Li M, Xiao Y, Yang H, Zhang R. LEAP: LLM instruction-example adaptive prompting framework for biomedical relation extraction. J Am Med Inform Assoc 2024;31:2010-2018. [PMID: 38904416 PMCID: PMC11339510 DOI: 10.1093/jamia/ocae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 05/26/2024] [Accepted: 06/03/2024] [Indexed: 06/22/2024] Open

Abstract

OBJECTIVE

To investigate the demonstration in large language models (LLMs) for biomedical relation extraction. This study introduces a framework comprising three types of adaptive tuning methods to assess their impacts and effectiveness.

MATERIALS AND METHODS

Our study was conducted in two phases. Initially, we analyzed a range of demonstration components vital for LLMs' biomedical data capabilities, including task descriptions and examples, experimenting with various combinations. Subsequently, we introduced the LLM instruction-example adaptive prompting (LEAP) framework, including instruction adaptive tuning, example adaptive tuning, and instruction-example adaptive tuning methods. This framework aims to systematically investigate both adaptive task descriptions and adaptive examples within the demonstration. We assessed the performance of the LEAP framework on the DDI, ChemProt, and BioRED datasets, employing LLMs such as Llama2-7b, Llama2-13b, and MedLLaMA_13B.

RESULTS

Our findings indicated that Instruction + Options + Example and its expanded form substantially improved F1 scores over the standard Instruction + Options mode for zero-shot LLMs. The LEAP framework, particularly through its example adaptive prompting, demonstrated superior performance over conventional instruction tuning across all models. Notably, the MedLLAMA_13B model achieved an exceptional F1 score of 95.13 on the ChemProt dataset using this method. Significant improvements were also observed in the DDI 2013 and BioRED datasets, confirming the method's robustness in sophisticated data extraction scenarios.

CONCLUSION

The LEAP framework offers a compelling strategy for enhancing LLM training strategies, steering away from extensive fine-tuning towards more dynamic and contextually enriched prompting methodologies, showcasing in biomedical relation extraction.

Collapse

Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024;31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open

Affiliation(s)

Ling Luo School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jinzhong Ning School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yingwen Zhao School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhijun Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zeyuan Ding School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Peng Chen School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Weiru Fu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Qinyu Han School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Guangtao Xu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yunzhi Qiu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Dinghao Pan School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jiru Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hao Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Wenduo Feng School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Senbo Tu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuqi Liu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhihao Yang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jian Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuanyuan Sun School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hongfei Lin School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

Collapse

Zong H, Wu R, Cha J, Feng W, Wu E, Li J, Shao A, Tao L, Li Z, Tang B, Shen B. Advancing Chinese biomedical text mining with community challenges. J Biomed Inform 2024;157:104716. [PMID: 39197732 DOI: 10.1016/j.jbi.2024.104716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 08/22/2024] [Accepted: 08/25/2024] [Indexed: 09/01/2024]

Affiliation(s)

Hui Zong Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Rongrong Wu Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Jiaxue Cha Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Weizhe Feng Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Erman Wu Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Jiakun Li Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China; Department of Urology, West China Hospital, Sichuan University, Chengdu 610041, China
Aibin Shao Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Liang Tao Faculty of Business Information, Shanghai Business School, Shanghai 201400, China
Zuofeng Li Takeda Co. Ltd., Shanghai 200040, China
Buzhou Tang Department of Computer Science, Harbin Institute of Technology, Shenzhen 518055, China
Bairong Shen Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China.

Collapse

Sarol MJ, Hong G, Guerra E, Kilicoglu H. Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. Database (Oxford) 2024;2024:baae079. [PMID: 39197056 PMCID: PMC11352595 DOI: 10.1093/database/baae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 08/30/2024]

Abstract

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.

Collapse

Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. BIOINFORMATICS ADVANCES 2024;4:vbae116. [PMID: 39411448 PMCID: PMC11474106 DOI: 10.1093/bioadv/vbae116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/10/2024] [Accepted: 08/04/2024] [Indexed: 10/19/2024]

Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024;2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Collapse

Islamaj R, Lai PT, Wei CH, Luo L, Almeida T, Jonker RAA, Conceição SIR, Sousa DF, Phan CP, Chiang JH, Li J, Pan D, Meesawad W, Tsai RTH, Sarol MJ, Hong G, Valiev A, Tutubalina E, Lee SM, Hsu YY, Li M, Verspoor K, Lu Z. The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford) 2024;2024:baae069. [PMID: 39114977 PMCID: PMC11306928 DOI: 10.1093/database/baae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/27/2024] [Accepted: 07/09/2024] [Indexed: 08/11/2024]

Abstract

The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.

Collapse

Affiliation(s)

Rezarta Islamaj National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
Po-Ting Lai National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
Chih-Hsuan Wei National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
Ling Luo School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
Tiago Almeida Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
Richard A. A Jonker Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
Sofia I. R Conceição Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
Diana F Sousa Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
Cong-Phuoc Phan Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
Jung-Hsien Chiang Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
Jiru Li School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
Dinghao Pan School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
Wilailack Meesawad Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
Richard Tzong-Han Tsai Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China Research Center for Humanities and Social Sciences, Academia Sinica, No. 128, Section 2, Academia Rd., Nangang District, Taoyuan City 115201, Taiwan, Republic of China
M. Janina Sarol School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
Gibong Hong School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
Airat Valiev Higher School of Economics University, 20 Myasnitskaya St, Moscow 101000, Russia
Elena Tutubalina Artificial Intelligence Research Institute (AIRI), 32 Kutuzovskiy St, Moscow 121170, Russia Kazan Federal University, 18 Kremlevskaya St, Kazan 420008, Russia
Shao-Man Lee Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
Yi-Yu Hsu Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
Mingjie Li School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
Karin Verspoor School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States

Collapse

Garda S, Leser U. BELHD: improving biomedical entity linking with homonym disambiguation. Bioinformatics 2024;40:btae474. [PMID: 39067036 PMCID: PMC11310454 DOI: 10.1093/bioinformatics/btae474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/14/2024] [Accepted: 07/25/2024] [Indexed: 07/30/2024] Open

Abstract

MOTIVATION

Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene).

RESULTS

We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach.

AVAILABILITY AND IMPLEMENTATION

The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.

Collapse

Cousins HC, Nayar G, Altman RB. Computational Approaches to Drug Repurposing: Methods, Challenges, and Opportunities. Annu Rev Biomed Data Sci 2024;7:15-29. [PMID: 38598857 DOI: 10.1146/annurev-biodatasci-110123-025333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]

Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024;2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]

Abstract

The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.

Collapse

Almeida T, Jonker RAA, Antunes R, Almeida JR, Matos S. Towards discovery: an end-to-end system for uncovering novel biomedical relations. Database (Oxford) 2024;2024:baae057. [PMID: 38994795 PMCID: PMC11240158 DOI: 10.1093/database/baae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/20/2024] [Accepted: 06/19/2024] [Indexed: 07/13/2024]

Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024;52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open

Xu T, Gu Y, Xue M, Gu R, Li B, Gu X. Knowledge graph construction for heart failure using large language models with prompt engineering. Front Comput Neurosci 2024;18:1389475. [PMID: 39015745 PMCID: PMC11250484 DOI: 10.3389/fncom.2024.1389475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 06/13/2024] [Indexed: 07/18/2024] Open

Yuan J, Zhang F, Qiu Y, Lin H, Zhang Y. Document-level biomedical relation extraction via hierarchical tree graph and relation segmentation module. BIOINFORMATICS (OXFORD, ENGLAND) 2024;40:btae418. [PMID: 38917409 DOI: 10.1093/bioinformatics/btae418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Revised: 05/27/2024] [Accepted: 06/24/2024] [Indexed: 06/27/2024]

Ming S, Zhang R, Kilicoglu H. Enhancing the coverage of SemRep using a relation classification approach. J Biomed Inform 2024;155:104658. [PMID: 38782169 DOI: 10.1016/j.jbi.2024.104658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 05/01/2024] [Accepted: 05/18/2024] [Indexed: 05/25/2024]

Abstract

OBJECTIVE

Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB.

METHODS

We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts.

RESULTS

Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct.

CONCLUSION

These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.

Collapse

Purpura A, Mulligan N, Kartoun U, Koski E, Anand V, Bettencourt-Silva J. Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024;2024:384-390. [PMID: 38827064 PMCID: PMC11141837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]

Nachtegael C, De Stefani J, Cnudde A, Lenaerts T. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations. Database (Oxford) 2024;2024:baae039. [PMID: 38805753 PMCID: PMC11131422 DOI: 10.1093/database/baae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/17/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]

Abstract

While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.

Collapse

Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024;25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open

Smith E, Paloots R, Giagkos D, Baudis M, Stockinger K. Data-driven information extraction and enrichment of molecular profiling data for cancer cell lines. BIOINFORMATICS ADVANCES 2024;4:vbae045. [PMID: 38560553 PMCID: PMC10978572 DOI: 10.1093/bioadv/vbae045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/12/2024] [Accepted: 03/14/2024] [Indexed: 04/04/2024]

Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023;18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open

Abstract

Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.

Collapse

Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. BioKG: a comprehensive, large-scale biomedical knowledge graph for AI-powered, data-driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]

Abstract

To cope with the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have emerged as a powerful data structure for integrating large volumes of heterogeneous data to facilitate accurate and efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured content from scientific literature into KGs has remained a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge to construct a largescale KG using all PubMed abstracts. The quality of the large-scale information extraction rivals that of human expert annotations, signaling a new era of automatic, high-quality database construction from literature. Our extracted information markedly surpasses the amount of content in manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. The comprehensive KG enabled rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and achieved unprecedented results for drug target identification and drug repurposing. Taking lung cancer as an example, we found that 40% of drug targets reported in literature could have been predicted by our algorithm about 15 years ago in a retrospective study, demonstrating that substantial acceleration in scientific discovery could be achieved through automated hypotheses generation and timely dissemination. A cloud-based platform (https://www.biokde.com) was developed for academic users to freely access this rich structured data and associated tools.

Collapse

Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC Bioinformatics 2023;24:412. [PMID: 37915001 PMCID: PMC10619245 DOI: 10.1186/s12859-023-05539-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023] Open

Abstract

BACKGROUND

The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: (1) they identify a relationship but not the type of relationship, (2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, (3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or (4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues.

RESULTS

We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches.

CONCLUSIONS

SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.

Collapse

Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023;146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]

Abstract

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

Collapse

Rey CA, Danguilan JL, Mendoza KP, Remolona MF. Transformer-based approach to variable typing. Heliyon 2023;9:e20505. [PMID: 37842594 PMCID: PMC10568320 DOI: 10.1016/j.heliyon.2023.e20505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 09/19/2023] [Accepted: 09/27/2023] [Indexed: 10/17/2023] Open

Zhao D, Yang Y, Chen P, Meng J, Sun S, Wang J, Lin H. Biomedical document relation extraction with prompt learning and KNN. J Biomed Inform 2023;145:104459. [PMID: 37531999 DOI: 10.1016/j.jbi.2023.104459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 06/26/2023] [Accepted: 07/22/2023] [Indexed: 08/04/2023]

Buonocore TM, Crema C, Redolfi A, Bellazzi R, Parimbelli E. Localizing in-domain adaptation of transformer-based biomedical language models. J Biomed Inform 2023;144:104431. [PMID: 37385327 DOI: 10.1016/j.jbi.2023.104431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 06/09/2023] [Accepted: 06/17/2023] [Indexed: 07/01/2023]

Yoon W, Yi S, Jackson R, Kim H, Kim S, Kang J. Biomedical relation extraction with knowledge base-refined weak supervision. Database (Oxford) 2023;2023:baad054. [PMID: 37551911 PMCID: PMC10407973 DOI: 10.1093/database/baad054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 05/13/2023] [Accepted: 07/04/2023] [Indexed: 08/09/2023]

Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) Discovers and Annotates Biomedical Knowledge Using Co-Occurrence and Transformer Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.30.542911. [PMID: 37397987 PMCID: PMC10312590 DOI: 10.1101/2023.05.30.542911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]

Abstract

Background

The PubMed database contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: 1) they identify a relationship but not the type of relationship, 2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, 3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or 4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues.

Results

Conclusions

Collapse

Luo L, Wei CH, Lai PT, Leaman R, Chen Q, Lu Z. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics 2023;39:btad310. [PMID: 37171899 PMCID: PMC10212279 DOI: 10.1093/bioinformatics/btad310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/12/2023] [Accepted: 05/11/2023] [Indexed: 05/14/2023] Open

Guo M, Zhou Z, Gotz D, Wang Y. GRAFS: Graphical Faceted Search System to Support Conceptual Understanding in Exploratory Search. ACM T INTERACT INTEL 2023. [DOI: 10.1145/3588319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]

Wang Z, Gu Y, Zheng S, Yang L, Li J. MGREL: A multi-graph representation learning-based ensemble learning method for gene-disease association prediction. Comput Biol Med 2023;155:106642. [PMID: 36805231 DOI: 10.1016/j.compbiomed.2023.106642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/15/2023] [Accepted: 02/05/2023] [Indexed: 02/12/2023]

Ivanisenko TV, Demenkov PS, Kolchanov NA, Ivanisenko VA. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int J Mol Sci 2022;23:ijms232314934. [PMID: 36499269 PMCID: PMC9738852 DOI: 10.3390/ijms232314934] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022] Open