1
|
Jia Y, Wang H, Yuan Z, Zhu L, Xiang ZL. Biomedical relation extraction method based on ensemble learning and attention mechanism. BMC Bioinformatics 2024; 25:333. [PMID: 39425010 PMCID: PMC11488084 DOI: 10.1186/s12859-024-05951-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Accepted: 10/07/2024] [Indexed: 10/21/2024] Open
Abstract
BACKGROUND Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale. RESULTS This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model. CONCLUSIONS SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .
Collapse
Affiliation(s)
- Yaxun Jia
- Department of Radiation Oncology, Shanghai East Hospital, Tongji University School of Medicine, Shanghai, China
| | - Haoyang Wang
- Department of Computer College, Beijing Information Science and Technology University, Beijing, China
| | - Zhu Yuan
- Department of Information Management, The National Police University for Criminal Justice, Baoding, China
| | - Lian Zhu
- Department of Radiation Oncology, Shanghai East Hospital Ji'an hospital, Jian, China
| | - Zuo-Lin Xiang
- Department of Radiation Oncology, Shanghai East Hospital, Tongji University School of Medicine, Shanghai, China.
- Department of Radiation Oncology, Shanghai East Hospital Ji'an hospital, Jian, China.
| |
Collapse
|
2
|
Phan CP, Phan B, Chiang JH. Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini. Database (Oxford) 2024; 2024:baae104. [PMID: 39383312 PMCID: PMC11463225 DOI: 10.1093/database/baae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/21/2024] [Accepted: 09/04/2024] [Indexed: 10/11/2024]
Abstract
Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.
Collapse
Affiliation(s)
- Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Ben Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| |
Collapse
|
3
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
4
|
Košprdić M, Prodanović N, Ljajić A, Bašaragin B, Milošević N. From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts. Artif Intell Med 2024; 156:102970. [PMID: 39197375 DOI: 10.1016/j.artmed.2024.102970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 08/23/2024] [Accepted: 08/23/2024] [Indexed: 09/01/2024]
Abstract
Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. This paper proposes a method for zero- and few-shot NER in the biomedical domain to address these challenges. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large number of datasets and biomedical entities, which allows the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.
Collapse
Affiliation(s)
- Miloš Košprdić
- Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia
| | - Nikola Prodanović
- Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia
| | - Adela Ljajić
- Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia
| | - Bojana Bašaragin
- Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia
| | - Nikola Milošević
- Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia; Bayer A.G., Research and Development, Mullerstrasse 173, Berlin, 13342, Germany.
| |
Collapse
|
5
|
Sun J, Zhang C, Xing L, Zhang L, Cai H, Guo M. BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention. J Biomed Inform 2024; 158:104733. [PMID: 39368528 DOI: 10.1016/j.jbi.2024.104733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 09/03/2024] [Accepted: 09/28/2024] [Indexed: 10/07/2024]
Abstract
Electronic Health Records (EHRs) contain various valuable medical entities and their relationships. Although the extraction of biomedical relationships has achieved good results in the mining of electronic health records and the construction of biomedical knowledge bases, there are still some problems. There may be implied complex associations between entities and relationships in overlapping triplets, and ignoring these interactions may lead to a decrease in the accuracy of entity extraction. To address this issue, a joint extraction model for medical entity relations based on a relation attention mechanism is proposed. The relation extraction module identifies candidate relationships within a sentence. The attention mechanism based on these relationships assigns weights to contextual words in the sentence that are associated with different relationships. Additionally, it extracts the subject and object entities. Under a specific relationship, entity vector representations are utilized to construct a global entity matching matrix based on Biaffine transformations. This matrix is designed to enhance the semantic dependencies and relational representations between entities, enabling triplet extraction. This allows the two subtasks of named entity recognition and relation extraction to be interrelated, fully utilizing contextual information within the sentence, and effectively addresses the issue of overlapping triplets. Experimental observations from the CMeIE Chinese medical relation extraction dataset and the Baidu2019 Chinese dataset confirm that our approach yields the superior F1 score across all cutting-edge baselines. Moreover, it offers substantial performance improvements in intricate situations involving diverse overlapping patterns, multitudes of triplets, and cross-sentence triplets.
Collapse
Affiliation(s)
- Jiaqi Sun
- Computer Science and Technology, Shandong University of Technology, Zibo, 255000, Shandong, China.
| | - Chen Zhang
- Computer Science and Technology, Shandong University of Technology, Zibo, 255000, Shandong, China.
| | - Linlin Xing
- Computer Science and Technology, Shandong University of Technology, Zibo, 255000, Shandong, China.
| | - Longbo Zhang
- Computer Science and Technology, Shandong University of Technology, Zibo, 255000, Shandong, China.
| | - Hongzhen Cai
- Agricultural Engineering and Food Science, Shandong University of Technology, Zibo, 255000, Shandong, China.
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, 102612, China.
| |
Collapse
|
6
|
Sänger M, Garda S, Wang XD, Weber-Genzel L, Droop P, Fuchs B, Akbik A, Leser U. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools. Bioinformatics 2024; 40:btae564. [PMID: 39302686 PMCID: PMC11453098 DOI: 10.1093/bioinformatics/btae564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 08/23/2024] [Accepted: 09/17/2024] [Indexed: 09/22/2024] Open
Abstract
MOTIVATION With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications. RESULTS Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools. AVAILABILITY AND IMPLEMENTATION All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.
Collapse
Affiliation(s)
- Mario Sänger
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Samuele Garda
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Xing David Wang
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, München 80539, Germany
| | - Pia Droop
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Benedikt Fuchs
- Research Industrial Systems Engineering (RISE) Forschungs-, Entwicklungs- und Großprojektberatung GmbH, Schwechat 2320, Austria
| | - Alan Akbik
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
7
|
Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen LJ. RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature. Database (Oxford) 2024; 2024:baae095. [PMID: 39265993 PMCID: PMC11394941 DOI: 10.1093/database/baae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/31/2024] [Accepted: 08/16/2024] [Indexed: 09/14/2024]
Abstract
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark
| | - Farrokh Mehryary
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Tomoko Ohta
- Textimi, 1-37-13 Kitazawa, Tokyo, Setagaya-ku 155-0031, Japan
| | - Jouni Luoma
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark
| |
Collapse
|
8
|
Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024; 11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open
Abstract
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
Collapse
Grants
- U24 HG007822 NHGRI NIH HHS
- NIH Intramural Research Program, National Library of Medicine
- Expert curation and evaluation of EnzChemRED at Swiss-Prot were supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and the National Human Genome Research Institute (NHGRI), Office of Director [OD/DPCPSI/ODSS], National Institute of Allergy and Infectious Diseases (NIAID), National Institute on Aging (NIA), National Institute of General Medical Sciences (NIGMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Eye Institute (NEI), National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health [U24HG007822], and by the European Union's Horizon Europe Framework Programme (grant number 101080997), supported in Switzerland through the State Secretariat for Education, Research and Innovation (SERI).
- Fundamental Research Funds for the Central Universities [DUT23RC(3)014 to L.L.]
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Elisabeth Coudert
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucila Aimo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Kristian Axelsen
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lionel Breuza
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Marc Feuermann
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucille Pourcel
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Ivo Pedruzzi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Catherine Rivoire
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anastasia Sveshnikova
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
| | - Alan Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
| |
Collapse
|
9
|
Nie P, Ning J, Lin M, Yang Z, Wang L. SSGU-CD: A combined semantic and structural information graph U-shaped network for document-level Chemical-Disease interaction extraction. J Biomed Inform 2024; 157:104719. [PMID: 39214159 DOI: 10.1016/j.jbi.2024.104719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 08/20/2024] [Accepted: 08/27/2024] [Indexed: 09/04/2024]
Abstract
Document-level interaction extraction for Chemical-Disease is aimed at inferring the interaction relations between chemical entities and disease entities across multiple sentences. Compared with sentence-level relation extraction, document-level relation extraction can capture the associations between different entities throughout the entire document, which is found to be more practical for biomedical text information. However, current biomedical extraction methods mainly concentrate on sentence-level relation extraction, making it difficult to access the rich structural information contained in documents in practical application scenarios. We put forward SSGU-CD, a combined Semantic and Structural information Graph U-shaped network for document-level Chemical-Disease interaction extraction. This framework effectively stores document semantic and structure information as graphs and can fuse the original context information of documents. Using the framework, we propose a balanced combination of cross-entropy loss function to facilitate collaborative optimization among models with the aim of enhancing the ability to extract Chemical-Disease interaction relations. We evaluated SSGU-CD on the document-level relation extraction dataset CDR and BioRED, and the results demonstrate that the framework can significantly improve the extraction performance.
Collapse
Affiliation(s)
- Pengyuan Nie
- Academy of Military Medical Sciences, Beijing, 100850, China
| | - Jinzhong Ning
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Mengxuan Lin
- Academy of Military Medical Sciences, Beijing, 100850, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Lei Wang
- Academy of Military Medical Sciences, Beijing, 100850, China.
| |
Collapse
|
10
|
Lu Z, Peng Y, Cohen T, Ghassemi M, Weng C, Tian S. Large language models in biomedicine and health: current research landscape and future directions. J Am Med Inform Assoc 2024; 31:1801-1811. [PMID: 39169867 PMCID: PMC11339542 DOI: 10.1093/jamia/ocae202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Indexed: 08/23/2024] Open
Affiliation(s)
- Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Trevor Cohen
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195, United States
| | - Marzyeh Ghassemi
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Shubo Tian
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| |
Collapse
|
11
|
Li Z, Wei Q, Huang LC, Li J, Hu Y, Chuang YS, He J, Das A, Keloth VK, Yang Y, Diala CS, Roberts KE, Tao C, Jiang X, Zheng WJ, Xu H. Ensemble pretrained language models to extract biomedical knowledge from literature. J Am Med Inform Assoc 2024; 31:1904-1911. [PMID: 38520725 PMCID: PMC11339500 DOI: 10.1093/jamia/ocae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 02/14/2024] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open
Abstract
OBJECTIVES The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking. MATERIALS AND METHODS For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE). RESULTS Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models. DISCUSSION AND CONCLUSION Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.
Collapse
Affiliation(s)
- Zhao Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Qiang Wei
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Liang-Chin Huang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Jianfu Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Yao-Shun Chuang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Jianping He
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Avisha Das
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Vipina Kuttichi Keloth
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| | - Yuntao Yang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Chiamaka S Diala
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Kirk E Roberts
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Cui Tao
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Xiaoqian Jiang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - W Jim Zheng
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| |
Collapse
|
12
|
Zhou H, Li M, Xiao Y, Yang H, Zhang R. LEAP: LLM instruction-example adaptive prompting framework for biomedical relation extraction. J Am Med Inform Assoc 2024; 31:2010-2018. [PMID: 38904416 PMCID: PMC11339510 DOI: 10.1093/jamia/ocae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 05/26/2024] [Accepted: 06/03/2024] [Indexed: 06/22/2024] Open
Abstract
OBJECTIVE To investigate the demonstration in large language models (LLMs) for biomedical relation extraction. This study introduces a framework comprising three types of adaptive tuning methods to assess their impacts and effectiveness. MATERIALS AND METHODS Our study was conducted in two phases. Initially, we analyzed a range of demonstration components vital for LLMs' biomedical data capabilities, including task descriptions and examples, experimenting with various combinations. Subsequently, we introduced the LLM instruction-example adaptive prompting (LEAP) framework, including instruction adaptive tuning, example adaptive tuning, and instruction-example adaptive tuning methods. This framework aims to systematically investigate both adaptive task descriptions and adaptive examples within the demonstration. We assessed the performance of the LEAP framework on the DDI, ChemProt, and BioRED datasets, employing LLMs such as Llama2-7b, Llama2-13b, and MedLLaMA_13B. RESULTS Our findings indicated that Instruction + Options + Example and its expanded form substantially improved F1 scores over the standard Instruction + Options mode for zero-shot LLMs. The LEAP framework, particularly through its example adaptive prompting, demonstrated superior performance over conventional instruction tuning across all models. Notably, the MedLLAMA_13B model achieved an exceptional F1 score of 95.13 on the ChemProt dataset using this method. Significant improvements were also observed in the DDI 2013 and BioRED datasets, confirming the method's robustness in sophisticated data extraction scenarios. CONCLUSION The LEAP framework offers a compelling strategy for enhancing LLM training strategies, steering away from extensive fine-tuning towards more dynamic and contextually enriched prompting methodologies, showcasing in biomedical relation extraction.
Collapse
Affiliation(s)
- Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, United States
| | - Mingchen Li
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States
| | - Yongkang Xiao
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, United States
| | - Han Yang
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, United States
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States
| |
Collapse
|
13
|
Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024; 31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open
Abstract
OBJECTIVE Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical natural language processing (NLP) tasks in different languages, we present Taiyi, a bilingual fine-tuned LLM for diverse biomedical NLP tasks. MATERIALS AND METHODS We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, these corpora were converted to the instruction data used to fine-tune the general LLM. During the supervised fine-tuning phase, a 2-stage strategy is proposed to optimize the model performance across various tasks. RESULTS Experimental results on 13 test sets, which include named entity recognition, relation extraction, text classification, and question answering tasks, demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multitasking. CONCLUSION Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multitasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still underperform the conventional discriminative approaches using smaller language models.
Collapse
Affiliation(s)
- Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jinzhong Ning
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yingwen Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhijun Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zeyuan Ding
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Peng Chen
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Weiru Fu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Qinyu Han
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Guangtao Xu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yunzhi Qiu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Dinghao Pan
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jiru Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hao Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Wenduo Feng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Senbo Tu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yuqi Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yuanyuan Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
14
|
Zong H, Wu R, Cha J, Feng W, Wu E, Li J, Shao A, Tao L, Li Z, Tang B, Shen B. Advancing Chinese biomedical text mining with community challenges. J Biomed Inform 2024; 157:104716. [PMID: 39197732 DOI: 10.1016/j.jbi.2024.104716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 08/22/2024] [Accepted: 08/25/2024] [Indexed: 09/01/2024]
Abstract
OBJECTIVE This study aims to review the recent advances in community challenges for biomedical text mining in China. METHODS We collected information of evaluation tasks released in community challenges of biomedical text mining, including task description, dataset description, data source, task type and related links. A systematic summary and comparative analysis were conducted on various biomedical natural language processing tasks, such as named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. RESULTS We identified 39 evaluation tasks from 6 community challenges that spanned from 2017 to 2023. Our analysis revealed the diverse range of evaluation task types and data sources in biomedical text mining. We explored the potential clinical applications of these community challenge tasks from a translational biomedical informatics perspective. We compared with their English counterparts, and discussed the contributions, limitations, lessons and guidelines of these community challenges, while highlighting future directions in the era of large language models. CONCLUSION Community challenge evaluation competitions have played a crucial role in promoting technology innovation and fostering interdisciplinary collaboration in the field of biomedical text mining. These challenges provide valuable platforms for researchers to develop state-of-the-art solutions.
Collapse
Affiliation(s)
- Hui Zong
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Rongrong Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Jiaxue Cha
- Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Weizhe Feng
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Erman Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Jiakun Li
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China; Department of Urology, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Aibin Shao
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Liang Tao
- Faculty of Business Information, Shanghai Business School, Shanghai 201400, China
| | | | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China.
| |
Collapse
|
15
|
Sarol MJ, Hong G, Guerra E, Kilicoglu H. Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. Database (Oxford) 2024; 2024:baae079. [PMID: 39197056 PMCID: PMC11352595 DOI: 10.1093/database/baae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 08/30/2024]
Abstract
Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.
Collapse
Affiliation(s)
- M Janina Sarol
- Informatics Programs, University of Illinois Urbana-Champaign, 614 E Daniel Street, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Evan Guerra
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| |
Collapse
|
16
|
Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. BIOINFORMATICS ADVANCES 2024; 4:vbae116. [PMID: 39411448 PMCID: PMC11474106 DOI: 10.1093/bioadv/vbae116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/10/2024] [Accepted: 08/04/2024] [Indexed: 10/19/2024]
Abstract
Motivation Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus. Results We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature. Availability and implementation All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| |
Collapse
|
17
|
Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024; 2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]
Abstract
The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Chih-Hsuan Wei
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, No.2 Linggong Road, Ganjingzi District, Dalian, Liaoning 116024, China
| | - Cathleen Coss
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Preeti Gokal Kochar
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Nicholas Miliaras
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Oleg Rodionov
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Keiko Sekiya
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Dorothy Trinh
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Deborah Whitman
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| |
Collapse
|
18
|
Islamaj R, Lai PT, Wei CH, Luo L, Almeida T, Jonker RAA, Conceição SIR, Sousa DF, Phan CP, Chiang JH, Li J, Pan D, Meesawad W, Tsai RTH, Sarol MJ, Hong G, Valiev A, Tutubalina E, Lee SM, Hsu YY, Li M, Verspoor K, Lu Z. The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford) 2024; 2024:baae069. [PMID: 39114977 PMCID: PMC11306928 DOI: 10.1093/database/baae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/27/2024] [Accepted: 07/09/2024] [Indexed: 08/11/2024]
Abstract
The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Tiago Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A. A Jonker
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sofia I. R Conceição
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
| | - Diana F Sousa
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
| | - Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
| | - Jiru Li
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Dinghao Pan
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Wilailack Meesawad
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
- Research Center for Humanities and Social Sciences, Academia Sinica, No. 128, Section 2, Academia Rd., Nangang District, Taoyuan City 115201, Taiwan, Republic of China
| | - M. Janina Sarol
- School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
| | - Airat Valiev
- Higher School of Economics University, 20 Myasnitskaya St, Moscow 101000, Russia
| | - Elena Tutubalina
- Artificial Intelligence Research Institute (AIRI), 32 Kutuzovskiy St, Moscow 121170, Russia
- Kazan Federal University, 18 Kremlevskaya St, Kazan 420008, Russia
| | - Shao-Man Lee
- Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
| | - Yi-Yu Hsu
- Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
| | - Mingjie Li
- School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| |
Collapse
|
19
|
Garda S, Leser U. BELHD: improving biomedical entity linking with homonym disambiguation. Bioinformatics 2024; 40:btae474. [PMID: 39067036 PMCID: PMC11310454 DOI: 10.1093/bioinformatics/btae474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/14/2024] [Accepted: 07/25/2024] [Indexed: 07/30/2024] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). RESULTS We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach. AVAILABILITY AND IMPLEMENTATION The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany
| | - Ulf Leser
- Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany
| |
Collapse
|
20
|
Cousins HC, Nayar G, Altman RB. Computational Approaches to Drug Repurposing: Methods, Challenges, and Opportunities. Annu Rev Biomed Data Sci 2024; 7:15-29. [PMID: 38598857 DOI: 10.1146/annurev-biodatasci-110123-025333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Drug repurposing refers to the inference of therapeutic relationships between a clinical indication and existing compounds. As an emerging paradigm in drug development, drug repurposing enables more efficient treatment of rare diseases, stratified patient populations, and urgent threats to public health. However, prioritizing well-suited drug candidates from among a nearly infinite number of repurposing options continues to represent a significant challenge in drug development. Over the past decade, advances in genomic profiling, database curation, and machine learning techniques have enabled more accurate identification of drug repurposing candidates for subsequent clinical evaluation. This review outlines the major methodologic classes that these approaches comprise, which rely on (a) protein structure, (b) genomic signatures, (c) biological networks, and (d) real-world clinical data. We propose that realizing the full impact of drug repurposing methodologies requires a multidisciplinary understanding of each method's advantages and limitations with respect to clinical practice.
Collapse
Affiliation(s)
- Henry C Cousins
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA;
| | - Gowri Nayar
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA;
| | - Russ B Altman
- Departments of Genetics, Medicine, and Bioengineering, Stanford University, Stanford, California, USA
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA;
| |
Collapse
|
21
|
Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024; 2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
Collapse
Affiliation(s)
- Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
22
|
Almeida T, Jonker RAA, Antunes R, Almeida JR, Matos S. Towards discovery: an end-to-end system for uncovering novel biomedical relations. Database (Oxford) 2024; 2024:baae057. [PMID: 38994795 PMCID: PMC11240158 DOI: 10.1093/database/baae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/20/2024] [Accepted: 06/19/2024] [Indexed: 07/13/2024]
Abstract
Biomedical relation extraction is an ongoing challenge within the natural language processing community. Its application is important for understanding scientific biomedical literature, with many use cases, such as drug discovery, precision medicine, disease diagnosis, treatment optimization and biomedical knowledge graph construction. Therefore, the development of a tool capable of effectively addressing this task holds the potential to improve knowledge discovery by automating the extraction of relations from research manuscripts. The first track in the BioCreative VIII competition extended the scope of this challenge by introducing the detection of novel relations within the literature. This paper describes that our participation system initially focused on jointly extracting and classifying novel relations between biomedical entities. We then describe our subsequent advancement to an end-to-end model. Specifically, we enhanced our initial system by incorporating it into a cascading pipeline that includes a tagger and linker module. This integration enables the comprehensive extraction of relations and classification of their novelty directly from raw text. Our experiments yielded promising results, and our tagger module managed to attain state-of-the-art named entity recognition performance, with a micro F1-score of 90.24, while our end-to-end system achieved a competitive novelty F1-score of 24.59. The code to run our system is publicly available at https://github.com/ieeta-pt/BioNExt. Database URL: https://github.com/ieeta-pt/BioNExt.
Collapse
Affiliation(s)
- Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
23
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024; 52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
24
|
Xu T, Gu Y, Xue M, Gu R, Li B, Gu X. Knowledge graph construction for heart failure using large language models with prompt engineering. Front Comput Neurosci 2024; 18:1389475. [PMID: 39015745 PMCID: PMC11250484 DOI: 10.3389/fncom.2024.1389475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 06/13/2024] [Indexed: 07/18/2024] Open
Abstract
Introduction Constructing an accurate and comprehensive knowledge graph of specific diseases is critical for practical clinical disease diagnosis and treatment, reasoning and decision support, rehabilitation, and health management. For knowledge graph construction tasks (such as named entity recognition, relation extraction), classical BERT-based methods require a large amount of training data to ensure model performance. However, real-world medical annotation data, especially disease-specific annotation samples, are very limited. In addition, existing models do not perform well in recognizing out-of-distribution entities and relations that are not seen in the training phase. Method In this study, we present a novel and practical pipeline for constructing a heart failure knowledge graph using large language models and medical expert refinement. We apply prompt engineering to the three phases of schema design: schema design, information extraction, and knowledge completion. The best performance is achieved by designing task-specific prompt templates combined with the TwoStepChat approach. Results Experiments on two datasets show that the TwoStepChat method outperforms the Vanillia prompt and outperforms the fine-tuned BERT-based baselines. Moreover, our method saves 65% of the time compared to manual annotation and is better suited to extract the out-of-distribution information in the real world.
Collapse
Affiliation(s)
- Tianhan Xu
- School of Information Engineering, Yangzhou University, Yangzhou, Jiangsu, China
- School of Information Engineering, Yangzhou Polytechnic Institute, Yangzhou, Jiangsu, China
| | - Yixun Gu
- Department of Radiation Oncology, Yangzhou Second People's Hospital, Yangzhou, Jiangsu, China
| | - Mantian Xue
- School of Information Engineering, Yangzhou University, Yangzhou, Jiangsu, China
| | - Renjie Gu
- Department of Cardiovascular, Northern Jiangsu Province People Hospital of Yangzhou University, Yangzhou, Jiangsu, China
| | - Bin Li
- School of Information Engineering, Yangzhou University, Yangzhou, Jiangsu, China
| | - Xiang Gu
- Department of Cardiovascular, Northern Jiangsu Province People Hospital of Yangzhou University, Yangzhou, Jiangsu, China
| |
Collapse
|
25
|
Yuan J, Zhang F, Qiu Y, Lin H, Zhang Y. Document-level biomedical relation extraction via hierarchical tree graph and relation segmentation module. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae418. [PMID: 38917409 DOI: 10.1093/bioinformatics/btae418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Revised: 05/27/2024] [Accepted: 06/24/2024] [Indexed: 06/27/2024]
Abstract
MOTIVATION Biomedical relation extraction at the document level (Bio-DocRE) involves extracting relation instances from biomedical texts that span multiple sentences, often containing various entity concepts such as genes, diseases, chemicals, variants, etc. Currently, this task is usually implemented based on graphs or transformers. However, most work directly models entity features to relation prediction, ignoring the effectiveness of entity pair information as an intermediate state for relation prediction. In this article, we decouple this task into a three-stage process to capture sufficient information for improving relation prediction. RESULTS We propose an innovative framework HTGRS for Bio-DocRE, which constructs a hierarchical tree graph (HTG) to integrate key information sources in the document, achieving relation reasoning based on entity. In addition, inspired by the idea of semantic segmentation, we conceptualize the task as a table-filling problem and develop a relation segmentation (RS) module to enhance relation reasoning based on the entity pair. Extensive experiments on three datasets show that the proposed framework outperforms the state-of-the-art methods and achieves superior performance. AVAILABILITY AND IMPLEMENTATION Our source code is available at https://github.com/passengeryjy/HTGRS.
Collapse
Affiliation(s)
- Jianyuan Yuan
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| | - Fengyu Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| | - Yimeng Qiu
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yijia Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
26
|
Ming S, Zhang R, Kilicoglu H. Enhancing the coverage of SemRep using a relation classification approach. J Biomed Inform 2024; 155:104658. [PMID: 38782169 DOI: 10.1016/j.jbi.2024.104658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 05/01/2024] [Accepted: 05/18/2024] [Indexed: 05/25/2024]
Abstract
OBJECTIVE Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB. METHODS We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts. RESULTS Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct. CONCLUSION These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.
Collapse
Affiliation(s)
- Shufan Ming
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel St., Champaign, 61820, IL, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, 516 Delaware St SE, Minneapolis, 55455, MN, USA
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel St., Champaign, 61820, IL, USA.
| |
Collapse
|
27
|
Purpura A, Mulligan N, Kartoun U, Koski E, Anand V, Bettencourt-Silva J. Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024; 2024:384-390. [PMID: 38827064 PMCID: PMC11141837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
This paper addresses the challenge of binary relation classification in biomedical Natural Language Processing (NLP), focusing on diverse domains including gene-disease associations, compound protein interactions, and social determinants of health (SDOH). We evaluate different approaches, including fine-tuning Bidirectional Encoder Representations from Transformers (BERT) models and generative Large Language Models (LLMs), and examine their performance in zero and few-shot settings. We also introduce a novel dataset of biomedical text annotated with social and clinical entities to facilitate research into relation classification. Our results underscore the continued complexity of this task for both humans and models. BERT-based models trained on domain-specific data excelled in certain domains and achieved comparable performance and generalization power to generative LLMs in others. Despite these encouraging results, these models are still far from achieving human-level performance. We also highlight the significance of high-quality training data and domain-specific fine-tuning on the performance of all the considered models.
Collapse
|
28
|
Nachtegael C, De Stefani J, Cnudde A, Lenaerts T. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations. Database (Oxford) 2024; 2024:baae039. [PMID: 38805753 PMCID: PMC11131422 DOI: 10.1093/database/baae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/17/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Anthony Cnudde
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Pharmacologie, Pharmacothérapie et Suivi Pharmaceutique, Université Libre de Bruxelles, Boulevard du Triomphe, CP 205, Brussels 1050, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium
| |
Collapse
|
29
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
30
|
Smith E, Paloots R, Giagkos D, Baudis M, Stockinger K. Data-driven information extraction and enrichment of molecular profiling data for cancer cell lines. BIOINFORMATICS ADVANCES 2024; 4:vbae045. [PMID: 38560553 PMCID: PMC10978572 DOI: 10.1093/bioadv/vbae045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/12/2024] [Accepted: 03/14/2024] [Indexed: 04/04/2024]
Abstract
Motivation With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. Cancer cell lines are frequently used models in biological and medical research that are currently applied for a wide range of purposes, from studies of cellular mechanisms to drug development, which has led to a wealth of related data and publications. Sifting through large quantities of text to gather relevant information on cell lines of interest is tedious and extremely slow when performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. Results In this work, we present the design, implementation, and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data concerning cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Availability and implementation Our system is publicly available on the web at https://cancercelllines.org.
Collapse
Affiliation(s)
- Ellery Smith
- Institute for Intelligent Information Systems, Zürich University of Applied Sciences, 8400 Winterthur, Switzerland
| | - Rahel Paloots
- Department of Molecular Life Sciences, University of Zürich, 8057 Zürich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | | | - Michael Baudis
- Department of Molecular Life Sciences, University of Zürich, 8057 Zürich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Kurt Stockinger
- Institute for Intelligent Information Systems, Zürich University of Applied Sciences, 8400 Winterthur, Switzerland
| |
Collapse
|
31
|
Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023; 18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open
Abstract
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Technology, Policy and Management Faculty, Technische Universiteit Delft, Delft, Netherlands
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Bruxelles, Belgium
| |
Collapse
|
32
|
Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. BioKG: a comprehensive, large-scale biomedical knowledge graph for AI-powered, data-driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
To cope with the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have emerged as a powerful data structure for integrating large volumes of heterogeneous data to facilitate accurate and efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured content from scientific literature into KGs has remained a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge to construct a largescale KG using all PubMed abstracts. The quality of the large-scale information extraction rivals that of human expert annotations, signaling a new era of automatic, high-quality database construction from literature. Our extracted information markedly surpasses the amount of content in manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. The comprehensive KG enabled rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and achieved unprecedented results for drug target identification and drug repurposing. Taking lung cancer as an example, we found that 40% of drug targets reported in literature could have been predicted by our algorithm about 15 years ago in a retrospective study, demonstrating that substantial acceleration in scientific discovery could be achieved through automated hypotheses generation and timely dissemination. A cloud-based platform (https://www.biokde.com) was developed for academic users to freely access this rich structured data and associated tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Xin Sui
- Insilicom LLC, Tallahassee, FL 32303
| | - Feng Pan
- Insilicom LLC, Tallahassee, FL 32303
| | | | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Wanjing Wang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Jian Wang
- 977 Wisteria Ter., Sunnyvale, CA 94086
| | | | | | - Jun Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Eric Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Ben Lee
- Insilicom LLC, Tallahassee, FL 32303
| | - Peili Zhang
- Forward Informatics, Winchester, Massachusetts, 01890
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL 32306
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| |
Collapse
|
33
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC Bioinformatics 2023; 24:412. [PMID: 37915001 PMCID: PMC10619245 DOI: 10.1186/s12859-023-05539-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023] Open
Abstract
BACKGROUND The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: (1) they identify a relationship but not the type of relationship, (2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, (3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or (4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. RESULTS We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. CONCLUSIONS SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
Affiliation(s)
| | - Kalpana Raja
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
| | - John Steill
- Morgridge Institute for Research, Madison, WI, USA
| | - Cannon Lock
- Morgridge Institute for Research, Madison, WI, USA
| | - Xuancheng Tu
- Morgridge Institute for Research, Madison, WI, USA
| | - Ian Ross
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | - Lam C Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Finn Kuusisto
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Data Science Institute, University of Wisconsin, Madison, WI, USA
| | - Zijian Ni
- Department of Statistics, University of Wisconsin, Madison, WI, USA
- Currently at Amazon, Seattle, WA, USA
| | - Miron Livny
- Morgridge Institute for Research, Madison, WI, USA
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | | | - James Thomson
- Morgridge Institute for Research, Madison, WI, USA
- Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI, USA
| | - Ron Stewart
- Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
34
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023; 146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.
| |
Collapse
|
35
|
Rey CA, Danguilan JL, Mendoza KP, Remolona MF. Transformer-based approach to variable typing. Heliyon 2023; 9:e20505. [PMID: 37842594 PMCID: PMC10568320 DOI: 10.1016/j.heliyon.2023.e20505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 09/19/2023] [Accepted: 09/27/2023] [Indexed: 10/17/2023] Open
Abstract
The upsurge of multifarious endeavors across scientific fields propelled Big Data in the scientific domain. Despite the advancements in management systems, researchers find that mathematical knowledge remains one of the most challenging to manage due to the latter's inherent heterogeneity. One novel recourse being explored is variable typing where current works remain preliminary and, thus, provide a wide room for contribution. In this study, a primordial attempt to implement the end-to-end Entity Recognition (ER) and Relation Extraction (RE) approach to variable typing was made using the BERT (Bidirectional Encoder Representations from Transformers) model. A micro-dataset was developed for this process. According to our findings, the ER model and RE model, respectively, have Precision of 0.8142 and 0.4919, Recall of 0.7816 and 0.6030, and F1-Scores of 0.7975 and 0.5418. Despite the limited dataset, the models performed at par with values in the literature. This work also discusses the factors affecting this BERT-based approach, giving rise to suggestions for future implementations.
Collapse
Affiliation(s)
- Charles Arthel Rey
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Jose Lorenzo Danguilan
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Karl Patrick Mendoza
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Miguel Francisco Remolona
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| |
Collapse
|
36
|
Zhao D, Yang Y, Chen P, Meng J, Sun S, Wang J, Lin H. Biomedical document relation extraction with prompt learning and KNN. J Biomed Inform 2023; 145:104459. [PMID: 37531999 DOI: 10.1016/j.jbi.2023.104459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 06/26/2023] [Accepted: 07/22/2023] [Indexed: 08/04/2023]
Abstract
Document-level relation extraction is designed to recognize connections between entities a cross sentences or between sentences. The current mainstream document relation extraction model is mainly based on the graph method or combined with the pre-trained language model, which leads to the relatively complex process of the whole workflow. In this work, we propose biomedical relation extraction based on prompt learning to avoid complex relation extraction processes and obtain decent performance. Particularity, we present a model that combines prompt learning with T5 for document relation extraction, by integrating a mask template mechanism into the model. In addition, this work also proposes a few-shot relation extraction method based on the K-nearest neighbor (KNN) algorithm with prompt learning. We select similar semantic labels through KNN, and subsequently conduct the relation extraction. The results acquired from two biomedical document benchmarks indicate that our model can improve the learning of document semantic information, achieving improvements in the relation F1 score of 3.1% on CDR.
Collapse
Affiliation(s)
- Di Zhao
- School of Computer Science and Engineering, Dalian Minzu University, 116650 Dalian, China
| | - Yumeng Yang
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China.
| | - Peng Chen
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Jiana Meng
- School of Computer Science and Engineering, Dalian Minzu University, 116650 Dalian, China
| | - Shichang Sun
- School of Computer Science and Engineering, Dalian Minzu University, 116650 Dalian, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| |
Collapse
|
37
|
Buonocore TM, Crema C, Redolfi A, Bellazzi R, Parimbelli E. Localizing in-domain adaptation of transformer-based biomedical language models. J Biomed Inform 2023; 144:104431. [PMID: 37385327 DOI: 10.1016/j.jbi.2023.104431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 06/09/2023] [Accepted: 06/17/2023] [Indexed: 07/01/2023]
Abstract
In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.
Collapse
Affiliation(s)
- Tommaso Mario Buonocore
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy.
| | - Claudio Crema
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, 25125, Italy
| | - Alberto Redolfi
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, 25125, Italy
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy
| | - Enea Parimbelli
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy
| |
Collapse
|
38
|
Yoon W, Yi S, Jackson R, Kim H, Kim S, Kang J. Biomedical relation extraction with knowledge base-refined weak supervision. Database (Oxford) 2023; 2023:baad054. [PMID: 37551911 PMCID: PMC10407973 DOI: 10.1093/database/baad054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 05/13/2023] [Accepted: 07/04/2023] [Indexed: 08/09/2023]
Abstract
Biomedical relation extraction (BioRE) is the task of automatically extracting and classifying relations between two biomedical entities in biomedical literature. Recent advances in BioRE research have largely been powered by supervised learning and large language models (LLMs). However, training of LLMs for BioRE with supervised learning requires human-annotated data, and the annotation process often accompanies challenging and expensive work. As a result, the quantity and coverage of annotated data are limiting factors for BioRE systems. In this paper, we present our system for the BioCreative VII challenge-DrugProt track, a BioRE system that leverages a language model structure and weak supervision. Our system is trained on weakly labelled data and then fine-tuned using human-labelled data. To create the weakly labelled dataset, we combined two approaches. First, we trained a model on the original dataset to predict labels on external literature, which will become a model-labelled dataset. Then, we refined the model-labelled dataset using an external knowledge base. Based on our experiment, our approach using refined weak supervision showed significant performance gain over the model trained using standard human-labelled datasets. Our final model showed outstanding performance at the BioCreative VII challenge, achieving 3rd place (this paper focuses on our participating system in the BioCreative VII challenge). Database URL: http://wonjin.info/biore-yoon-et-al-2022.
Collapse
Affiliation(s)
- Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Sean Yi
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Richard Jackson
- AstraZeneca UK, 1 Francis Crick Ave, Trumpington, Cambridge CB2 0AA, UK
| | - Hyunjae Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
- AIGEN Sciences Inc., 25 Ttukseom-ro 1-gil, Seongdong-gu, Seoul 04778, South Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
- AIGEN Sciences Inc., 25 Ttukseom-ro 1-gil, Seongdong-gu, Seoul 04778, South Korea
| |
Collapse
|
39
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) Discovers and Annotates Biomedical Knowledge Using Co-Occurrence and Transformer Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.30.542911. [PMID: 37397987 PMCID: PMC10312590 DOI: 10.1101/2023.05.30.542911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Background The PubMed database contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: 1) they identify a relationship but not the type of relationship, 2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, 3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or 4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. Results We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. Conclusions SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
|
40
|
Luo L, Wei CH, Lai PT, Leaman R, Chen Q, Lu Z. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics 2023; 39:btad310. [PMID: 37171899 PMCID: PMC10212279 DOI: 10.1093/bioinformatics/btad310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/12/2023] [Accepted: 05/11/2023] [Indexed: 05/14/2023] Open
Abstract
MOTIVATION Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g. gene or disease). RESULTS We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). AVAILABILITY AND IMPLEMENTATION The source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| |
Collapse
|
41
|
Guo M, Zhou Z, Gotz D, Wang Y. GRAFS: Graphical Faceted Search System to Support Conceptual Understanding in Exploratory Search. ACM T INTERACT INTEL 2023. [DOI: 10.1145/3588319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
Abstract
When people search for information about a new topic within large document collections, they implicitly construct a mental model of the unfamiliar information space to represent what they currently know and guide their exploration into the unknown. Building this mental model can be challenging as it requires not only finding relevant documents, but also synthesizing important concepts and the relationships that connect those concepts both within and across documents. This paper describes a novel interactive approach designed to help users construct a mental model of an unfamiliar information space during exploratory search. We propose a new semantic search system to organize and visualize important concepts and their relations for a set of search results. A user study (
n
= 20) was conducted to compare the proposed approach against a baseline faceted search system on exploratory literature search tasks. Experimental results show that the proposed approach is more effective in helping users recognize relationships between key concepts, leading to a more sophisticated understanding of the search topic while maintaining similar functionality and usability as a faceted search system.
Collapse
|
42
|
Wang Z, Gu Y, Zheng S, Yang L, Li J. MGREL: A multi-graph representation learning-based ensemble learning method for gene-disease association prediction. Comput Biol Med 2023; 155:106642. [PMID: 36805231 DOI: 10.1016/j.compbiomed.2023.106642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/15/2023] [Accepted: 02/05/2023] [Indexed: 02/12/2023]
Abstract
The identification of gene-disease associations plays an important role in the exploration of pathogenic mechanisms and therapeutic targets. Computational methods have been regarded as an effective way to discover the potential gene-disease associations in recent years. However, most of them ignored the combination of abundant genetic, therapeutic information, and gene-disease network topology. To this end, we re-organized the current gene-disease association benchmark dataset by extracting the newest gene-disease associations from the OMIM database. Then, we developed a multi-graph representation learning-based ensemble model, named MGREL to predict gene-disease associations. MGREL integrated two feature generation channels to extract gene and disease features, including a knowledge extraction channel which learned high-order representations from genetic and therapeutic information, and a graph learning channel which acquired network topological representations through multiple advanced graph representation learning methods. Then, an ensemble learning method with 5 machine learning models was used as the classifier to predict the gene-disease association. Comprehensive experiments have demonstrated the significant performance achieved by MGREL compared to 5 state-of-the-art methods. For the major measurements (AUC = 0.925, AUPR = 0.935), the relative improvements of MGREL compared to the suboptimal methods are 3.24%, and 2.75%, respectively. MGREL also achieved impressive improvements in the challenging tasks of predicting potential associations for unknown genes/diseases. In addition, case studies implied potential applications for MGREL in the discovery of potential therapeutic targets.
Collapse
Affiliation(s)
- Ziyang Wang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Yaowen Gu
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Si Zheng
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China; Institute for Artificial Intelligence, Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, 100084, China
| | - Lin Yang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Jiao Li
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China.
| |
Collapse
|
43
|
Ivanisenko TV, Demenkov PS, Kolchanov NA, Ivanisenko VA. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int J Mol Sci 2022; 23:ijms232314934. [PMID: 36499269 PMCID: PMC9738852 DOI: 10.3390/ijms232314934] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022] Open
Abstract
The body of scientific literature continues to grow annually. Over 1.5 million abstracts of biomedical publications were added to the PubMed database in 2021. Therefore, developing cognitive systems that provide a specialized search for information in scientific publications based on subject area ontology and modern artificial intelligence methods is urgently needed. We previously developed a web-based information retrieval system, ANDDigest, designed to search and analyze information in the PubMed database using a customized domain ontology. This paper presents an improved ANDDigest version that uses fine-tuned PubMedBERT classifiers to enhance the quality of short name recognition for molecular-genetics entities in PubMed abstracts on eight biological object types: cell components, diseases, side effects, genes, proteins, pathways, drugs, and metabolites. This approach increased average short name recognition accuracy by 13%.
Collapse
Affiliation(s)
- Timofey V. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Correspondence:
| | - Pavel S. Demenkov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Nikolay A. Kolchanov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| | - Vladimir A. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| |
Collapse
|