1
|
Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024; 23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open
Abstract
Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.
Collapse
Affiliation(s)
- Yang Yang
- Computing Science and Artificial Intelligence College, Suzhou City University, Suzhou 215004, China
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Zixuan Zheng
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Hao Wu
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
| | - Yuxin Lin
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Department of Urology, the First Affiliated Hospital of Soochow University, Suzhou 215000, China
| | - Fuliang Qian
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Medical Center of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| |
Collapse
|
2
|
Zheng H, Xu L, Xie H, Xie J, Ma Y, Hu Y, Wu L, Chen J, Wang M, Yi Y, Huang Y, Wang D. RIscoper 2.0: A deep learning tool to extract RNA biomedical relation sentences from literature. Comput Struct Biotechnol J 2024; 23:1469-1476. [PMID: 38623560 PMCID: PMC11016866 DOI: 10.1016/j.csbj.2024.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 03/15/2024] [Accepted: 03/21/2024] [Indexed: 04/17/2024] Open
Abstract
RNA plays an extensive role in a multi-dimensional regulatory system, and its biomedical relationships are scattered across numerous biological studies. However, text mining works dedicated to the extraction of RNA biomedical relations remain limited. In this study, we established a comprehensive and reliable corpus of RNA biomedical relations, recruiting over 30,000 sentences manually curated from more than 15,000 biomedical literature. We also updated RIscoper 2.0, a BERT-based deep learning tool to extract RNA biomedical relation sentences from literature. Benefiting from approximately 100,000 annotated named entities, we integrated the text classification and named entity recognition tasks in this tool. Additionally, RIscoper 2.0 outperformed the original tool in both tasks and can discover new RNA biomedical relations. Additionally, we provided a user-friendly online search tool that enables rapid scanning of RNA biomedical relationships using local and online resources. Both the online tools and data resources of RIscoper 2.0 are available at http://www.rnainter.org/riscoper.
Collapse
Affiliation(s)
- Hailong Zheng
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Linfu Xu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Hailong Xie
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Jiajing Xie
- National Institute for Data Science in Health and Medicine, Xiamen University, 361102 Xiamen, China
| | - Yapeng Ma
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Yongfei Hu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Le Wu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Jia Chen
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Meiyi Wang
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Ying Yi
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Yan Huang
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
| | - Dong Wang
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, 510515 Guangzhou, China
- Guangdong Province Key Laboratory of Molecular Tumor Pathology, 510515, Guangzhou, China
| |
Collapse
|
3
|
Chen J, Goudey B, Geard N, Verspoor K. Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation. Bioinformatics 2024; 40:i390-i400. [PMID: 38940182 DOI: 10.1093/bioinformatics/btae246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. RESULTS We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies. This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge. AVAILABILITY AND IMPLEMENTATION https://github.com/jiyuc/de-inconsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
- Data61, The Commonwealth Scientific and Industrial Research Organisation, Marsfield 2122, NSW, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3000, Australia
| |
Collapse
|
4
|
Zhang J, Jiang Q, Du Z, Geng Y, Hu Y, Tong Q, Song Y, Zhang HY, Yan X, Feng Z. Knowledge graph-derived feed efficiency analysis via pig gut microbiota. Sci Rep 2024; 14:13939. [PMID: 38886444 PMCID: PMC11182767 DOI: 10.1038/s41598-024-64835-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 06/13/2024] [Indexed: 06/20/2024] Open
Abstract
Feed efficiency (FE) is essential for pig production, has been reported to be partially explained by gut microbiota. Despite an extensive body of research literature to this topic, studies regarding the regulation of feed efficiency by gut microbiota remain fragmented and mostly confined to disorganized or semi-structured unrestricted texts. Meanwhile, structured databases for microbiota analysis are available, yet they often lack a comprehensive understanding of the associated biological processes. Therefore, we have devised an approach to construct a comprehensive knowledge graph by combining unstructured textual intelligence with structured database information and applied it to investigate the relationship between pig gut microbes and FE. Firstly, we created the pgmReading knowledge base and the domain ontology of pig gut microbiota by annotating, extracting, and integrating semantic information from 157 scientific publications. Secondly, we created the pgmPubtator by utilizing PubTator to expand the semantic information related to microbiota. Thirdly, we created the pgmDatabase by mapping and combining the ADDAGMA, gutMGene, and KEGG databases based on the ontology. These three knowledge bases were integrated to form the Pig Gut Microbial Knowledge Graph (PGMKG). Additionally, we created five biological query cases to validate the performance of PGMKG. These cases not only allow us to identify microbes with the most significant impact on FE but also provide insights into the metabolites produced by these microbes and the associated metabolic pathways. This study introduces PGMKG, mapping key microbes in pig feed efficiency and guiding microbiota-targeted optimization.
Collapse
Affiliation(s)
- Junmei Zhang
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Qin Jiang
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
- Yazhouwan National Laboratory (YNL), Sanya, 572025, China
| | - Zhihong Du
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yilin Geng
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yuren Hu
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Qichang Tong
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yunfeng Song
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hong-Yu Zhang
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xianghua Yan
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Zaiwen Feng
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
5
|
Nachtegael C, De Stefani J, Cnudde A, Lenaerts T. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations. Database (Oxford) 2024; 2024:baae039. [PMID: 38805753 PMCID: PMC11131422 DOI: 10.1093/database/baae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/17/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Anthony Cnudde
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Pharmacologie, Pharmacothérapie et Suivi Pharmaceutique, Université Libre de Bruxelles, Boulevard du Triomphe, CP 205, Brussels 1050, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium
| |
Collapse
|
6
|
Hramyka D, Sczakiel HL, Zhao MX, Stolpe O, Nieminen M, Adam R, Danyel M, Einicke L, Hägerling R, Knaus A, Mundlos S, Schwartzmann S, Seelow D, Ehmke N, Mensah MA, Boschann F, Beule D, Holtgrewe M. REEV: review, evaluate and explain variants. Nucleic Acids Res 2024:gkae366. [PMID: 38769069 DOI: 10.1093/nar/gkae366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/07/2024] [Accepted: 05/03/2024] [Indexed: 05/22/2024] Open
Abstract
In the era of high throughput sequencing, special software is required for the clinical evaluation of genetic variants. We developed REEV (Review, Evaluate and Explain Variants), a user-friendly platform for clinicians and researchers in the field of rare disease genetics. Supporting data was aggregated from public data sources. We compared REEV with seven other tools for clinical variant evaluation. REEV (semi-)automatically fills individual ACMG criteria facilitating variant interpretation. REEV can store disease and phenotype data related to a case to use these for phenotype similarity measures. Users can create public permanent links for individual variants that can be saved as browser bookmarks and shared. REEV may help in the fast diagnostic assessment of genetic variants in a clinical as well as in a research context. REEV (https://reev.bihealth.org/) is free and open to all users and there is no login requirement.
Collapse
Affiliation(s)
- Dzmitry Hramyka
- Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
| | - Henrike Lisa Sczakiel
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
- RG Development & Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Max Xiaohang Zhao
- Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Oliver Stolpe
- Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
| | - Mikko Nieminen
- Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
| | - Ronja Adam
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Magdalena Danyel
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Lara Einicke
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - René Hägerling
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
- RG Development & Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Berlin Institute of Health , BIH Center for Regenerative Therapies, Berlin, Germany
| | - Alexej Knaus
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Germany
| | - Stefan Mundlos
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- RG Development & Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Sarina Schwartzmann
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Dominik Seelow
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Nadja Ehmke
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Martin Atta Mensah
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- BIH Biomedical Innovation Academy, Digital Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Felix Boschann
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Dieter Beule
- Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
- Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Manuel Holtgrewe
- Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
| |
Collapse
|
7
|
Di Maria A, Bellomo L, Billeci F, Cardillo A, Alaimo S, Ferragina P, Ferro A, Pulvirenti A. NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae194. [PMID: 38597890 DOI: 10.1093/bioinformatics/btae194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]
Abstract
MOTIVATION The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging. RESULTS We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e. in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION https://netme.click/.
Collapse
Affiliation(s)
- Antonio Di Maria
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | | | - Fabrizio Billeci
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Alfio Cardillo
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa, 56126 , Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| |
Collapse
|
8
|
Fisher JL, Wilk EJ, Oza VH, Gary SE, Howton TC, Flanary VL, Clark AD, Hjelmeland AB, Lasseigne BN. Signature reversion of three disease-associated gene signatures prioritizes cancer drug repurposing candidates. FEBS Open Bio 2024; 14:803-830. [PMID: 38531616 PMCID: PMC11073506 DOI: 10.1002/2211-5463.13796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2024] [Revised: 03/13/2024] [Accepted: 03/14/2024] [Indexed: 03/28/2024] Open
Abstract
Drug repurposing is promising because approving a drug for a new indication requires fewer resources than approving a new drug. Signature reversion detects drug perturbations most inversely related to the disease-associated gene signature to identify drugs that may reverse that signature. We assessed the performance and biological relevance of three approaches for constructing disease-associated gene signatures (i.e., limma, DESeq2, and MultiPLIER) and prioritized the resulting drug repurposing candidates for four low-survival human cancers. Our results were enriched for candidates that had been used in clinical trials or performed well in the PRISM drug screen. Additionally, we found that pamidronate and nimodipine, drugs predicted to be efficacious against the brain tumor glioblastoma (GBM), inhibited the growth of a GBM cell line and cells isolated from a patient-derived xenograft (PDX). Our results demonstrate that by applying multiple disease-associated gene signature methods, we prioritized several drug repurposing candidates for low-survival cancers.
Collapse
Affiliation(s)
- Jennifer L. Fisher
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Elizabeth J. Wilk
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Vishal H. Oza
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Sam E. Gary
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Timothy C. Howton
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Victoria L. Flanary
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Amanda D. Clark
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Anita B. Hjelmeland
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| | - Brittany N. Lasseigne
- Department of Cell, Developmental and Integrative Biology, Heersink School of MedicineThe University of Alabama at BirminghamALUSA
| |
Collapse
|
9
|
Liu J, Wu H, Robertson DH, Zhang J. Text mining and portal development for gene-specific publications on Alzheimer's disease and other neurodegenerative diseases. BMC Med Inform Decis Mak 2024; 24:98. [PMID: 38632621 PMCID: PMC11025191 DOI: 10.1186/s12911-024-02501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 04/04/2024] [Indexed: 04/19/2024] Open
Abstract
BACKGROUND Tremendous research efforts have been made in the Alzheimer's disease (AD) field to understand the disease etiology, progression and discover treatments for AD. Many mechanistic hypotheses, therapeutic targets and treatment strategies have been proposed in the last few decades. Reviewing previous work and staying current on this ever-growing body of AD publications is an essential yet difficult task for AD researchers. METHODS In this study, we designed and implemented a natural language processing (NLP) pipeline to extract gene-specific neurodegenerative disease (ND) -focused information from the PubMed database. The collected publication information was filtered and cleaned to construct AD-related gene-specific publication profiles. Six categories of AD-related information are extracted from the processed publication data: publication trend by year, dementia type occurrence, brain region occurrence, mouse model information, keywords occurrence, and co-occurring genes. A user-friendly web portal is then developed using Django framework to provide gene query functions and data visualizations for the generalized and summarized publication information. RESULTS By implementing the NLP pipeline, we extracted gene-specific ND-related publication information from the abstracts of the publications in the PubMed database. The results are summarized and visualized through an interactive web query portal. Multiple visualization windows display the ND publication trends, mouse models used, dementia types, involved brain regions, keywords to major AD-related biological processes, and co-occurring genes. Direct links to PubMed sites are provided for all recorded publications on the query result page of the web portal. CONCLUSION The resulting portal is a valuable tool and data source for quick querying and displaying AD publications tailored to users' interested research areas and gene targets, which is especially convenient for users without informatic mining skills. Our study will not only keep AD field researchers updated with the progress of AD research, assist them in conducting preliminary examinations efficiently, but also offers additional support for hypothesis generation and validation which will contribute significantly to the communication, dissemination, and progress of AD research.
Collapse
Affiliation(s)
- Jiannan Liu
- Department of BioHealth Informatics, Indiana University School of Informatics & Computing, Indianapolis, IN, 46202, USA
| | - Huanmei Wu
- Department of BioHealth Informatics, Indiana University School of Informatics & Computing, Indianapolis, IN, 46202, USA
- Health Services Administration & Policy, Temple University College of Public Health, Philadelphia, PA, 19122, USA
| | - Daniel H Robertson
- Integrated Data Sciences, Indiana Biosciences Research Institute, Indianapolis, IN, 46202, USA
| | - Jie Zhang
- Dept of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, 46202, USA.
| |
Collapse
|
10
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024:gkae235. [PMID: 38572754 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
11
|
Arakane K, Imoto H, Ormersbach F, Okada M. Extending BioMASS to construct mathematical models from external knowledge. BIOINFORMATICS ADVANCES 2024; 4:vbae042. [PMID: 38606187 PMCID: PMC11007111 DOI: 10.1093/bioadv/vbae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 02/13/2024] [Accepted: 04/03/2024] [Indexed: 04/13/2024]
Abstract
Motivation Mechanistic modeling based on ordinary differential equations has led to numerous findings in systems biology by integrating prior knowledge and experimental data. However, the manual curation of knowledge necessary when constructing models poses a bottleneck. As the speed of knowledge accumulation continues to grow, there is a demand for a scalable means of constructing executable models. Results We previously introduced BioMASS-an open-source, Python-based framework-to construct, simulate, and analyze mechanistic models of signaling networks. With one of its features, Text2Model, BioMASS allows users to define models in a natural language-like format, thereby facilitating the construction of large-scale models. We demonstrate that Text2Model can serve as a tool for integrating external knowledge for mathematical modeling by generating Text2Model files from a pathway database or through the use of a large language model, and simulating its dynamics through BioMASS. Our findings reveal the tool's capabilities to encourage exploration from prior knowledge and pave the way for a fully data-driven approach to constructing mathematical models. Availability and implementation The code and documentation for BioMASS are available at https://github.com/biomass-dev/biomass and https://biomass-core.readthedocs.io, respectively. The code used in this article are available at https://github.com/okadalabipr/text2model-from-knowledge.
Collapse
Affiliation(s)
- Kiwamu Arakane
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Hiroaki Imoto
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | | | - Mariko Okada
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
- Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka 565-0871, Japan
| |
Collapse
|
12
|
Mateu-Sanz M, Fuenteslópez CV, Uribe-Gomez J, Haugen HJ, Pandit A, Ginebra MP, Hakimi O, Krallinger M, Samara A. Redefining biomaterial biocompatibility: challenges for artificial intelligence and text mining. Trends Biotechnol 2024; 42:402-417. [PMID: 37858386 DOI: 10.1016/j.tibtech.2023.09.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 09/25/2023] [Accepted: 09/26/2023] [Indexed: 10/21/2023]
Abstract
The surge in 'Big data' has significantly influenced biomaterials research and development, with vast data volumes emerging from clinical trials, scientific literature, electronic health records, and other sources. Biocompatibility is essential in developing safe medical devices and biomaterials to perform as intended without provoking adverse reactions. Therefore, establishing an artificial intelligence (AI)-driven biocompatibility definition has become decisive for automating data extraction and profiling safety effectiveness. This definition should both reflect the attributes related to biocompatibility and be compatible with computational data-mining methods. Here, we discuss the need for a comprehensive and contemporary definition of biocompatibility and the challenges in developing one. We also identify the key elements that comprise biocompatibility, and propose an integrated biocompatibility definition that enables data-mining approaches.
Collapse
Affiliation(s)
- Miguel Mateu-Sanz
- Biomaterials, Biomechanics, and Tissue Engineering Group, Department of Materials Science and Engineering, Universitat Politècnica de Catalunya, Barcelona 08019, Spain
| | - Carla V Fuenteslópez
- Institute of Biomedical Engineering, Botnar Research Centre, Nuffield Orthopaedic Centre, University of Oxford, Oxford OX3 7LD, UK
| | - Juan Uribe-Gomez
- CÚRAM, SFI Research Centre for Medical Devices, University of Galway, Galway H92 W2TY, Ireland
| | - Håvard Jostein Haugen
- Department of Biomaterials, Center for Functional Tissue Reconstruction, Faculty of Dentistry, University of Oslo, Oslo 0317, Norway
| | - Abhay Pandit
- CÚRAM, SFI Research Centre for Medical Devices, University of Galway, Galway H92 W2TY, Ireland
| | - Maria-Pau Ginebra
- Biomaterials, Biomechanics, and Tissue Engineering Group, Department of Materials Science and Engineering, Universitat Politècnica de Catalunya, Barcelona 08019, Spain
| | - Osnat Hakimi
- aMoon Ventures, Yerushalaim Rd 34, Ra'anana 4350108, Israel
| | | | - Athina Samara
- Department of Biomaterials, Center for Functional Tissue Reconstruction, Faculty of Dentistry, University of Oslo, Oslo 0317, Norway.
| |
Collapse
|
13
|
Wittau J, Celik S, Kacprowski T, Deserno TM, Seifert R. Fake paper identification in the pool of withdrawn and rejected manuscripts submitted to Naunyn-Schmiedeberg's Archives of Pharmacology. NAUNYN-SCHMIEDEBERG'S ARCHIVES OF PHARMACOLOGY 2024; 397:2171-2181. [PMID: 37796310 PMCID: PMC10933159 DOI: 10.1007/s00210-023-02741-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 09/20/2023] [Indexed: 10/06/2023]
Abstract
Honesty of publications is fundamental in science. Unfortunately, science has an increasing fake paper problem with multiple cases having surfaced in recent years, even in renowned journals. There are companies, the so-called paper mills, which professionally fake research data and papers. However, there is no easy way to systematically identify these papers. Here, we show that scanning for exchanged authors in resubmissions is a simple approach to detect potential fake papers. We investigated 2056 withdrawn or rejected submissions to Naunyn-Schmiedeberg's Archives of Pharmacology (NSAP), 952 of which were subsequently published in other journals. In six cases, the stated authors of the final publications differed by more than two thirds from those named in the submission to NSAP. In four cases, they differed completely. Our results reveal that paper mills take advantage of the fact that journals are unaware of submissions to other journals. Consequently, papers can be submitted multiple times (even simultaneously), and authors can be replaced if they withdraw from their purchased authorship. We suggest that publishers collaborate with each other by sharing titles, authors, and abstracts of their submissions. Doing so would allow the detection of suspicious changes in the authorship of submitted and already published papers. Independently of such collaboration across publishers, every scientific journal can make an important contribution to the integrity of the scientific record by analyzing its own pool of withdrawn and rejected papers versus published papers according to the simple algorithm proposed in the present paper.
Collapse
Affiliation(s)
- Jonathan Wittau
- Institute of Pharmacology, Hannover Medical School, Carl-Neuberg-Straße 1, 30625, Hannover, Germany
| | - Serkan Celik
- Braunschweig Integrated Centre of Systems Biology, TU Braunschweig, Braunschweig, Germany
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, 38106, Braunschweig, Germany
| | - Tim Kacprowski
- Braunschweig Integrated Centre of Systems Biology, TU Braunschweig, Braunschweig, Germany
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, 38106, Braunschweig, Germany
| | - Thomas M Deserno
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, 38106, Braunschweig, Germany
| | - Roland Seifert
- Institute of Pharmacology, Hannover Medical School, Carl-Neuberg-Straße 1, 30625, Hannover, Germany.
| |
Collapse
|
14
|
Rai P, Jain A, Kumar S, Sharma D, Jha N, Chawla S, Raj A, Gupta A, Poonia S, Majumdar A, Chakraborty T, Ahuja G, Sengupta D. Literature mining discerns latent disease-gene relationships. Bioinformatics 2024; 40:btae185. [PMID: 38608194 PMCID: PMC11060865 DOI: 10.1093/bioinformatics/btae185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 01/30/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape project, researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates. RESULTS To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained bidirectional encoder representations from transformers (BERT) for language modeling from the domain of natural language processing to learn vector representation of entities such as genes, diseases, tissues, cell-types, etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in silico synthesis of hypotheses linking different biological entities such as genes and conditions. AVAILABILITY AND IMPLEMENTATION PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-Model. BioSentVec-based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-Model. Pathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.
Collapse
Affiliation(s)
- Priyadarshini Rai
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Atishay Jain
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Shivani Kumar
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Divya Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Neha Jha
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Smriti Chawla
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Abhijit Raj
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Apoorva Gupta
- Department of Biotechnology, Delhi Technological University, Shahbad Daulatpur, Delhi 110042, India
| | - Sarita Poonia
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | | | - Tanmoy Chakraborty
- Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi 110016, India
- Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi 110016, India
| | - Gaurav Ahuja
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
- Centre for Artificial Intelligence, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| | - Debarka Sengupta
- Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
- Centre for Artificial Intelligence, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
| |
Collapse
|
15
|
Richardson R, Tejedor Navarro H, Amaral LAN, Stoeger T. Meta-Research: Understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results. eLife 2024; 12:RP93429. [PMID: 38546716 PMCID: PMC10977968 DOI: 10.7554/elife.93429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/01/2024] Open
Abstract
Present-day publications on human genes primarily feature genes that already appeared in many publications prior to completion of the Human Genome Project in 2003. These patterns persist despite the subsequent adoption of high-throughput technologies, which routinely identify novel genes associated with biological processes and disease. Although several hypotheses for bias in the selection of genes as research targets have been proposed, their explanatory powers have not yet been compared. Our analysis suggests that understudied genes are systematically abandoned in favor of better-studied genes between the completion of -omics experiments and the reporting of results. Understudied genes remain abandoned by studies that cite these -omics experiments. Conversely, we find that publications on understudied genes may even accrue a greater number of citations. Among 45 biological and experimental factors previously proposed to affect which genes are being studied, we find that 33 are significantly associated with the choice of hit genes presented in titles and abstracts of -omics studies. To promote the investigation of understudied genes, we condense our insights into a tool, find my understudied genes (FMUG), that allows scientists to engage with potential bias during the selection of hits. We demonstrate the utility of FMUG through the identification of genes that remain understudied in vertebrate aging. FMUG is developed in Flutter and is available for download at fmug.amaral.northwestern.edu as a MacOS/Windows app.
Collapse
Affiliation(s)
- Reese Richardson
- Interdisciplinary Biological Sciences, Northwestern UniversityEvanstonUnited States
- Department of Chemical and Biological Engineering, Northwestern UniversityEvanstonUnited States
| | - Heliodoro Tejedor Navarro
- Department of Chemical and Biological Engineering, Northwestern UniversityEvanstonUnited States
- Northwestern Institute on Complex Systems, Northwestern UniversityEvanstonUnited States
| | - Luis A Nunes Amaral
- Department of Chemical and Biological Engineering, Northwestern UniversityEvanstonUnited States
- Northwestern Institute on Complex Systems, Northwestern UniversityEvanstonUnited States
- Department of Molecular Biosciences, Northwestern UniversityEvanstonUnited States
- Department of Physics and Astronomy, Northwestern UniversityEvanstonUnited States
| | - Thomas Stoeger
- Department of Chemical and Biological Engineering, Northwestern UniversityEvanstonUnited States
- The Potocsnak Longevity Institute, Northwestern UniversityChicagoUnited States
- Simpson Querrey Lung Institute for Translational Science, Northwestern UniversityChicagoUnited States
| |
Collapse
|
16
|
Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024; 25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools. RESULTS We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances. CONCLUSIONS MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.
Collapse
Affiliation(s)
- Ornella Irrera
- Department of Information Engineering, University of Padova, Padua, Italy.
| | - Stefano Marchesin
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
17
|
Yao X, He Z, Liu Y, Wang Y, Ouyang S, Xia J. Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer. Sci Data 2024; 11:265. [PMID: 38431735 PMCID: PMC10908799 DOI: 10.1038/s41597-024-03083-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 02/20/2024] [Indexed: 03/05/2024] Open
Abstract
It is vital to investigate the complex mechanisms underlying tumors to better understand cancer and develop effective treatments. Metabolic abnormalities and clinical phenotypes can serve as essential biomarkers for diagnosing this challenging disease. Additionally, genetic alterations provide profound insights into the fundamental aspects of cancer. This study introduces Cancer-Alterome, a literature-mined dataset that focuses on the regulatory events of an organism's biological processes or clinical phenotypes caused by genetic alterations. By proposing and leveraging a text-mining pipeline, we identify 16,681 thousand of regulatory events records encompassing 21K genes, 157K genetic alterations and 154K downstream bio-concepts, extracted from 4,354K pan-cancer literature. The resulting dataset empowers a multifaceted investigation of cancer pathology, enabling the meticulous tracking of relevant literature support. Its potential applications extend to evidence-based medicine and precision medicine, yielding valuable insights for further advancements in cancer research.
Collapse
Affiliation(s)
- Xinzhi Yao
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Zhihan He
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yawen Liu
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, P.R. China
| | - Sizhuo Ouyang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China.
| |
Collapse
|
18
|
Liu J, Li J, Jin F, Li Q, Zhao G, Wu L, Li X, Xia J, Cheng N. dbCRAF: a curated knowledgebase for regulation of radiation response in human cancer. NAR Cancer 2024; 6:zcae008. [PMID: 38406264 PMCID: PMC10894039 DOI: 10.1093/narcan/zcae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 12/10/2023] [Accepted: 02/15/2024] [Indexed: 02/27/2024] Open
Abstract
Radiation therapy (RT) is one of the primary treatment modalities of cancer, with 40-60% of cancer patients benefiting from RT during their treatment course. The intrinsic radiosensitivity or acquired radioresistance of tumor cells would affect the response to RT and clinical outcomes in patients. Thus, mining the regulatory mechanisms in tumor radiosensitivity or radioresistance that have been verified by biological experiments and computational analysis methods will enhance the overall understanding of RT. Here, we describe a comprehensive database dbCRAF (http://dbCRAF.xialab.info/) to document and annotate the factors (1,677 genes, 49 proteins and 612 radiosensitizers) linked with radiation response, including radiosensitivity, radioresistance in cancer cells and prognosis in cancer patients receiving RT. On the one hand, dbCRAF enables researchers to directly access knowledge for regulation of radiation response in human cancer buried in the vast literature. On the other hand, dbCRAF provides four flexible modules to analyze and visualize the functional relationship between these factors and clinical outcome, KEGG pathway and target genes. In conclusion, dbCRAF serves as a valuable resource for elucidating the regulatory mechanisms of radiation response in human cancers as well as for the improvement of RT options.
Collapse
Affiliation(s)
- Jie Liu
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
| | - Jing Li
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
| | - Fangfang Jin
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
| | - Qian Li
- School of Environmental Science and Optoelectronic Technology, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Guoping Zhao
- Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui 230031, China
| | - Lijun Wu
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
- Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui 230031, China
| | - Xiaoyan Li
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
| | - Junfeng Xia
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
| | - Na Cheng
- School of Biomedical Engineering, Anhui Medical University, Hefei, Anhui 230032, China
| |
Collapse
|
19
|
Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024:10.1007/s12539-024-00605-2. [PMID: 38340264 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]
Abstract
We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.
Collapse
Affiliation(s)
- Dao-Ling Huang
- BGI Research, Shenzhen, 518083, China.
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Quanlei Zeng
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yun Xiong
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Shuixia Liu
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Chaoqun Pang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Menglei Xia
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Ting Fang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yanli Ma
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Cuicui Qiang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yi Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yu Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Hong Li
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yuying Yuan
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
20
|
Tumilovich A, Yablokov E, Mezentsev Y, Ershov P, Basina V, Gnedenko O, Kaluzhskiy L, Tsybruk T, Grabovec I, Kisel M, Shabunya P, Soloveva N, Vavilov N, Gilep A, Ivanov A. The Multienzyme Complex Nature of Dehydroepiandrosterone Sulfate Biosynthesis. Int J Mol Sci 2024; 25:2072. [PMID: 38396748 PMCID: PMC10889563 DOI: 10.3390/ijms25042072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/16/2024] [Accepted: 01/26/2024] [Indexed: 02/25/2024] Open
Abstract
Dehydroepiandrosterone (DHEA), a precursor of steroid sex hormones, is synthesized by steroid 17-alpha-hydroxylase/17,20-lyase (CYP17A1) with the participation of microsomal cytochrome b5 (CYB5A) and cytochrome P450 reductase (CPR), followed by sulfation by two cytosolic sulfotransferases, SULT1E1 and SULT2A1, for storage and transport to tissues in which its synthesis is not available. The involvement of CYP17A1 and SULTs in these successive reactions led us to consider the possible interaction of SULTs with DHEA-producing CYP17A1 and its redox partners. Text mining analysis, protein-protein network analysis, and gene co-expression analysis were performed to determine the relationships between SULTs and microsomal CYP isoforms. For the first time, using surface plasmon resonance, we detected interactions between CYP17A1 and SULT2A1 or SULT1E1. SULTs also interacted with CYB5A and CPR. The interaction parameters of SULT2A1/CYP17A1 and SULT2A1/CYB5A complexes seemed to be modulated by 3'-phosphoadenosine-5'-phosphosulfate (PAPS). Affinity purification, combined with mass spectrometry (AP-MS), allowed us to identify a spectrum of SULT1E1 potential protein partners, including CYB5A. We showed that the enzymatic activity of SULTs increased in the presence of only CYP17A1 or CYP17A1 and CYB5A mixture. The structures of CYP17A1/SULT1E1 and CYB5A/SULT1E1 complexes were predicted. Our data provide novel fundamental information about the organization of microsomal CYP-dependent macromolecular complexes.
Collapse
Affiliation(s)
- Anastasiya Tumilovich
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Evgeniy Yablokov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Yuri Mezentsev
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Pavel Ershov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Viktoriia Basina
- Research Centre for Medical Genetics, 1 Moskvorechye Street, 115522 Moscow, Russia;
| | - Oksana Gnedenko
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Leonid Kaluzhskiy
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Tatsiana Tsybruk
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Irina Grabovec
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Maryia Kisel
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Polina Shabunya
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Natalia Soloveva
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Nikita Vavilov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Andrei Gilep
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Alexis Ivanov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| |
Collapse
|
21
|
Kilicoglu H, Ensan F, McInnes B, Wang LL. Semantics-enabled biomedical literature analytics. J Biomed Inform 2024; 150:104588. [PMID: 38244957 DOI: 10.1016/j.jbi.2024.104588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 01/10/2024] [Indexed: 01/22/2024]
Affiliation(s)
- Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana Champaign, Champaign, IL, USA.
| | - Faezeh Ensan
- Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON, Canada.
| | - Bridget McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Lucy Lu Wang
- Information School, University of Washington, Seattle, WA, USA.
| |
Collapse
|
22
|
Reed CJ, Denise R, Hourihan J, Babor J, Jaroch M, Martinelli M, Hutinet G, de Crécy-Lagard V. Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families. Microb Genom 2024; 10:001183. [PMID: 38323604 PMCID: PMC10926702 DOI: 10.1099/mgen.0.001183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Accepted: 01/08/2024] [Indexed: 02/08/2024] Open
Abstract
Capturing the published corpus of information on all members of a given protein family should be an essential step in any study focusing on specific members of that family. Using a previously gathered dataset of more than 280 references mentioning a member of the DUF34 (NIF3/Ngg1-interacting Factor 3) family, we evaluated the efficiency of different databases and search tools, and devised a workflow that experimentalists can use to capture the most information published on members of a protein family in the least amount of time. To complement this workflow, web-based platforms allowing for the exploration of protein family members across sequenced genomes or for the analysis of gene neighbourhood information were reviewed for their versatility and ease of use. Recommendations that can be used for experimentalist users, as well as educators, are provided and integrated within a customized, publicly accessible Wiki.
Collapse
Affiliation(s)
- Colbie J. Reed
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Rémi Denise
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Jacob Hourihan
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Jill Babor
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Maria Martinelli
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, FL, USA
| | | | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
- Department of Biology, Haverford College, Haverford, PA, USA
- UF Genetics Institute, University of Florida, Gainesville, FL, USA
| |
Collapse
|
23
|
Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024; 100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open
Abstract
Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
24
|
Alqaissi E, Alotaibi F, Sher Ramzan M, Algarni A. Novel graph-based machine-learning technique for viral infectious diseases: application to influenza and hepatitis diseases. Ann Med 2024; 55:2304108. [PMID: 38242107 PMCID: PMC10802812 DOI: 10.1080/07853890.2024.2304108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 12/18/2023] [Indexed: 01/21/2024] Open
Abstract
BACKGROUND Most infectious diseases are caused by viruses, fungi, bacteria and parasites. Their ability to easily infect humans and trigger large-scale epidemics makes them a public health concern. Methods for early detection of these diseases have been developed; however, they are hindered by the absence of a unified, interoperable and reusable model. This study seeks to create a holistic and real-time model for swift, preliminary detection of infectious diseases using symptoms and additional clinical data. MATERIALS AND METHODS In this study, we present a medical knowledge graph (MKG) that leverages multiple data sources to analyse connections between different nodes. Medical ontologies were used to enhance the MKG. We applied various graph algorithms to extract key features. The performance of multiple machine-learning (ML) techniques for influenza and hepatitis detection was assessed, selecting multi-layer perceptron (MLP) and random forest (RF) models due to their superior outcomes. The hyperparameters of both graph-based ML models were automatically fine-tuned. RESULTS Both the graph-based MLP and RF models showcased the least loss and error rates, along with the most specific, accurate recall, precision and F1 scores. Their Matthews correlation coefficients were also optimal. When compared with existing ML techniques and findings from the literature, these graph-based ML models manifested superior detection accuracy. CONCLUSIONS The graph-based MLP and RF models effectively diagnosed influenza and hepatitis, respectively. This underlines the potential of graph data science in enhancing ML model performance and uncovering concealed relationships in the MKG.
Collapse
Affiliation(s)
- Eman Alqaissi
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
- Computer Science and Information Systems, The Applied College, King Khalid University, Abha, Saudi Arabia
| | - Fahd Alotaibi
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Sher Ramzan
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | | |
Collapse
|
25
|
Gao J, Mo S, Wang J, Zhang M, Shi Y, Zhu C, Shang Y, Tang X, Zhang S, Wu X, Xu X, Wang Y, Li Z, Zheng G, Chen Z, Wang Q, Tang K, Cao Z. MACC: a visual interactive knowledgebase of metabolite-associated cell communications. Nucleic Acids Res 2024; 52:D633-D639. [PMID: 37897362 PMCID: PMC10767829 DOI: 10.1093/nar/gkad914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/21/2023] [Accepted: 10/10/2023] [Indexed: 10/30/2023] Open
Abstract
Metabolite-associated cell communications play critical roles in maintaining the normal biological function of human through coordinating cells, organs and physiological systems. Though substantial information of MACCs has been continuously reported, no relevant database has become available so far. To address this gap, we here developed the first knowledgebase (MACC), to comprehensively describe human metabolite-associated cell communications through curation of experimental literatures. MACC currently contains: (a) 4206 carefully curated metabolite-associated cell communications pairs involving 244 human endogenous metabolites and reported biological effects in vivo and in vitro; (b) 226 comprehensive cell subtypes and 296 disease states, such as cancers, autoimmune diseases, and pathogenic infections; (c) 4508 metabolite-related enzymes and transporters, involving 542 pathways; (d) an interactive tool with user-friendly interface to visualize networks of multiple metabolite-cell interactions. (e) overall expression landscape of metabolite-associated gene sets derived from over 1500 single-cell expression profiles to infer metabolites variations across different cells in the sample. Also, MACC enables cross-links to well-known databases, such as HMDB, DrugBank, TTD and PubMed etc. In complement to ligand-receptor databases, MACC may give new perspectives of alternative communication between cells via metabolite secretion and adsorption, together with the resulting biological functions. MACC is publicly accessible at: http://macc.badd-cao.net/.
Collapse
Affiliation(s)
- Jian Gao
- School of Life Sciences, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
- Department of Thoracic Surgery and State Key Laboratory of Genetic Engineering, Fudan University Shanghai Cancer Center, Shanghai, China
| | - Saifeng Mo
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Jun Wang
- School of Life Sciences, Fudan University, Shanghai, China
| | - Mou Zhang
- School of Life Sciences, Fudan University, Shanghai, China
| | - Yao Shi
- School of Life Sciences, Fudan University, Shanghai, China
| | - Chuhan Zhu
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yuxuan Shang
- Biological Sciences, University of California Santa Barbara, CA, USA
| | - Xinyue Tang
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Shiyue Zhang
- School of Life Sciences, Fudan University, Shanghai, China
| | - Xinwen Wu
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Xinyan Xu
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yiheng Wang
- School of Life Sciences, Fudan University, Shanghai, China
| | - Zihao Li
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Genhui Zheng
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Zikun Chen
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Qiming Wang
- School of Life Sciences, Fudan University, Shanghai, China
| | - Kailin Tang
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Zhiwei Cao
- School of Life Sciences, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
| |
Collapse
|
26
|
Savage SR, Zhang Y, Jaehnig EJ, Liao Y, Shi Z, Pham HA, Xu H, Zhang B. IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining. Mol Cell Proteomics 2024; 23:100682. [PMID: 37993103 PMCID: PMC10716774 DOI: 10.1016/j.mcpro.2023.100682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 10/25/2023] [Accepted: 11/14/2023] [Indexed: 11/24/2023] Open
Abstract
Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.
Collapse
Affiliation(s)
- Sara R Savage
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Eric J Jaehnig
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Yuxing Liao
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Zhiao Shi
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, Connecticut, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.
| |
Collapse
|
27
|
Jeynes JCG, James T, Corney M. Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls. Methods Mol Biol 2024; 2716:223-240. [PMID: 37702942 DOI: 10.1007/978-1-0716-3449-3_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
Building and analyzing knowledge graphs (KGs) to aid drug discovery is a topical area of research. A salient feature of KGs is their ability to combine many heterogeneous data sources in a format that facilitates discovering connections. The utility of KGs has been exemplified in areas such as drug repurposing, with insights made through manual exploration and modeling of the data. In this chapter, we discuss promises and pitfalls of using natural language processing (NLP) to mine "unstructured text"- typically from scientific literature- as a data source for KGs. This draws on our experience of initially parsing "structured" data sources-such as ChEMBL-as the basis for data within a KG, and then enriching or expanding upon them using NLP. The fundamental promise of NLP for KGs is the automated extraction of data from millions of documents-a task practically impossible to do via human curation alone. However, there are many potential pitfalls in NLP-KG pipelines, such as incorrect named entity recognition and ontology linking, all of which could ultimately lead to erroneous inferences and conclusions.
Collapse
Affiliation(s)
- J Charles G Jeynes
- Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK.
| | - Tim James
- Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK.
| | - Matthew Corney
- Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK
| |
Collapse
|
28
|
Fuenteslópez CV, McKitrick A, Corvi J, Ginebra MP, Hakimi O. Biomaterials text mining: A hands-on comparative study of methods on polydioxanone biocompatibility. N Biotechnol 2023; 77:161-175. [PMID: 37673372 DOI: 10.1016/j.nbt.2023.09.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 08/14/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Scientific information extraction is fundamental for research and innovation, but is currently mostly a manual, time-consuming process. Text Mining tools (TMTs) enable automated, accurate and quick information extraction from text, but there is little precedent of their use in the biomaterials field. Here, we compare the ability of various TMTs to extract useful information from biomaterials abstracts. Focusing on the biocompatibility of polydioxanone, a biodegradable polymer for which there are relatively few scientific publications, we tested several tools ranging from machine learning approaches and statistical text analysis to MeSH indexing and domain-specific semantic tools for Named Entity Recognition. We also evaluated their output alongside a manual review of systematic reviews and meta-analyses. The findings show that TMTs can be highly efficient and powerful for mapping biomaterials texts and rapidly yield up-to-date information. Here, TMTs enable one to identify dominating themes, see the evolution of specific terms and topics, and learn about key medical applications in biomaterials literature over the years. The analysis also shows that ambiguity around biomaterials nomenclature is a significant challenge in mining biomedical literature that is yet to be tackled. This research showcases the potential value of using Natural Language Processing and domain-specific tools to extract and organize biomaterials data.
Collapse
Affiliation(s)
- Carla V Fuenteslópez
- Institute of Biomedical Engineering, Botnar Research Centre, Nuffield Orthopaedic Centre, University of Oxford, Oxford OX3 7LD, UK.
| | - Austin McKitrick
- Institute of Social Research, University of Michigan, MI 48104, USA
| | - Javier Corvi
- Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | - Maria-Pau Ginebra
- Department of Materials Science and Engineering, Universitat Politècnica de Catalunya, Barcelona 08019, Spain
| | - Osnat Hakimi
- Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain; Department of Materials Science and Engineering, Universitat Politècnica de Catalunya, Barcelona 08019, Spain; Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya, Barcelona 08017, Spain.
| |
Collapse
|
29
|
He F, Liu K, Yang Z, Chen Y, Hammer RD, Xu D, Popescu M. pathCLIP: Detection of Genes and Gene Relations from Biological Pathway Figures through Image-Text Contrastive Learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.31.564859. [PMID: 37961680 PMCID: PMC10635012 DOI: 10.1101/2023.10.31.564859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. A case study on extracting pathway information from non-small cell lung cancer literature further demonstrates the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.
Collapse
Affiliation(s)
- Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China; Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Kai Liu
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
| | - Zhiyuan Yang
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
| | - Yibo Chen
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Richard D Hammer
- School of Medicine, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Dong Xu
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Mihail Popescu
- School of Medicine, University of Missouri, Columbia Missouri, MO 65211 USA
| |
Collapse
|
30
|
Weber L, Barth F, Lorenz L, Konrath F, Huska K, Wolf J, Leser U. PEDL+: protein-centered relation extraction from PubMed at your fingertip. Bioinformatics 2023; 39:btad603. [PMID: 37950510 PMCID: PMC10660277 DOI: 10.1093/bioinformatics/btad603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 08/29/2023] [Accepted: 10/31/2023] [Indexed: 11/12/2023] Open
Abstract
SUMMARY Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. AVAILABILITY AND IMPLEMENTATION PEDL+ is freely available at https://github.com/leonweber/pedl.
Collapse
Affiliation(s)
- Leon Weber
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, München 80539, Germany
| | - Fabio Barth
- Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Leonie Lorenz
- Pathogen Informatics and Modelling, EMBL-EBI, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Fabian Konrath
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin 13125, Germany
| | - Kirsten Huska
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin 13125, Germany
| | - Jana Wolf
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin 13125, Germany
- Department of Mathematics and Computer Science, Free University Berlin, Berlin, 14195, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| |
Collapse
|
31
|
Garda S, Weber-Genzel L, Martin R, Leser U. BELB: a biomedical entity linking benchmark. Bioinformatics 2023; 39:btad698. [PMID: 37975879 PMCID: PMC10681865 DOI: 10.1093/bioinformatics/btad698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/30/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, München 80539, Germany
| | - Robert Martin
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
32
|
Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023; 10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.
Collapse
Affiliation(s)
- Xiao Yang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK.
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Vid Vartak
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
33
|
Wei CH, Luo L, Islamaj R, Lai PT, Lu Z. GNorm2: an improved gene name recognition and normalization system. Bioinformatics 2023; 39:btad599. [PMID: 37878810 PMCID: PMC10612401 DOI: 10.1093/bioinformatics/btad599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 09/06/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Gene name normalization is an important yet highly complex task in biomedical text mining research, as gene names can be highly ambiguous and may refer to different genes in different species or share similar names with other bioconcepts. This poses a challenge for accurately identifying and linking gene mentions to their corresponding entries in databases such as NCBI Gene or UniProt. While there has been a body of literature on the gene normalization task, few have addressed all of these challenges or make their solutions publicly available to the scientific community. RESULTS Building on the success of GNormPlus, we have created GNorm2: a more advanced tool with optimized functions and improved performance. GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date. Our tool is freely available for download. AVAILABILITY AND IMPLEMENTATION https://github.com/ncbi/GNorm2.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| |
Collapse
|
34
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023; 146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.
| |
Collapse
|
35
|
Marchesin S, Menotti L, Giachelle F, Silvello G, Alonso O. Building a large gene expression-cancer knowledge base with limited human annotations. Database (Oxford) 2023; 2023:baad061. [PMID: 37768281 PMCID: PMC10533344 DOI: 10.1093/database/baad061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 07/27/2023] [Accepted: 08/25/2023] [Indexed: 09/29/2023]
Abstract
Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a knowledge base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms and offers a seamless, transparent, modular architecture equipped for large-scale processing. We focus on precision medicine and build the largest KB on 'fine-grained' gene expression-cancer associations-a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss the usefulness of the provided KB. Database URL https://zenodo.org/record/7577127.
Collapse
Affiliation(s)
- Stefano Marchesin
- Department of Information Engineering, University of Padova, Via G. Gradenigo 6b, Padova 35131, Italy
| | - Laura Menotti
- Department of Information Engineering, University of Padova, Via G. Gradenigo 6b, Padova 35131, Italy
| | - Fabio Giachelle
- Department of Information Engineering, University of Padova, Via G. Gradenigo 6b, Padova 35131, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padova, Via G. Gradenigo 6b, Padova 35131, Italy
| | - Omar Alonso
- Applied Science, Amazon, 3075 Olcott St., Santa Clara, California 95054, USA
| |
Collapse
|
36
|
Zhang Z, Fang M, Wu R, Zong H, Huang H, Tong Y, Xie Y, Cheng S, Wei Z, Crabbe MJC, Zhang X, Wang Y. Large-Scale Biomedical Relation Extraction Across Diverse Relation Types: Model Development and Usability Study on COVID-19. J Med Internet Res 2023; 25:e48115. [PMID: 37632414 PMCID: PMC10551783 DOI: 10.2196/48115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 07/03/2023] [Accepted: 08/25/2023] [Indexed: 08/28/2023] Open
Abstract
BACKGROUND Biomedical relation extraction (RE) is of great importance for researchers to conduct systematic biomedical studies. It not only helps knowledge mining, such as knowledge graphs and novel knowledge discovery, but also promotes translational applications, such as clinical diagnosis, decision-making, and precision medicine. However, the relations between biomedical entities are complex and diverse, and comprehensive biomedical RE is not yet well established. OBJECTIVE We aimed to investigate and improve large-scale RE with diverse relation types and conduct usability studies with application scenarios to optimize biomedical text mining. METHODS Data sets containing 125 relation types with different entity semantic levels were constructed to evaluate the impact of entity semantic information on RE, and performance analysis was conducted on different model architectures and domain models. This study also proposed a continued pretraining strategy and integrated models with scripts into a tool. Furthermore, this study applied RE to the COVID-19 corpus with article topics and application scenarios of clinical interest to assess and demonstrate its biological interpretability and usability. RESULTS The performance analysis revealed that RE achieves the best performance when the detailed semantic type is provided. For a single model, PubMedBERT with continued pretraining performed the best, with an F1-score of 0.8998. Usability studies on COVID-19 demonstrated the interpretability and usability of RE, and a relation graph database was constructed, which was used to reveal existing and novel drug paths with edge explanations. The models (including pretrained and fine-tuned models), integrated tool (Docker), and generated data (including the COVID-19 relation graph database and drug paths) have been made publicly available to the biomedical text mining community and clinical researchers. CONCLUSIONS This study provided a comprehensive analysis of RE with diverse relation types. Optimized RE models and tools for diverse relation types were developed, which can be widely used in biomedical text mining. Our usability studies provided a proof-of-concept demonstration of how large-scale RE can be leveraged to facilitate novel research.
Collapse
Affiliation(s)
- Zeyu Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
- Department of Clinical Laboratory Medicine Center, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Meng Fang
- Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai, China
| | - Rebecca Wu
- University of California, Berkeley, Berkeley, CA, United States
| | - Hui Zong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
- Institutes for Systems Genetics, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Honglian Huang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yuantao Tong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yujia Xie
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Shiyang Cheng
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Ziyi Wei
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - M James C Crabbe
- Wolfson College, Oxford University, Oxford, United Kingdom
- Institute of Biomedical and Environmental Science & Technology, University of Bedfordshire, Luton, United Kingdom
- School of Life Sciences, Shanxi University, Taiyuan, China
| | - Xiaoyan Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Ying Wang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
- Department of Clinical Laboratory Medicine Center, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China
- Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai, China
| |
Collapse
|
37
|
Jeynes JCG, Corney M, James T. A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction. PLoS One 2023; 18:e0291142. [PMID: 37682956 PMCID: PMC10490933 DOI: 10.1371/journal.pone.0291142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 08/22/2023] [Indexed: 09/10/2023] Open
Abstract
One area of active research is the use of natural language processing (NLP) to mine biomedical texts for sets of triples (subject-predicate-object) for knowledge graph (KG) construction. While statistical methods to mine co-occurrences of entities within sentences are relatively robust, accurate relationship extraction is more challenging. Herein, we evaluate the Global Network of Biomedical Relationships (GNBR), a dataset that uses distributional semantics to model relationships between biomedical entities. The focus of our paper is an evaluation of a subset of the GNBR data; the relationships between chemicals and genes/proteins. We use Evotec's structured 'Nexus' database of >2.76M chemical-protein interactions as a ground truth to compare with GNBRs relationships and find a micro-averaged precision-recall area under the curve (AUC) of 0.50 and a micro-averaged receiver operating characteristic (ROC) curve AUC of 0.71 across the relationship classes 'inhibits', 'binding', 'agonism' and 'antagonism', when a comparison is made on a sentence-by-sentence basis. We conclude that, even though these micro-average scores are modest, using a high threshold on certain relationship classes like 'inhibits' could yield high fidelity triples that are not reported in structured datasets. We discuss how different methods of processing GNBR data, and the factuality of triples could affect the accuracy of NLP data incorporated into knowledge graphs. We provide a GNBR-Nexus(ChEMBL-subset) merged datafile that contains over 20,000 sentences where a protein/gene-chemical co-occur and includes both the GNBR relationship scores as well as the ChEMBL (manually curated) relationships (e.g., 'agonist', 'inhibitor') -this can be accessed at https://doi.org/10.5281/zenodo.8136752. We envisage this being used to aid curation efforts by the drug discovery community.
Collapse
Affiliation(s)
- Jonathan C. G. Jeynes
- Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom
| | - Matthew Corney
- Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom
| | - Tim James
- Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom
| |
Collapse
|
38
|
Neves M, Klippert A, Knöspel F, Rudeck J, Stolz A, Ban Z, Becker M, Diederich K, Grune B, Kahnau P, Ohnesorge N, Pucher J, Schönfelder G, Bert B, Butzke D. Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. J Biomed Semantics 2023; 14:13. [PMID: 37658458 PMCID: PMC10472567 DOI: 10.1186/s13326-023-00292-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 07/29/2023] [Indexed: 09/03/2023] Open
Abstract
Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).
Collapse
Affiliation(s)
- Mariana Neves
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany.
| | - Antonina Klippert
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
- Current affiliation: Nuvisan ICB GmbH, Müllerstraße 178, 13353, Berlin, Germany
| | - Fanny Knöspel
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Juliane Rudeck
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Ailine Stolz
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Zsofia Ban
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Markus Becker
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Kai Diederich
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Barbara Grune
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Pia Kahnau
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Nils Ohnesorge
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Johannes Pucher
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Gilbert Schönfelder
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
- Institute of Clinical Pharmacology and Toxicology, Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Bettina Bert
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Daniel Butzke
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| |
Collapse
|
39
|
Sosa DN, Hintzen R, Xiong B, de Giorgio A, Fauqueur J, Davies M, Lever J, Altman RB. Associating biological context with protein-protein interactions through text mining at PubMed scale. J Biomed Inform 2023; 145:104474. [PMID: 37572825 DOI: 10.1016/j.jbi.2023.104474] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 08/03/2023] [Accepted: 08/05/2023] [Indexed: 08/14/2023]
Abstract
Inferring knowledge from known relationships between drugs, proteins, genes, and diseases has great potential for clinical impact, such as predicting which existing drugs could be repurposed to treat rare diseases. Incorporating key biological context such as cell type or tissue of action into representations of extracted biomedical knowledge is essential for principled pharmacological discovery. Existing global, literature-derived knowledge graphs of interactions between drugs, proteins, genes, and diseases lack this essential information. In this study, we frame the task of associating biological context with protein-protein interactions extracted from text as a classification task using syntactic, semantic, and novel meta-discourse features. We introduce the Insider corpora, which are automatically generated PubMed-scale corpora for training classifiers for the context association task. These corpora are created by searching for precise syntactic cues of cell type and tissue relevancy to extracted regulatory relations. We report F1 scores of 0.955 and 0.862 for identifying relevant cell types and tissues, respectively, for our identified relations. By classifying with this framework, we demonstrate that the problem of context association can be addressed using intuitive, interpretable features. We demonstrate the potential of this approach to enrich text-derived knowledge bases with biological detail by incorporating cell type context into a protein-protein network for dengue fever.
Collapse
Affiliation(s)
- Daniel N Sosa
- Stanford University, Department of Biomedical Data Science, Stanford, CA, USA
| | | | - Betty Xiong
- Stanford University, Department of Biomedical Data Science, Stanford, CA, USA
| | | | | | | | | | - Russ B Altman
- Stanford University, Department of Bioengineering, Stanford, CA, USA; Stanford University, Department of Genetics, Stanford, CA, USA.
| |
Collapse
|
40
|
Pu Y, Beck D, Verspoor K. Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease. J Biomed Inform 2023; 145:104464. [PMID: 37541406 DOI: 10.1016/j.jbi.2023.104464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 07/29/2023] [Accepted: 07/30/2023] [Indexed: 08/06/2023]
Abstract
OBJECTIVE We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology. METHODS We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed. RESULTS We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of ∼11 k nodes and ∼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation. CONCLUSION Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases. AVAILABILITY Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.
Collapse
Affiliation(s)
- Yiyuan Pu
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Daniel Beck
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia.
| |
Collapse
|
41
|
Basereh M, Caputo A, Brennan R. Automatic transparency evaluation for open knowledge extraction systems. J Biomed Semantics 2023; 14:12. [PMID: 37653549 PMCID: PMC10468861 DOI: 10.1186/s13326-023-00293-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 07/30/2023] [Indexed: 09/02/2023] Open
Abstract
BACKGROUND This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities. RESULTS In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency. CONCLUSIONS This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential applications in trustworthy AI, compliance, data protection, data governance, and future OKE system design and testing.
Collapse
Affiliation(s)
- Maryam Basereh
- School of Computing, Dublin City University, Dublin, Ireland.
| | - Annalina Caputo
- School of Computing, Dublin City University, Dublin, Ireland
| | - Rob Brennan
- ADAPT Centre, School of Computer Science, University College Dublin, Dublin, Ireland
| |
Collapse
|
42
|
Lee H, Jeon J, Jung D, Won JI, Kim K, Kim YJ, Yoon J. RelCurator: a text mining-based curation system for extracting gene-phenotype relationships specific to neurodegenerative disorders. Genes Genomics 2023; 45:1025-1036. [PMID: 37300788 DOI: 10.1007/s13258-023-01405-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 05/18/2023] [Indexed: 06/12/2023]
Abstract
BACKGROUND The identification of gene-phenotype relationships is important in medical genetics as it serves as a basis for precision medicine. However, most of the gene-phenotype relationship data are buried in the biomedical literature in textual form. OBJECTIVE We propose RelCurator, a curation system that extracts sentences including both gene and phenotype entities related to specific disease categories from PubMed articles, provides rich additional information such as entity taggings, and predictions of gene-phenotype relationships. METHODS We targeted neurodegenerative disorders and developed a deep learning model using Bidirectional Gated Recurrent Unit (BiGRU) networks and BioWordVec word embeddings for predicting gene-phenotype relationships from biomedical texts. The prediction model is trained with more than 130,000 labeled PubMed sentences including gene and phenotype entities, which are related to or unrelated to neurodegenerative disorders. RESULTS We compared the performance of our deep learning model with those of Bidirectional Encoder Representations from Transformers (BERT), Support Vector Machine (SVM), and simple Recurrent Neural Network (simple RNN) models. Our model performed better with an F1-score of 0.96. Furthermore, the evaluation done using a few curation cases in the real scenario showed the effectiveness of our work. Therefore, we conclude that RelCurator can identify not only new causative genes, but also new genes associated with neurodegenerative disorders' phenotype. CONCLUSION RelCurator is a user-friendly method for accessing deep learning-based supporting information and a concise web interface to assist curators while browsing the PubMed articles. Our curation process represents an important and broadly applicable improvement to the state of the art for the curation of gene-phenotype relationships.
Collapse
Affiliation(s)
- Heonwoo Lee
- Department of Computer Engineering, Hallym University, Chuncheon, Gangwon-do, 200- 702, Republic of Korea
| | - Junbeom Jeon
- Department of Computer Engineering, Hallym University, Chuncheon, Gangwon-do, 200- 702, Republic of Korea
| | - Dawoon Jung
- Department of Computer Engineering, Hallym University, Chuncheon, Gangwon-do, 200- 702, Republic of Korea
| | - Jung-Im Won
- Center for Innovation in Engineering Education, Hanyang University, Seoul, Republic of Korea
| | - Kiyong Kim
- Department of Electronic Engineering, Kyonggi University, Suwon, Republic of Korea
| | - Yun Joong Kim
- Department of Neurology, Yonsei University College of Medicine, Seoul, Republic of Korea.
- Department of Neurology, Yongin Severance Hospital, Yonsei University College of Medicine, Yonsei University Health System, Yongin, Gyeonggi-do, 16995, Republic of Korea.
| | - Jeehee Yoon
- Department of Computer Engineering, Hallym University, Chuncheon, Gangwon-do, 200- 702, Republic of Korea.
| |
Collapse
|
43
|
Li X, Dai A, Tran R, Wang J. Text mining-based identification of promising miRNA biomarkers for diabetes mellitus. Front Endocrinol (Lausanne) 2023; 14:1195145. [PMID: 37560309 PMCID: PMC10407569 DOI: 10.3389/fendo.2023.1195145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 07/05/2023] [Indexed: 08/11/2023] Open
Abstract
Introduction MicroRNAs (miRNAs) are small, non-coding RNAs that play a critical role in diabetes development. While individual studies investigating the mechanisms of miRNA in diabetes provide valuable insights, their narrow focus limits their ability to provide a comprehensive understanding of miRNAs' role in diabetes pathogenesis and complications. Methods To reduce potential bias from individual studies, we employed a text mining-based approach to identify the role of miRNAs in diabetes and their potential as biomarker candidates. Abstracts of publications were tokenized, and biomedical terms were extracted for topic modeling. Four machine learning algorithms, including Naïve Bayes, Decision Tree, Random Forest, and Support Vector Machines (SVM), were employed for diabetes classification. Feature importance was assessed to construct miRNA-diabetes networks. Results Our analysis identified 13 distinct topics of miRNA studies in the context of diabetes, and miRNAs exhibited a topic-specific pattern. SVM achieved a promising prediction for diabetes with an accuracy score greater than 60%. Notably, miR-146 emerged as one of the critical biomarkers for diabetes prediction, targeting multiple genes and signal pathways implicated in diabetic inflammation and neuropathy. Conclusion This comprehensive approach yields generalizable insights into the network miRNAs-diabetes network and supports miRNAs' potential as a biomarker for diabetes.
Collapse
Affiliation(s)
- Xin Li
- Central Hospital Affiliated to Shandong First Medical University, Ophthalmology Department, Jinan, Shandong, China
| | - Andrea Dai
- Oakland University William Beaumont School of Medicine, Rochester, MI, United States
| | - Richard Tran
- University of Chicago, Master’s Program in Computer Science, Chicago, IL, United States
| | - Jie Wang
- Syracuse University, Applied Data Science Program, Syracuse, NY, United States
- MDSight, LLC, Brookeville, MD, United States
| |
Collapse
|
44
|
Kowalski TW, Feira MF, Lord VO, Gomes JDA, Giudicelli GC, Fraga LR, Sanseverino MTV, Recamonde-Mendoza M, Schuler-Faccini L, Vianna FSL. A New Strategy for the Old Challenge of Thalidomide: Systems Biology Prioritization of Potential Immunomodulatory Drug (IMiD)-Targeted Transcription Factors. Int J Mol Sci 2023; 24:11515. [PMID: 37511270 PMCID: PMC10380514 DOI: 10.3390/ijms241411515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/06/2023] [Accepted: 07/08/2023] [Indexed: 07/30/2023] Open
Abstract
Several molecular mechanisms of thalidomide embryopathy (TE) have been investigated, from anti-angiogenesis to oxidative stress to cereblon binding. Recently, it was discovered that thalidomide and its analogs, named immunomodulatory drugs (IMiDs), induced the degradation of C2H2 transcription factors (TFs). This mechanism might impact the strict transcriptional regulation of the developing embryo. Hence, this study aims to evaluate the TFs altered by IMiDs, prioritizing the ones associated with embryogenesis through transcriptome and systems biology-allied analyses. This study comprises only the experimental data accessed through bioinformatics databases. First, proteins and genes reported in the literature as altered/affected by the IMiDs were annotated. A protein systems biology network was evaluated. TFs beta-catenin (CTNNB1) and SP1 play more central roles: beta-catenin is an essential protein in the network, while SP1 is a putative C2H2 candidate for IMiD-induced degradation. Separately, the differential expressions of the annotated genes were analyzed through 23 publicly available transcriptomes, presenting 8624 differentially expressed genes (2947 in two or more datasets). Seventeen C2H2 TFs were identified as related to embryonic development but not studied for IMiD exposure; these TFs are potential IMiDs degradation neosubstrates. This is the first study to suggest an integration of IMiD molecular mechanisms through C2H2 TF degradation.
Collapse
Affiliation(s)
- Thayne Woycinck Kowalski
- Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
- Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Biomedical Sciences Course, Centro Universitário CESUCA, Cachoeirinha 94935-630, Brazil
| | - Mariléa Furtado Feira
- Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
- Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
| | - Vinícius Oliveira Lord
- Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Biomedical Sciences Course, Centro Universitário CESUCA, Cachoeirinha 94935-630, Brazil
| | - Julia do Amaral Gomes
- Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
| | - Giovanna Câmara Giudicelli
- Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
- Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
| | - Lucas Rosa Fraga
- Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Post-Graduation Program in Medicine, Medical Sciences, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 90035-003, Brazil
- Department of Morphological Sciences, Institute of Health Sciences, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 90010-150, Brazil
| | - Maria Teresa Vieira Sanseverino
- Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
- Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- School of Medicine, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre 90619-900, Brazil
| | - Mariana Recamonde-Mendoza
- Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Post-Graduation Program in Computer Science, Institute of Informatics, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
| | - Lavinia Schuler-Faccini
- Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
- Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
| | - Fernanda Sales Luiz Vianna
- Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
- Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
- Post-Graduation Program in Medicine, Medical Sciences, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 90035-003, Brazil
| |
Collapse
|
45
|
Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023; 143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]
Abstract
BACKGROUND Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA.
| | - Nourah M Salem
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Elizabeth K White
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Katherine J Sullivan
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Michael Bada
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Teri L Hernandez
- College of Nursing, Department of Medicine/Division of Endocrinology, Metabolism, & Diabetes, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Sonia M Leach
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| |
Collapse
|
46
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets. ARXIV 2023:arXiv:2306.11189v1. [PMID: 37502629 PMCID: PMC10370213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| |
Collapse
|
47
|
Knafou J, Haas Q, Borissov N, Counotte M, Low N, Imeri H, Ipekci AM, Buitrago-Garcia D, Heron L, Amini P, Teodoro D. Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature. Syst Rev 2023; 12:94. [PMID: 37277872 DOI: 10.1186/s13643-023-02247-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 04/24/2023] [Indexed: 06/07/2023] Open
Abstract
BACKGROUND The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process. METHODS In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article. RESULTS The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset. CONCLUSION This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.
Collapse
Affiliation(s)
- Julien Knafou
- University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland.
| | | | - Nikolay Borissov
- University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland
- CTU Bern, University of Bern, Bern, Switzerland
| | - Michel Counotte
- Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland
- Wageningen Bioveterinary Research, Wageningen University & Research, Wageningen, The Netherlands
| | - Nicola Low
- Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland
| | - Hira Imeri
- Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland
| | - Aziz Mert Ipekci
- Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland
| | | | - Leonie Heron
- Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland
| | - Poorya Amini
- Risklick AG, Bern, Switzerland
- CTU Bern, University of Bern, Bern, Switzerland
| | - Douglas Teodoro
- University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland.
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.
| |
Collapse
|
48
|
Jeong M, Kang J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023; 39:btad361. [PMID: 37261870 PMCID: PMC10272703 DOI: 10.1093/bioinformatics/btad361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 04/17/2023] [Accepted: 05/31/2023] [Indexed: 06/02/2023] Open
Abstract
SUMMARY Biomedical named entity recognition (NER) plays a crucial role in extracting information from documents in biomedical applications. However, many of these applications require NER models to operate at a document level, rather than just a sentence level. This presents a challenge, as the extension from a sentence model to a document model is not always straightforward. Despite the existence of document NER models that are able to make consistent predictions, they still fall short of meeting the expectations of researchers and practitioners in the field. To address this issue, we have undertaken an investigation into the underlying causes of inconsistent predictions. Our research has led us to believe that the use of adjectives and prepositions within entities may be contributing to low label consistency. In this article, we present our method, ConNER, to enhance a label consistency of modifiers such as adjectives and prepositions. By refining the labels of these modifiers, ConNER is able to improve representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets. On three datasets, we achieve a higher F1 score than the previous state-of-the-art model. Our method shows its efficacy on two datasets, resulting in 7.5%-8.6% absolute improvements in the F1 score. Our findings suggest that our ConNER method is effective on datasets with intrinsically low label consistency. Through qualitative analysis, we demonstrate how our approach helps the NER model generate more consistent predictions. AVAILABILITY AND IMPLEMENTATION Our code and resources are available at https://github.com/dmis-lab/ConNER/.
Collapse
Affiliation(s)
- Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul 04778, Republic of Korea
| |
Collapse
|
49
|
Allot A, Wei CH, Phan L, Hefferon T, Landrum M, Rehm HL, Lu Z. Tracking genetic variants in the biomedical literature using LitVar 2.0. Nat Genet 2023; 55:901-903. [PMID: 37268776 PMCID: PMC11096795 DOI: 10.1038/s41588-023-01414-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Timothy Hefferon
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Melissa Landrum
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Heidi L Rehm
- Program in Medical and Population Genetics, Broad institute of MIT and Harvard, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
50
|
Faessler E, Hahn U, Schäuble S. GePI: large-scale text mining, customized retrieval and flexible filtering of gene/protein interactions. Nucleic Acids Res 2023:7177881. [PMID: 37224532 DOI: 10.1093/nar/gkad445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 05/01/2023] [Accepted: 05/11/2023] [Indexed: 05/26/2023] Open
Abstract
We present GePI, a novel Web server for large-scale text mining of molecular interactions from the scientific biomedical literature. GePI leverages natural language processing techniques to identify genes and related entities, interactions between those entities and biomolecular events involving them. GePI supports rapid retrieval of interactions based on powerful search options to contextualize queries targeting (lists of) genes of interest. Contextualization is enabled by full-text filters constraining the search for interactions to either sentences or paragraphs, with or without pre-defined gene lists. Our knowledge graph is updated several times a week ensuring the most recent information to be available at all times. The result page provides an overview of the outcome of a search, with accompanying interaction statistics and visualizations. A table (downloadable in Excel format) gives direct access to the retrieved interaction pairs, together with information about the molecular entities, the factual certainty of the interactions (as verbatim expressed by the authors), and a text snippet from the original document that verbalizes each interaction. In summary, our Web application offers free, easy-to-use, and up-to-date monitoring of gene and protein interaction information, in company with flexible query formulation and filtering options. GePI is available at https://gepi.coling.uni-jena.de/.
Collapse
Affiliation(s)
- Erik Faessler
- Jena University Language and Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, Fürstengraben 30, 07743 Jena, Germany
| | - Udo Hahn
- Jena University Language and Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, Fürstengraben 30, 07743 Jena, Germany
| | - Sascha Schäuble
- Jena University Language and Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, Fürstengraben 30, 07743 Jena, Germany
- Microbiome Dynamics, Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI), 07745 Jena, Germany
| |
Collapse
|