Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020;47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open

For:	Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020;47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024;23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open

Abstract

Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.

Collapse

Zheng H, Xu L, Xie H, Xie J, Ma Y, Hu Y, Wu L, Chen J, Wang M, Yi Y, Huang Y, Wang D. RIscoper 2.0: A deep learning tool to extract RNA biomedical relation sentences from literature. Comput Struct Biotechnol J 2024;23:1469-1476. [PMID: 38623560 PMCID: PMC11016866 DOI: 10.1016/j.csbj.2024.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 03/15/2024] [Accepted: 03/21/2024] [Indexed: 04/17/2024] Open

Chen J, Goudey B, Geard N, Verspoor K. Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation. Bioinformatics 2024;40:i390-i400. [PMID: 38940182 DOI: 10.1093/bioinformatics/btae246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open

Abstract

MOTIVATION

Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process.

RESULTS

We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies.

This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge.

AVAILABILITY AND IMPLEMENTATION

https://github.com/jiyuc/de-inconsistency.

Collapse

Zhang J, Jiang Q, Du Z, Geng Y, Hu Y, Tong Q, Song Y, Zhang HY, Yan X, Feng Z. Knowledge graph-derived feed efficiency analysis via pig gut microbiota. Sci Rep 2024;14:13939. [PMID: 38886444 PMCID: PMC11182767 DOI: 10.1038/s41598-024-64835-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 06/13/2024] [Indexed: 06/20/2024] Open

Affiliation(s)

Junmei Zhang National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Qin Jiang National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China Yazhouwan National Laboratory (YNL), Sanya, 572025, China
Zhihong Du National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Yilin Geng National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Yuren Hu National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Qichang Tong National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Yunfeng Song National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Hong-Yu Zhang National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Xianghua Yan National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
Zaiwen Feng National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China.

Collapse

Nachtegael C, De Stefani J, Cnudde A, Lenaerts T. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations. Database (Oxford) 2024;2024:baae039. [PMID: 38805753 PMCID: PMC11131422 DOI: 10.1093/database/baae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/17/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]

Abstract

While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.

Collapse

Hramyka D, Sczakiel HL, Zhao MX, Stolpe O, Nieminen M, Adam R, Danyel M, Einicke L, Hägerling R, Knaus A, Mundlos S, Schwartzmann S, Seelow D, Ehmke N, Mensah MA, Boschann F, Beule D, Holtgrewe M. REEV: review, evaluate and explain variants. Nucleic Acids Res 2024:gkae366. [PMID: 38769069 DOI: 10.1093/nar/gkae366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/07/2024] [Accepted: 05/03/2024] [Indexed: 05/22/2024] Open

Affiliation(s)

Dzmitry Hramyka Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
Henrike Lisa Sczakiel Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany RG Development & Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
Max Xiaohang Zhao Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany
Oliver Stolpe Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
Mikko Nieminen Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany
Ronja Adam Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany
Magdalena Danyel Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
Lara Einicke Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany
René Hägerling Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany RG Development & Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany Berlin Institute of Health , BIH Center for Regenerative Therapies, Berlin, Germany
Alexej Knaus Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Germany
Stefan Mundlos Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany RG Development & Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
Sarina Schwartzmann Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
Dominik Seelow Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany
Nadja Ehmke Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany
Martin Atta Mensah Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany BIH Biomedical Innovation Academy, Digital Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
Felix Boschann Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany
Dieter Beule Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
Manuel Holtgrewe Berlin Institute of Health, Core Unit Bioinformatics, Berlin, Germany

Collapse

Di Maria A, Bellomo L, Billeci F, Cardillo A, Alaimo S, Ferragina P, Ferro A, Pulvirenti A. NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph. BIOINFORMATICS (OXFORD, ENGLAND) 2024;40:btae194. [PMID: 38597890 DOI: 10.1093/bioinformatics/btae194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]

Fisher JL, Wilk EJ, Oza VH, Gary SE, Howton TC, Flanary VL, Clark AD, Hjelmeland AB, Lasseigne BN. Signature reversion of three disease-associated gene signatures prioritizes cancer drug repurposing candidates. FEBS Open Bio 2024;14:803-830. [PMID: 38531616 PMCID: PMC11073506 DOI: 10.1002/2211-5463.13796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2024] [Revised: 03/13/2024] [Accepted: 03/14/2024] [Indexed: 03/28/2024] Open

Liu J, Wu H, Robertson DH, Zhang J. Text mining and portal development for gene-specific publications on Alzheimer's disease and other neurodegenerative diseases. BMC Med Inform Decis Mak 2024;24:98. [PMID: 38632621 PMCID: PMC11025191 DOI: 10.1186/s12911-024-02501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 04/04/2024] [Indexed: 04/19/2024] Open

Abstract

BACKGROUND

Tremendous research efforts have been made in the Alzheimer's disease (AD) field to understand the disease etiology, progression and discover treatments for AD. Many mechanistic hypotheses, therapeutic targets and treatment strategies have been proposed in the last few decades. Reviewing previous work and staying current on this ever-growing body of AD publications is an essential yet difficult task for AD researchers.

METHODS

In this study, we designed and implemented a natural language processing (NLP) pipeline to extract gene-specific neurodegenerative disease (ND) -focused information from the PubMed database. The collected publication information was filtered and cleaned to construct AD-related gene-specific publication profiles. Six categories of AD-related information are extracted from the processed publication data: publication trend by year, dementia type occurrence, brain region occurrence, mouse model information, keywords occurrence, and co-occurring genes. A user-friendly web portal is then developed using Django framework to provide gene query functions and data visualizations for the generalized and summarized publication information.

RESULTS

By implementing the NLP pipeline, we extracted gene-specific ND-related publication information from the abstracts of the publications in the PubMed database. The results are summarized and visualized through an interactive web query portal. Multiple visualization windows display the ND publication trends, mouse models used, dementia types, involved brain regions, keywords to major AD-related biological processes, and co-occurring genes. Direct links to PubMed sites are provided for all recorded publications on the query result page of the web portal.

CONCLUSION

The resulting portal is a valuable tool and data source for quick querying and displaying AD publications tailored to users' interested research areas and gene targets, which is especially convenient for users without informatic mining skills. Our study will not only keep AD field researchers updated with the progress of AD research, assist them in conducting preliminary examinations efficiently, but also offers additional support for hypothesis generation and validation which will contribute significantly to the communication, dissemination, and progress of AD research.

Collapse

Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024:gkae235. [PMID: 38572754 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open

Arakane K, Imoto H, Ormersbach F, Okada M. Extending BioMASS to construct mathematical models from external knowledge. BIOINFORMATICS ADVANCES 2024;4:vbae042. [PMID: 38606187 PMCID: PMC11007111 DOI: 10.1093/bioadv/vbae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 02/13/2024] [Accepted: 04/03/2024] [Indexed: 04/13/2024]

Mateu-Sanz M, Fuenteslópez CV, Uribe-Gomez J, Haugen HJ, Pandit A, Ginebra MP, Hakimi O, Krallinger M, Samara A. Redefining biomaterial biocompatibility: challenges for artificial intelligence and text mining. Trends Biotechnol 2024;42:402-417. [PMID: 37858386 DOI: 10.1016/j.tibtech.2023.09.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 09/25/2023] [Accepted: 09/26/2023] [Indexed: 10/21/2023]

Wittau J, Celik S, Kacprowski T, Deserno TM, Seifert R. Fake paper identification in the pool of withdrawn and rejected manuscripts submitted to Naunyn-Schmiedeberg's Archives of Pharmacology. NAUNYN-SCHMIEDEBERG'S ARCHIVES OF PHARMACOLOGY 2024;397:2171-2181. [PMID: 37796310 PMCID: PMC10933159 DOI: 10.1007/s00210-023-02741-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 09/20/2023] [Indexed: 10/06/2023]

Rai P, Jain A, Kumar S, Sharma D, Jha N, Chawla S, Raj A, Gupta A, Poonia S, Majumdar A, Chakraborty T, Ahuja G, Sengupta D. Literature mining discerns latent disease-gene relationships. Bioinformatics 2024;40:btae185. [PMID: 38608194 PMCID: PMC11060865 DOI: 10.1093/bioinformatics/btae185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 01/30/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open

Abstract

MOTIVATION

Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape project, researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.

RESULTS

To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained bidirectional encoder representations from transformers (BERT) for language modeling from the domain of natural language processing to learn vector representation of entities such as genes, diseases, tissues, cell-types, etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in silico synthesis of hypotheses linking different biological entities such as genes and conditions.

AVAILABILITY AND IMPLEMENTATION

PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-Model. BioSentVec-based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-Model. Pathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.

Collapse

Affiliation(s)

Priyadarshini Rai Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Atishay Jain Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Shivani Kumar Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Divya Sharma Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Neha Jha Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Smriti Chawla Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Abhijit Raj Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Apoorva Gupta Department of Biotechnology, Delhi Technological University, Shahbad Daulatpur, Delhi 110042, India
Sarita Poonia Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Angshul Majumdar IAI, TCG CREST, Kolkata 700091, India
Tanmoy Chakraborty Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi 110016, India Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi 110016, India
Gaurav Ahuja Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India Centre for Artificial Intelligence, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India
Debarka Sengupta Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India Centre for Artificial Intelligence, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla Phase III, New Delhi 110020, India

Collapse

Richardson R, Tejedor Navarro H, Amaral LAN, Stoeger T. Meta-Research: Understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results. eLife 2024;12:RP93429. [PMID: 38546716 PMCID: PMC10977968 DOI: 10.7554/elife.93429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/01/2024] Open

Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024;25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open

Abstract

BACKGROUND

The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools.

RESULTS

We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances.

CONCLUSIONS

MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.

Collapse

Yao X, He Z, Liu Y, Wang Y, Ouyang S, Xia J. Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer. Sci Data 2024;11:265. [PMID: 38431735 PMCID: PMC10908799 DOI: 10.1038/s41597-024-03083-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 02/20/2024] [Indexed: 03/05/2024] Open

Liu J, Li J, Jin F, Li Q, Zhao G, Wu L, Li X, Xia J, Cheng N. dbCRAF: a curated knowledgebase for regulation of radiation response in human cancer. NAR Cancer 2024;6:zcae008. [PMID: 38406264 PMCID: PMC10894039 DOI: 10.1093/narcan/zcae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 12/10/2023] [Accepted: 02/15/2024] [Indexed: 02/27/2024] Open

Affiliation(s)

Jie Liu Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
Jing Li Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
Fangfang Jin Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
Qian Li School of Environmental Science and Optoelectronic Technology, University of Science and Technology of China, Hefei, Anhui 230026, China
Guoping Zhao Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui 230031, China
Lijun Wu Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui 230031, China
Xiaoyan Li Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
Junfeng Xia Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui 230601, China
Na Cheng School of Biomedical Engineering, Anhui Medical University, Hefei, Anhui 230032, China

Collapse

Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024:10.1007/s12539-024-00605-2. [PMID: 38340264 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]

Tumilovich A, Yablokov E, Mezentsev Y, Ershov P, Basina V, Gnedenko O, Kaluzhskiy L, Tsybruk T, Grabovec I, Kisel M, Shabunya P, Soloveva N, Vavilov N, Gilep A, Ivanov A. The Multienzyme Complex Nature of Dehydroepiandrosterone Sulfate Biosynthesis. Int J Mol Sci 2024;25:2072. [PMID: 38396748 PMCID: PMC10889563 DOI: 10.3390/ijms25042072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/16/2024] [Accepted: 01/26/2024] [Indexed: 02/25/2024] Open

Affiliation(s)

Anastasiya Tumilovich Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
Evgeniy Yablokov Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Yuri Mezentsev Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Pavel Ershov Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Viktoriia Basina Research Centre for Medical Genetics, 1 Moskvorechye Street, 115522 Moscow, Russia;
Oksana Gnedenko Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Leonid Kaluzhskiy Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Tatsiana Tsybruk Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
Irina Grabovec Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
Maryia Kisel Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
Polina Shabunya Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
Natalia Soloveva Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Nikita Vavilov Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Andrei Gilep Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.) Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
Alexis Ivanov Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)

Collapse

Kilicoglu H, Ensan F, McInnes B, Wang LL. Semantics-enabled biomedical literature analytics. J Biomed Inform 2024;150:104588. [PMID: 38244957 DOI: 10.1016/j.jbi.2024.104588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 01/10/2024] [Indexed: 01/22/2024]

Reed CJ, Denise R, Hourihan J, Babor J, Jaroch M, Martinelli M, Hutinet G, de Crécy-Lagard V. Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families. Microb Genom 2024;10:001183. [PMID: 38323604 PMCID: PMC10926702 DOI: 10.1099/mgen.0.001183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Accepted: 01/08/2024] [Indexed: 02/08/2024] Open

Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024;100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open

Alqaissi E, Alotaibi F, Sher Ramzan M, Algarni A. Novel graph-based machine-learning technique for viral infectious diseases: application to influenza and hepatitis diseases. Ann Med 2024;55:2304108. [PMID: 38242107 PMCID: PMC10802812 DOI: 10.1080/07853890.2024.2304108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 12/18/2023] [Indexed: 01/21/2024] Open

Gao J, Mo S, Wang J, Zhang M, Shi Y, Zhu C, Shang Y, Tang X, Zhang S, Wu X, Xu X, Wang Y, Li Z, Zheng G, Chen Z, Wang Q, Tang K, Cao Z. MACC: a visual interactive knowledgebase of metabolite-associated cell communications. Nucleic Acids Res 2024;52:D633-D639. [PMID: 37897362 PMCID: PMC10767829 DOI: 10.1093/nar/gkad914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/21/2023] [Accepted: 10/10/2023] [Indexed: 10/30/2023] Open

Affiliation(s)

Jian Gao School of Life Sciences, Fudan University, Shanghai, China International Human Phenome Institutes (Shanghai), Shanghai, China Department of Thoracic Surgery and State Key Laboratory of Genetic Engineering, Fudan University Shanghai Cancer Center, Shanghai, China
Saifeng Mo Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Jun Wang School of Life Sciences, Fudan University, Shanghai, China
Mou Zhang School of Life Sciences, Fudan University, Shanghai, China
Yao Shi School of Life Sciences, Fudan University, Shanghai, China
Chuhan Zhu Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Yuxuan Shang Biological Sciences, University of California Santa Barbara, CA, USA
Xinyue Tang Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Shiyue Zhang School of Life Sciences, Fudan University, Shanghai, China
Xinwen Wu Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Xinyan Xu Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Yiheng Wang School of Life Sciences, Fudan University, Shanghai, China
Zihao Li Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Genhui Zheng Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Zikun Chen Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Qiming Wang School of Life Sciences, Fudan University, Shanghai, China
Kailin Tang Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Zhiwei Cao School of Life Sciences, Fudan University, Shanghai, China International Human Phenome Institutes (Shanghai), Shanghai, China

Collapse

Savage SR, Zhang Y, Jaehnig EJ, Liao Y, Shi Z, Pham HA, Xu H, Zhang B. IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining. Mol Cell Proteomics 2024;23:100682. [PMID: 37993103 PMCID: PMC10716774 DOI: 10.1016/j.mcpro.2023.100682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 10/25/2023] [Accepted: 11/14/2023] [Indexed: 11/24/2023] Open

Abstract

Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.

Collapse

Jeynes JCG, James T, Corney M. Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls. Methods Mol Biol 2024;2716:223-240. [PMID: 37702942 DOI: 10.1007/978-1-0716-3449-3_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]

Fuenteslópez CV, McKitrick A, Corvi J, Ginebra MP, Hakimi O. Biomaterials text mining: A hands-on comparative study of methods on polydioxanone biocompatibility. N Biotechnol 2023;77:161-175. [PMID: 37673372 DOI: 10.1016/j.nbt.2023.09.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 08/14/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]

He F, Liu K, Yang Z, Chen Y, Hammer RD, Xu D, Popescu M. pathCLIP: Detection of Genes and Gene Relations from Biological Pathway Figures through Image-Text Contrastive Learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.31.564859. [PMID: 37961680 PMCID: PMC10635012 DOI: 10.1101/2023.10.31.564859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]

Weber L, Barth F, Lorenz L, Konrath F, Huska K, Wolf J, Leser U. PEDL+: protein-centered relation extraction from PubMed at your fingertip. Bioinformatics 2023;39:btad603. [PMID: 37950510 PMCID: PMC10660277 DOI: 10.1093/bioinformatics/btad603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 08/29/2023] [Accepted: 10/31/2023] [Indexed: 11/12/2023] Open

Garda S, Weber-Genzel L, Martin R, Leser U. BELB: a biomedical entity linking benchmark. Bioinformatics 2023;39:btad698. [PMID: 37975879 PMCID: PMC10681865 DOI: 10.1093/bioinformatics/btad698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/30/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open

Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023;10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open

Wei CH, Luo L, Islamaj R, Lai PT, Lu Z. GNorm2: an improved gene name recognition and normalization system. Bioinformatics 2023;39:btad599. [PMID: 37878810 PMCID: PMC10612401 DOI: 10.1093/bioinformatics/btad599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 09/06/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open

Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023;146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]

Abstract

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

Collapse

Marchesin S, Menotti L, Giachelle F, Silvello G, Alonso O. Building a large gene expression-cancer knowledge base with limited human annotations. Database (Oxford) 2023;2023:baad061. [PMID: 37768281 PMCID: PMC10533344 DOI: 10.1093/database/baad061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 07/27/2023] [Accepted: 08/25/2023] [Indexed: 09/29/2023]

Zhang Z, Fang M, Wu R, Zong H, Huang H, Tong Y, Xie Y, Cheng S, Wei Z, Crabbe MJC, Zhang X, Wang Y. Large-Scale Biomedical Relation Extraction Across Diverse Relation Types: Model Development and Usability Study on COVID-19. J Med Internet Res 2023;25:e48115. [PMID: 37632414 PMCID: PMC10551783 DOI: 10.2196/48115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 07/03/2023] [Accepted: 08/25/2023] [Indexed: 08/28/2023] Open

Abstract

BACKGROUND

Biomedical relation extraction (RE) is of great importance for researchers to conduct systematic biomedical studies. It not only helps knowledge mining, such as knowledge graphs and novel knowledge discovery, but also promotes translational applications, such as clinical diagnosis, decision-making, and precision medicine. However, the relations between biomedical entities are complex and diverse, and comprehensive biomedical RE is not yet well established.

OBJECTIVE

We aimed to investigate and improve large-scale RE with diverse relation types and conduct usability studies with application scenarios to optimize biomedical text mining.

METHODS

Data sets containing 125 relation types with different entity semantic levels were constructed to evaluate the impact of entity semantic information on RE, and performance analysis was conducted on different model architectures and domain models. This study also proposed a continued pretraining strategy and integrated models with scripts into a tool. Furthermore, this study applied RE to the COVID-19 corpus with article topics and application scenarios of clinical interest to assess and demonstrate its biological interpretability and usability.

RESULTS

The performance analysis revealed that RE achieves the best performance when the detailed semantic type is provided. For a single model, PubMedBERT with continued pretraining performed the best, with an F1-score of 0.8998. Usability studies on COVID-19 demonstrated the interpretability and usability of RE, and a relation graph database was constructed, which was used to reveal existing and novel drug paths with edge explanations. The models (including pretrained and fine-tuned models), integrated tool (Docker), and generated data (including the COVID-19 relation graph database and drug paths) have been made publicly available to the biomedical text mining community and clinical researchers.

CONCLUSIONS

This study provided a comprehensive analysis of RE with diverse relation types. Optimized RE models and tools for diverse relation types were developed, which can be widely used in biomedical text mining. Our usability studies provided a proof-of-concept demonstration of how large-scale RE can be leveraged to facilitate novel research.

Collapse

Affiliation(s)

Zeyu Zhang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China Department of Clinical Laboratory Medicine Center, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China
Meng Fang Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai, China
Rebecca Wu University of California, Berkeley, Berkeley, CA, United States
Hui Zong Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China Institutes for Systems Genetics, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
Honglian Huang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Yuantao Tong Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Yujia Xie Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Shiyang Cheng Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Ziyi Wei Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
M James C Crabbe Wolfson College, Oxford University, Oxford, United Kingdom Institute of Biomedical and Environmental Science & Technology, University of Bedfordshire, Luton, United Kingdom School of Life Sciences, Shanxi University, Taiyuan, China
Xiaoyan Zhang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
Ying Wang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China Department of Clinical Laboratory Medicine Center, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai, China

Collapse

Jeynes JCG, Corney M, James T. A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction. PLoS One 2023;18:e0291142. [PMID: 37682956 PMCID: PMC10490933 DOI: 10.1371/journal.pone.0291142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 08/22/2023] [Indexed: 09/10/2023] Open

Abstract

One area of active research is the use of natural language processing (NLP) to mine biomedical texts for sets of triples (subject-predicate-object) for knowledge graph (KG) construction. While statistical methods to mine co-occurrences of entities within sentences are relatively robust, accurate relationship extraction is more challenging. Herein, we evaluate the Global Network of Biomedical Relationships (GNBR), a dataset that uses distributional semantics to model relationships between biomedical entities. The focus of our paper is an evaluation of a subset of the GNBR data; the relationships between chemicals and genes/proteins. We use Evotec's structured 'Nexus' database of >2.76M chemical-protein interactions as a ground truth to compare with GNBRs relationships and find a micro-averaged precision-recall area under the curve (AUC) of 0.50 and a micro-averaged receiver operating characteristic (ROC) curve AUC of 0.71 across the relationship classes 'inhibits', 'binding', 'agonism' and 'antagonism', when a comparison is made on a sentence-by-sentence basis. We conclude that, even though these micro-average scores are modest, using a high threshold on certain relationship classes like 'inhibits' could yield high fidelity triples that are not reported in structured datasets. We discuss how different methods of processing GNBR data, and the factuality of triples could affect the accuracy of NLP data incorporated into knowledge graphs. We provide a GNBR-Nexus(ChEMBL-subset) merged datafile that contains over 20,000 sentences where a protein/gene-chemical co-occur and includes both the GNBR relationship scores as well as the ChEMBL (manually curated) relationships (e.g., 'agonist', 'inhibitor') -this can be accessed at https://doi.org/10.5281/zenodo.8136752. We envisage this being used to aid curation efforts by the drug discovery community.

Collapse

Neves M, Klippert A, Knöspel F, Rudeck J, Stolz A, Ban Z, Becker M, Diederich K, Grune B, Kahnau P, Ohnesorge N, Pucher J, Schönfelder G, Bert B, Butzke D. Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. J Biomed Semantics 2023;14:13. [PMID: 37658458 PMCID: PMC10472567 DOI: 10.1186/s13326-023-00292-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 07/29/2023] [Indexed: 09/03/2023] Open

Abstract

Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).

Collapse

Affiliation(s)

Mariana Neves German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany.
Antonina Klippert German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany Current affiliation: Nuvisan ICB GmbH, Müllerstraße 178, 13353, Berlin, Germany
Fanny Knöspel German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Juliane Rudeck German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Ailine Stolz German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Zsofia Ban German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Markus Becker German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Kai Diederich German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Barbara Grune German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Pia Kahnau German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Nils Ohnesorge German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Johannes Pucher German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Gilbert Schönfelder German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany Institute of Clinical Pharmacology and Toxicology, Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
Bettina Bert German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
Daniel Butzke German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany

Collapse

Sosa DN, Hintzen R, Xiong B, de Giorgio A, Fauqueur J, Davies M, Lever J, Altman RB. Associating biological context with protein-protein interactions through text mining at PubMed scale. J Biomed Inform 2023;145:104474. [PMID: 37572825 DOI: 10.1016/j.jbi.2023.104474] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 08/03/2023] [Accepted: 08/05/2023] [Indexed: 08/14/2023]

Pu Y, Beck D, Verspoor K. Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease. J Biomed Inform 2023;145:104464. [PMID: 37541406 DOI: 10.1016/j.jbi.2023.104464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 07/29/2023] [Accepted: 07/30/2023] [Indexed: 08/06/2023]

Abstract

OBJECTIVE

We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology.

METHODS

We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed.

RESULTS

We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of ∼11 k nodes and ∼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation.

CONCLUSION

Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases.

AVAILABILITY

Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.

Collapse

Basereh M, Caputo A, Brennan R. Automatic transparency evaluation for open knowledge extraction systems. J Biomed Semantics 2023;14:12. [PMID: 37653549 PMCID: PMC10468861 DOI: 10.1186/s13326-023-00293-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 07/30/2023] [Indexed: 09/02/2023] Open

Abstract

BACKGROUND

This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities.

RESULTS

In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency.

CONCLUSIONS

This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential applications in trustworthy AI, compliance, data protection, data governance, and future OKE system design and testing.

Collapse

Lee H, Jeon J, Jung D, Won JI, Kim K, Kim YJ, Yoon J. RelCurator: a text mining-based curation system for extracting gene-phenotype relationships specific to neurodegenerative disorders. Genes Genomics 2023;45:1025-1036. [PMID: 37300788 DOI: 10.1007/s13258-023-01405-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 05/18/2023] [Indexed: 06/12/2023]

Abstract

BACKGROUND

The identification of gene-phenotype relationships is important in medical genetics as it serves as a basis for precision medicine. However, most of the gene-phenotype relationship data are buried in the biomedical literature in textual form.

OBJECTIVE

We propose RelCurator, a curation system that extracts sentences including both gene and phenotype entities related to specific disease categories from PubMed articles, provides rich additional information such as entity taggings, and predictions of gene-phenotype relationships.

METHODS

We targeted neurodegenerative disorders and developed a deep learning model using Bidirectional Gated Recurrent Unit (BiGRU) networks and BioWordVec word embeddings for predicting gene-phenotype relationships from biomedical texts. The prediction model is trained with more than 130,000 labeled PubMed sentences including gene and phenotype entities, which are related to or unrelated to neurodegenerative disorders.

RESULTS

We compared the performance of our deep learning model with those of Bidirectional Encoder Representations from Transformers (BERT), Support Vector Machine (SVM), and simple Recurrent Neural Network (simple RNN) models. Our model performed better with an F1-score of 0.96. Furthermore, the evaluation done using a few curation cases in the real scenario showed the effectiveness of our work. Therefore, we conclude that RelCurator can identify not only new causative genes, but also new genes associated with neurodegenerative disorders' phenotype.

CONCLUSION

RelCurator is a user-friendly method for accessing deep learning-based supporting information and a concise web interface to assist curators while browsing the PubMed articles. Our curation process represents an important and broadly applicable improvement to the state of the art for the curation of gene-phenotype relationships.

Collapse

Li X, Dai A, Tran R, Wang J. Text mining-based identification of promising miRNA biomarkers for diabetes mellitus. Front Endocrinol (Lausanne) 2023;14:1195145. [PMID: 37560309 PMCID: PMC10407569 DOI: 10.3389/fendo.2023.1195145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 07/05/2023] [Indexed: 08/11/2023] Open

Kowalski TW, Feira MF, Lord VO, Gomes JDA, Giudicelli GC, Fraga LR, Sanseverino MTV, Recamonde-Mendoza M, Schuler-Faccini L, Vianna FSL. A New Strategy for the Old Challenge of Thalidomide: Systems Biology Prioritization of Potential Immunomodulatory Drug (IMiD)-Targeted Transcription Factors. Int J Mol Sci 2023;24:11515. [PMID: 37511270 PMCID: PMC10380514 DOI: 10.3390/ijms241411515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/06/2023] [Accepted: 07/08/2023] [Indexed: 07/30/2023] Open

Affiliation(s)

Thayne Woycinck Kowalski Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Biomedical Sciences Course, Centro Universitário CESUCA, Cachoeirinha 94935-630, Brazil
Mariléa Furtado Feira Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
Vinícius Oliveira Lord Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Biomedical Sciences Course, Centro Universitário CESUCA, Cachoeirinha 94935-630, Brazil
Julia do Amaral Gomes Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
Giovanna Câmara Giudicelli Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
Lucas Rosa Fraga Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Post-Graduation Program in Medicine, Medical Sciences, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 90035-003, Brazil Department of Morphological Sciences, Institute of Health Sciences, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 90010-150, Brazil
Maria Teresa Vieira Sanseverino Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil School of Medicine, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre 90619-900, Brazil
Mariana Recamonde-Mendoza Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Post-Graduation Program in Computer Science, Institute of Informatics, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil
Lavinia Schuler-Faccini Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil
Fernanda Sales Luiz Vianna Graduate Program in Genetics and Molecular Biology, Genetics Department, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 91501-970, Brazil Teratogen Information System (SIAT), Medical Genetics Service, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Laboratory of Genomic Medicine, Center of Experimental Research, Hospital de Clínicas de Porto Alegre (HCPA), Porto Alegre 90035-903, Brazil Post-Graduation Program in Medicine, Medical Sciences, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre 90035-003, Brazil

Collapse

Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023;143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]

Abstract

BACKGROUND

Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition.

RESULTS

We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements.

CONCLUSION

Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.

Collapse

Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets. ARXIV 2023:arXiv:2306.11189v1. [PMID: 37502629 PMCID: PMC10370213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]

Abstract

Collapse

Knafou J, Haas Q, Borissov N, Counotte M, Low N, Imeri H, Ipekci AM, Buitrago-Garcia D, Heron L, Amini P, Teodoro D. Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature. Syst Rev 2023;12:94. [PMID: 37277872 DOI: 10.1186/s13643-023-02247-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 04/24/2023] [Indexed: 06/07/2023] Open

Abstract

BACKGROUND

The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process.

METHODS

In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article.

RESULTS

The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset.

CONCLUSION

This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

Collapse

Jeong M, Kang J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023;39:btad361. [PMID: 37261870 PMCID: PMC10272703 DOI: 10.1093/bioinformatics/btad361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 04/17/2023] [Accepted: 05/31/2023] [Indexed: 06/02/2023] Open

Allot A, Wei CH, Phan L, Hefferon T, Landrum M, Rehm HL, Lu Z. Tracking genetic variants in the biomedical literature using LitVar 2.0. Nat Genet 2023;55:901-903. [PMID: 37268776 PMCID: PMC11096795 DOI: 10.1038/s41588-023-01414-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]

Faessler E, Hahn U, Schäuble S. GePI: large-scale text mining, customized retrieval and flexible filtering of gene/protein interactions. Nucleic Acids Res 2023:7177881. [PMID: 37224532 DOI: 10.1093/nar/gkad445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 05/01/2023] [Accepted: 05/11/2023] [Indexed: 05/26/2023] Open