1
|
Ming S, Zhang R, Kilicoglu H. Enhancing the coverage of SemRep using a relation classification approach. J Biomed Inform 2024; 155:104658. [PMID: 38782169 DOI: 10.1016/j.jbi.2024.104658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 05/01/2024] [Accepted: 05/18/2024] [Indexed: 05/25/2024]
Abstract
OBJECTIVE Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB. METHODS We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts. RESULTS Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct. CONCLUSION These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.
Collapse
Affiliation(s)
- Shufan Ming
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel St., Champaign, 61820, IL, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, 516 Delaware St SE, Minneapolis, 55455, MN, USA
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel St., Champaign, 61820, IL, USA.
| |
Collapse
|
2
|
Du J, Soysal E, Wang D, He L, Lin B, Wang J, Manion FJ, Li Y, Wu E, Yao L. Machine learning models for abstract screening task - A systematic literature review application for health economics and outcome research. BMC Med Res Methodol 2024; 24:108. [PMID: 38724903 PMCID: PMC11080200 DOI: 10.1186/s12874-024-02224-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 04/18/2024] [Indexed: 05/13/2024] Open
Abstract
OBJECTIVE Systematic literature reviews (SLRs) are critical for life-science research. However, the manual selection and retrieval of relevant publications can be a time-consuming process. This study aims to (1) develop two disease-specific annotated corpora, one for human papillomavirus (HPV) associated diseases and the other for pneumococcal-associated pediatric diseases (PAPD), and (2) optimize machine- and deep-learning models to facilitate automation of the SLR abstract screening. METHODS This study constructed two disease-specific SLR screening corpora for HPV and PAPD, which contained citation metadata and corresponding abstracts. Performance was evaluated using precision, recall, accuracy, and F1-score of multiple combinations of machine- and deep-learning algorithms and features such as keywords and MeSH terms. RESULTS AND CONCLUSIONS The HPV corpus contained 1697 entries, with 538 relevant and 1159 irrelevant articles. The PAPD corpus included 2865 entries, with 711 relevant and 2154 irrelevant articles. Adding additional features beyond title and abstract improved the performance (measured in Accuracy) of machine learning models by 3% for HPV corpus and 2% for PAPD corpus. Transformer-based deep learning models that consistently outperformed conventional machine learning algorithms, highlighting the strength of domain-specific pre-trained language models for SLR abstract screening. This study provides a foundation for the development of more intelligent SLR systems.
Collapse
Affiliation(s)
| | - Ekin Soysal
- Intelligent Medical Objects, Houston, TX, USA
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Long He
- Intelligent Medical Objects, Houston, TX, USA
| | - Bin Lin
- Intelligent Medical Objects, Houston, TX, USA
| | - Jingqi Wang
- Intelligent Medical Objects, Houston, TX, USA
| | | | - Yeran Li
- Merck & Co., Inc, Rahway, NJ, USA
| | - Elise Wu
- Merck & Co., Inc, Rahway, NJ, USA
| | | |
Collapse
|
3
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024:gkae235. [PMID: 38572754 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
4
|
Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024; 25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools. RESULTS We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances. CONCLUSIONS MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.
Collapse
Affiliation(s)
- Ornella Irrera
- Department of Information Engineering, University of Padova, Padua, Italy.
| | - Stefano Marchesin
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
5
|
Xiong J, Liu X, Li Z, Xiao H, Wang G, Niu Z, Fei C, Zhong F, Wang G, Zhang W, Fu Z, Liu Z, Chen K, Jiang H, Zheng M. αExtractor: a system for automatic extraction of chemical information from biomedical literature. SCIENCE CHINA. LIFE SCIENCES 2024; 67:618-621. [PMID: 37758905 DOI: 10.1007/s11427-023-2388-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 06/07/2023] [Indexed: 09/29/2023]
Affiliation(s)
- Jiacheng Xiong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Xiaohong Liu
- AI Department, Suzhou Alphama Biotechnology Co., Ltd., Suzhou, 215125, China
| | - Zhaojun Li
- AI Department, Suzhou Alphama Biotechnology Co., Ltd., Suzhou, 215125, China
- College of Computer and Information Engineering, Dezhou University, Dezhou, 253023, China
| | - Hongzhong Xiao
- AI Department, Suzhou Alphama Biotechnology Co., Ltd., Suzhou, 215125, China
| | - Guangchao Wang
- College of Computer and Information Engineering, Dezhou University, Dezhou, 253023, China
| | - Zhenjiang Niu
- AI Department, Suzhou Alphama Biotechnology Co., Ltd., Suzhou, 215125, China
| | - Chaoyuan Fei
- AI Department, Suzhou Alphama Biotechnology Co., Ltd., Suzhou, 215125, China
| | - Feisheng Zhong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Gang Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zhiguo Liu
- AI Department, Suzhou Alphama Biotechnology Co., Ltd., Suzhou, 215125, China
| | - Kaixian Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Hualiang Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
6
|
Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024; 100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open
Abstract
Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
7
|
Azam M, Chen Y, Arowolo MO, Liu H, Popescu M, Xu D. A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.21.576542. [PMID: 38328046 PMCID: PMC10849485 DOI: 10.1101/2024.01.21.576542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Background Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways is useful but cannot keep up with the exponential growth of the literature. Large-scale language models (LLMs), notable for their vast parameter sizes and comprehensive training on extensive text corpora, have great potential in automated text mining of biological pathways. Method This study assesses the effectiveness of 21 LLMs, including both API-based models and open-source models. The evaluation focused on two key aspects: gene regulatory relations (specifically, 'activation', 'inhibition', and 'phosphorylation') and KEGG pathway component recognition. The performance of these models was analyzed using statistical metrics such as precision, recall, F1 scores, and the Jaccard similarity index. Results Our results indicated a significant disparity in model performance. Among the API-based models, ChatGPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged their API-based counterparts, where Falcon-180b-chat and llama1-7b led with the highest performance in gene regulatory relations (F1 of 0.2787 and 0.1923, respectively) and KEGG pathway recognition (Jaccard similarity index of 0.2237 and 0. 2207, respectively). Conclusion LLMs are valuable in biomedical research, especially in gene network analysis and pathway mapping. However, their effectiveness varies, necessitating careful model selection. This work also provided a case study and insight into using LLMs as knowledge graphs.
Collapse
Affiliation(s)
- Muhammad Azam
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Yibo Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
| | - Micheal Olaolu Arowolo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Haowang Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Mihail Popescu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
- Department of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, Missouri, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
8
|
Martini L, Baek SH, Lo I, Raby BA, Silverman E, Weiss S, Glass K, Halu A. Detecting and dissecting signaling crosstalk via the multilayer network integration of signaling and regulatory interactions. Nucleic Acids Res 2024; 52:e5. [PMID: 37953325 PMCID: PMC10783515 DOI: 10.1093/nar/gkad1035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 06/27/2023] [Accepted: 10/23/2023] [Indexed: 11/14/2023] Open
Abstract
The versatility of cellular response arises from the communication, or crosstalk, of signaling pathways in a complex network of signaling and transcriptional regulatory interactions. Understanding the various mechanisms underlying crosstalk on a global scale requires untargeted computational approaches. We present a network-based statistical approach, MuXTalk, that uses high-dimensional edges called multilinks to model the unique ways in which signaling and regulatory interactions can interface. We demonstrate that the signaling-regulatory interface is located primarily in the intermediary region between signaling pathways where crosstalk occurs, and that multilinks can differentiate between distinct signaling-transcriptional mechanisms. Using statistically over-represented multilinks as proxies of crosstalk, we infer crosstalk among 60 signaling pathways, expanding currently available crosstalk databases by more than five-fold. MuXTalk surpasses existing methods in terms of model performance metrics, identifies additions to manual curation efforts, and pinpoints potential mediators of crosstalk. Moreover, it accommodates the inherent context-dependence of crosstalk, allowing future applications to cell type- and disease-specific crosstalk.
Collapse
Affiliation(s)
- Leonardo Martini
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
- Department of Computer, Control, and Management Engineering, Sapienza University of Rome, Rome, 00185, Italy
| | - Seung Han Baek
- Division of Pulmonary Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Ian Lo
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Benjamin A Raby
- Division of Pulmonary Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Edwin K Silverman
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Scott T Weiss
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Kimberly Glass
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Arda Halu
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
| |
Collapse
|
9
|
Wu Z, Feng C, Hu Y, Zhou Y, Li S, Zhang S, Hu Y, Chen Y, Chao H, Ni Q, Chen M. HALD, a human aging and longevity knowledge graph for precision gerontology and geroscience analyses. Sci Data 2023; 10:851. [PMID: 38040715 PMCID: PMC10692171 DOI: 10.1038/s41597-023-02781-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 11/23/2023] [Indexed: 12/03/2023] Open
Abstract
Human aging is a natural and inevitable biological process that leads to an increased risk of aging-related diseases. Developing anti-aging therapies for aging-related diseases requires a comprehensive understanding of the mechanisms and effects of aging and longevity from a multi-modal and multi-faceted perspective. However, most of the relevant knowledge is scattered in the biomedical literature, the volume of which reached 36 million in PubMed. Here, we presented HALD, a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed. HALD integrated multiple state-of-the-art natural language processing (NLP) techniques to improve the accuracy and coverage of the knowledge graph for precision gerontology and geroscience analyses. Up to September 2023, HALD had contained 12,227 entities in 10 types (gene, RNA, protein, carbohydrate, lipid, peptide, pharmaceutical preparations, toxin, mutation, and disease), 115,522 relations, 1,855 aging biomarkers, and 525 longevity biomarkers from 339,918 biomedical articles in PubMed. HALD is available at https://bis.zju.edu.cn/hald .
Collapse
Affiliation(s)
- Zexu Wu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Cong Feng
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- The First Affiliated Hospital, Zhejiang University School of Medicine; Institute of Hematology, Zhejiang University, Hangzhou, 310058, China
| | - Yanshi Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yincong Zhou
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Joint Research Centre for Engineering Biology, Zhejiang University-University of Edinburgh Institute, Zhejiang University, Haining, 314400, China
| | - Sida Li
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Shilong Zhang
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yueming Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yuhao Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Haoyu Chao
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Qingyang Ni
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.
- The First Affiliated Hospital, Zhejiang University School of Medicine; Institute of Hematology, Zhejiang University, Hangzhou, 310058, China.
- Joint Research Centre for Engineering Biology, Zhejiang University-University of Edinburgh Institute, Zhejiang University, Haining, 314400, China.
| |
Collapse
|
10
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC Bioinformatics 2023; 24:412. [PMID: 37915001 PMCID: PMC10619245 DOI: 10.1186/s12859-023-05539-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023] Open
Abstract
BACKGROUND The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: (1) they identify a relationship but not the type of relationship, (2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, (3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or (4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. RESULTS We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. CONCLUSIONS SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
Affiliation(s)
| | - Kalpana Raja
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
| | - John Steill
- Morgridge Institute for Research, Madison, WI, USA
| | - Cannon Lock
- Morgridge Institute for Research, Madison, WI, USA
| | - Xuancheng Tu
- Morgridge Institute for Research, Madison, WI, USA
| | - Ian Ross
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | - Lam C Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Finn Kuusisto
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Data Science Institute, University of Wisconsin, Madison, WI, USA
| | - Zijian Ni
- Department of Statistics, University of Wisconsin, Madison, WI, USA
- Currently at Amazon, Seattle, WA, USA
| | - Miron Livny
- Morgridge Institute for Research, Madison, WI, USA
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | | | - James Thomson
- Morgridge Institute for Research, Madison, WI, USA
- Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI, USA
| | - Ron Stewart
- Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
11
|
de Couvreur LA, Cobo MJ, Kennedy PJ, Ellis JT. Bibliometric analysis of parasite vaccine research from 1990 to 2019. Vaccine 2023; 41:6468-6477. [PMID: 37777454 DOI: 10.1016/j.vaccine.2023.09.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/21/2023] [Accepted: 09/19/2023] [Indexed: 10/02/2023]
Abstract
Bibliometric and bibliographic analyses are popular tools for investigating publication metrics and thematic transitions in an expanding codex of biomedical literature. Bibliometric techniques have been employed in parasitology and vaccinology, with only a few malaria-specific literature analyses being reported specifically on parasite vaccines. The pursuit of parasite prophylactics is an important, global endeavour both medically and economically. As such, a comprehensive understanding of the research topics would be a valuable tool in assessing the current status and future directions of parasite vaccine development. Consequently, this study investigated parasite vaccinology from 1990 to 2019 by analysing literature exported from the Web of Science and Dimensions databases using two, commonly used, bibliometric programs: SciMAT and VOSviewer. The results of this study show the common, emerging, and transient themes within the discipline, and where the future lies as vaccine development moves further into the age of omics and informatics.
Collapse
Affiliation(s)
- L A de Couvreur
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW, Australia.
| | - M J Cobo
- Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada, Granada, Spain
| | - P J Kennedy
- School of Software, Faculty of Engineering and Information Technology and the Australian Artificial Intelligence Institute, University of Technology Sydney, PO Box 123, Broadway, NSW, Australia
| | - J T Ellis
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW, Australia
| |
Collapse
|
12
|
Cai L, Li J, Lv H, Liu W, Niu H, Wang Z. Integrating domain knowledge for biomedical text analysis into deep learning: A survey. J Biomed Inform 2023; 143:104418. [PMID: 37290540 DOI: 10.1016/j.jbi.2023.104418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 04/24/2023] [Accepted: 05/31/2023] [Indexed: 06/10/2023]
Abstract
The past decade has witnessed an explosion of textual information in the biomedical field. Biomedical texts provide a basis for healthcare delivery, knowledge discovery, and decision-making. Over the same period, deep learning has achieved remarkable performance in biomedical natural language processing, however, its development has been limited by well-annotated datasets and interpretability. To solve this, researchers have considered combining domain knowledge (such as biomedical knowledge graph) with biomedical data, which has become a promising means of introducing more information into biomedical datasets and following evidence-based medicine. This paper comprehensively reviews more than 150 recent literature studies on incorporating domain knowledge into deep learning models to facilitate typical biomedical text analysis tasks, including information extraction, text classification, and text generation. We eventually discuss various challenges and future directions.
Collapse
Affiliation(s)
- Linkun Cai
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Jia Li
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Han Lv
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Wenjuan Liu
- Aerospace Center Hospital, 100049 Beijing, China
| | - Haijun Niu
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Zhenchang Wang
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China; Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China.
| |
Collapse
|
13
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) Discovers and Annotates Biomedical Knowledge Using Co-Occurrence and Transformer Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.30.542911. [PMID: 37397987 PMCID: PMC10312590 DOI: 10.1101/2023.05.30.542911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Background The PubMed database contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: 1) they identify a relationship but not the type of relationship, 2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, 3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or 4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. Results We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. Conclusions SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
|
14
|
Oliveira Dos Santos Á, Sergio da Silva E, Machado Couto L, Valadares Labanca Reis G, Silva Belo V. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: a scoping review. J Biomed Inform 2023; 142:104389. [PMID: 37187321 DOI: 10.1016/j.jbi.2023.104389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/11/2023] [Accepted: 05/08/2023] [Indexed: 05/17/2023]
Abstract
OBJECTIVE Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. MATERIALS AND METHODS Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. RESULTS The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. CONCLUSION Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).
Collapse
Affiliation(s)
| | - Eduardo Sergio da Silva
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | - Letícia Machado Couto
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | | | - Vinícius Silva Belo
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| |
Collapse
|
15
|
Lokker C, Bagheri E, Abdelkader W, Parrish R, Afzal M, Navarro T, Cotoi C, Germini F, Linkins L, Brian Haynes R, Chu L, Iorio A. Deep Learning to Refine the Identification of High-Quality Clinical Research Articles from the Biomedical Literature: Performance Evaluation. J Biomed Inform 2023; 142:104384. [PMID: 37164244 DOI: 10.1016/j.jbi.2023.104384] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 04/24/2023] [Accepted: 05/03/2023] [Indexed: 05/12/2023]
Abstract
BACKGROUND Identifying practice-ready evidence-based journal articles in medicine is a challenge due to the sheer volume of biomedical research publications. Newer approaches to support evidence discovery apply deep learning techniques to improve the efficiency and accuracy of classifying sound evidence. OBJECTIVE To determine how well deep learning models using variants of Bidirectional Encoder Representations from Transformers (BERT) identify high-quality evidence with high clinical relevance from the biomedical literature for consideration in clinical practice. METHODS We fine-tuned variations of BERT models (BERTBASE, BioBERT, BlueBERT, and PubMedBERT) and compared their performance in classifying articles based on methodological quality criteria. The dataset used for fine-tuning models included titles and abstracts of >160,000 PubMed records from 2012-2020 that were of interest to human health which had been manually labeled based on meeting established critical appraisal criteria for methodological rigor. The data was randomly divided into 80:10:10 sets for training, validating, and testing. In addition to using the full unbalanced set, the training data was randomly undersampled into four balanced datasets to assess performance and select the best performing model. For each of the four sets, one model that maintained sensitivity (recall) at ≥99% was selected and were ensembled. The best performing model was evaluated in a prospective, blinded test and applied to an established reference standard, the Clinical Hedges dataset. RESULTS In training, three of the four selected best performing models were trained using BioBERTBASE. The ensembled model did not boost performance compared with the best individual model. Hence a solo BioBERT-based model (named DL-PLUS) was selected for further testing as it was computationally more efficient. The model had high recall (>99%) and 60% to 77% specificity in a prospective evaluation conducted with blinded research associates and saved >60% of the work required to identify high quality articles. CONCLUSIONS Deep learning using pretrained language models and a large dataset of classified articles produced models with improved specificity while maintaining >99% recall. The resulting DL-PLUS model identifies high-quality, clinically relevant articles from PubMed at the time of publication. The model improves the efficiency of a literature surveillance program, which allows for faster dissemination of appraised research.
Collapse
Affiliation(s)
- Cynthia Lokker
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada.
| | - Elham Bagheri
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Wael Abdelkader
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Rick Parrish
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Muhammad Afzal
- Department of Computing, Birmingham City University, Birmingham, UK
| | - Tamara Navarro
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Chris Cotoi
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Federico Germini
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada; Department of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Lori Linkins
- Department of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - R Brian Haynes
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada; Department of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Lingyang Chu
- Department of Computing and Software, McMaster University, Hamilton, Ontario, Canada
| | - Alfonso Iorio
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada; Department of Medicine, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
16
|
Su C, Hou Y, Zhou M, Rajendran S, Maasch JRA, Abedi Z, Zhang H, Bai Z, Cuturrufo A, Guo W, Chaudhry FF, Ghahramani G, Tang J, Cheng F, Li Y, Zhang R, DeKosky ST, Bian J, Wang F. Biomedical discovery through the integrative biomedical knowledge hub (iBKH). iScience 2023; 26:106460. [PMID: 37020958 PMCID: PMC10068563 DOI: 10.1016/j.isci.2023.106460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 09/20/2022] [Accepted: 03/16/2023] [Indexed: 04/01/2023] Open
Abstract
The abundance of biomedical knowledge gained from biological experiments and clinical practices is an invaluable resource for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In this study, we created a comprehensive BKG called the integrative Biomedical Knowledge Hub (iBKH) by harmonizing and integrating information from diverse biomedical resources. To make iBKH easily accessible for biomedical research, we developed a web-based, user-friendly graphical portal that allows fast and interactive knowledge retrieval. Additionally, we also implemented an efficient and scalable graph learning pipeline for discovering novel biomedical knowledge in iBKH. As a proof of concept, we performed our iBKH-based method for computational in-silico drug repurposing for Alzheimer's disease. The iBKH is publicly available.
Collapse
Affiliation(s)
- Chang Su
- Department of Health Service Administration and Policy, College of Public Health, Temple University, Philadelphia, PA 19122, USA
| | - Yu Hou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Manqi Zhou
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
| | - Suraj Rajendran
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, New York, NY 10065, USA
| | | | - Zehra Abedi
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Haotan Zhang
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | | | - Winston Guo
- Department of Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Fayzan F. Chaudhry
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Gregory Ghahramani
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jian Tang
- Mila-Quebec AI Institute and HEC Montreal, Montreal, QC H2S 3H1, Canada
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| | - Yue Li
- School of Computer Science, McGill University, Montreal, QC H3A 0C6, Canada
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Steven T. DeKosky
- Department of Neurology, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
17
|
Luo M, Li S, Pang Y, Yao L, Ma R, Huang HY, Huang HD, Lee TY. Extraction of microRNA-target interaction sentences from biomedical literature by deep learning approach. Brief Bioinform 2023; 24:6847797. [PMID: 36440972 DOI: 10.1093/bib/bbac497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 10/16/2022] [Accepted: 10/19/2022] [Indexed: 11/29/2022] Open
Abstract
MicroRNA (miRNA)-target interaction (MTI) plays a substantial role in various cell activities, molecular regulations and physiological processes. Published biomedical literature is the carrier of high-confidence MTI knowledge. However, digging out this knowledge in an efficient manner from large-scale published articles remains challenging. To address this issue, we were motivated to construct a deep learning-based model. We applied the pre-trained language models to biomedical text to obtain the representation, and subsequently fed them into a deep neural network with gate mechanism layers and a fully connected layer for the extraction of MTI information sentences. Performances of the proposed models were evaluated using two datasets constructed on the basis of text data obtained from miRTarBase. The validation and test results revealed that incorporating both PubMedBERT and SciBERT for sentence level encoding with the long short-term memory (LSTM)-based deep neural network can yield an outstanding performance, with both F1 and accuracy being higher than 80% on validation data and test data. Additionally, the proposed deep learning method outperformed the following machine learning methods: random forest, support vector machine, logistic regression and bidirectional LSTM. This work would greatly facilitate studies on MTI analysis and regulations. It is anticipated that this work can assist in large-scale screening of miRNAs, thereby revealing their functional roles in various diseases, which is important for the development of highly specific drugs with fewer side effects. Source code and corpus are publicly available at https://github.com/qi29.
Collapse
Affiliation(s)
- Mengqi Luo
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China; School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Shangfu Li
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen
| | - Yuxuan Pang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China, and also in the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Lantian Yao
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China, and also in the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Renfei Ma
- Warshel Institute for Computational Biology, Chinese University of Hong Kong, Shenzhen; School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Hsi-Yuan Huang
- School of Medicine and the Warshel Institute of Computational Biology, The Chinese University of Hong Kong, Shenzhen
| | - Hsien-Da Huang
- School of Medicine, and the executive director of Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|
18
|
Zhao S, Wang A, Qin B, Wang F. Biomedical evidence engineering for data-driven discovery. Bioinformatics 2022; 38:5270-5278. [PMID: 36227057 DOI: 10.1093/bioinformatics/btac675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 10/04/2022] [Accepted: 10/11/2022] [Indexed: 01/29/2023] Open
Abstract
MOTIVATION With the rapid development of precision medicine, a large amount of health data (such as electronic health records, gene sequencing, medical images, etc.) has been produced. It encourages more and more interest in data-driven insight discovery from these data. A reasonable way to verify the derived insights is by checking evidence from biomedical literature. However, manual verification is inefficient and not scalable. Therefore, an intelligent technique is necessary to solve this problem. RESULTS This article introduces a framework for biomedical evidence engineering, addressing this problem more effectively. The framework consists of a biomedical literature retrieval module and an evidence extraction module. The retrieval module ensembles several methods and achieves state-of-the-art performance in biomedical literature retrieval. A BERT-based evidence extraction model is proposed to extract evidence from literature in response to queries. Moreover, we create a dataset with 1 million examples of biomedical evidence, 10 000 of which are manually annotated. AVAILABILITY AND IMPLEMENTATION Datasets are available at https://github.com/SendongZhao.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Population Health Sciences, College of Computer Science and Technology, Harbin Institute of Technology, Harbin 10065, China
| | - Aobo Wang
- Department of Population Health Sciences, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bing Qin
- Department of Population Health Sciences, College of Computer Science and Technology, Harbin Institute of Technology, Harbin 10065, China
| | - Fei Wang
- Department of Population Health Sciences, Weill Medical College, Cornell University, New York, NY 14853, USA
| |
Collapse
|
19
|
Su Y, Wang M, Wang P, Zheng C, Liu Y, Zeng X. Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison. Brief Bioinform 2022; 23:6686739. [PMID: 36125190 DOI: 10.1093/bib/bbac342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/20/2022] [Accepted: 07/25/2022] [Indexed: 12/14/2022] Open
Abstract
The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.
Collapse
Affiliation(s)
- Yansen Su
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Minglu Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Pengpeng Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Chunhou Zheng
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
20
|
Literature Mining of Disease Associated Noncoding RNA in the Omics Era. Molecules 2022; 27:molecules27154710. [PMID: 35897884 PMCID: PMC9331993 DOI: 10.3390/molecules27154710] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/20/2022] [Accepted: 07/22/2022] [Indexed: 02/01/2023] Open
Abstract
Noncoding RNAs (ncRNA) are transcripts without protein-coding potential that play fundamental regulatory roles in diverse cellular processes and diseases. The application of deep sequencing experiments in ncRNA research have generated massive omics datasets, which require rapid examination, interpretation and validation based on exiting knowledge resources. Thus, text-mining methods have been increasingly adapted for automatic extraction of relations between an ncRNA and its target or a disease condition from biomedical literature. These bioinformatics tools can also assist in more complex research, such as database curation of candidate ncRNAs and hypothesis generation with respect to pathophysiological mechanisms. In this concise review, we first introduced basic concepts and workflow of literature mining systems. Then, we compared available bioinformatics tools tailored for ncRNA studies, including the tasks, applicability, and limitations. Their powerful utilities and flexibility are demonstrated by examples in a variety of diseases, such as Alzheimer’s disease, atherosclerosis and cancers. Finally, we outlined several challenges from the viewpoints of both system developers and end users. We concluded that the application of text-mining techniques will booster disease-associated ncRNA discoveries in the biomedical literature and enable integrative biology in the current omics era.
Collapse
|
21
|
Transducer Cascades for Biological Literature-Based Discovery. INFORMATION 2022. [DOI: 10.3390/info13050262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
G protein-coupled receptors (GPCRs) control the response of cells to many signals, and as such, are involved in most cellular processes. As membrane receptors, they are accessible at the surface of the cell. GPCRs are also the largest family of membrane receptors, with more than 800 representatives in mammal genomes. For this reason, they are ideal targets for drugs. Although about one third of approved drugs target GPCRs, only about 16% of GPCRs are targeted by drugs. One of the difficulties comes from the lack of knowledge on the intra-cellular events triggered by these molecules. In the last two decades, scientists have started mapping the signaling networks triggered by GPCRs. However, it soon appeared that the system is very complex, which led to the publication of more than 320,000 scientific papers. Clearly, a human cannot take into account such massive sources of information. These papers represent a mine of information about both ontological knowledge and experimental results related to GPCRs, which have to be exploited in order to build signaling networks. The ABLISS project aims at the automatic building of GPCRs networks using automated deductive reasoning, allowing to integrate all available data. Therefore, we processed the automatic extraction of network information from the literature using Natural Language Processing (NLP). We mainly focused on the experimental results about GPCRs reported in the scientific papers, as so far there is no source gathering all these experimental results. We designed a relational database in order to make them available to the scientific community later. After introducing the more general objectives of the ABLISS project, we describe the formalism in detail. We then explain the NLP program using the finite state methods (Unitex graph cascades) we implemented and discuss the extracted facts obtained. Finally, we present the design of the relational database that stores the facts extracted from the selected papers.
Collapse
|
22
|
Li PH, Chen TF, Yu JY, Shih SH, Su CH, Lin YH, Tsai HK, Juan HF, Chen CY, Huang JH. pubmedKB: an interactive web server for exploring biomedical entity relations in the biomedical literature. Nucleic Acids Res 2022; 50:W616-W622. [PMID: 35536289 PMCID: PMC9252824 DOI: 10.1093/nar/gkac310] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 04/06/2022] [Accepted: 04/18/2022] [Indexed: 11/15/2022] Open
Abstract
With the proliferation of genomic sequence data for biomedical research, the exploration of human genetic information by domain experts requires a comprehensive interrogation of large numbers of scientific publications in PubMed. However, a query in PubMed essentially provides search results sorted only by the date of publication. A search engine for retrieving and interpreting complex relations between biomedical concepts in scientific publications remains lacking. Here, we present pubmedKB, a web server designed to extract and visualize semantic relationships between four biomedical entity types: variants, genes, diseases, and chemicals. pubmedKB uses state-of-the-art natural language processing techniques to extract semantic relations from the large number of PubMed abstracts. Currently, over 2 million semantic relations between biomedical entity pairs are extracted from over 33 million PubMed abstracts in pubmedKB. pubmedKB has a user-friendly interface with an interactive semantic graph, enabling the user to easily query entities and explore entity relations. Supporting sentences with the highlighted snippets allow to easily navigate the publications. Combined with a new explorative approach to literature mining and an interactive interface for researchers, pubmedKB thus enables rapid, intelligent searching of the large biomedical literature to provide useful knowledge and insights. pubmedKB is available at https://www.pubmedkb.cc/.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Huai-Kuang Tsai
- Taiwan AI Labs, Taipei 10351, Taiwan.,Institute of Information Science, Academia Sinica, Taipei, 11529, Taiwan
| | - Hsueh-Fen Juan
- Taiwan AI Labs, Taipei 10351, Taiwan.,Department of Life Science, National Taiwan University, Taipei 10617, Taiwan.,Center for Computational and Systems Biology, National Taiwan University, Taipei 10617, Taiwan
| | - Chien-Yu Chen
- Taiwan AI Labs, Taipei 10351, Taiwan.,Center for Computational and Systems Biology, National Taiwan University, Taipei 10617, Taiwan.,Department of Biomechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan
| | | |
Collapse
|
23
|
Stocker M, Heger T, Schweidtmann A, Ćwiek-Kupczyńska H, Penev L, Dojchinovski M, Willighagen E, Vidal ME, Turki H, Balliet D, Tiddi I, Kuhn T, Mietchen D, Karras O, Vogt L, Hellmann S, Jeschke J, Krajewski P, Auer S. SKG4EOSC - Scholarly Knowledge Graphs for EOSC: Establishing a backbone of knowledge graphs for FAIR Scholarly Information in EOSC. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e83789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In the age of advanced information systems powering fast-paced knowledge economies that face global societal challenges, it is no longer adequate to express scholarly information - an essential resource for modern economies - primarily as article narratives in document form. Despite being a well-established tradition in scholarly communication, PDF-based text publishing is hindering scientific progress as it buries scholarly information into non-machine-readable formats. The key objective of SKG4EOSC is to improve science productivity through development and implementation of services for text and data conversion, and production, curation, and re-use of FAIR scholarly information. This will be achieved by (1) establishing the Open Research Knowledge Graph (ORKG, orkg.org), a service operated by the SKG4EOSC coordinator, as a Hub for access to FAIR scholarly information in the EOSC; (2) lifting to EOSC of numerous and heterogeneous domain-specific research infrastructures through the ORKG Hub’s harmonized access facilities; and (3) leverage the Hub to support cross-disciplinary research and policy decisions addressing societal challenges. SKG4EOSC will pilot the devised approaches and technologies in four research domains: biodiversity crisis, precision oncology, circular processes, and human cooperation. With the aim to improve machine-based scholarly information use, SKG4EOSC addresses an important current and future need of researchers. It extends the application of the FAIR data principles to scholarly communication practices, hence a more comprehensive coverage of the entire research lifecycle. Through explicit, machine actionable provenance links between FAIR scholarly information, primary data and contextual entities, it will substantially contribute to reproducibility, validation and trust in science. The resulting advanced machine support will catalyse new discoveries in basic research and solutions in key application areas.
Collapse
|
24
|
Kropiwnicki E, Lachmann A, Clarke DJB, Xie Z, Jagodnik KM, Ma’ayan A. DrugShot: querying biomedical search terms to retrieve prioritized lists of small molecules. BMC Bioinformatics 2022; 23:76. [PMID: 35183110 PMCID: PMC8858480 DOI: 10.1186/s12859-022-04590-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Accepted: 01/28/2022] [Indexed: 11/29/2022] Open
Abstract
Background PubMed contains millions of abstracts that co-mention terms that describe drugs with other biomedical terms such as genes or diseases. Unique opportunities exist for leveraging these co-mentions by integrating them with other drug-drug similarity resources such as the Library of Integrated Network-based Cellular Signatures (LINCS) L1000 signatures to develop novel hypotheses. Results DrugShot is a web-based server application and an Appyter that enables users to enter any biomedical search term into a simple input form to receive ranked lists of drugs and other small molecules based on their relevance to the search term. To produce ranked lists of small molecules, DrugShot cross-references returned PubMed identifiers (PMIDs) with DrugRIF or AutoRIF, which are curated resources of drug-PMID associations, to produce an associated small molecule list where each small molecule is ranked according to total co-mentions with the search term from shared PubMed IDs. Additionally, using two types of drug-drug similarity matrices, lists of small molecules are predicted to be associated with the search term. Such predictions are based on literature co-mentions and signature similarity from LINCS L1000 drug-induced gene expression profiles. Conclusions DrugShot prioritizes drugs and small molecules associated with biomedical search terms. In addition to listing known associations, DrugShot predicts additional drugs and small molecules related to any search term. Hence, DrugShot can be used to prioritize drugs and preclinical compounds for drug repurposing and suggest indications and adverse events for preclinical compounds. DrugShot is freely and openly available at: https://maayanlab.cloud/drugshot and https://appyters.maayanlab.cloud/#/DrugShot. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04590-5.
Collapse
|
25
|
Xiang J, Zhang J, Zhao Y, Wu FX, Li M. Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
Affiliation(s)
- Ju Xiang
- School of Computer Science and Engineering, Central South University, China
| | - Jiashuai Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, China
| | - Fang-Xiang Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- Division of Biomedical Engineering and Department of Mechanical Engineering at University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
26
|
Brincat A, Hofmann M. Automated extraction of genes associated with antibiotic resistance from the biomedical literature. Database (Oxford) 2022; 2022:6520791. [PMID: 35134132 PMCID: PMC9263533 DOI: 10.1093/database/baab077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 09/21/2021] [Accepted: 11/22/2021] [Indexed: 11/15/2022]
Abstract
Abstract
The detection of bacterial antibiotic resistance phenotypes is important when carrying out clinical decisions for patient treatment. Conventional phenotypic testing involves culturing bacteria which requires a significant amount of time and work. Whole-genome sequencing is emerging as a fast alternative to resistance prediction, by considering the presence/absence of certain genes. A lot of research has focused on determining which bacterial genes cause antibiotic resistance and efforts are being made to consolidate these facts in knowledge bases (KBs). KBs are usually manually curated by domain experts to be of the highest quality. However, this limits the pace at which new facts are added. Automated relation extraction of gene-antibiotic resistance relations from the biomedical literature is one solution that can simplify the curation process. This paper reports on the development of a text mining pipeline that takes in English biomedical abstracts and outputs genes that are predicted to cause resistance to antibiotics. To test the generalisability of this pipeline it was then applied to predict genes associated with Helicobacter pylori antibiotic resistance, that are not present in common antibiotic resistance KBs or publications studying H. pylori. These genes would be candidates for further lab-based antibiotic research and inclusion in these KBs. For relation extraction, state-of-the-art deep learning models were used. These models were trained on a newly developed silver corpus which was generated by distant supervision of abstracts using the facts obtained from KBs. The top performing model was superior to a co-occurrence model, achieving a recall of 95%, a precision of 60% and F1-score of 74% on a manually annotated holdout dataset. To our knowledge, this project was the first attempt at developing a complete text mining pipeline that incorporates deep learning models to extract gene-antibiotic resistance relations from the literature. Additional related data can be found at https://github.com/AndreBrincat/Gene-Antibiotic-Resistance-Relation-Extraction
Collapse
Affiliation(s)
- Andre Brincat
- Department of Informatics, TU Dublin , Blanchardstown Campus, Dublin D15 YV78, Ireland
| | - Markus Hofmann
- Department of Informatics, TU Dublin , Blanchardstown Campus, Dublin D15 YV78, Ireland
| |
Collapse
|
27
|
Bhasuran B. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries. Methods Mol Biol 2022; 2496:123-140. [PMID: 35713862 DOI: 10.1007/978-1-0716-2305-3_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
28
|
Liang L, Hu J, Sun G, Hong N, Wu G, He Y, Li Y, Hao T, Liu L, Gong M. Artificial Intelligence-Based Pharmacovigilance in the Setting of Limited Resources. Drug Saf 2022; 45:511-519. [PMID: 35579814 PMCID: PMC9112260 DOI: 10.1007/s40264-022-01170-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2022] [Indexed: 01/28/2023]
Abstract
With the rapid development of artificial intelligence (AI) technologies, and the large amount of pharmacovigilance-related data stored in an electronic manner, data-driven automatic methods need to be urgently applied to all aspects of pharmacovigilance to assist healthcare professionals. However, the quantity and quality of data directly affect the performance of AI, and there are particular challenges to implementing AI in limited-resource settings. Analyzing challenges and solutions for AI-based pharmacovigilance in resource-limited settings can improve pharmacovigilance frameworks and capabilities in these settings. In this review, we summarize the challenges into four categories: establishing a database for an AI-based pharmacovigilance system, lack of human resources, weak AI technology and insufficient government support. This study also discusses possible solutions and future perspectives on AI-based pharmacovigilance in resource-limited settings.
Collapse
Affiliation(s)
- Likeng Liang
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Jifa Hu
- The Central Hospital of Wuhan, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Gang Sun
- Key Laboratory of Oncology of Xinjiang Uyghur Autonomous Region, The Affiliated Cancer Hospital of Xinjiang Medical University, Ürümqi, China
| | - Na Hong
- Digital Health China Technologies Co., Ltd., Beijing, China
| | - Ge Wu
- Digital Health China Technologies Co., Ltd., Beijing, China
| | - Yuejun He
- Digital Health China Technologies Co., Ltd., Beijing, China
| | - Yong Li
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Tianyong Hao
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Li Liu
- Institute of Health Management, Southern Medical University, Guangzhou, China
| | - Mengchun Gong
- Institute of Health Management, Southern Medical University, Guangzhou, China
| |
Collapse
|
29
|
Bhasuran B. BioBERT and Similar Approaches for Relation Extraction. Methods Mol Biol 2022; 2496:221-235. [PMID: 35713867 DOI: 10.1007/978-1-0716-2305-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
30
|
Crema C, Attardi G, Sartiano D, Redolfi A. Natural language processing in clinical neuroscience and psychiatry: A review. Front Psychiatry 2022; 13:946387. [PMID: 36186874 PMCID: PMC9515453 DOI: 10.3389/fpsyt.2022.946387] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 08/22/2022] [Indexed: 11/13/2022] Open
Abstract
Natural language processing (NLP) is rapidly becoming an important topic in the medical community. The ability to automatically analyze any type of medical document could be the key factor to fully exploit the data it contains. Cutting-edge artificial intelligence (AI) architectures, particularly machine learning and deep learning, have begun to be applied to this topic and have yielded promising results. We conducted a literature search for 1,024 papers that used NLP technology in neuroscience and psychiatry from 2010 to early 2022. After a selection process, 115 papers were evaluated. Each publication was classified into one of three categories: information extraction, classification, and data inference. Automated understanding of clinical reports in electronic health records has the potential to improve healthcare delivery. Overall, the performance of NLP applications is high, with an average F1-score and AUC above 85%. We also derived a composite measure in the form of Z-scores to better compare the performance of NLP models and their different classes as a whole. No statistical differences were found in the unbiased comparison. Strong asymmetry between English and non-English models, difficulty in obtaining high-quality annotated data, and train biases causing low generalizability are the main limitations. This review suggests that NLP could be an effective tool to help clinicians gain insights from medical reports, clinical research forms, and more, making NLP an effective tool to improve the quality of healthcare services.
Collapse
Affiliation(s)
- Claudio Crema
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy
| | | | - Daniele Sartiano
- Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy
| | - Alberto Redolfi
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy
| |
Collapse
|
31
|
Srivastava P, Bej S, Yordanova K, Wolkenhauer O. Self-Attention-Based Models for the Extraction of Molecular Interactions from Biological Texts. Biomolecules 2021; 11:biom11111591. [PMID: 34827589 PMCID: PMC8615611 DOI: 10.3390/biom11111591] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 10/22/2021] [Accepted: 10/24/2021] [Indexed: 01/02/2023] Open
Abstract
For any molecule, network, or process of interest, keeping up with new publications on these is becoming increasingly difficult. For many cellular processes, the amount molecules and their interactions that need to be considered can be very large. Automated mining of publications can support large-scale molecular interaction maps and database curation. Text mining and Natural-Language-Processing (NLP)-based techniques are finding their applications in mining the biological literature, handling problems such as Named Entity Recognition (NER) and Relationship Extraction (RE). Both rule-based and Machine-Learning (ML)-based NLP approaches have been popular in this context, with multiple research and review articles examining the scope of such models in Biological Literature Mining (BLM). In this review article, we explore self-attention-based models, a special type of Neural-Network (NN)-based architecture that has recently revitalized the field of NLP, applied to biological texts. We cover self-attention models operating either at the sentence level or an abstract level, in the context of molecular interaction extraction, published from 2019 onwards. We conducted a comparative study of the models in terms of their architecture. Moreover, we also discuss some limitations in the field of BLM that identifies opportunities for the extraction of molecular interactions from biological text.
Collapse
Affiliation(s)
- Prashant Srivastava
- Institute of Computer Science, University of Rostock, 18059 Rostock, Germany; (P.S.); (S.B.); (K.Y.)
| | - Saptarshi Bej
- Institute of Computer Science, University of Rostock, 18059 Rostock, Germany; (P.S.); (S.B.); (K.Y.)
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Kristina Yordanova
- Institute of Computer Science, University of Rostock, 18059 Rostock, Germany; (P.S.); (S.B.); (K.Y.)
| | - Olaf Wolkenhauer
- Institute of Computer Science, University of Rostock, 18059 Rostock, Germany; (P.S.); (S.B.); (K.Y.)
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354 Freising, Germany
- Correspondence:
| |
Collapse
|
32
|
Rosário-Ferreira N, Guimarães V, Costa VS, Moreira IS. SicknessMiner: a deep-learning-driven text-mining tool to abridge disease-disease associations. BMC Bioinformatics 2021; 22:482. [PMID: 34607568 PMCID: PMC8491382 DOI: 10.1186/s12859-021-04397-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 09/24/2021] [Indexed: 12/24/2022] Open
Abstract
Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04397-w.
Collapse
Affiliation(s)
- Nícia Rosário-Ferreira
- CQC - Coimbra Chemistry Center, Chemistry Department, Faculty of Science and Technology, University of Coimbra, 3004-535, Coimbra, Portugal. .,CNC - Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal.
| | - Victor Guimarães
- Department of Sciences, University of Porto, Porto, Portugal.,INESC-TEC - Centre of Advanced Computing Systems, Porto, Portugal
| | - Vítor S Costa
- Department of Sciences, University of Porto, Porto, Portugal.,INESC-TEC - Centre of Advanced Computing Systems, Porto, Portugal
| | - Irina S Moreira
- Department of Life Sciences, University of Coimbra, Calçada Martim de Freitas, 3000-456, Coimbra, Portugal. .,CNC - Center for Neuroscience and Cell Biology, CIBB - Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal.
| |
Collapse
|
33
|
Leonardelli L, Lofano G, Selvaggio G, Parolo S, Giampiccolo S, Tomasoni D, Domenici E, Priami C, Song H, Medini D, Marchetti L, Siena E. Literature Mining and Mechanistic Graphical Modelling to Improve mRNA Vaccine Platforms. Front Immunol 2021; 12:738388. [PMID: 34557200 PMCID: PMC8454234 DOI: 10.3389/fimmu.2021.738388] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 08/23/2021] [Indexed: 12/25/2022] Open
Abstract
RNA vaccines represent a milestone in the history of vaccinology. They provide several advantages over more traditional approaches to vaccine development, showing strong immunogenicity and an overall favorable safety profile. While preclinical testing has provided some key insights on how RNA vaccines interact with the innate immune system, their mechanism of action appears to be fragmented amid the literature, making it difficult to formulate new hypotheses to be tested in clinical settings and ultimately improve this technology platform. Here, we propose a systems biology approach, based on the combination of literature mining and mechanistic graphical modeling, to consolidate existing knowledge around mRNA vaccines mode of action and enhance the translatability of preclinical hypotheses into clinical evidence. A Natural Language Processing (NLP) pipeline for automated knowledge extraction retrieved key biological evidences that were joined into an interactive mechanistic graphical model representing the chain of immune events induced by mRNA vaccines administration. The achieved mechanistic graphical model will help the design of future experiments, foster the generation of new hypotheses and set the basis for the development of mathematical models capable of simulating and predicting the immune response to mRNA vaccines.
Collapse
Affiliation(s)
- Lorena Leonardelli
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | | | - Gianluca Selvaggio
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Silvia Parolo
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Stefano Giampiccolo
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Danilo Tomasoni
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Enrico Domenici
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy.,Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Povo, Italy
| | - Corrado Priami
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy.,Department of Computer Science, University of Pisa, Pisa, Italy
| | | | | | - Luca Marchetti
- Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy.,Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Povo, Italy
| | - Emilio Siena
- Data Science and Computational Vaccinology, GSK, Siena, Italy
| |
Collapse
|
34
|
Zhu T, Qin Y, Xiang Y, Hu B, Chen Q, Peng W. Distantly supervised biomedical relation extraction using piecewise attentive convolutional neural network and reinforcement learning. J Am Med Inform Assoc 2021; 28:2571-2581. [PMID: 34524450 DOI: 10.1093/jamia/ocab176] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 07/08/2021] [Accepted: 08/06/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE There have been various methods to deal with the erroneous training data in distantly supervised relation extraction (RE), however, their performance is still far from satisfaction. We aimed to deal with the insufficient modeling problem on instance-label correlations for predicting biomedical relations using deep learning and reinforcement learning. MATERIALS AND METHODS In this study, a new computational model called piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) was proposed to perform RE on distantly supervised data generated from Unified Medical Language System with MEDLINE abstracts and benchmark datasets. In PACNN+RL, PACNN was introduced to encode semantic information of biomedical text, and the RL method with memory backtracking mechanism was leveraged to alleviate the erroneous data issue. Extensive experiments were conducted on 4 biomedical RE tasks. RESULTS The proposed PACNN+RL model achieved competitive performance on 8 biomedical corpora, outperforming most baseline systems. Specifically, PACNN+RL outperformed all baseline methods with the F1-score of 0.5592 on the may-prevent dataset, 0.6666 on the may-treat dataset, and 0.3838 on the DDI corpus, 2011. For the protein-protein interaction RE task, we obtained new state-of-the-art performance on 4 out of 5 benchmark datasets. CONCLUSIONS The performance on many distantly supervised biomedical RE tasks was substantially improved, primarily owing to the denoising effect of the proposed model. It is anticipated that PACNN+RL will become a useful tool for large-scale RE and other downstream tasks to facilitate biomedical knowledge acquisition. We also made the demonstration program and source code publicly available at http://112.74.48.115:9000/.
Collapse
Affiliation(s)
- Tiantian Zhu
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.,Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Yang Qin
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Yang Xiang
- Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Baotian Hu
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Qingcai Chen
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.,Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Weihua Peng
- Department of Knowledge Graph, Baidu International Technology (Shenzhen), Shenzhen, China
| |
Collapse
|
35
|
Liu Z, Roberts RA, Lal-Nag M, Chen X, Huang R, Tong W. AI-based language models powering drug discovery and development. Drug Discov Today 2021; 26:2593-2607. [PMID: 34216835 PMCID: PMC8604259 DOI: 10.1016/j.drudis.2021.06.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 04/28/2021] [Accepted: 06/25/2021] [Indexed: 02/08/2023]
Abstract
The discovery and development of new medicines is expensive, time-consuming, and often inefficient, with many failures along the way. Powered by artificial intelligence (AI), language models (LMs) have changed the landscape of natural language processing (NLP), offering possibilities to transform treatment development more effectively. Here, we summarize advances in AI-powered LMs and their potential to aid drug discovery and development. We highlight opportunities for AI-powered LMs in target identification, clinical design, regulatory decision-making, and pharmacovigilance. We specifically emphasize the potential role of AI-powered LMs for developing new treatments for Coronavirus 2019 (COVID-19) strategies, including drug repurposing, which can be extrapolated to other infectious diseases that have the potential to cause pandemics. Finally, we set out the remaining challenges and propose possible solutions for improvement.
Collapse
Affiliation(s)
- Zhichao Liu
- National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA.
| | - Ruth A Roberts
- National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA; ApconiX, BioHub at Alderley Park, Alderley Edge SK10 4TG, UK; University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
| | - Madhu Lal-Nag
- Office of Translational Sciences, Center for Drug Evaluation and Research, US FDA, Silver Spring, MD 20993, USA
| | - Xi Chen
- National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ruili Huang
- National Center for Advancing Translational Sciences, National Institutes of Health, Rockville, MD 20850, USA
| | - Weida Tong
- National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA.
| |
Collapse
|
36
|
Bayram U, Roy R, Assalil A, BenHiba L. The unknown knowns: a graph-based approach for temporal COVID-19 literature mining. ONLINE INFORMATION REVIEW 2021. [DOI: 10.1108/oir-12-2020-0562] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe COVID-19 pandemic has sparked a remarkable volume of research literature, and scientists are increasingly in need of intelligent tools to cut through the noise and uncover relevant research directions. As a response, the authors propose a novel framework. In this framework, the authors develop a novel weighted semantic graph model to compress the research studies efficiently. Also, the authors present two analyses on this graph to propose alternative ways to uncover additional aspects of COVID-19 research.Design/methodology/approachThe authors construct the semantic graph using state-of-the-art natural language processing (NLP) techniques on COVID-19 publication texts (>100,000 texts). Next, the authors conduct an evolutionary analysis to capture the changes in COVID-19 research across time. Finally, the authors apply a link prediction study to detect novel COVID-19 research directions that are so far undiscovered.FindingsFindings reveal the success of the semantic graph in capturing scientific knowledge and its evolution. Meanwhile, the prediction experiments provide 79% accuracy on returning intelligible links, showing the reliability of the methods for predicting novel connections that could help scientists discover potential new directions.Originality/valueTo the authors’ knowledge, this is the first study to propose a holistic framework that includes encoding the scientific knowledge in a semantic graph, demonstrates an evolutionary examination of past and ongoing research and offers scientists with tools to generate new hypotheses and research directions through predictive modeling and deep machine learning techniques.
Collapse
|
37
|
Yi H, Zhang Q, Lin C, Ma S. Information-incorporated Gaussian graphical model for gene expression data. Biometrics 2021; 78:512-523. [PMID: 33527365 DOI: 10.1111/biom.13428] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 09/19/2020] [Accepted: 01/13/2021] [Indexed: 11/29/2022]
Abstract
In the analysis of gene expression data, network approaches take a system perspective and have played an irreplaceably important role. Gaussian graphical models (GGMs) have been popular in the network analysis of gene expression data. They investigate the conditional dependence between genes and "transform" the problem of estimating network structures into a sparse estimation of precision matrices. When there is a moderate to large number of genes, the number of parameters to be estimated may overwhelm the limited sample size, leading to unreliable estimation and selection. In this article, we propose incorporating information from previous studies (for example, those deposited at PubMed) to assist estimating the network structure in the present data. It is recognized that such information can be partial, biased, or even wrong. A penalization-based estimation approach is developed, shown to have consistency properties, and realized using an effective computational algorithm. Simulation demonstrates its competitive performance under various information accuracy scenarios. The analysis of TCGA lung cancer prognostic genes leads to network structures different from the alternatives.
Collapse
Affiliation(s)
- Huangdi Yi
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics; Key Laboratory of Econometrics, Ministry of Education; The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Cunjie Lin
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| |
Collapse
|
38
|
|