1
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024; 52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
2
|
Ming S, Zhang R, Kilicoglu H. Enhancing the coverage of SemRep using a relation classification approach. J Biomed Inform 2024; 155:104658. [PMID: 38782169 DOI: 10.1016/j.jbi.2024.104658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 05/01/2024] [Accepted: 05/18/2024] [Indexed: 05/25/2024]
Abstract
OBJECTIVE Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB. METHODS We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts. RESULTS Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct. CONCLUSION These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.
Collapse
Affiliation(s)
- Shufan Ming
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel St., Champaign, 61820, IL, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, 516 Delaware St SE, Minneapolis, 55455, MN, USA
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel St., Champaign, 61820, IL, USA.
| |
Collapse
|
3
|
Tyagin I, Safro I. Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique. BMC Bioinformatics 2024; 25:213. [PMID: 38872097 PMCID: PMC11177514 DOI: 10.1186/s12859-024-05812-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 05/16/2024] [Indexed: 06/15/2024] Open
Abstract
BACKGROUND Automated hypothesis generation (HG) focuses on uncovering hidden connections within the extensive information that is publicly available. This domain has become increasingly popular, thanks to modern machine learning algorithms. However, the automated evaluation of HG systems is still an open problem, especially on a larger scale. RESULTS This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypotheses accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. CONCLUSIONS Dyport is an open-source benchmarking framework designed for biomedical hypothesis generation systems evaluation, which takes into account knowledge dynamics, semantics and impact. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport .
Collapse
Affiliation(s)
- Ilya Tyagin
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19713, USA.
| | - Ilya Safro
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19716, USA.
| |
Collapse
|
4
|
Xiao Y, Hou Y, Zhou H, Diallo G, Fiszman M, Wolfson J, Zhou L, Kilicoglu H, Chen Y, Su C, Xu H, Mantyh WG, Zhang R. Repurposing non-pharmacological interventions for Alzheimer's disease through link prediction on biomedical literature. Sci Rep 2024; 14:8693. [PMID: 38622164 PMCID: PMC11018822 DOI: 10.1038/s41598-024-58604-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Accepted: 04/01/2024] [Indexed: 04/17/2024] Open
Abstract
Non-pharmaceutical interventions (NPI) have great potential to improve cognitive function but limited investigation to discover NPI repurposing for Alzheimer's Disease (AD). This is the first study to develop an innovative framework to extract and represent NPI information from biomedical literature in a knowledge graph (KG), and train link prediction models to repurpose novel NPIs for AD prevention. We constructed a comprehensive KG, called ADInt, by extracting NPI information from biomedical literature. We used the previously-created SuppKG and NPI lexicon to identify NPI entities. Four KG embedding models (i.e., TransE, RotatE, DistMult and ComplEX) and two novel graph convolutional network models (i.e., R-GCN and CompGCN) were trained and compared to learn the representation of ADInt. Models were evaluated and compared on two test sets (time slice and clinical trial ground truth) and the best performing model was used to predict novel NPIs for AD. Discovery patterns were applied to generate mechanistic pathways for high scoring candidates. The ADInt has 162,212 nodes and 1,017,284 edges. R-GCN performed best in time slice (MR = 5.2054, Hits@10 = 0.8496) and clinical trial ground truth (MR = 3.4996, Hits@10 = 0.9192) test sets. After evaluation by domain experts, 10 novel dietary supplements and 10 complementary and integrative health were proposed from the score table calculated by R-GCN. Among proposed novel NPIs, we found plausible mechanistic pathways for photodynamic therapy and Choerospondias axillaris to prevent AD, and validated psychotherapy and manual therapy techniques using real-world data analysis. The proposed framework shows potential for discovering new NPIs for AD prevention and understanding their mechanistic pathways.
Collapse
Affiliation(s)
- Yongkang Xiao
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Yu Hou
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| | - Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Gayo Diallo
- INRIA SISTM, Team AHeaD - INSERM 1219 Bordeaux Population Health, University of Bordeaux, 33000, Bordeaux, France
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil
- Semedy Inc, Needham, MA, USA
| | - Julian Wolfson
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, USA
| | - You Chen
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Chang Su
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA
| | - William G Mantyh
- Department of Neurology, University of Minnesota, Minneapolis, MN, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
5
|
Jin S, Liang H, Zhang W, Li H. Knowledge Graph for Breast Cancer Prevention and Treatment: Literature-Based Data Analysis Study. JMIR Med Inform 2024; 12:e52210. [PMID: 38409769 PMCID: PMC11004512 DOI: 10.2196/52210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 01/02/2024] [Accepted: 01/06/2024] [Indexed: 02/28/2024] Open
Abstract
Background The incidence of breast cancer has remained high and continues to rise since the 21st century. Consequently, there has been a significant increase in research efforts focused on breast cancer prevention and treatment. Despite the extensive body of literature available on this subject, systematic integration is lacking. To address this issue, knowledge graphs have emerged as a valuable tool. By harnessing their powerful knowledge integration capabilities, knowledge graphs offer a comprehensive and structured approach to understanding breast cancer prevention and treatment. Objective We aim to integrate literature data on breast cancer treatment and prevention, build a knowledge graph, and provide support for clinical decision-making. Methods We used Medical Subject Headings terms to search for clinical trial literature on breast cancer prevention and treatment published on PubMed between 2018 and 2022. We downloaded triplet data from the Semantic MEDLINE Database (SemMedDB) and matched them with the retrieved literature to obtain triplet data for the target articles. We visualized the triplet information using NetworkX for knowledge discovery. Results Within the scope of literature research in the past 5 years, malignant neoplasms appeared most frequently (587/1387, 42.3%). Pharmacotherapy (267/1387, 19.3%) was the primary treatment method, with trastuzumab (209/1805, 11.6%) being the most commonly used therapeutic drug. Through the analysis of the knowledge graph, we have discovered a complex network of relationships between treatment methods, therapeutic drugs, and preventive measures for different types of breast cancer. Conclusions This study constructed a knowledge graph for breast cancer prevention and treatment, which enabled the integration and knowledge discovery of relevant literature in the past 5 years. Researchers can gain insights into treatment methods, drugs, preventive knowledge regarding adverse reactions to treatment, and the associations between different knowledge domains from the graph.
Collapse
Affiliation(s)
- Shuyan Jin
- Health Department, Shenzhen Maternity and Child Healthcare Hospital, Shenzhen, China
| | - Haobin Liang
- School of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Wenxia Zhang
- Health Department, Shenzhen Maternity and Child Healthcare Hospital, Shenzhen, China
| | - Huan Li
- Health Department, Shenzhen Maternity and Child Healthcare Hospital, Shenzhen, China
| |
Collapse
|
6
|
Kilicoglu H, Ensan F, McInnes B, Wang LL. Semantics-enabled biomedical literature analytics. J Biomed Inform 2024; 150:104588. [PMID: 38244957 DOI: 10.1016/j.jbi.2024.104588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 01/10/2024] [Indexed: 01/22/2024]
Affiliation(s)
- Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana Champaign, Champaign, IL, USA.
| | - Faezeh Ensan
- Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON, Canada.
| | - Bridget McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Lucy Lu Wang
- Information School, University of Washington, Seattle, WA, USA.
| |
Collapse
|
7
|
Colicchio TK, Osborne JD, Do Rosario CV, Anand A, Timkovich NA, Wyatt MC, Cimino JJ. Semantically oriented EHR navigation with a patient specific knowledge base and a clinical context ontology. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:309-318. [PMID: 38222434 PMCID: PMC10785934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Widespread adoption of electronic health records (EHR) in the U.S. has been followed by unintended consequences, overexposing clinicians to widely reported EHR limitations. As an attempt to fixing the EHR, we propose the use of a clinical context ontology (CCO), applied to turn implicit contextual statements into formally represented data in the form of concept-relationship-concept tuples. These tuples form what we call a patient specific knowledge base (PSKB), a collection of formally defined tuples containing facts about the patient's care context. We report the process to create a CCO, which guides annotation of structured and narrative patient data to produce a PSKB. We also present an application of our PSKB using real patient data displayed on a semantically oriented patient summary to improve EHR navigation. Our approach can potentially save precious time spent by clinicians using today's EHRs, by showing a chronological view of the patient's record along with contextual statements needed for care decisions with minimum effort. We propose several other applications of a PSKB to improve multiple EHR functions to guide future research.
Collapse
Affiliation(s)
| | - John D Osborne
- Informatics Institute, University of Alabama at Birmingham
| | | | - Ankit Anand
- Informatics Institute, University of Alabama at Birmingham
| | | | | | - James J Cimino
- Informatics Institute, University of Alabama at Birmingham
| |
Collapse
|
8
|
Jeynes JCG, James T, Corney M. Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls. Methods Mol Biol 2024; 2716:223-240. [PMID: 37702942 DOI: 10.1007/978-1-0716-3449-3_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
Building and analyzing knowledge graphs (KGs) to aid drug discovery is a topical area of research. A salient feature of KGs is their ability to combine many heterogeneous data sources in a format that facilitates discovering connections. The utility of KGs has been exemplified in areas such as drug repurposing, with insights made through manual exploration and modeling of the data. In this chapter, we discuss promises and pitfalls of using natural language processing (NLP) to mine "unstructured text"- typically from scientific literature- as a data source for KGs. This draws on our experience of initially parsing "structured" data sources-such as ChEMBL-as the basis for data within a KG, and then enriching or expanding upon them using NLP. The fundamental promise of NLP for KGs is the automated extraction of data from millions of documents-a task practically impossible to do via human curation alone. However, there are many potential pitfalls in NLP-KG pipelines, such as incorrect named entity recognition and ontology linking, all of which could ultimately lead to erroneous inferences and conclusions.
Collapse
Affiliation(s)
- J Charles G Jeynes
- Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK.
| | - Tim James
- Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK.
| | - Matthew Corney
- Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK
| |
Collapse
|
9
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC Bioinformatics 2023; 24:412. [PMID: 37915001 PMCID: PMC10619245 DOI: 10.1186/s12859-023-05539-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023] Open
Abstract
BACKGROUND The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: (1) they identify a relationship but not the type of relationship, (2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, (3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or (4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. RESULTS We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. CONCLUSIONS SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
Affiliation(s)
| | - Kalpana Raja
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
| | - John Steill
- Morgridge Institute for Research, Madison, WI, USA
| | - Cannon Lock
- Morgridge Institute for Research, Madison, WI, USA
| | - Xuancheng Tu
- Morgridge Institute for Research, Madison, WI, USA
| | - Ian Ross
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | - Lam C Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Finn Kuusisto
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Data Science Institute, University of Wisconsin, Madison, WI, USA
| | - Zijian Ni
- Department of Statistics, University of Wisconsin, Madison, WI, USA
- Currently at Amazon, Seattle, WA, USA
| | - Miron Livny
- Morgridge Institute for Research, Madison, WI, USA
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | | | - James Thomson
- Morgridge Institute for Research, Madison, WI, USA
- Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI, USA
| | - Ron Stewart
- Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
10
|
Tan J, Hu J, Dong S. Incorporating entity-level knowledge in pretrained language model for biomedical dense retrieval. Comput Biol Med 2023; 166:107535. [PMID: 37788508 DOI: 10.1016/j.compbiomed.2023.107535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 09/04/2023] [Accepted: 09/27/2023] [Indexed: 10/05/2023]
Abstract
In recent years, pre-trained language models (PLMs) have dominated natural language processing (NLP) and achieved outstanding performance in various NLP tasks, including dense retrieval based on PLMs. However, in the biomedical domain, the effectiveness of dense retrieval models based on PLMs still needs to be improved due to the diversity and ambiguity of entity expressions caused by the enrichment of biomedical entities. To alleviate the semantic gap, in this paper, we propose a method that incorporates external knowledge at the entity level into a dense retrieval model to enrich the dense representations of queries and documents. Specifically, we first add additional self-attention and information interaction modules in the Transformer layer of the BERT architecture to perform fusion and interaction between query/document text and entity embeddings from knowledge graphs. We then propose an entity similarity loss to constrain the model to better learn external knowledge from entity embeddings, and further propose a weighted entity concatenation mechanism to balance the impact of entity representations when matching queries and documents. Experiments on two publicly available biomedical retrieval datasets show that our proposed method outperforms state-of-the-art dense retrieval methods. In term of NDCG metrics, the proposed method (called ELK) improves the ranking performance of coCondenser by at least 5% on both two datasets, and also obtains further performance gain over state-of-the-art EVA methods. Though having a more sophisticated architecture, the average query latency of ELK is still within the same order of magnitude as that of other efficient methods.
Collapse
Affiliation(s)
- Jiajie Tan
- Guangdong Key Lab of Computer Network, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Jinlong Hu
- Guangdong Key Lab of Computer Network, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Shoubin Dong
- Guangdong Key Lab of Computer Network, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China.
| |
Collapse
|
11
|
Lou P, Fang A, Zhao W, Yao K, Yang Y, Hu J. Potential Target Discovery and Drug Repurposing for Coronaviruses: Study Involving a Knowledge Graph-Based Approach. J Med Internet Res 2023; 25:e45225. [PMID: 37862061 PMCID: PMC10592722 DOI: 10.2196/45225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 08/30/2023] [Accepted: 09/22/2023] [Indexed: 10/21/2023] Open
Abstract
BACKGROUND The global pandemics of severe acute respiratory syndrome, Middle East respiratory syndrome, and COVID-19 have caused unprecedented crises for public health. Coronaviruses are constantly evolving, and it is unknown which new coronavirus will emerge and when the next coronavirus will sweep across the world. Knowledge graphs are expected to help discover the pathogenicity and transmission mechanism of viruses. OBJECTIVE The aim of this study was to discover potential targets and candidate drugs to repurpose for coronaviruses through a knowledge graph-based approach. METHODS We propose a computational and evidence-based knowledge discovery approach to identify potential targets and candidate drugs for coronaviruses from biomedical literature and well-known knowledge bases. To organize the semantic triples extracted automatically from biomedical literature, a semantic conversion model was designed. The literature knowledge was associated and integrated with existing drug and gene knowledge through semantic mapping, and the coronavirus knowledge graph (CovKG) was constructed. We adopted both the knowledge graph embedding model and the semantic reasoning mechanism to discover unrecorded mechanisms of drug action as well as potential targets and drug candidates. Furthermore, we have provided evidence-based support with a scoring and backtracking mechanism. RESULTS The constructed CovKG contains 17,369,620 triples, of which 641,195 were extracted from biomedical literature, covering 13,065 concept unique identifiers, 209 semantic types, and 97 semantic relations of the Unified Medical Language System. Through multi-source knowledge integration, 475 drugs and 262 targets were mapped to existing knowledge, and 41 new drug mechanisms of action were found by semantic reasoning, which were not recorded in the existing knowledge base. Among the knowledge graph embedding models, TransR outperformed others (mean reciprocal rank=0.2510, Hits@10=0.3505). A total of 33 potential targets and 18 drug candidates were identified for coronaviruses. Among them, 7 novel drugs (ie, quinine, nelfinavir, ivermectin, asunaprevir, tylophorine, Artemisia annua extract, and resveratrol) and 3 highly ranked targets (ie, angiotensin converting enzyme 2, transmembrane serine protease 2, and M protein) were further discussed. CONCLUSIONS We showed the effectiveness of a knowledge graph-based approach in potential target discovery and drug repurposing for coronaviruses. Our approach can be extended to other viruses or diseases for biomedical knowledge discovery and relevant applications.
Collapse
Affiliation(s)
- Pei Lou
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - An Fang
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Wanqing Zhao
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Kuanda Yao
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Yusheng Yang
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jiahui Hu
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| |
Collapse
|
12
|
Jeynes JCG, Corney M, James T. A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction. PLoS One 2023; 18:e0291142. [PMID: 37682956 PMCID: PMC10490933 DOI: 10.1371/journal.pone.0291142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 08/22/2023] [Indexed: 09/10/2023] Open
Abstract
One area of active research is the use of natural language processing (NLP) to mine biomedical texts for sets of triples (subject-predicate-object) for knowledge graph (KG) construction. While statistical methods to mine co-occurrences of entities within sentences are relatively robust, accurate relationship extraction is more challenging. Herein, we evaluate the Global Network of Biomedical Relationships (GNBR), a dataset that uses distributional semantics to model relationships between biomedical entities. The focus of our paper is an evaluation of a subset of the GNBR data; the relationships between chemicals and genes/proteins. We use Evotec's structured 'Nexus' database of >2.76M chemical-protein interactions as a ground truth to compare with GNBRs relationships and find a micro-averaged precision-recall area under the curve (AUC) of 0.50 and a micro-averaged receiver operating characteristic (ROC) curve AUC of 0.71 across the relationship classes 'inhibits', 'binding', 'agonism' and 'antagonism', when a comparison is made on a sentence-by-sentence basis. We conclude that, even though these micro-average scores are modest, using a high threshold on certain relationship classes like 'inhibits' could yield high fidelity triples that are not reported in structured datasets. We discuss how different methods of processing GNBR data, and the factuality of triples could affect the accuracy of NLP data incorporated into knowledge graphs. We provide a GNBR-Nexus(ChEMBL-subset) merged datafile that contains over 20,000 sentences where a protein/gene-chemical co-occur and includes both the GNBR relationship scores as well as the ChEMBL (manually curated) relationships (e.g., 'agonist', 'inhibitor') -this can be accessed at https://doi.org/10.5281/zenodo.8136752. We envisage this being used to aid curation efforts by the drug discovery community.
Collapse
Affiliation(s)
- Jonathan C. G. Jeynes
- Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom
| | - Matthew Corney
- Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom
| | - Tim James
- Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom
| |
Collapse
|
13
|
Fan W, Zhang Y, Wang D, Wang C, Yang J. The impact of Yiwei decoction on the LncRNA and CircRNA regulatory networks in premature ovarian insufficiency. Heliyon 2023; 9:e20022. [PMID: 37809621 PMCID: PMC10559751 DOI: 10.1016/j.heliyon.2023.e20022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 08/19/2023] [Accepted: 09/08/2023] [Indexed: 10/10/2023] Open
Abstract
Premature ovarian insufficiency(POI)is a female reproductive aging illness. Yiwei decoction(YWD) is a traditional treatment for Yangming nourishment. YWD can treat premature ovarian insufficiency, but the exact molecular mechanism is unknown. As a result, the differential expression of Long noncoding RNAs (LncRNAs) and Circular RNAs(CircRNAs) in the ovary of POI rats after YWD treatment was investigated in this paper, and the CeRNA regulatory network was built. The model was created using cyclophosphamide. The model group + YWD was in Group A, the model control group was in Group B, and the regular control group was in Group C. In this study, 177 differential expression Long noncoding RNAs(DELncRNAs) and 190 differential expression Circular RNAs (DECircRNAs) were discovered between A and B (P<0.05,|LogFC|>1). Following the analysis, 27 DELncRNAs and 96 DECircRNAs (P-adjusted<0.05,|LogFC|>1) were discovered. At the same time, we built the CeRNA network using differentially expressed mRNAs and microRNAs (miRNAs) expression between groups A and B. The DELncRNAs were used to construct a lncRNA-miRNA-mRNA ceRNA network with 27 LncRNAs, 4 miRNAs, and 19 mRNAs. The DECircRNAs were utilized to establish a CircRNA-miRNA-mRNA ceRNA network that was made up of 15 CircRNAs, 4 miRNAs, and 20 mRNA. The highly correlated regulatory networks were the LncMSTRG.22691.3/miR-3102/ANGPT4 and Circ10_34698898_34699378/miR-33-5p/TTC22. Circ20_12035276_12036793、Circ20_30693935_30696337、Circ4_157723097_157723378 and Circ4_157923266_157923904 occurred concurrently in AvsB, BvsC, and AvsC. MiRDB predicted eight target miRNAs for these CircRNAs. The miRanda(score = 140,energy = -1) binding energy calculation revealed that seven miRNAs were well combined with three CircRNA base complementary pairs. This implies that 3 DECircRNAs could serve as spongy bodies for these miRNAs. Network pharmacological analysis showed that ten active components in YWD may regulate the expression of LncRNAs and CircRNAs, such as Stigmasterol, Uridine, Ophiopogonanone A, Gamma-Aminobutyric Acid, and others. In conclusion, this study combined transcriptomics and network pharmacological analysis to identify differentially expressed lncRNAs as well as CircRNAs in ovaries of YWD-treated POI rats, thereby constructing ceRNA networks implicated in POI. This would contribute to clarifying the pathways by which Chinese herbal compounds regulate gene expression in POI.
Collapse
Affiliation(s)
- Weisen Fan
- The First Clinical Medical College of Shandong University of Traditional Chinese Medicine, Jinan, 250013, China
| | - Yingjie Zhang
- School of Health, Shandong University of Traditional Chinese Medicine, Jinan, 250013, China
| | - Dandan Wang
- School of Health, Shandong University of Traditional Chinese Medicine, Jinan, 250013, China
| | - Chen Wang
- School of Traditional Chinese Medicine, Shandong University of Chinese Medicine, Jinan, 250013, China
| | - Jie Yang
- School of Physical Education and Health, Shandong Sport University, Jinan, 250013, China
| |
Collapse
|
14
|
Malec SA, Taneja SB, Albert SM, Elizabeth Shaaban C, Karim HT, Levine AS, Munro P, Callahan TJ, Boyce RD. Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: A use case studying depression as a risk factor for Alzheimer's disease. J Biomed Inform 2023; 142:104368. [PMID: 37086959 PMCID: PMC10355339 DOI: 10.1016/j.jbi.2023.104368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 03/03/2023] [Accepted: 04/17/2023] [Indexed: 04/24/2023]
Abstract
BACKGROUND Causal feature selection is essential for estimating effects from observational data. Identifying confounders is a crucial step in this process. Traditionally, researchers employ content-matter expertise and literature review to identify confounders. Uncontrolled confounding from unidentified confounders threatens validity, conditioning on intermediate variables (mediators) weakens estimates, and conditioning on common effects (colliders) induces bias. Additionally, without special treatment, erroneous conditioning on variables combining roles introduces bias. However, the vast literature is growing exponentially, making it infeasible to assimilate this knowledge. To address these challenges, we introduce a novel knowledge graph (KG) application enabling causal feature selection by combining computable literature-derived knowledge with biomedical ontologies. We present a use case of our approach specifying a causal model for estimating the total causal effect of depression on the risk of developing Alzheimer's disease (AD) from observational data. METHODS We extracted computable knowledge from a literature corpus using three machine reading systems and inferred missing knowledge using logical closure operations. Using a KG framework, we mapped the output to target terminologies and combined it with ontology-grounded resources. We translated epidemiological definitions of confounder, collider, and mediator into queries for searching the KG and summarized the roles played by the identified variables. We compared the results with output from a complementary method and published observational studies and examined a selection of confounding and combined role variables in-depth. RESULTS Our search identified 128 confounders, including 58 phenotypes, 47 drugs, 35 genes, 23 collider, and 16 mediator phenotypes. However, only 31 of the 58 confounder phenotypes were found to behave exclusively as confounders, while the remaining 27 phenotypes played other roles. Obstructive sleep apnea emerged as a potential novel confounder for depression and AD. Anemia exemplified a variable playing combined roles. CONCLUSION Our findings suggest combining machine reading and KG could augment human expertise for causal feature selection. However, the complexity of causal feature selection for depression with AD highlights the need for standardized field-specific databases of causal variables. Further work is needed to optimize KG search and transform the output for human consumption.
Collapse
Affiliation(s)
- Scott A Malec
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| | - Steven M Albert
- Department of Behavioral and Community Health Sciences, School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| | - C Elizabeth Shaaban
- Department of Epidemiology, School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| | - Helmet T Karim
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA, USA; Department of Bioengineering, University of Pittsburgh, Pittsburgh, PA, USA
| | - Arthur S Levine
- Department of Neurobiology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA; The Brain Institute, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Paul Munro
- School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, USA
| | - Tiffany J Callahan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA; Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
15
|
Jiang Y, Kavuluru R. End-to-End n-ary Relation Extraction for Combination Drug Therapies. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2023; 2023:72-80. [PMID: 38283165 PMCID: PMC10814995 DOI: 10.1109/ichi57859.2023.00021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Combination drug therapies are treatment regimens that involve two or more drugs, administered more commonly for patients with cancer, HIV, malaria, or tuberculosis. Currently there are over 350K articles in PubMed that use the combination drug therapy MeSH heading with at least 10K articles published per year over the past two decades. Extracting combination therapies from scientific literature inherently constitutes an n-ary relation extraction problem. Unlike in the general n-ary setting where n is fixed (e.g., drug-gene-mutation relations where n = 3), extracting combination therapies is a special setting where n ≥ 2 is dynamic, depending on each instance. Recently, Tiktinsky et al. (NAACL 2022) introduced a first of its kind dataset, CombDrugExt, for extracting such therapies from literature. Here, we use a sequence-to-sequence style end-to-end extraction method to achieve an F1-Score of 66.7% on the CombDrugExt test set for positive (or effective) combinations. This is an absolute ≈ 5% F1-score improvement even over the prior best relation classification score with spotted drug entities (hence, not end-to-end). Thus our effort introduces a state-of-the-art first model for end-to-end extraction that is already superior to the best prior non end-to-end model for this task. Our model seamlessly extracts all drug entities and relations in a single pass and is highly suitable for dynamic n-ary extraction scenarios.
Collapse
Affiliation(s)
- Yuhang Jiang
- Department of Computer Science, University of Kentucky, Lexington, KY USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Dept. of Internal Medicine, Univ. of Kentucky, Lexington, KY USA
| |
Collapse
|
16
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) Discovers and Annotates Biomedical Knowledge Using Co-Occurrence and Transformer Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.30.542911. [PMID: 37397987 PMCID: PMC10312590 DOI: 10.1101/2023.05.30.542911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Background The PubMed database contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: 1) they identify a relationship but not the type of relationship, 2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, 3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or 4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. Results We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. Conclusions SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
|
17
|
Hsiao TK, Torvik VI. OpCitance: Citation contexts identified from the PubMed Central open access articles. Sci Data 2023; 10:243. [PMID: 37117220 PMCID: PMC10139909 DOI: 10.1038/s41597-023-02134-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 04/04/2023] [Indexed: 04/30/2023] Open
Abstract
OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the "citation contexts"). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly.
Collapse
Affiliation(s)
- Tzu-Kun Hsiao
- School of Information Sciences, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, Champaign, IL, 61820, USA.
| | - Vetle I Torvik
- School of Information Sciences, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, Champaign, IL, 61820, USA.
| |
Collapse
|
18
|
Sousa DF, Couto FM. K-RET: knowledgeable biomedical relation extraction system. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:7108769. [PMID: 37018156 PMCID: PMC10112952 DOI: 10.1093/bioinformatics/btad174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/25/2023] [Accepted: 03/29/2023] [Indexed: 04/20/2023]
Abstract
MOTIVATION Relation extraction (RE) is a crucial process to deal with the amount of text published daily, e.g. to find missing associations in a database. RE is a text mining task for which the state-of-the-art approaches use bidirectional encoders, namely, BERT. However, state-of-the-art performance may be limited by the lack of efficient external knowledge injection approaches, with a larger impact in the biomedical area given the widespread usage and high quality of biomedical ontologies. This knowledge can propel these systems forward by aiding them in predicting more explainable biomedical associations. With this in mind, we developed K-RET, a novel, knowledgeable biomedical RE system that, for the first time, injects knowledge by handling different types of associations, multiple sources and where to apply it, and multi-token entities. RESULTS We tested K-RET on three independent and open-access corpora (DDI, BC5CDR, and PGR) using four biomedical ontologies handling different entities. K-RET improved state-of-the-art results by 2.68% on average, with the DDI Corpus yielding the most significant boost in performance, from 79.30% to 87.19% in F-measure, representing a P-value of 2.91×10-12. AVAILABILITY AND IMPLEMENTATION https://github.com/lasigeBioTM/K-RET.
Collapse
Affiliation(s)
- Diana F Sousa
- Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Francisco M Couto
- Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| |
Collapse
|
19
|
Taneja SB, Callahan TJ, Paine MF, Kane-Gill SL, Kilicoglu H, Joachimiak MP, Boyce RD. Developing a Knowledge Graph for Pharmacokinetic Natural Product-Drug Interactions. J Biomed Inform 2023; 140:104341. [PMID: 36933632 PMCID: PMC10150409 DOI: 10.1016/j.jbi.2023.104341] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/09/2023] [Accepted: 03/13/2023] [Indexed: 03/17/2023]
Abstract
BACKGROUND Pharmacokinetic natural product-drug interactions (NPDIs) occur when botanical or other natural products are co-consumed with pharmaceutical drugs. With the growing use of natural products, the risk for potential NPDIs and consequent adverse events has increased. Understanding mechanisms of NPDIs is key to preventing or minimizing adverse events. Although biomedical knowledge graphs (KGs) have been widely used for drug-drug interaction applications, computational investigation of NPDIs is novel. We constructed NP-KG as a first step toward computational discovery of plausible mechanistic explanations for pharmacokinetic NPDIs that can be used to guide scientific research. METHODS We developed a large-scale, heterogeneous KG with biomedical ontologies, linked data, and full texts of the scientific literature. To construct the KG, biomedical ontologies and drug databases were integrated with the Phenotype Knowledge Translator framework. The semantic relation extraction systems, SemRep and Integrated Network and Dynamic Reasoning Assembler, were used to extract semantic predications (subject-relation-object triples) from full texts of the scientific literature related to the exemplar natural products green tea and kratom. A literature-based graph constructed from the predications was integrated into the ontology-grounded KG to create NP-KG. NP-KG was evaluated with case studies of pharmacokinetic green tea- and kratom-drug interactions through KG path searches and meta-path discovery to determine congruent and contradictory information in NP-KG compared to ground truth data. We also conducted an error analysis to identify knowledge gaps and incorrect predications in the KG. RESULTS The fully integrated NP-KG consisted of 745,512 nodes and 7,249,576 edges. Evaluation of NP-KG resulted in congruent (38.98% for green tea, 50% for kratom), contradictory (15.25% for green tea, 21.43% for kratom), and both congruent and contradictory (15.25% for green tea, 21.43% for kratom) information compared to ground truth data. Potential pharmacokinetic mechanisms for several purported NPDIs, including the green tea-raloxifene, green tea-nadolol, kratom-midazolam, kratom-quetiapine, and kratom-venlafaxine interactions were congruent with the published literature. CONCLUSION NP-KG is the first KG to integrate biomedical ontologies with full texts of the scientific literature focused on natural products. We demonstrate the application of NP-KG to identify known pharmacokinetic interactions between natural products and pharmaceutical drugs mediated by drug metabolizing enzymes and transporters. Future work will incorporate context, contradiction analysis, and embedding-based methods to enrich NP-KG. NP-KG is publicly available at https://doi.org/10.5281/zenodo.6814507. The code for relation extraction, KG construction, and hypothesis generation is available at https://github.com/sanyabt/np-kg.
Collapse
Affiliation(s)
- Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15206, USA.
| | - Tiffany J Callahan
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Mary F Paine
- Department of Pharmaceutical Sciences, College of Pharmacy and Pharmaceutical Sciences, Washington State University, Spokane, WA 99202, USA
| | | | - Halil Kilicoglu
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA
| | - Marcin P Joachimiak
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| |
Collapse
|
20
|
Lyu K, Tian Y, Shang Y, Zhou T, Yang Z, Liu Q, Yao X, Zhang P, Chen J, Li J. Causal knowledge graph construction and evaluation for clinical decision support of diabetic nephropathy. J Biomed Inform 2023; 139:104298. [PMID: 36731730 DOI: 10.1016/j.jbi.2023.104298] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Revised: 12/25/2022] [Accepted: 01/25/2023] [Indexed: 01/31/2023]
Abstract
BACKGROUND Many important clinical decisions require causal knowledge (CK) to take action. Although many causal knowledge bases for medicine have been constructed, a comprehensive evaluation based on real-world data and methods for handling potential knowledge noise are still lacking. OBJECTIVE The objectives of our study are threefold: (1) propose a framework for the construction of a large-scale and high-quality causal knowledge graph (CKG); (2) design the methods for knowledge noise reduction to improve the quality of the CKG; (3) evaluate the knowledge completeness and accuracy of the CKG using real-world data. MATERIAL AND METHODS We extracted causal triples from three knowledge sources (SemMedDB, UpToDate and Churchill's Pocketbook of Differential Diagnosis) based on rule methods and language models, performed ontological encoding, and then designed semantic modeling between electronic health record (EHR) data and the CKG to complete knowledge instantiation. We proposed two graph pruning strategies (co-occurrence ratio and causality ratio) to reduce the potential noise introduced by SemMedDB. Finally, the evaluation was carried out by taking the diagnostic decision support (DDS) of diabetic nephropathy (DN) as a real-world case. The data originated from a Chinese hospital EHR system from October 2010 to October 2020. The knowledge completeness and accuracy of the CKG were evaluated based on three state-of-the-art embedding methods (R-GCN, MHGRN and MedPath), the annotated clinical text and the expert review, respectively. RESULTS This graph included 153,289 concepts and 1,719,968 causal triples. A total of 1427 inpatient data were used for evaluation. Better results were achieved by combining three knowledge sources than using only SemMedDB (three models: area under the receiver operating characteristic curve (AUC): p < 0.01, F1: p < 0.01), and the graph covered 93.9 % of the causal relations between diseases and diagnostic evidence recorded in clinical text. Causal relations played a vital role in all relations related to disease progression for DDS of DN (three models: AUC: p > 0.05, F1: p > 0.05), and after pruning, the knowledge accuracy of the CKG was significantly improved (three models: AUC: p < 0.01, F1: p < 0.01; expert review: average accuracy: + 5.5 %). CONCLUSIONS The results demonstrated that our proposed CKG could completely and accurately capture the abstract CK under the concrete EHR data, and the pruning strategies could improve the knowledge accuracy of our CKG. The CKG has the potential to be applied to the DDS of diseases.
Collapse
Affiliation(s)
- Kewei Lyu
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Yu Tian
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Yong Shang
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China
| | - Tianshu Zhou
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China
| | - Ziyue Yang
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Qianghua Liu
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Xi Yao
- Kidney Disease Center, the First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China
| | - Ping Zhang
- Kidney Disease Center, the First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China
| | - Jianghua Chen
- Kidney Disease Center, the First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China
| | - Jingsong Li
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China; Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China.
| |
Collapse
|
21
|
Sakor A, Jozashoori S, Niazmand E, Rivas A, Bougiatiotis K, Aisopos F, Iglesias E, Rohde PD, Padiya T, Krithara A, Paliouras G, Vidal ME. Knowledge4COVID-19: A semantic-based approach for constructing a COVID-19 related knowledge graph from various sources and analyzing treatments' toxicities. WEB SEMANTICS (ONLINE) 2023; 75:100760. [PMID: 36268112 PMCID: PMC9558693 DOI: 10.1016/j.websem.2022.100760] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 09/08/2022] [Accepted: 10/05/2022] [Indexed: 05/20/2023]
Abstract
In this paper, we present Knowledge4COVID-19, a framework that aims to showcase the power of integrating disparate sources of knowledge to discover adverse drug effects caused by drug-drug interactions among COVID-19 treatments and pre-existing condition drugs. Initially, we focus on constructing the Knowledge4COVID-19 knowledge graph (KG) from the declarative definition of mapping rules using the RDF Mapping Language. Since valuable information about drug treatments, drug-drug interactions, and side effects is present in textual descriptions in scientific databases (e.g., DrugBank) or in scientific literature (e.g., the CORD-19, the Covid-19 Open Research Dataset), the Knowledge4COVID-19 framework implements Natural Language Processing. The Knowledge4COVID-19 framework extracts relevant entities and predicates that enable the fine-grained description of COVID-19 treatments and the potential adverse events that may occur when these treatments are combined with treatments of common comorbidities, e.g., hypertension, diabetes, or asthma. Moreover, on top of the KG, several techniques for the discovery and prediction of interactions and potential adverse effects of drugs have been developed with the aim of suggesting more accurate treatments for treating the virus. We provide services to traverse the KG and visualize the effects that a group of drugs may have on a treatment outcome. Knowledge4COVID-19 was part of the Pan-European hackathon#EUvsVirus in April 2020 and is publicly available as a resource through a GitHub repository and a DOI.
Collapse
Affiliation(s)
- Ahmad Sakor
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Samaneh Jozashoori
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Emetis Niazmand
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Ariam Rivas
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Konstantinos Bougiatiotis
- Institute of Informatics & Telecommunications, NCSR Demokritos, Patr. Grigoriou & Neapoleos Str, Ag. Paraskevi, Athens, Greece
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiou 30, Athens, Greece
| | - Fotis Aisopos
- Institute of Informatics & Telecommunications, NCSR Demokritos, Patr. Grigoriou & Neapoleos Str, Ag. Paraskevi, Athens, Greece
| | - Enrique Iglesias
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Philipp D Rohde
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Trupti Padiya
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| | - Anastasia Krithara
- Institute of Informatics & Telecommunications, NCSR Demokritos, Patr. Grigoriou & Neapoleos Str, Ag. Paraskevi, Athens, Greece
| | - Georgios Paliouras
- Institute of Informatics & Telecommunications, NCSR Demokritos, Patr. Grigoriou & Neapoleos Str, Ag. Paraskevi, Athens, Greece
| | - Maria-Esther Vidal
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1 B, Hannover, Germany
- L3S Research Center, University of Hannover, Appelstraße 9a, Hannover, Germany
| |
Collapse
|
22
|
Ma C, Zhou Z, Liu H, Koslicki D. KGML-xDTD: a knowledge graph-based machine learning framework for drug treatment prediction and mechanism description. Gigascience 2022; 12:giad057. [PMID: 37602759 PMCID: PMC10441000 DOI: 10.1093/gigascience/giad057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 05/05/2023] [Accepted: 07/04/2023] [Indexed: 08/22/2023] Open
Abstract
BACKGROUND Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action (MOAs) between repurposed drugs and their target diseases remain largely unknown, which is still a main obstacle for computational drug repurposing methods to be widely adopted in clinical settings. RESULTS In this work, we propose KGML-xDTD: a Knowledge Graph-based Machine Learning framework for explainably predicting Drugs Treating Diseases. It is a 2-module framework that not only predicts the treatment probabilities between drugs/compounds and diseases but also biologically explains them via knowledge graph (KG) path-based, testable MOAs. We leverage knowledge-and-publication-based information to extract biologically meaningful "demonstration paths" as the intermediate guidance in the Graph-based Reinforcement Learning (GRL) path-finding process. Comprehensive experiments and case study analyses show that the proposed framework can achieve state-of-the-art performance in both predictions of drug repurposing and recapitulation of human-curated drug MOA paths. CONCLUSIONS KGML-xDTD is the first model framework that can offer KG path explanations for drug repurposing predictions by leveraging the combination of prediction outcomes and existing biological knowledge and publications. We believe it can effectively reduce "black-box" concerns and increase prediction confidence for drug repurposing based on predicted path-based explanations and further accelerate the process of drug discovery for emerging diseases.
Collapse
Affiliation(s)
- Chunyu Ma
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - Zhihan Zhou
- Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
| | - Han Liu
- Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
| | - David Koslicki
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
- Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16801, USA
- Department of Biology, Pennsylvania State University, State College, PA 16801, USA
| |
Collapse
|
23
|
Zou L, Bao W, Gao Y, Chen M, Wu Y, Wang S, Li C, Zhang J, Zhang D, Wang Q, Zhu A. Integrated Analysis of Transcriptome and microRNA Profile Reveals the Toxicity of Euphorbia Factors toward Human Colon Adenocarcinoma Cell Line Caco-2. Molecules 2022; 27:molecules27206931. [PMID: 36296525 PMCID: PMC9608949 DOI: 10.3390/molecules27206931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 10/13/2022] [Accepted: 10/14/2022] [Indexed: 11/22/2022] Open
Abstract
Euphorbia factors, lathyrane-type diterpenoids isolated from the medical herb Euphorbia lathyris L. (Euphorbiaceae), have been associated with intestinal irritation toxicity, but the mechanisms underlying this phenomenon are still unknown. The objective of this study was to evaluate the transcriptome and miRNA profiles of human colon adenocarcinoma Caco-2 cells in response to Euphorbia factors L1 (EFL1) and EFL2. Whole transcriptomes of mRNA and microRNA (miRNA) were obtained using second generation high-throughput sequencing technology in response to 200 μM EFL treatment for 72 h, and the differentially expressed genes and metabolism pathway were enriched. Gene structure changes were analyzed by comparing them with reference genome sequences. After 72 h of treatment, 16 miRNAs and 154 mRNAs were differently expressed between the EFL1 group and the control group, and 47 miRNAs and 1101 mRNAs were differentially expressed between the EFL2 group and the control. Using clusters of orthologous protein enrichment, the sequenced mRNAs were shown to be mainly involved in transcription, post-translational modification, protein turnover, chaperones, signal transduction mechanisms, intracellular trafficking, secretion, vesicular transport, and the cytoskeleton. The differentially expressed mRNA functions and pathways were enriched in transmembrane transport, T cell extravasation, the IL-17 signaling pathway, apoptosis, and the cell cycle. The differentially expressed miRNA EFLs caused changes in the structure of the gene, including alternative splicing, insertion and deletion, and single nucleotide polymorphisms. This study reveals the underlying mechanism responsible for the toxicity of EFLs in intestinal cells based on transcriptome and miRNA profiles of gene expression and structure.
Collapse
Affiliation(s)
- Lingyue Zou
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Wenqiang Bao
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Yadong Gao
- Department of Toxicology, School of Public Health, Peking University, Beijing 100191, China
- Fujian Provincial Key Laboratory of Zoonosis Research, Fujian Center for Disease Control and Prevention, Fuzhou 350001, China
| | - Mengting Chen
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Yajiao Wu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Shuo Wang
- Department of Toxicology, School of Public Health, Peking University, Beijing 100191, China
| | - Chutao Li
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Jian Zhang
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Dongcheng Zhang
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
| | - Qi Wang
- Department of Toxicology, School of Public Health, Peking University, Beijing 100191, China
- Key Laboratory of State Administration of Traditional Chinese Medicine for Compatibility Toxicology, Beijing 100191, China
- Beijing Key Laboratory of Toxicological Research and Risk Assessment for Food Safety, Beijing 100191, China
- Correspondence: (Q.W.); (A.Z.)
| | - An Zhu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350108, China
- Correspondence: (Q.W.); (A.Z.)
| |
Collapse
|
24
|
Foksinska A, Crowder CM, Crouse AB, Henrikson J, Byrd WE, Rosenblatt G, Patton MJ, He K, Tran-Nguyen TK, Zheng M, Ramsey SA, Amin N, Osborne J, Might M. The precision medicine process for treating rare disease using the artificial intelligence tool mediKanren. Front Artif Intell 2022; 5:910216. [PMID: 36248623 PMCID: PMC9562701 DOI: 10.3389/frai.2022.910216] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 08/23/2022] [Indexed: 12/03/2022] Open
Abstract
There are over 6,000 different rare diseases estimated to impact 300 million people worldwide. As genetic testing becomes more common practice in the clinical setting, the number of rare disease diagnoses will continue to increase, resulting in the need for novel treatment options. Identifying treatments for these disorders is challenging due to a limited understanding of disease mechanisms, small cohort sizes, interindividual symptom variability, and little commercial incentive to develop new treatments. A promising avenue for treatment is drug repurposing, where FDA-approved drugs are repositioned as novel treatments. However, linking disease mechanisms to drug action can be extraordinarily difficult and requires a depth of knowledge across multiple fields, which is complicated by the rapid pace of biomedical knowledge discovery. To address these challenges, The Hugh Kaul Precision Medicine Institute developed an artificial intelligence tool, mediKanren, that leverages the mechanistic insight of genetic disorders to identify therapeutic options. Using knowledge graphs, mediKanren enables an efficient way to link all relevant literature and databases. This tool has allowed for a scalable process that has been used to help over 500 rare disease families. Here, we provide a description of our process, the advantages of mediKanren, and its impact on rare disease patients.
Collapse
Affiliation(s)
- Aleksandra Foksinska
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Camerron M. Crowder
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Neurobiology, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Andrew B. Crouse
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | | | - William E. Byrd
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Gregory Rosenblatt
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Michael J. Patton
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Kaiwen He
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Thi K. Tran-Nguyen
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Marissa Zheng
- Department of Molecular and Cellular Biology, Harvard College, Cambridge, MA, United States
| | - Stephen A. Ramsey
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, United States
| | - Nada Amin
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, United States
| | - John Osborne
- Department of Medicine, Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Matthew Might
- The Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| |
Collapse
|
25
|
Nian Y, Hu X, Zhang R, Feng J, Du J, Li F, Bu L, Zhang Y, Chen Y, Tao C. Mining on Alzheimer's diseases related knowledge graph to identity potential AD-related semantic triples for drug repurposing. BMC Bioinformatics 2022; 23:407. [PMID: 36180861 PMCID: PMC9523633 DOI: 10.1186/s12859-022-04934-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND To date, there are no effective treatments for most neurodegenerative diseases. Knowledge graphs can provide comprehensive and semantic representation for heterogeneous data, and have been successfully leveraged in many biomedical applications including drug repurposing. Our objective is to construct a knowledge graph from literature to study the relations between Alzheimer's disease (AD) and chemicals, drugs and dietary supplements in order to identify opportunities to prevent or delay neurodegenerative progression. We collected biomedical annotations and extracted their relations using SemRep via SemMedDB. We used both a BERT-based classifier and rule-based methods during data preprocessing to exclude noise while preserving most AD-related semantic triples. The 1,672,110 filtered triples were used to train with knowledge graph completion algorithms (i.e., TransE, DistMult, and ComplEx) to predict candidates that might be helpful for AD treatment or prevention. RESULTS Among three knowledge graph completion models, TransE outperformed the other two (MR = 10.53, Hits@1 = 0.28). We leveraged the time-slicing technique to further evaluate the prediction results. We found supporting evidence for most highly ranked candidates predicted by our model which indicates that our approach can inform reliable new knowledge. CONCLUSION This paper shows that our graph mining model can predict reliable new relationships between AD and other entities (i.e., dietary supplements, chemicals, and drugs). The knowledge graph constructed can facilitate data-driven knowledge discoveries and the generation of novel hypotheses.
Collapse
Affiliation(s)
- Yi Nian
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030 USA
| | - Xinyue Hu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030 USA
| | - Rui Zhang
- Department of Pharmaceutical Care & Health System (PCHS) and the Institute for Health Informatics (IHI), University of Minnesota, 7-115A Weaver-Densford Hall, Minneapolis, MN 55455 USA
| | - Jingna Feng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030 USA
| | - Jingcheng Du
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030 USA
| | - Fang Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030 USA
| | - Larry Bu
- University of Maryland School of Medicine, 655 W Baltimore St S, Baltimore, MD 21201 USA
| | - Yuji Zhang
- University of Maryland School of Medicine, 655 W Baltimore St S, Baltimore, MD 21201 USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics (DBEI), the Perelman School of Medicine, University of Pennsylvania, 602 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104 USA
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030 USA
| |
Collapse
|
26
|
Lee Y, Son J, Song M. BertSRC: transformer-based semantic relation classification. BMC Med Inform Decis Mak 2022; 22:234. [PMID: 36068535 PMCID: PMC9446816 DOI: 10.1186/s12911-022-01977-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 08/11/2022] [Indexed: 11/13/2022] Open
Abstract
The relationship between biomedical entities is complex, and many of them have not yet been identified. For many biomedical research areas including drug discovery, it is of paramount importance to identify the relationships that have already been established through a comprehensive literature survey. However, manually searching through literature is difficult as the amount of biomedical publications continues to increase. Therefore, the relation classification task, which automatically mines meaningful relations from the literature, is spotlighted in the field of biomedical text mining. By applying relation classification techniques to the accumulated biomedical literature, existing semantic relations between biomedical entities that can help to infer previously unknown relationships are efficiently grasped. To develop semantic relation classification models, which is a type of supervised machine learning, it is essential to construct a training dataset that is manually annotated by biomedical experts with semantic relations among biomedical entities. Any advanced model must be trained on a dataset with reliable quality and meaningful scale to be deployed in the real world and can assist biologists in their research. In addition, as the number of such public datasets increases, the performance of machine learning algorithms can be accurately revealed and compared by using those datasets as a benchmark for model development and improvement. In this paper, we aim to build such a dataset. Along with that, to validate the usability of the dataset as training data for relation classification models and to improve the performance of the relation extraction task, we built a relation classification model based on Bidirectional Encoder Representations from Transformers (BERT) trained on our dataset, applying our newly proposed fine-tuning methodology. In experiments comparing performance among several models based on different deep learning algorithms, our model with the proposed fine-tuning methodology showed the best performance. The experimental results show that the constructed training dataset is an important information resource for the development and evaluation of semantic relation extraction models. Furthermore, relation extraction performance can be improved by integrating our proposed fine-tuning methodology. Therefore, this can lead to the promotion of future text mining research in the biomedical field.
Collapse
Affiliation(s)
- Yeawon Lee
- Department of Library and Information Science, Yonsei University, Seoul, South Korea
| | - Jinseok Son
- Department of Digital Analytics, Yonsei University, Seoul, South Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, South Korea.
| |
Collapse
|
27
|
Sosa DN, Altman RB. Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference. Brief Bioinform 2022; 23:6640007. [PMID: 35817308 PMCID: PMC9294417 DOI: 10.1093/bib/bbac268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/25/2022] [Accepted: 06/07/2022] [Indexed: 11/30/2022] Open
Abstract
The cost of drug development continues to rise and may be prohibitive in cases of unmet clinical need, particularly for rare diseases. Artificial intelligence-based methods are promising in their potential to discover new treatment options. The task of drug repurposing hypothesis generation is well-posed as a link prediction problem in a knowledge graph (KG) of interacting of drugs, proteins, genes and disease phenotypes. KGs derived from biomedical literature are semantically rich and up-to-date representations of scientific knowledge. Inference methods on scientific KGs can be confounded by unspecified contexts and contradictions. Extracting context enables incorporation of relevant pharmacokinetic and pharmacodynamic detail, such as tissue specificity of interactions. Contradictions in biomedical KGs may arise when contexts are omitted or due to contradicting research claims. In this review, we describe challenges to creating literature-scale representations of pharmacological knowledge and survey current approaches toward incorporating context and resolving contradictions.
Collapse
Affiliation(s)
- Daniel N Sosa
- Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| | - Russ B Altman
- Department of Biological Engineering; Department of Genetics; Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| |
Collapse
|
28
|
Schutte D, Vasilakes J, Bompelli A, Zhou Y, Fiszman M, Xu H, Kilicoglu H, Bishop JR, Adam T, Zhang R. Discovering novel drug-supplement interactions using SuppKG generated from the biomedical literature. J Biomed Inform 2022; 131:104120. [PMID: 35709900 PMCID: PMC9335448 DOI: 10.1016/j.jbi.2022.104120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 04/26/2022] [Accepted: 06/08/2022] [Indexed: 12/04/2022]
Abstract
Objective: Develop a novel methodology to create a comprehensive knowledge graph (SuppKG) to represent a domain with limited coverage in the Unified Medical Language System (UMLS), specifically dietary supplement (DS) information for discovering drug-supplement interactions (DSI), by leveraging biomedical natural language processing (NLP) technologies and a DS domain terminology. Materials and Methods: We created SemRepDS (an extension of an NLP tool, SemRep), capable of extracting semantic relations from abstracts by leveraging a DS-specific terminology (iDISK) containing 28,884 DS terms not found in the UMLS. PubMed abstracts were processed using SemRepDS to generate semantic relations, which were then filtered using a PubMedBERT model to remove incorrect relations before generating SuppKG. Two discovery pathways were applied to SuppKG to identify potential DSIs, which are then compared with an existing DSI database and also evaluated by medical professionals for mechanistic plausibility. Results: SemRepDS returned 158.5% more DS entities and 206.9% more DS relations than SemRep. The fine-tuned PubMedBERT model (significantly outperformed other machine learning and BERT models) obtained an F1 score of 0.8605 and removed 43.86% of semantic relations, improving the precision of the relations by 26.4% over pre-filtering. SuppKG consists of 56,635 nodes and 595,222 directed edges with 2,928 DS-specific nodes and 164,738 edges. Manual review of findings identified 182 of 250 (72.8%) proposed DS-Gene-Drug and 77 of 100 (77%) proposed DS-Gene1-Function-Gene2-Drug pathways to be mechanistically plausible. Discussion: With added DS terminology to the UMLS, SemRepDS has the capability to find more DS-specific semantic relationships from PubMed than SemRep. The utility of the resulting SuppKG was demonstrated using discovery patterns to find novel DSIs. Conclusion: For the domain with limited coverage in the traditional terminology (e.g., UMLS), we demonstrated an approach to leverage domain terminology and improve existing NLP tools to generate a more comprehensive knowledge graph for the downstream task. Even this study focuses on DSI, the method may be adapted to other domains.
Collapse
Affiliation(s)
- Dalton Schutte
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Jake Vasilakes
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA; National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Anu Bompelli
- Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Yuqi Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Brazil
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois, Champaign, IL, USA
| | - Jeffrey R Bishop
- Department of Experimental and Clinical Pharmacy, University of Minnesota, Minneapolis, MN, USA
| | - Terrence Adam
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
29
|
Extracting and Measuring Uncertain Biomedical Knowledge from Scientific Statements. JOURNAL OF DATA AND INFORMATION SCIENCE 2022. [DOI: 10.2478/jdis-2022-0008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Abstract
Purpose
Given the information overload of scientific literature, there is an increasing need for computable biomedical knowledge buried in free text. This study aimed to develop a novel approach to extracting and measuring uncertain biomedical knowledge from scientific statements.
Design/methodology/approach
Taking cardiovascular research publications in China as a sample, we extracted subject–predicate–object triples (SPO triples) as knowledge units and unknown/hedging/conflicting uncertainties as the knowledge context. We introduced information entropy (IE) as potential metric to quantify the uncertainty of epistemic status of scientific knowledge represented at subject-object pairs (SO pairs) levels.
Findings
The results indicated an extraordinary growth of cardiovascular publications in China while only a modest growth of the novel SPO triples. After evaluating the uncertainty of biomedical knowledge with IE, we identified the Top 10 SO pairs with highest IE, which implied the epistemic status pluralism. Visual presentation of the SO pairs overlaid with uncertainty provided a comprehensive overview of clusters of biomedical knowledge and contending topics in cardiovascular research.
Research limitations
The current methods didn’t distinguish the specificity and probabilities of uncertainty cue words. The number of sentences surrounding a given triple may also influence the value of IE.
Practical implications
Our approach identified major uncertain knowledge areas such as diagnostic biomarkers, genetic polymorphism and co-existing risk factors related to cardiovascular diseases in China. These areas are suggested to be prioritized; new hypotheses need to be verified, while disputes, conflicts, and contradictions need to be settled.
Originality/value
We provided a novel approach by combining natural language processing and computational linguistics with informetric methods to extract and measure uncertain knowledge from scientific statements.
Collapse
|
30
|
Lardos A, Aghaebrahimian A, Koroleva A, Sidorova J, Wolfram E, Anisimova M, Gil M. Computational Literature-based Discovery for Natural Products Research: Current State and Future Prospects. FRONTIERS IN BIOINFORMATICS 2022; 2:827207. [PMID: 36304281 PMCID: PMC9580913 DOI: 10.3389/fbinf.2022.827207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 02/28/2022] [Indexed: 11/21/2022] Open
Abstract
Literature-based discovery (LBD) mines existing literature in order to generate new hypotheses by finding links between previously disconnected pieces of knowledge. Although automated LBD systems are becoming widespread and indispensable in a wide variety of knowledge domains, little has been done to introduce LBD to the field of natural products research. Despite growing knowledge in the natural product domain, most of the accumulated information is found in detached data pools. LBD can facilitate better contextualization and exploitation of this wealth of data, for example by formulating new hypotheses for natural product research, especially in the context of drug discovery and development. Moreover, automated LBD systems promise to accelerate the currently tedious and expensive process of lead identification, optimization, and development. Focusing on natural product research, we briefly reflect the development of automated LBD and summarize its methods and principal data sources. In a thorough review of published use cases of LBD in the biomedical domain, we highlight the immense potential of this data mining approach for natural product research, especially in context with drug discovery or repurposing, mode of action, as well as drug or substance interactions. Most of the 91 natural product-related discoveries in our sample of reported use cases of LBD were addressed at a computer science audience. Therefore, it is the wider goal of this review to introduce automated LBD to researchers who work with natural products and to facilitate the dialogue between this community and the developers of automated LBD systems.
Collapse
Affiliation(s)
- Andreas Lardos
- Natural Product Chemistry and Phytopharmacy Research Group, Institute of Chemistry and Biotechnology, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- *Correspondence: Andreas Lardos,
| | - Ahmad Aghaebrahimian
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Anna Koroleva
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Julia Sidorova
- Instituto de Tecnología del Conocimiento, Universidad Complutense de Madrid, Madrid, Spain
| | - Evelyn Wolfram
- Natural Product Chemistry and Phytopharmacy Research Group, Institute of Chemistry and Biotechnology, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| |
Collapse
|
31
|
A Knowledge Graph Completion Method Applied to Literature-Based Discovery for Predicting Missing Links Targeting Cancer Drug Repurposing. Artif Intell Med 2022. [DOI: 10.1007/978-3-031-09342-5_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
32
|
Balasubramanian V, Vivekanandhan S, Mahadevan V. Pandemic tele-smart: a contactless tele-health system for efficient monitoring of remotely located COVID-19 quarantine wards in India using near-field communication and natural language processing system. Med Biol Eng Comput 2021; 60:61-79. [PMID: 34705163 PMCID: PMC8548353 DOI: 10.1007/s11517-021-02456-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 10/07/2021] [Indexed: 11/28/2022]
Abstract
Efficient remote monitoring of the patient infected with coronavirus without spread to healthcare workers is the need of the hour. An effectual and faster communication system must be established wherein the healthcare workers at the remote quarantine ward can communicate with healthcare professionals present in specialty hospitals. Incidentally, there is a need to establish a contactless smart cloud-based connection between a specialty hospital and quarantine wards during pandemic situation. This paper proposes an initial contactless web-based tele-health clinical decision support system that integrates near-field communication (NFC) tags and a smart cloud-based structuring tool that enables the quick diagnosis of patients with COVID-19 symptoms and monitors the remotely located quarantine wards during the recent pandemic. The proposed framework consists of three-stages: (i) contactless health parameter extraction from the patient using an NFC tag; (ii) converting medical report into digital text using optical character recognition algorithm and extracting values of relevant medical-parameters using natural language processing; and (iii) smart visualization of key medical parameters. The accuracy of the proposed system from NFC reader until analysis using a novel structuring algorithm deployed in the cloud is more than 94%. Several capabilities of the proposed web-based system were compared with similar systems and tested in an authentic mock clinical setup, and the physicians found that the system is reliable and user friendly.
Collapse
Affiliation(s)
- Vishal Balasubramanian
- Department of Electronics & Communication Engineering, Rajalakshmi Engineering College, Chennai, 602105, India
| | - Sapthagirivasan Vivekanandhan
- Department of Biomedical Engineering, Rajalakshmi Engineering College, Chennai, 602105, India. .,Medical Devices and Healthcare Technologies Department, Engineering R&D Division, IT Service Company, Bengaluru, 560066, India.
| | | |
Collapse
|
33
|
Mante J, Hao Y, Jett J, Joshi U, Keating K, Lu X, Nakum G, Rodriguez NE, Tang J, Terry L, Wu X, Yu E, Downie JS, McInnes BT, Nguyen MH, Sepulvado B, Young EM, Myers CJ. Synthetic Biology Knowledge System. ACS Synth Biol 2021; 10:2276-2285. [PMID: 34387462 DOI: 10.1021/acssynbio.1c00188] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The Synthetic Biology Knowledge System (SBKS) is an instance of the SynBioHub repository that includes text and data information that has been mined from papers published in ACS Synthetic Biology. This paper describes the SBKS curation framework that is being developed to construct the knowledge stored in this repository. The text mining pipeline performs automatic annotation of the articles using natural language processing techniques to identify salient content such as key terms, relationships between terms, and main topics. The data mining pipeline performs automatic annotation of the sequences extracted from the supplemental documents with the genetic parts used in them. Together these two pipelines link genetic parts to papers describing the context in which they are used. Ultimately, SBKS will reduce the time necessary for synthetic biologists to find the information necessary to complete their designs.
Collapse
Affiliation(s)
- Jeanet Mante
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Udayan Joshi
- University of California San Diego, La Jolla, California 92093, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusettes 01609, United States
| | - Xiang Lu
- University of California San Diego, La Jolla, California 92093, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Logan Terry
- University of Utah, Salt Lake City, Utah 84112, United States
| | - Xuanyu Wu
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric Yu
- University of Utah, Salt Lake City, Utah 84112, United States
| | - J. Stephen Downie
- University of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Bridget T. McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - Mai H. Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| | - Brandon Sepulvado
- NORC at the University of Chicago Bethesda, Chicago, Illinois 60637, United States
| | - Eric M. Young
- Worcester Polytechnic Institute, Worcester, Massachusettes 01609, United States
| | - Chris J. Myers
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| |
Collapse
|
34
|
Henry S, Wijesinghe DS, Myers A, McInnes BT. Using Literature Based Discovery to Gain Insights Into the Metabolomic Processes of Cardiac Arrest. Front Res Metr Anal 2021; 6:644728. [PMID: 34250435 PMCID: PMC8267364 DOI: 10.3389/frma.2021.644728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 05/07/2021] [Indexed: 12/19/2022] Open
Abstract
In this paper, we describe how we applied LBD techniques to discover lecithin cholesterol acyltransferase (LCAT) as a druggable target for cardiac arrest. We fully describe our process which includes the use of high-throughput metabolomic analysis to identify metabolites significantly related to cardiac arrest, and how we used LBD to gain insights into how these metabolites relate to cardiac arrest. These insights lead to our proposal (for the first time) of LCAT as a druggable target; the effects of which are supported by in vivo studies which were brought forth by this work. Metabolites are the end product of many biochemical pathways within the human body. Observed changes in metabolite levels are indicative of changes in these pathways, and provide valuable insights toward the cause, progression, and treatment of diseases. Following cardiac arrest, we observed changes in metabolite levels pre- and post-resuscitation. We used LBD to help discover diseases implicitly linked via these metabolites of interest. Results of LBD indicated a strong link between Fish Eye disease and cardiac arrest. Since fish eye disease is characterized by an LCAT deficiency, it began an investigation into the effects of LCAT and cardiac arrest survival. In the investigation, we found that decreased LCAT activity may increase cardiac arrest survival rates by increasing ω-3 polyunsaturated fatty acid availability in circulation. We verified the effects of ω-3 polyunsaturated fatty acids on increasing survival rate following cardiac arrest via in vivo with rat models.
Collapse
Affiliation(s)
- Sam Henry
- Department of Physics, Computer Science and Engineering, Christopher Newport University, Newport News, VA, United States
| | - D. Shanaka Wijesinghe
- Department of Pharmacotherapy and Outcomes Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Aidan Myers
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
35
|
Malec SA, Wei P, Bernstam EV, Boyce RD, Cohen T. Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance. J Biomed Inform 2021; 117:103719. [PMID: 33716168 PMCID: PMC8559730 DOI: 10.1016/j.jbi.2021.103719] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 12/31/2020] [Accepted: 01/04/2021] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. However, causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data. METHODS We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure) and that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2 M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ2 and reporting odds ratio) and with each other. RESULTS AND CONCLUSIONS We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic covariates.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States.
| | - Peng Wei
- The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, TX, United States
| | - Elmer V Bernstam
- University of Texas Health Science Center at Houston, School of Biomedical Informatics, Houston, TX, United States
| | - Richard D Boyce
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States
| | - Trevor Cohen
- University of Washington, Department of Biomedical Informatics and Medical Education, Seattle, WA, United States
| |
Collapse
|
36
|
Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H. Drug repurposing for COVID-19 via knowledge graph completion. J Biomed Inform 2021; 115:103696. [PMID: 33571675 PMCID: PMC7869625 DOI: 10.1016/j.jbi.2021.103696] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 12/23/2020] [Accepted: 02/01/2021] [Indexed: 02/07/2023]
Abstract
OBJECTIVE To discover candidate drugs to repurpose for COVID-19 using literature-derived knowledge and knowledge graph completion methods. METHODS We propose a novel, integrative, and neural network-based literature-based discovery (LBD) approach to identify drug candidates from PubMed and other COVID-19-focused research literature. Our approach relies on semantic triples extracted using SemRep (via SemMedDB). We identified an informative and accurate subset of semantic triples using filtering rules and an accuracy classifier developed on a BERT variant. We used this subset to construct a knowledge graph, and applied five state-of-the-art, neural knowledge graph completion algorithms (i.e., TransE, RotatE, DistMult, ComplEx, and STELP) to predict drug repurposing candidates. The models were trained and assessed using a time slicing approach and the predicted drugs were compared with a list of drugs reported in the literature and evaluated in clinical trials. These models were complemented by a discovery pattern-based approach. RESULTS Accuracy classifier based on PubMedBERT achieved the best performance (F1 = 0.854) in identifying accurate semantic predications. Among five knowledge graph completion models, TransE outperformed others (MR = 0.923, Hits@1 = 0.417). Some known drugs linked to COVID-19 in the literature were identified, as well as others that have not yet been studied. Discovery patterns enabled identification of additional candidate drugs and generation of plausible hypotheses regarding the links between the candidate drugs and COVID-19. Among them, five highly ranked and novel drugs (i.e., paclitaxel, SB 203580, alpha 2-antiplasmin, metoclopramide, and oxymatrine) and the mechanistic explanations for their potential use are further discussed. CONCLUSION We showed that a LBD approach can be feasible not only for discovering drug candidates for COVID-19, but also for generating mechanistic explanations. Our approach can be generalized to other diseases as well as to other clinical questions. Source code and data are available at https://github.com/kilicogluh/lbd-covid.
Collapse
Affiliation(s)
- Rui Zhang
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA.
| | - Dimitar Hristovski
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Dalton Schutte
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA
| | - Andrej Kastrin
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Brazil
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| |
Collapse
|
37
|
Towards medical knowmetrics: representing and computing medical knowledge using semantic predications as the knowledge unit and the uncertainty as the knowledge context. Scientometrics 2021; 126:6225-6251. [PMID: 33612884 PMCID: PMC7882417 DOI: 10.1007/s11192-021-03880-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 01/19/2021] [Indexed: 11/05/2022]
Abstract
In China, Prof. Hongzhou Zhao and Zeyuan Liu are the pioneers of the concept “knowledge unit” and “knowmetrics” for measuring knowledge. However, the definition on “computable knowledge object” remains controversial so far in different fields. For example, it is defined as (1) quantitative scientific concept in natural science and engineering, (2) knowledge point in the field of education research, and (3) semantic predications, i.e., Subject-Predicate-Object (SPO) triples in biomedical fields. The Semantic MEDLINE Database (SemMedDB), a high-quality public repository of SPO triples extracted from medical literature, provides a basic data infrastructure for measuring medical knowledge. In general, the study of extracting SPO triples as computable knowledge unit from unstructured scientific text has been overwhelmingly focusing on scientific knowledge per se. Since the SPO triples would be possibly extracted from hypothetical, speculative statements or even conflicting and contradictory assertions, the knowledge status (i.e., the uncertainty), which serves as an integral and critical part of scientific knowledge has been largely overlooked. This article aims to put forward a framework for Medical Knowmetrics using the SPO triples as the knowledge unit and the uncertainty as the knowledge context. The lung cancer publications dataset is used to validate the proposed framework. The uncertainty of medical knowledge and how its status evolves over time indirectly reflect the strength of competing knowledge claims, and the probability of certainty for a given SPO triple. We try to discuss the new insights using the uncertainty-centric approaches to detect research fronts, and identify knowledge claims with high certainty level, in order to improve the efficacy of knowledge-driven decision support.
Collapse
|
38
|
Biziukova N, Tarasova O, Ivanov S, Poroikov V. Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies. Front Genet 2021; 11:618862. [PMID: 33414815 PMCID: PMC7783389 DOI: 10.3389/fgene.2020.618862] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 11/26/2020] [Indexed: 12/16/2022] Open
Abstract
Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.
Collapse
Affiliation(s)
- Nadezhda Biziukova
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Olga Tarasova
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Sergey Ivanov
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia.,Department of Bioinformatics, Faculty of Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Vladimir Poroikov
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| |
Collapse
|
39
|
Comparing NLP Systems to Extract Entities of Eligibility Criteria in Dietary Supplements Clinical Trials Using NLP-ADAPT. Artif Intell Med 2020. [DOI: 10.1007/978-3-030-59137-3_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|