1
|
Tyagin I, Safro I. Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique. BMC Bioinformatics 2024; 25:213. [PMID: 38872097 PMCID: PMC11177514 DOI: 10.1186/s12859-024-05812-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 05/16/2024] [Indexed: 06/15/2024] Open
Abstract
BACKGROUND Automated hypothesis generation (HG) focuses on uncovering hidden connections within the extensive information that is publicly available. This domain has become increasingly popular, thanks to modern machine learning algorithms. However, the automated evaluation of HG systems is still an open problem, especially on a larger scale. RESULTS This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypotheses accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. CONCLUSIONS Dyport is an open-source benchmarking framework designed for biomedical hypothesis generation systems evaluation, which takes into account knowledge dynamics, semantics and impact. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport .
Collapse
Affiliation(s)
- Ilya Tyagin
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19713, USA.
| | - Ilya Safro
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19716, USA.
| |
Collapse
|
2
|
Zhou H, Jiang H, Wang L, Yao W, Lin Y. Temporal attention networks for biomedical hypothesis generation. J Biomed Inform 2024; 151:104607. [PMID: 38360080 DOI: 10.1016/j.jbi.2024.104607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 02/11/2024] [Accepted: 02/12/2024] [Indexed: 02/17/2024]
Abstract
OBJECTIVES Hypothesis Generation (HG) is a task that aims to uncover hidden associations between disjoint scientific terms, which influences innovations in prevention, treatment, and overall public health. Several recent studies strive to use Recurrent Neural Network (RNN) to learn evolutional embeddings for HG. However, the complex spatiotemporal dependencies of term-pair relations will be difficult to depict due to the inherent recurrent structure. This paper aims to accurately model the temporal evolution of term-pair relations using only attention mechanisms, for capturing crucial information on inferring the future connectivities. METHODS This paper proposes a Temporal Attention Networks (TAN) to produce powerful spatiotemporal embeddings for Biomedical Hypothesis Generation. Specifically, we formulate HG problem as a future connectivity prediction task in a temporal attributed graph. Our TAN develops a Temporal Spatial Attention Module (TSAM) to establish temporal dependencies of node-pair (term-pair) embeddings between any two time-steps for smoothing spatiotemporal node-pair embeddings. Meanwhile, a Temporal Difference Attention Module (TDAM) is proposed to sharpen temporal differences of spatiotemporal embeddings for highlighting the historical changes of node-pair relations. As such, TAN can adaptively calibrate spatiotemporal embeddings by considering both continuity and difference of node-pair embeddings. RESULTS Three real-world biomedical term relationship datasets are constructed from PubMed papers. TAN significantly outperforms the best baseline with 12.03%, 4.59 and 2.34% Micro-F1 Score improvement in Immunotherapy, Virology and Neurology, respectively. Extensive experiments demonstrate that TAN can model complex spatiotemporal dependencies of term-pairs for explicitly capturing the temporal evolution of relation, significantly outperforming existing state-of-the-art methods. CONCLUSION We proposed a novel TAN to learn spatiotemporal embeddings based on pure attention mechanisms for HG. TAN learns the evolution of relationships by modeling both the continuity and difference of temporal term-pair embeddings. The important spatiotemporal dependencies of term-pair relations are extracted based solely on attention mechanism for generating hypotheses.
Collapse
Affiliation(s)
- Huiwei Zhou
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Haibin Jiang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Lanlan Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Weihong Yao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Yingyu Lin
- School of Foreign Languages, Dalian University of Technology, Dalian 116024, Liaoning, China.
| |
Collapse
|
3
|
Lardos A, Aghaebrahimian A, Koroleva A, Sidorova J, Wolfram E, Anisimova M, Gil M. Computational Literature-based Discovery for Natural Products Research: Current State and Future Prospects. FRONTIERS IN BIOINFORMATICS 2022; 2:827207. [PMID: 36304281 PMCID: PMC9580913 DOI: 10.3389/fbinf.2022.827207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 02/28/2022] [Indexed: 11/21/2022] Open
Abstract
Literature-based discovery (LBD) mines existing literature in order to generate new hypotheses by finding links between previously disconnected pieces of knowledge. Although automated LBD systems are becoming widespread and indispensable in a wide variety of knowledge domains, little has been done to introduce LBD to the field of natural products research. Despite growing knowledge in the natural product domain, most of the accumulated information is found in detached data pools. LBD can facilitate better contextualization and exploitation of this wealth of data, for example by formulating new hypotheses for natural product research, especially in the context of drug discovery and development. Moreover, automated LBD systems promise to accelerate the currently tedious and expensive process of lead identification, optimization, and development. Focusing on natural product research, we briefly reflect the development of automated LBD and summarize its methods and principal data sources. In a thorough review of published use cases of LBD in the biomedical domain, we highlight the immense potential of this data mining approach for natural product research, especially in context with drug discovery or repurposing, mode of action, as well as drug or substance interactions. Most of the 91 natural product-related discoveries in our sample of reported use cases of LBD were addressed at a computer science audience. Therefore, it is the wider goal of this review to introduce automated LBD to researchers who work with natural products and to facilitate the dialogue between this community and the developers of automated LBD systems.
Collapse
Affiliation(s)
- Andreas Lardos
- Natural Product Chemistry and Phytopharmacy Research Group, Institute of Chemistry and Biotechnology, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
| | - Ahmad Aghaebrahimian
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Anna Koroleva
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Julia Sidorova
- Instituto de Tecnología del Conocimiento, Universidad Complutense de Madrid, Madrid, Spain
| | - Evelyn Wolfram
- Natural Product Chemistry and Phytopharmacy Research Group, Institute of Chemistry and Biotechnology, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| |
Collapse
|
4
|
Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0. BIG DATA AND COGNITIVE COMPUTING 2022; 6. [PMID: 35936510 PMCID: PMC9351549 DOI: 10.3390/bdcc6010027] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer’s disease and metabolic co-morbidities.
Collapse
|
5
|
Henry S, Wijesinghe DS, Myers A, McInnes BT. Using Literature Based Discovery to Gain Insights Into the Metabolomic Processes of Cardiac Arrest. Front Res Metr Anal 2021; 6:644728. [PMID: 34250435 PMCID: PMC8267364 DOI: 10.3389/frma.2021.644728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 05/07/2021] [Indexed: 12/19/2022] Open
Abstract
In this paper, we describe how we applied LBD techniques to discover lecithin cholesterol acyltransferase (LCAT) as a druggable target for cardiac arrest. We fully describe our process which includes the use of high-throughput metabolomic analysis to identify metabolites significantly related to cardiac arrest, and how we used LBD to gain insights into how these metabolites relate to cardiac arrest. These insights lead to our proposal (for the first time) of LCAT as a druggable target; the effects of which are supported by in vivo studies which were brought forth by this work. Metabolites are the end product of many biochemical pathways within the human body. Observed changes in metabolite levels are indicative of changes in these pathways, and provide valuable insights toward the cause, progression, and treatment of diseases. Following cardiac arrest, we observed changes in metabolite levels pre- and post-resuscitation. We used LBD to help discover diseases implicitly linked via these metabolites of interest. Results of LBD indicated a strong link between Fish Eye disease and cardiac arrest. Since fish eye disease is characterized by an LCAT deficiency, it began an investigation into the effects of LCAT and cardiac arrest survival. In the investigation, we found that decreased LCAT activity may increase cardiac arrest survival rates by increasing ω-3 polyunsaturated fatty acid availability in circulation. We verified the effects of ω-3 polyunsaturated fatty acids on increasing survival rate following cardiac arrest via in vivo with rat models.
Collapse
Affiliation(s)
- Sam Henry
- Department of Physics, Computer Science and Engineering, Christopher Newport University, Newport News, VA, United States
| | - D. Shanaka Wijesinghe
- Department of Pharmacotherapy and Outcomes Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Aidan Myers
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
6
|
Mejia C, Kajikawa Y. Exploration of Shared Themes Between Food Security and Internet of Things Research Through Literature-Based Discovery. Front Res Metr Anal 2021; 6:652285. [PMID: 34056514 PMCID: PMC8159171 DOI: 10.3389/frma.2021.652285] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Accepted: 04/19/2021] [Indexed: 11/28/2022] Open
Abstract
This paper applied a literature-based discovery methodology utilizing citation networks and text mining in order to extract and represent shared terminologies found in disjoint academic literature on food security and the Internet of Things. The topic of food security includes research on improvements in nutrition, sustainable agriculture, and a plurality of other social challenges, while the Internet of Things refers to a collection of technologies from which solutions can be drawn. Academic articles on both topics were classified into subclusters, and their text contents were compared against each other to find shared terms. These terms formed a network from which clusters of related keywords could be identified, potentially easing the exploration of common themes. Thirteen transversal themes, including blockchain, healthcare, and air quality, were found. This method can be applied by policymakers and other stakeholders to understand how a given technology could contribute to solving a pressing social issue.
Collapse
Affiliation(s)
- Cristian Mejia
- Graduate School of Environment and Society, Tokyo Institute of Technology, Tokyo, Japan
| | - Yuya Kajikawa
- Graduate School of Environment and Society, Tokyo Institute of Technology, Tokyo, Japan.,Institute for Future Initiatives, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
7
|
Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H. Drug repurposing for COVID-19 via knowledge graph completion. J Biomed Inform 2021; 115:103696. [PMID: 33571675 PMCID: PMC7869625 DOI: 10.1016/j.jbi.2021.103696] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 12/23/2020] [Accepted: 02/01/2021] [Indexed: 02/07/2023]
Abstract
OBJECTIVE To discover candidate drugs to repurpose for COVID-19 using literature-derived knowledge and knowledge graph completion methods. METHODS We propose a novel, integrative, and neural network-based literature-based discovery (LBD) approach to identify drug candidates from PubMed and other COVID-19-focused research literature. Our approach relies on semantic triples extracted using SemRep (via SemMedDB). We identified an informative and accurate subset of semantic triples using filtering rules and an accuracy classifier developed on a BERT variant. We used this subset to construct a knowledge graph, and applied five state-of-the-art, neural knowledge graph completion algorithms (i.e., TransE, RotatE, DistMult, ComplEx, and STELP) to predict drug repurposing candidates. The models were trained and assessed using a time slicing approach and the predicted drugs were compared with a list of drugs reported in the literature and evaluated in clinical trials. These models were complemented by a discovery pattern-based approach. RESULTS Accuracy classifier based on PubMedBERT achieved the best performance (F1 = 0.854) in identifying accurate semantic predications. Among five knowledge graph completion models, TransE outperformed others (MR = 0.923, Hits@1 = 0.417). Some known drugs linked to COVID-19 in the literature were identified, as well as others that have not yet been studied. Discovery patterns enabled identification of additional candidate drugs and generation of plausible hypotheses regarding the links between the candidate drugs and COVID-19. Among them, five highly ranked and novel drugs (i.e., paclitaxel, SB 203580, alpha 2-antiplasmin, metoclopramide, and oxymatrine) and the mechanistic explanations for their potential use are further discussed. CONCLUSION We showed that a LBD approach can be feasible not only for discovering drug candidates for COVID-19, but also for generating mechanistic explanations. Our approach can be generalized to other diseases as well as to other clinical questions. Source code and data are available at https://github.com/kilicogluh/lbd-covid.
Collapse
Affiliation(s)
- Rui Zhang
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA.
| | - Dimitar Hristovski
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Dalton Schutte
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA
| | - Andrej Kastrin
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Brazil
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| |
Collapse
|
8
|
|
9
|
Choudhury N, Faisal F, Khushi M. Mining Temporal Evolution of Knowledge Graphs and Genealogical Features for Literature-based Discovery Prediction. J Informetr 2020. [DOI: 10.1016/j.joi.2020.101057] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
10
|
Malec SA, Boyce RD. Exploring Novel Computable Knowledge in Structured Drug Product Labels. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:403-412. [PMID: 32477661 PMCID: PMC7233092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This paper introduces a database derived from Structured Product Labels (SPLs). SPLs are legally mandated snapshots containing information on all drugs released to market in the United States. Since publication is not required for pre-trial findings, we hypothesize that SPLs may contain knowledge absent in the literature, and hence "novel." SemMedDB is an existing database of computable knowledge derived from the literature. If SPL content could be similarly transformed, novel clinically relevant assertions in the SPLs could be identified through comparison with SemMedDB. After we derive a database (containing 4,297,481 assertions), we compare the extracted content with SemMedDB for recent FDA drug approvals. We find that novelty between the SPLs and the literature is nuanced, due to the redundancy of SPLs. Highlighting areas for improvement and future work, we conclude that SPLs contain a wealth of novel knowledge relevant to research and complementary to the literature.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh Department of Biomedical Informatics, Pittsburgh, PA
| | - Richard D Boyce
- University of Pittsburgh Department of Biomedical Informatics, Pittsburgh, PA
| |
Collapse
|
11
|
Crichton G, Baker S, Guo Y, Korhonen A. Neural networks for open and closed Literature-based Discovery. PLoS One 2020; 15:e0232891. [PMID: 32413059 PMCID: PMC7228051 DOI: 10.1371/journal.pone.0232891] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 04/23/2020] [Indexed: 12/18/2022] Open
Abstract
Literature-based Discovery (LBD) aims to discover new knowledge automatically from large collections of literature. Scientific literature is growing at an exponential rate, making it difficult for researchers to stay current in their discipline and easy to miss knowledge necessary to advance their research. LBD can facilitate hypothesis testing and generation and thus accelerate scientific progress. Neural networks have demonstrated improved performance on LBD-related tasks but are yet to be applied to it. We propose four graph-based, neural network methods to perform open and closed LBD. We compared our methods with those used by the state-of-the-art LION LBD system on the same evaluations to replicate recently published findings in cancer biology. We also applied them to a time-sliced dataset of human-curated peer-reviewed biological interactions. These evaluations and the metrics they employ represent performance on real-world knowledge advances and are thus robust indicators of approach efficacy. In the first experiments, our best methods performed 2-4 times better than the baselines in closed discovery and 2-3 times better in open discovery. In the second, our best methods performed almost 2 times better than the baselines in open discovery. These results are strong indications that neural LBD is potentially a very effective approach for generating new scientific discoveries from existing literature. The code for our models and other information can be found at: https://github.com/cambridgeltl/nn_for_LBD.
Collapse
Affiliation(s)
- Gamal Crichton
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Simon Baker
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Yufan Guo
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Anna Korhonen
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
12
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
13
|
Guo ZH, You ZH, Huang DS, Yi HC, Zheng K, Chen ZH, Wang YB. MeSHHeading2vec: a new method for representing MeSH headings as vectors based on graph embedding algorithm. Brief Bioinform 2020; 22:2085-2095. [PMID: 32232320 PMCID: PMC7986599 DOI: 10.1093/bib/bbaa037] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 02/13/2020] [Indexed: 02/03/2023] Open
Abstract
Effectively representing Medical Subject Headings (MeSH) headings (terms) such as disease and drug as discriminative vectors could greatly improve the performance of downstream computational prediction models. However, these terms are often abstract and difficult to quantify. In this paper, we converted the MeSH tree structure into a relationship network and applied several graph embedding algorithms on it to represent these terms. Specifically, the relationship network consisting of nodes (MeSH headings) and edges (relationships), which can be constructed by the tree num. Then, five graph embedding algorithms including DeepWalk, LINE, SDNE, LAP and HOPE were implemented on the relationship network to represent MeSH headings as vectors. In order to evaluate the performance of the proposed methods, we carried out the node classification and relationship prediction tasks. The results show that the MeSH headings characterized by graph embedding algorithms can not only be treated as an independent carrier for representation, but also can be utilized as additional information to enhance the representation ability of vectors. Thus, it can serve as an input and continue to play a significant role in any computational models related to disease, drug, microbe, etc. Besides, our method holds great hope to inspire relevant researchers to study the representation of terms in this network perspective.
Collapse
Affiliation(s)
| | - Zhu-Hong You
- Corresponding author: Zhu-Hong You, The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China. Tel: +86-991-367-2967; E-mail:
| | | | | | | | | | | |
Collapse
|
14
|
Pesaranghader A, Matwin S, Sokolova M, Pesaranghader A. deepBioWSD: effective deep neural word sense disambiguation of biomedical text data. J Am Med Inform Assoc 2020; 26:438-446. [PMID: 30811548 DOI: 10.1093/jamia/ocy189] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 12/03/2018] [Accepted: 12/19/2018] [Indexed: 01/05/2023] Open
Abstract
OBJECTIVE In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable. MATERIALS AND METHODS Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner. RESULTS We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy. CONCLUSIONS Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.
Collapse
Affiliation(s)
- Ahmad Pesaranghader
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.,Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.,Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Marina Sokolova
- Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada.,School of Epidemiology and Public Health, University of Ottawa, University of Ottawa, Ottawa, ON K1G 5Z3, Canada
| | - Ali Pesaranghader
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| |
Collapse
|
15
|
|
16
|
Henry S, McInnes BT. Indirect association and ranking hypotheses for literature based discovery. BMC Bioinformatics 2019; 20:425. [PMID: 31416434 PMCID: PMC6694578 DOI: 10.1186/s12859-019-2989-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 07/09/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Literature Based Discovery (LBD) produces more potential hypotheses than can be manually reviewed, making automatically ranking these hypotheses critical. In this paper, we introduce the indirect association measures of Linking Term Association (LTA), Minimum Weight Association (MWA), and Shared B to C Set Association (SBC), and compare them to Linking Set Association (LSA), concept embeddings vector cosine, Linking Term Count (LTC), and direct co-occurrence vector cosine. Our proposed indirect association measures extend traditional association measures to quantify indirect rather than direct associations while preserving valuable statistical properties. RESULTS We perform a comparison between several different hypothesis ranking methods for LBD, and compare them against our proposed indirect association measures. We intrinsically evaluate each method's performance using its ability to estimate semantic relatedness on standard evaluation datasets. We extrinsically evaluate each method's ability to rank hypotheses in LBD using a time-slicing dataset based on co-occurrence information, and another time-slicing dataset based on SemRep extracted-relationships. Precision and recall curves are generated by ranking term pairs and applying a threshold at each rank. CONCLUSIONS Results differ depending on the evaluation methods and datasets, but it is unclear if this is a result of biases in the evaluation datasets or if one method is truly better than another. We conclude that LTC and SBC are the best suited methods for hypothesis ranking in LBD, but there is value in having a variety of methods to choose from.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| |
Collapse
|
17
|
Gopalakrishnan V, Jha K, Xun G, Ngo HQ, Zhang A. Towards self-learning based hypotheses generation in biomedical text domain. Bioinformatics 2019; 34:2103-2115. [PMID: 29293920 DOI: 10.1093/bioinformatics/btx837] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Accepted: 12/22/2017] [Indexed: 01/01/2023] Open
Abstract
Motivation The overwhelming amount of research articles in the domain of bio-medicine might cause important connections to remain unnoticed. Literature Based Discovery is a sub-field within biomedical text mining that peruses these articles to formulate high confident hypotheses on possible connections between medical concepts. Although many alternate methodologies have been proposed over the last decade, they still suffer from scalability issues. The primary reason, apart from the dense inter-connections between biological concepts, is the absence of information on the factors that lead to the edge-formation. In this work, we formulate this problem as a collaborative filtering task and leverage a relatively new concept of word-vectors to learn and mimic the implicit edge-formation process. Along with single-class classifier, we prune the search-space of redundant and irrelevant hypotheses to increase the efficiency of the system and at the same time maintaining and in some cases even boosting the overall accuracy. Results We show that our proposed framework is able to prune up to 90% of the hypotheses while still retaining high recall in top-K results. This level of efficiency enables the discovery algorithm to look for higher-order hypotheses, something that was infeasible until now. Furthermore, the generic formulation allows our approach to be agile to perform both open and closed discovery. We also experimentally validate that the core data-structures upon which the system bases its decision has a high concordance with the opinion of the experts.This coupled with the ability to understand the edge formation process provides us with interpretable results without any manual intervention. Availability and implementation The relevant JAVA codes are available at: https://github.com/vishrawas/Medline-Code_v2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vishrawas Gopalakrishnan
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Kishlay Jha
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Guangxu Xun
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Hung Q Ngo
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Aidong Zhang
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| |
Collapse
|
18
|
Henry S, Panahi A, Wijesinghe DS, McInnes BT. A Literature Based Discovery Visualization System with Hierarchical Clustering and Linking Set Associations. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2019; 2019:582-591. [PMID: 31259013 PMCID: PMC6568119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Literature Based discovery (LBD) seeks to find information implicit in text, but never explicitly stated. In this work, we develop a method of visually summarizing LBD output in an automatically generated tree structure. This structure promotes a comprehensive understanding of LBD output as a whole, and encourages the user to explore branches of the hierarchy they find most interesting or surprising. This novel visualization system requires the development and integration of automatic functional group discovery, set associations, and linking set associations. Specifically, we perform hierarchical clustering on the potential discoveries generated by an LBD system to create a tree of potential hypotheses. We weight the tree by developing set association measures, and extending them to linking set association measures. This weighted tree is displayed in an interactive visual environment, and validated by replicating the historic Raynaud's Disease - fish oil discovery.
Collapse
Affiliation(s)
- Sam Henry
- Dept. Computer Science, Virginia Commonwealth University, VA, USA
| | - Aliakbar Panahi
- Dept. Computer Science, Virginia Commonwealth University, VA, USA
| | - D Shanaka Wijesinghe
- Dept. of Pharmacotherapy & Outcomes Science, Virginia Commonwealth University, VA, USA
| | | |
Collapse
|
19
|
Kim YH, Song M. A context-based ABC model for literature-based discovery. PLoS One 2019; 14:e0215313. [PMID: 31017923 PMCID: PMC6481912 DOI: 10.1371/journal.pone.0215313] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 03/29/2019] [Indexed: 12/13/2022] Open
Abstract
Background In the literature-based discovery, considerable research has been done based on the ABC model developed by Swanson. ABC model hypothesizes that there is a meaningful relation between entity A extracted from document set 1 and entity C extracted from document set 2 through B entities that appear commonly in both document sets. The results of ABC model are relations among entity A, B, and C, which is referred as paths. A path allows for hypothesizing the relationship between entity A and entity C, or helps discover entity B as a new evidence for the relationship between entity A and entity C. The co-occurrence based approach of ABC model is a well-known approach to automatic hypothesis generation by creating various paths. However, the co-occurrence based ABC model has a limitation, in that biological context is not considered. It focuses only on matching of B entity which commonly appears in relation between two entities. Therefore, the paths extracted by the co-occurrence based ABC model tend to include a lot of irrelevant paths, meaning that expert verification is essential. Methods In order to overcome this limitation of the co-occurrence based ABC model, we propose a context-based approach to connecting one entity relation to another, modifying the ABC model using biological contexts. In this study, we defined four biological context elements: cell, drug, disease, and organism. Based on these biological context, we propose two extended ABC models: a context-based ABC model and a context-assignment-based ABC model. In order to measure the performance of the both proposed models, we examined the relevance of the B entities between the well-known relations “APOE–MAPT” as well as “FUS–TARDBP”. Each relation means interaction between neurodegenerative disease associated with proteins. The interaction between APOE and MAPT is known to play a crucial role in Alzheimer’s disease as APOE affects tau-mediated neurodegeneration. It has been shown that mutation in FUS and TARDBP are associated with amyotrophic lateral sclerosis(ALS), a motor neuron disease by leading to neuronal cell death. Using these two relations, we compared both of proposed models to co-occurrence based ABC model. Results The precision of B entities by co-occurrence based ABC model was 27.1% for “APOE–MAPT” and 22.1% for “FUS–TARDBP”, respectively. In context-based ABC model, precision of extracted B entities was 71.4% for “APOE–MAPT”, and 77.9% for “FUS–TARDBP”. Context-assignment based ABC model achieved 89% and 97.5% precision for the two relations, respectively. Both proposed models achieved a higher precision than co-occurrence-based ABC model.
Collapse
Affiliation(s)
- Yong Hwan Kim
- Division of Humanities, CheongJu University, CheongJu, Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Korea
- * E-mail:
| |
Collapse
|
20
|
Zhao D, Wang J, Sang S, Lin H, Wen J, Yang C. Relation path feature embedding based convolutional neural network method for drug discovery. BMC Med Inform Decis Mak 2019; 19:59. [PMID: 30961599 PMCID: PMC6454669 DOI: 10.1186/s12911-019-0764-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Drug development is an expensive and time-consuming process. Literature-based discovery has played a critical role in drug development and may be a supplementary method to help scientists speed up the discovery of drugs. METHODS Here, we propose a relation path features embedding based convolutional neural network model with attention mechanism for drug discovery from literature, which we denote as PACNN. First, we use predications from biomedical abstracts to construct a biomedical knowledge graph, and then apply a path ranking algorithm to extract drug-disease relation path features on the biomedical knowledge graph. After that, we use these drug-disease relation features to train a convolutional neural network model which combined with the attention mechanism. Finally, we employ the trained models to mine drugs for treating diseases. RESULTS The experiment shows that the proposed model achieved promising results, comparing to several random walk algorithms. CONCLUSIONS In this paper, we propose a relation path features embedding based convolutional neural network with attention mechanism for discovering potential drugs from literature. Our method could be an auxiliary method for drug discovery, which can speed up the discovery of new drugs for the incurable diseases.
Collapse
Affiliation(s)
- Di Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Shengtian Sang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jiabin Wen
- Department of VIP, the Second Hospital of Dalian Medical University, Dalian, China
| | - Chunmei Yang
- Department of VIP, the Second Hospital of Dalian Medical University, Dalian, China
| |
Collapse
|
21
|
Gopalakrishnan V, Jha K, Jin W, Zhang A. A survey on literature based discovery approaches in biomedical domain. J Biomed Inform 2019; 93:103141. [PMID: 30857950 DOI: 10.1016/j.jbi.2019.103141] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 02/06/2023]
Abstract
Literature Based Discovery (LBD) refers to the problem of inferring new and interesting knowledge by logically connecting independent fragments of information units through explicit or implicit means. This area of research, which incorporates techniques from Natural Language Processing (NLP), Information Retrieval and Artificial Intelligence, has significant potential to reduce discovery time in biomedical research fields. Formally introduced in 1986, LBD has grown to be a significant and a core task for text mining practitioners in the biomedical domain. Together with its inter-disciplinary nature, this has led researchers across domains to contribute in advancing this field of study. This survey attempts to consolidate and present the evolution of techniques in this area. We cover a variety of techniques and provide a detailed description of the problem setting, the intuition, the advantages and limitations of various influential papers. We also list the current bottlenecks in this field and provide a general direction of research activities for the future. In an effort to be comprehensive and for ease of reference for off-the-shelf users, we also list many publicly available tools for LBD. We hope this survey will act as a guide to both academic and industry (bio)-informaticians, introduce the various methodologies currently employed and also the challenges yet to be tackled.
Collapse
Affiliation(s)
| | | | - Wei Jin
- University of North Texas at Denton, TX, United States.
| | | |
Collapse
|
22
|
Thilakaratne M, Falkner K, Atapattu T. A systematic review on literature-based discovery workflow. PeerJ Comput Sci 2019; 5:e235. [PMID: 33816888 PMCID: PMC7924697 DOI: 10.7717/peerj-cs.235] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/17/2019] [Indexed: 05/02/2023]
Abstract
As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. This systematic review provides a comprehensive overview of the LBD workflow by answering nine research questions related to the major components of the LBD workflow (i.e., input, process, output, and evaluation). With regards to the input component, we discuss the data types and data sources used in the literature. The process component presents filtering techniques, ranking/thresholding techniques, domains, generalisability levels, and resources. Subsequently, the output component focuses on the visualisation techniques used in LBD discipline. As for the evaluation component, we outline the evaluation techniques, their generalisability, and the quantitative measures used to validate results. To conclude, we summarise the findings of the review for each component by highlighting the possible future research directions.
Collapse
Affiliation(s)
- Menasha Thilakaratne
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| | - Katrina Falkner
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| | - Thushari Atapattu
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
23
|
Ilgisonis E, Lisitsa A, Kudryavtseva V, Ponomarenko E. Creation of Individual Scientific Concept-Centered Semantic Maps Based on Automated Text-Mining Analysis of PubMed. Adv Bioinformatics 2018; 2018:4625394. [PMID: 30147721 PMCID: PMC6083525 DOI: 10.1155/2018/4625394] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 07/05/2018] [Indexed: 01/22/2023] Open
Abstract
Concept-centered semantic maps were created based on a text-mining analysis of PubMed using the BiblioEngine_v2018 software. The objects ("concepts") of a semantic map can be MeSH-terms or other terms (names of proteins, diseases, chemical compounds, etc.) structured in the form of controlled vocabularies. The edges between the two objects were automatically calculated based on the index of semantic similarity, which is proportional to the number of publications related to both objects simultaneously. On the one hand, an individual semantic map created based on the already published papers allows us to trace scientific inquiry. On the other hand, a prospective analysis based on the study of PubMed search history enables us to determine the possible directions for future research.
Collapse
|
24
|
Lossio-Ventura JA, Hogan W, Modave F, Guo Y, He Z, Yang X, Zhang H, Bian J. OC-2-KB: integrating crowdsourcing into an obesity and cancer knowledge base curation system. BMC Med Inform Decis Mak 2018; 18:55. [PMID: 30066655 PMCID: PMC6069686 DOI: 10.1186/s12911-018-0635-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND There is strong scientific evidence linking obesity and overweight to the risk of various cancers and to cancer survivorship. Nevertheless, the existing online information about the relationship between obesity and cancer is poorly organized, not evidenced-based, of poor quality, and confusing to health information consumers. A formal knowledge representation such as a Semantic Web knowledge base (KB) can help better organize and deliver quality health information. We previously presented the OC-2-KB (Obesity and Cancer to Knowledge Base), a software pipeline that can automatically build an obesity and cancer KB from scientific literature. In this work, we investigated crowdsourcing strategies to increase the number of ground truth annotations and improve the quality of the KB. METHODS We developed a new release of the OC-2-KB system addressing key challenges in automatic KB construction. OC-2-KB automatically extracts semantic triples in the form of subject-predicate-object expressions from PubMed abstracts related to the obesity and cancer literature. The accuracy of the facts extracted from scientific literature heavily relies on both the quantity and quality of the available ground truth triples. Thus, we incorporated a crowdsourcing process to improve the quality of the KB. RESULTS We conducted two rounds of crowdsourcing experiments using a new corpus with 82 obesity and cancer-related PubMed abstracts. We demonstrated that crowdsourcing is indeed a low-cost mechanism to collect labeled data from non-expert laypeople. Even though individual layperson might not offer reliable answers, the collective wisdom of the crowd is comparable to expert opinions. We also retrained the relation detection machine learning models in OC-2-KB using the crowd annotated data and evaluated the content of the curated KB with a set of competency questions. Our evaluation showed improved performance of the underlying relation detection model in comparison to the baseline OC-2-KB. CONCLUSIONS We presented a new version of OC-2-KB, a system that automatically builds an evidence-based obesity and cancer KB from scientific literature. Our KB construction framework integrated automatic information extraction with crowdsourcing techniques to verify the extracted knowledge. Our ultimate goal is a paradigm shift in how the general public access, read, digest, and use online health information.
Collapse
Affiliation(s)
- Juan Antonio Lossio-Ventura
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA
| | - William Hogan
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA
| | - François Modave
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA
| | - Yi Guo
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA
| | - Zhe He
- School of Information, Florida State University, 142 Collegiate Loop, Tallahassee, FL, 32306, USA
| | - Xi Yang
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA
| | - Hansi Zhang
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA
| | - Jiang Bian
- Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, FL, 32610, USA.
| |
Collapse
|
25
|
Bakal G, Talari P, Kakani EV, Kavuluru R. Exploiting semantic patterns over biomedical knowledge graphs for predicting treatment and causative relations. J Biomed Inform 2018; 82:189-199. [PMID: 29763706 PMCID: PMC6070294 DOI: 10.1016/j.jbi.2018.05.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Revised: 01/31/2018] [Accepted: 05/09/2018] [Indexed: 01/27/2023]
Abstract
BACKGROUND Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying different causal relations between biomedical entities is also critical to understand biomedical processes. Generally, natural language processing (NLP) and machine learning are used to predict specific relations between any given pair of entities using the distant supervision approach. OBJECTIVE To build high accuracy supervised predictive models to predict previously unknown treatment and causative relations between biomedical entities based only on semantic graph pattern features extracted from biomedical knowledge graphs. METHODS We used 7000 treats and 2918 causes hand-curated relations from the UMLS Metathesaurus to train and test our models. Our graph pattern features are extracted from simple paths connecting biomedical entities in the SemMedDB graph (based on the well-known SemMedDB database made available by the U.S. National Library of Medicine). Using these graph patterns connecting biomedical entities as features of logistic regression and decision tree models, we computed mean performance measures (precision, recall, F-score) over 100 distinct 80-20% train-test splits of the datasets. For all experiments, we used a positive:negative class imbalance of 1:10 in the test set to model relatively more realistic scenarios. RESULTS Our models predict treats and causes relations with high F-scores of 99% and 90% respectively. Logistic regression model coefficients also help us identify highly discriminative patterns that have an intuitive interpretation. We are also able to predict some new plausible relations based on false positives that our models scored highly based on our collaborations with two physician co-authors. Finally, our decision tree models are able to retrieve over 50% of treatment relations from a recently created external dataset. CONCLUSIONS We employed semantic graph patterns connecting pairs of candidate biomedical entities in a knowledge graph as features to predict treatment/causative relations between them. We provide what we believe is the first evidence in direct prediction of biomedical relations based on graph features. Our work complements lexical pattern based approaches in that the graph patterns can be used as additional features for weakly supervised relation prediction.
Collapse
Affiliation(s)
- Gokhan Bakal
- Department of Computer Science, University of Kentucky, United States.
| | - Preetham Talari
- Division of Hospital Medicine, Department of Internal Medicine, University of Kentucky, United States.
| | - Elijah V Kakani
- Division of Hospital Medicine, Department of Internal Medicine, University of Kentucky, United States.
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States; Department of Computer Science, University of Kentucky, United States.
| |
Collapse
|
26
|
Elsworth B, Dawe K, Vincent EE, Langdon R, Lynch BM, Martin RM, Relton C, Higgins JPT, Gaunt TR. MELODI: Mining Enriched Literature Objects to Derive Intermediates. Int J Epidemiol 2018; 47:4803214. [PMID: 29342271 PMCID: PMC5913624 DOI: 10.1093/ije/dyx251] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 11/02/2017] [Accepted: 01/03/2018] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND The scientific literature contains a wealth of information from different fields on potential disease mechanisms. However, identifying and prioritizing mechanisms for further analytical evaluation presents enormous challenges in terms of the quantity and diversity of published research. The application of data mining approaches to the literature offers the potential to identify and prioritize mechanisms for more focused and detailed analysis. METHODS Here we present MELODI, a literature mining platform that can identify mechanistic pathways between any two biomedical concepts. RESULTS Two case studies demonstrate the potential uses of MELODI and how it can generate hypotheses for further investigation. First, an analysis of ETS-related gene ERG and prostate cancer derives the intermediate transcription factor SP1, recently confirmed to be physically interacting with ERG. Second, examining the relationship between a new potential risk factor for pancreatic cancer identifies possible mechanistic insights which can be studied in vitro. CONCLUSIONS We have demonstrated the possible applications of MELODI, including two case studies. MELODI has been implemented as a Python/Django web application, and is freely available to use at [www.melodi.biocompute.org.uk].
Collapse
Affiliation(s)
- Benjamin Elsworth
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Karen Dawe
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Emma E Vincent
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Ryan Langdon
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Brigid M Lynch
- Cancer Epidemiology and Intelligence Division, Cancer Council Victoria, Melbourne, VIC, Australia
- Centre for Epidemiology and Biostatistics, University of Melbourne, Melbourne, VIC, Australia
- Physical Activity Laboratory, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
| | - Richard M Martin
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Caroline Relton
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | | | - Tom R Gaunt
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| |
Collapse
|
27
|
Smalheiser NR. Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery. JOURNAL OF DATA AND INFORMATION SCIENCE 2017; 2:43-64. [PMID: 29355246 PMCID: PMC5771422 DOI: 10.1515/jdis-2017-0019] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.
Collapse
Affiliation(s)
- Neil R Smalheiser
- Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612 USA, +1 312-413-4581
| |
Collapse
|
28
|
Sabbir A, Jimeno-Yepes A, Kavuluru R. Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings. PROCEEDINGS. IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING 2017; 2017:163-170. [PMID: 29399672 PMCID: PMC5792196 DOI: 10.1109/bibe.2017.00-61] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.
Collapse
Affiliation(s)
- Akm Sabbir
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | | | - Ramakanth Kavuluru
- Division of Biomedical Informatics (Department of Internal Medicine) and the Department of Computer Science, University of Kentucky, Lexington, KY, USA
| |
Collapse
|
29
|
Henry S, McInnes BT. Literature Based Discovery: Models, methods, and trends. J Biomed Inform 2017; 74:20-32. [PMID: 28838802 DOI: 10.1016/j.jbi.2017.08.011] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 07/21/2017] [Accepted: 08/20/2017] [Indexed: 01/25/2023]
Abstract
OBJECTIVES This paper provides an introduction and overview of literature based discovery (LBD) in the biomedical domain. It introduces the reader to modern and historical LBD models, key system components, evaluation methodologies, and current trends. After completion, the reader will be familiar with the challenges and methodologies of LBD. The reader will be capable of distinguishing between recent LBD systems and publications, and be capable of designing an LBD system for a specific application. TARGET AUDIENCE From biomedical researchers curious about LBD, to someone looking to design an LBD system, to an LBD expert trying to catch up on trends in the field. The reader need not be familiar with LBD, but knowledge of biomedical text processing tools is helpful. SCOPE This paper describes a unifying framework for LBD systems. Within this framework, different models and methods are presented to both distinguish and show overlap between systems. Topics include term and document representation, system components, and an overview of models including co-occurrence models, semantic models, and distributional models. Other topics include uninformative term filtering, term ranking, results display, system evaluation, an overview of the application areas of drug development, drug repurposing, and adverse drug event prediction, and challenges and future directions. A timeline showing contributions to LBD, and a table summarizing the works of several authors is provided. Topics are presented from a high level perspective. References are given if more detailed analysis is required.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Rm E4222, Richmond, VA 23284, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Rm E4222, Richmond, VA 23284, USA
| |
Collapse
|
30
|
Harris DR, Kavuluru R, Jaromczyk JW, Johnson TR. Rapid and Reusable Text Visualization and Exploration Development with DELVE. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:139-148. [PMID: 28815123 PMCID: PMC5543346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
We present DELVE (Document ExpLoration and Visualization Engine), a framework for developing interactive visualizations as modular Web-applications to assist researchers with exploratory literature search. The goal for web-applications driven by DELVE is to better satisfy the information needs of researchers and to help explore and understand the state of research in scientific liter ature by providing immersive visualizations that both contain facets and are driven by facets derived from the literature. We base our framework on principles from user-centered design and human-computer interaction (HCI). Preliminary evaluations demon strate the usefulness of DELVE's techniques: (1) a clinical researcher immediately saw that her original query was inappropriate simply due to the frequencies displayed via generalized clouds and (2) a muscle biologist quickly learned of vocabulary differences found between two disciplines that were referencing the same idea, which we feel is critical for interdisciplinary work. We dis cuss the underlying category-theoretic model of our framework and show that it naturally encourages the development of reusable visualizations by emphasizing interoperability.
Collapse
Affiliation(s)
- Daniel R. Harris
- Center for Clinical and Translational Sciences, University of Kentucky, Lexington, KY 40506;,Department of Computer Science, University of Kentucky, Lexington, KY 40506
| | - Ramakanth Kavuluru
- Department of Computer Science, University of Kentucky, Lexington, KY 40506;,Institute of Biomedical Informatics, University of Kentucky, Lexington, KY 40506
| | - Jerzy W. Jaromczyk
- Department of Computer Science, University of Kentucky, Lexington, KY 40506
| | - Todd R. Johnson
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030
| |
Collapse
|
31
|
Automated extraction of potential migraine biomarkers using a semantic graph. J Biomed Inform 2017; 71:178-189. [PMID: 28579531 DOI: 10.1016/j.jbi.2017.05.018] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Revised: 04/03/2017] [Accepted: 05/23/2017] [Indexed: 01/20/2023]
Abstract
PROBLEM Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.
Collapse
|
32
|
Abstract
Background Literature based discovery (LBD) automatically infers missed connections between concepts in literature. It is often assumed that LBD generates more information than can be reasonably examined. Methods We present a detailed analysis of the quantity of hidden knowledge produced by an LBD system and the effect of various filtering approaches upon this. The investigation of filtering combined with single or multi-step linking term chains is carried out on all articles in PubMed. Results The evaluation is carried out using both replication of existing discoveries, which provides justification for multi-step linking chain knowledge in specific cases, and using timeslicing, which gives a large scale measure of performance. Conclusions While the quantity of hidden knowledge generated by LBD can be vast, we demonstrate that (a) intelligent filtering can greatly reduce the number of hidden knowledge pairs generated, (b) for a specific term, the number of single step connections can be manageable, and (c) in the absence of single step hidden links, considering multiple steps can provide valid links.
Collapse
Affiliation(s)
- Judita Preiss
- Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, UK.
| | - Mark Stevenson
- Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, UK
| |
Collapse
|
33
|
Abstract
AbstractLiterature-based discovery systems aim at discovering valuable latent connections between previously disparate research areas. This is achieved by analyzing the contents of their respective literatures with the help of various intelligent computational techniques. In this paper, we review the progress of literature-based discovery research, focusing on understanding their technical features and evaluating their performance. The present literature-based discovery techniques can be divided into two general approaches: the traditional approach and the emerging approach. The traditional approach, which dominate the current research landscape, comprises mainly of techniques that rely on utilizing lexical statistics, knowledge-based and visualization methods in order to address literature-based discovery problems. On the other hand, we have also observed the births of new trends and unprecedented paradigm shifts among the recently emerging literature-based discovery approach. These trends are likely to shape the future trajectory of the next generation literature-based discovery systems.
Collapse
|
34
|
Sebastian Y, Siew EG, Orimaye SO. Learning the heterogeneous bibliographic information network for literature-based discovery. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2016.10.015] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
35
|
Rios A, Kavuluru R. Analyzing the Moving Parts of a Large-Scale Multi-Label Text Classification Pipeline: Experiences in Indexing Biomedical Articles. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2015; 2015:1-7. [PMID: 28758165 DOI: 10.1109/ichi.2015.6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Medical subject headings (MeSH) is a controlled hierarchical vocabulary used by the National Library of Medicine (NLM) to index biomedical articles. In the 2014 version of MeSH terminology there are a total of 27,149 terms. Librarians at the NLM tag each biomedical article to be indexed for the PubMed literature search system with terms from MeSH. This means the human indexers look at each article's full text and index it with a small set of descriptors, 13 on average, from over 27,000 descriptors available in MeSH. There have been many recent attempts to automate this process focused on using the article title and abstract text to predict MeSH terms for the corresponding article. There has also been an open automated biomedical indexing challenge, BioASQ [1], that started in 2013. The best general supervised learning framework in these challenges has been a pipeline with four different components: 1. pre-processing and feature extraction; 2. employing the binary relevance and/or nearest neighbor approaches to select a set of candidate terms; 3. ranking these candidate terms using corresponding informative features; and 4. applying label calibration to dynamically predict the number of top terms to be included in the final selection for the current instance. The specific details in how each of these components is implemented determines the performance variations of various entries in the challenge. In this paper, we analyze these moving parts of the MeSH indexing multi-label classification pipeline with experiments involving different combinations. Our best combination achieves ≈ 1% increase in micro F-score compared with the top performing team across the five weeks of the final batch of the BioASQ 2014 challenge. The main take away from our efforts is that small improvements/modifications to different components of the pipeline can offer moderate improvements to the overall performance of the method. Our experiences show that, at least thus far, top performances have resulted mostly due to these improvements rather than drastic changes of the core methodology.
Collapse
Affiliation(s)
- Anthony Rios
- Department of Computer Science, University of Kentucky, Lexington, KY
| | - Ramakanth Kavuluru
- Department of Computer Science, University of Kentucky, Lexington, KY.,Division of Biomedical Informatics, Dept. of Biostatistics, University of Kentucky, Lexington, KY
| |
Collapse
|
36
|
Rios A, Kavuluru R. Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2015; 2015:258-267. [PMID: 28736769 PMCID: PMC5521984 DOI: 10.1145/2808719.2808746] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Building high accuracy text classifiers is an important task in biomedicine given the wealth of information hidden in unstructured narratives such as research articles and clinical documents. Due to large feature spaces, traditionally, discriminative approaches such as logistic regression and support vector machines with n-gram and semantic features (e.g., named entities) have been used for text classification where additional performance gains are typically made through feature selection and ensemble approaches. In this paper, we demonstrate that a more direct approach using convolutional neural networks (CNNs) outperforms several traditional approaches in biomedical text classification with the specific use-case of assigning medical subject headings (or MeSH terms) to biomedical articles. Trained annotators at the national library of medicine (NLM) assign on an average 13 codes to each biomedical article, thus semantically indexing scientific literature to support NLM's PubMed search system. Recent evidence suggests that effective automated efforts for MeSH term assignment start with binary classifiers for each term. In this paper, we use CNNs to build binary text classifiers and achieve an absolute improvement of over 3% in macro F-score over a set of selected hard-to-classify MeSH terms when compared with the best prior results on a public dataset. Additional experiments on 50 high frequency terms in the dataset also show improvements with CNNs. Our results indicate the strong potential of CNNs in biomedical text classification tasks.
Collapse
Affiliation(s)
- Anthony Rios
- Department of Computer Science, University of Kentucky, Lexington, Kentucky
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Depts. of Biostatistics and Computer Science, University of Kentucky, Lexington, Kentucky
| |
Collapse
|