1
|
Millikin RJ, Raja K, Steill J, Lock C, Tu X, Ross I, Tsoi LC, Kuusisto F, Ni Z, Livny M, Bockelman B, Thomson J, Stewart R. Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC Bioinformatics 2023; 24:412. [PMID: 37915001 PMCID: PMC10619245 DOI: 10.1186/s12859-023-05539-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023] Open
Abstract
BACKGROUND The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: (1) they identify a relationship but not the type of relationship, (2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, (3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or (4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. RESULTS We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. CONCLUSIONS SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.
Collapse
Affiliation(s)
| | - Kalpana Raja
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
| | - John Steill
- Morgridge Institute for Research, Madison, WI, USA
| | - Cannon Lock
- Morgridge Institute for Research, Madison, WI, USA
| | - Xuancheng Tu
- Morgridge Institute for Research, Madison, WI, USA
| | - Ian Ross
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | - Lam C Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Finn Kuusisto
- Morgridge Institute for Research, Madison, WI, USA
- Currently at Data Science Institute, University of Wisconsin, Madison, WI, USA
| | - Zijian Ni
- Department of Statistics, University of Wisconsin, Madison, WI, USA
- Currently at Amazon, Seattle, WA, USA
| | - Miron Livny
- Morgridge Institute for Research, Madison, WI, USA
- Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin, Madison, WI, USA
| | | | - James Thomson
- Morgridge Institute for Research, Madison, WI, USA
- Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI, USA
| | - Ron Stewart
- Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
2
|
Nelson B, Faquin W. Breaking free of the research silo: A growing case for multidisciplinary work: From studying human origins to developing cancer diagnoses and treatments, working across disciplines is not always easy, but it is often transformative: From studying human origins to developing cancer diagnoses and treatments, working across disciplines is not always easy, but it is often transformative. Cancer Cytopathol 2023; 131:275-276. [PMID: 37139788 DOI: 10.1002/cncy.22687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
|
3
|
Moreau E. Literature-based discovery: addressing the issue of the subpar evaluation methodology. Bioinformatics 2023; 39:btad090. [PMID: 36786419 PMCID: PMC9945845 DOI: 10.1093/bioinformatics/btad090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 01/26/2023] [Accepted: 02/13/2023] [Indexed: 02/15/2023] Open
Affiliation(s)
- Erwan Moreau
- Adapt Centre, Trinity College Dublin, Dublin, Ireland
| |
Collapse
|
4
|
Doroudi S. What is a related work? A typology of relationships in research literature. SYNTHESE 2023; 201:24. [PMID: 36643731 PMCID: PMC9829224 DOI: 10.1007/s11229-022-03976-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 11/10/2022] [Indexed: 06/17/2023]
Abstract
An important part of research is situating one's work in a body of existing literature, thereby connecting to existing ideas. Despite this, the various kinds of relationships that might exist among academic literature do not appear to have been formally studied. Here I present a graphical representation of academic work in terms of entities and relations, drawing on structure-mapping theory (used in the study of analogies). I then use this representation to present a typology of operations that could relate two pieces of academic work. I illustrate the various types of relationships with examples from medicine, physics, psychology, history and philosophy of science, machine learning, education, and neuroscience. The resulting typology not only gives insights into the relationships that might exist between static publications, but also the rich process whereby an ongoing research project evolves through interactions with the research literature.
Collapse
Affiliation(s)
- Shayan Doroudi
- School of Education, University of California, Irvine, 401 E. Peltason Drive, Suite 3200, Irvine, CA 92617 USA
| |
Collapse
|
5
|
Bhasuran B. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries. Methods Mol Biol 2022; 2496:123-140. [PMID: 35713862 DOI: 10.1007/978-1-0716-2305-3_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
6
|
Phang CSJ, Vong WT, Sebastian Y, Raman V, Then PHH. Understanding the Usability of a Literature-Based Discovery System Among Clinical Researchers in Sarawak, Malaysia. INTERNATIONAL JOURNAL OF TECHNOLOGY AND HUMAN INTERACTION 2022. [DOI: 10.4018/ijthi.304092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The rapid increase in scientific publications makes it difficult for researchers to keep up with the latest literature and to explore new research directions. The literature-based discovery (LBD) systems aim to resolve this issue by bridging literatures from disparate fields to assist researchers in knowledge discovery and the formulation and testing of research hypotheses. Previous studies have focused mainly on evaluating the efficacy of LBD systems by replicating historical LBD events. The usability of LBD systems has been under-researched, which partly explains the low adoption of the systems. This paper presents a survey study that evaluates the usability of a LBD system for knowledge discovery and hypothesis refinement, and also investigates factors affecting its adoption among biomedical researchers in Sarawak, Malaysia. The findings suggest that the adoption of the LBD system is related to their perceived usefulness and perceived difficulty in interacting with the user interface features of the system.
Collapse
Affiliation(s)
| | - Wan-Tze Vong
- Swinburne University of Technology, Sarawak, Malaysia
| | | | | | | |
Collapse
|
7
|
Rahaman T. Discovering New Trends & Connections: Current Applications of Biomedical Text Mining. Med Ref Serv Q 2021; 40:329-336. [PMID: 34495798 DOI: 10.1080/02763869.2021.1945869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
The explosive growth of digital information in recent years has amplified the information overload experienced by today's health-care professionals. In particular, the wide variety of unstructured text makes it difficult for researchers to find meaningful data without spending a considerable amount of time reading. Text mining can be used to facilitate better discoverability and analysis, and aid researchers in identifying critical trends and connections. This column will introduce key text-mining terms, recent use cases of biomedical text mining, and current applications for this technology in medical libraries.
Collapse
Affiliation(s)
- Tariq Rahaman
- Tampa Bay Regional Campus Library, Nova Southeastern University, Clearwater, Florida, USA
| |
Collapse
|
8
|
Henry S, Wijesinghe DS, Myers A, McInnes BT. Using Literature Based Discovery to Gain Insights Into the Metabolomic Processes of Cardiac Arrest. Front Res Metr Anal 2021; 6:644728. [PMID: 34250435 PMCID: PMC8267364 DOI: 10.3389/frma.2021.644728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 05/07/2021] [Indexed: 12/19/2022] Open
Abstract
In this paper, we describe how we applied LBD techniques to discover lecithin cholesterol acyltransferase (LCAT) as a druggable target for cardiac arrest. We fully describe our process which includes the use of high-throughput metabolomic analysis to identify metabolites significantly related to cardiac arrest, and how we used LBD to gain insights into how these metabolites relate to cardiac arrest. These insights lead to our proposal (for the first time) of LCAT as a druggable target; the effects of which are supported by in vivo studies which were brought forth by this work. Metabolites are the end product of many biochemical pathways within the human body. Observed changes in metabolite levels are indicative of changes in these pathways, and provide valuable insights toward the cause, progression, and treatment of diseases. Following cardiac arrest, we observed changes in metabolite levels pre- and post-resuscitation. We used LBD to help discover diseases implicitly linked via these metabolites of interest. Results of LBD indicated a strong link between Fish Eye disease and cardiac arrest. Since fish eye disease is characterized by an LCAT deficiency, it began an investigation into the effects of LCAT and cardiac arrest survival. In the investigation, we found that decreased LCAT activity may increase cardiac arrest survival rates by increasing ω-3 polyunsaturated fatty acid availability in circulation. We verified the effects of ω-3 polyunsaturated fatty acids on increasing survival rate following cardiac arrest via in vivo with rat models.
Collapse
Affiliation(s)
- Sam Henry
- Department of Physics, Computer Science and Engineering, Christopher Newport University, Newport News, VA, United States
| | - D. Shanaka Wijesinghe
- Department of Pharmacotherapy and Outcomes Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Aidan Myers
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
9
|
Mejia C, Kajikawa Y. Exploration of Shared Themes Between Food Security and Internet of Things Research Through Literature-Based Discovery. Front Res Metr Anal 2021; 6:652285. [PMID: 34056514 PMCID: PMC8159171 DOI: 10.3389/frma.2021.652285] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Accepted: 04/19/2021] [Indexed: 11/28/2022] Open
Abstract
This paper applied a literature-based discovery methodology utilizing citation networks and text mining in order to extract and represent shared terminologies found in disjoint academic literature on food security and the Internet of Things. The topic of food security includes research on improvements in nutrition, sustainable agriculture, and a plurality of other social challenges, while the Internet of Things refers to a collection of technologies from which solutions can be drawn. Academic articles on both topics were classified into subclusters, and their text contents were compared against each other to find shared terms. These terms formed a network from which clusters of related keywords could be identified, potentially easing the exploration of common themes. Thirteen transversal themes, including blockchain, healthcare, and air quality, were found. This method can be applied by policymakers and other stakeholders to understand how a given technology could contribute to solving a pressing social issue.
Collapse
Affiliation(s)
- Cristian Mejia
- Graduate School of Environment and Society, Tokyo Institute of Technology, Tokyo, Japan
| | - Yuya Kajikawa
- Graduate School of Environment and Society, Tokyo Institute of Technology, Tokyo, Japan.,Institute for Future Initiatives, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
10
|
Malec SA, Wei P, Bernstam EV, Boyce RD, Cohen T. Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance. J Biomed Inform 2021; 117:103719. [PMID: 33716168 PMCID: PMC8559730 DOI: 10.1016/j.jbi.2021.103719] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 12/31/2020] [Accepted: 01/04/2021] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. However, causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data. METHODS We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure) and that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2 M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ2 and reporting odds ratio) and with each other. RESULTS AND CONCLUSIONS We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic covariates.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States.
| | - Peng Wei
- The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, TX, United States
| | - Elmer V Bernstam
- University of Texas Health Science Center at Houston, School of Biomedical Informatics, Houston, TX, United States
| | - Richard D Boyce
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States
| | - Trevor Cohen
- University of Washington, Department of Biomedical Informatics and Medical Education, Seattle, WA, United States
| |
Collapse
|
11
|
Škrlj B, Kokalj E, Lavrač N. PubMed-Scale Chemical Concept Embeddings Reconstruct Physical Protein Interaction Networks. Front Res Metr Anal 2021; 6:644614. [PMID: 33928210 PMCID: PMC8076635 DOI: 10.3389/frma.2021.644614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/08/2021] [Indexed: 11/13/2022] Open
Abstract
PubMed is the largest resource of curated biomedical knowledge to date, entailing more than 25 million documents. Large quantities of novel literature prevent a single expert from keeping track of all potentially relevant papers, resulting in knowledge gaps. In this article, we present CHEMMESHNET, a newly developed PubMed-based network comprising more than 10,000,000 associations, constructed from expert-curated MeSH annotations of chemicals based on all currently available PubMed articles. By learning latent representations of concepts in the obtained network, we demonstrate in a proof of concept study that purely literature-based representations are sufficient for the reconstruction of a large part of the currently known network of physical, empirically determined protein-protein interactions. We demonstrate that simple linear embeddings of node pairs, when coupled with a neural network-based classifier, reliably reconstruct the existing collection of empirically confirmed protein-protein interactions. Furthermore, we demonstrate how pairs of learned representations can be used to prioritize potentially interesting novel interactions based on the common chemical context. Highly ranked interactions are qualitatively inspected in terms of potential complex formation at the structural level and represent potentially interesting new knowledge. We demonstrate that two protein-protein interactions, prioritized by structure-based approaches, also emerge as probable with regard to the trained machine-learning model.
Collapse
Affiliation(s)
- Blaž Škrlj
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | - Enja Kokalj
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | - Nada Lavrač
- Jožef Stefan Institute, Ljubljana, Slovenia
- University of Nova Gorica, Vipava, Slovenia
| |
Collapse
|
12
|
|
13
|
Choudhury N, Faisal F, Khushi M. Mining Temporal Evolution of Knowledge Graphs and Genealogical Features for Literature-based Discovery Prediction. J Informetr 2020. [DOI: 10.1016/j.joi.2020.101057] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
14
|
Malec SA, Boyce RD. Exploring Novel Computable Knowledge in Structured Drug Product Labels. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:403-412. [PMID: 32477661 PMCID: PMC7233092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This paper introduces a database derived from Structured Product Labels (SPLs). SPLs are legally mandated snapshots containing information on all drugs released to market in the United States. Since publication is not required for pre-trial findings, we hypothesize that SPLs may contain knowledge absent in the literature, and hence "novel." SemMedDB is an existing database of computable knowledge derived from the literature. If SPL content could be similarly transformed, novel clinically relevant assertions in the SPLs could be identified through comparison with SemMedDB. After we derive a database (containing 4,297,481 assertions), we compare the extracted content with SemMedDB for recent FDA drug approvals. We find that novelty between the SPLs and the literature is nuanced, due to the redundancy of SPLs. Highlighting areas for improvement and future work, we conclude that SPLs contain a wealth of novel knowledge relevant to research and complementary to the literature.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh Department of Biomedical Informatics, Pittsburgh, PA
| | - Richard D Boyce
- University of Pittsburgh Department of Biomedical Informatics, Pittsburgh, PA
| |
Collapse
|
15
|
Visualizing a field of research: A methodology of systematic scientometric reviews. PLoS One 2019; 14:e0223994. [PMID: 31671124 PMCID: PMC6822756 DOI: 10.1371/journal.pone.0223994] [Citation(s) in RCA: 348] [Impact Index Per Article: 69.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Accepted: 10/02/2019] [Indexed: 12/14/2022] Open
Abstract
Systematic scientometric reviews, empowered by computational and visual analytic approaches, offer opportunities to improve the timeliness, accessibility, and reproducibility of studies of the literature of a field of research. On the other hand, effectively and adequately identifying the most representative body of scholarly publications as the basis of subsequent analyses remains a common bottleneck in the current practice. What can we do to reduce the risk of missing something potentially significant? How can we compare different search strategies in terms of the relevance and specificity of topical areas covered? In this study, we introduce a flexible and generic methodology based on a significant extension of the general conceptual framework of citation indexing for delineating the literature of a research field. The method, through cascading citation expansion, provides a practical connection between studies of science from local and global perspectives. We demonstrate an application of the methodology to the research of literature-based discovery (LBD) and compare five datasets constructed based on three use scenarios and corresponding retrieval strategies, namely a query-based lexical search (one dataset), forward expansions starting from a groundbreaking article of LBD (two datasets), and backward expansions starting from a recently published review article by a prominent expert in LBD (two datasets). We particularly discuss the relevance of areas captured by expansion processes with reference to the query-based scientometric visualization. The method used in this study for comparing bibliometric datasets is applicable to comparative studies of search strategies.
Collapse
|
16
|
Henry S, McInnes BT. Indirect association and ranking hypotheses for literature based discovery. BMC Bioinformatics 2019; 20:425. [PMID: 31416434 PMCID: PMC6694578 DOI: 10.1186/s12859-019-2989-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 07/09/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Literature Based Discovery (LBD) produces more potential hypotheses than can be manually reviewed, making automatically ranking these hypotheses critical. In this paper, we introduce the indirect association measures of Linking Term Association (LTA), Minimum Weight Association (MWA), and Shared B to C Set Association (SBC), and compare them to Linking Set Association (LSA), concept embeddings vector cosine, Linking Term Count (LTC), and direct co-occurrence vector cosine. Our proposed indirect association measures extend traditional association measures to quantify indirect rather than direct associations while preserving valuable statistical properties. RESULTS We perform a comparison between several different hypothesis ranking methods for LBD, and compare them against our proposed indirect association measures. We intrinsically evaluate each method's performance using its ability to estimate semantic relatedness on standard evaluation datasets. We extrinsically evaluate each method's ability to rank hypotheses in LBD using a time-slicing dataset based on co-occurrence information, and another time-slicing dataset based on SemRep extracted-relationships. Precision and recall curves are generated by ranking term pairs and applying a threshold at each rank. CONCLUSIONS Results differ depending on the evaluation methods and datasets, but it is unclear if this is a result of biases in the evaluation datasets or if one method is truly better than another. We conclude that LTC and SBC are the best suited methods for hypothesis ranking in LBD, but there is value in having a variety of methods to choose from.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| |
Collapse
|
17
|
Meng G, Huang Y, Yu Q, Ding Y, Wild D, Zhao Y, Liu X, Song M. Adopting Text Mining on Rehabilitation Therapy Repositioning for Stroke. Front Neuroinform 2019; 13:17. [PMID: 30941028 PMCID: PMC6433708 DOI: 10.3389/fninf.2019.00017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 03/05/2019] [Indexed: 12/30/2022] Open
Abstract
Stroke is a common disabling disease that severely affects the daily life of patients. Accumulating evidence indicates that rehabilitation therapy can improve movement function. However, no clear guidelines have specific and effective rehabilitation therapy schemes, and the development of new rehabilitation techniques has been relatively slow. This study used a text mining approach, the ABC model, to identify an existing rehabilitation candidate therapy method that is most likely to be repositioned for stroke. In the model, we built the internal links of stroke (A), assessment scales (B), and rehabilitation therapies (C) in PubMed and the links were related to upper limb function measurements for patients with stroke. In the first step, using E-utility, we retrieved both stroke-related assessment scales and rehabilitation therapy records and then compiled two datasets, which were called Stroke_Scales and Stroke_Therapies, respectively. In the next step, we crawled all rehabilitation therapies co-occurring with the Stroke_Therapies and then named them as All_Therapies. Therapies that were already included in Stroke_Therapies were deleted from All_Therapies; therefore, the remaining therapies were the potential rehabilitation therapies, which could be repositioned for stroke after subsequent filtration by a manual check. We identified the top-ranked repositioning rehabilitation therapy and subsequently examined its clinical validation. Hand-arm bimanual intensive training (HABIT) was ranked the first in our repositioning rehabilitation therapies and had the most interaction links with Stroke_Scales. HABIT significantly improved clinical scores on assessment scales [Fugl-Meyer Assessment (FMA) and action research arm test (ARAT)] in the clinical validation study for acute stroke patients with upper limb dysfunction. Therefore, based on the ABC model and clinical validation, HABIT is a promising repositioned rehabilitation therapy for stroke, and the ABC model is an effective text mining approach for rehabilitation therapy repositioning. The findings in this study would be helpful in clinical knowledge discovery.
Collapse
Affiliation(s)
- Guilin Meng
- Shanghai Tenth People's Hospital, School of Medicine, Tongji University, Shanghai, China.,School of Informatics Computing and Engineering, Indiana University, Bloomington, IN, United States
| | - Yong Huang
- School of Informatics Computing and Engineering, Indiana University, Bloomington, IN, United States.,School of Information Management, Wuhan University, Wuhan, China
| | - Qi Yu
- School of Management, Shanxi Medical University, Shanxi, China
| | - Ying Ding
- School of Informatics Computing and Engineering, Indiana University, Bloomington, IN, United States
| | - David Wild
- School of Informatics Computing and Engineering, Indiana University, Bloomington, IN, United States
| | - Yanxin Zhao
- Shanghai Tenth People's Hospital, School of Medicine, Tongji University, Shanghai, China
| | - Xueyuan Liu
- Shanghai Tenth People's Hospital, School of Medicine, Tongji University, Shanghai, China
| | - Min Song
- School of Informatics, Yonsei University, Seoul, South Korea
| |
Collapse
|
18
|
Gopalakrishnan V, Jha K, Jin W, Zhang A. A survey on literature based discovery approaches in biomedical domain. J Biomed Inform 2019; 93:103141. [PMID: 30857950 DOI: 10.1016/j.jbi.2019.103141] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 02/06/2023]
Abstract
Literature Based Discovery (LBD) refers to the problem of inferring new and interesting knowledge by logically connecting independent fragments of information units through explicit or implicit means. This area of research, which incorporates techniques from Natural Language Processing (NLP), Information Retrieval and Artificial Intelligence, has significant potential to reduce discovery time in biomedical research fields. Formally introduced in 1986, LBD has grown to be a significant and a core task for text mining practitioners in the biomedical domain. Together with its inter-disciplinary nature, this has led researchers across domains to contribute in advancing this field of study. This survey attempts to consolidate and present the evolution of techniques in this area. We cover a variety of techniques and provide a detailed description of the problem setting, the intuition, the advantages and limitations of various influential papers. We also list the current bottlenecks in this field and provide a general direction of research activities for the future. In an effort to be comprehensive and for ease of reference for off-the-shelf users, we also list many publicly available tools for LBD. We hope this survey will act as a guide to both academic and industry (bio)-informaticians, introduce the various methodologies currently employed and also the challenges yet to be tackled.
Collapse
Affiliation(s)
| | | | - Wei Jin
- University of North Texas at Denton, TX, United States.
| | | |
Collapse
|
19
|
Thilakaratne M, Falkner K, Atapattu T. A systematic review on literature-based discovery workflow. PeerJ Comput Sci 2019; 5:e235. [PMID: 33816888 PMCID: PMC7924697 DOI: 10.7717/peerj-cs.235] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/17/2019] [Indexed: 05/02/2023]
Abstract
As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. This systematic review provides a comprehensive overview of the LBD workflow by answering nine research questions related to the major components of the LBD workflow (i.e., input, process, output, and evaluation). With regards to the input component, we discuss the data types and data sources used in the literature. The process component presents filtering techniques, ranking/thresholding techniques, domains, generalisability levels, and resources. Subsequently, the output component focuses on the visualisation techniques used in LBD discipline. As for the evaluation component, we outline the evaluation techniques, their generalisability, and the quantitative measures used to validate results. To conclude, we summarise the findings of the review for each component by highlighting the possible future research directions.
Collapse
Affiliation(s)
- Menasha Thilakaratne
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| | - Katrina Falkner
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| | - Thushari Atapattu
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
20
|
Sybrandt J, Shtutman M, Safro I. Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking. PROCEEDINGS : ... IEEE INTERNATIONAL CONFERENCE ON BIG DATA. IEEE INTERNATIONAL CONFERENCE ON BIG DATA 2018; 2018:1494-1503. [PMID: 35789222 PMCID: PMC9248026 DOI: 10.1109/bigdata.2018.8622637] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
UNLABELLED The first step of many research projects is to define and rank a short list of candidates for study. In the modern rapidity of scientific progress, some turn to automated hypothesis generation (HG) systems to aid this process. These systems can identify implicit or overlooked connections within a large scientific corpus, and while their importance grows alongside the pace of science, they lack thorough validation. Without any standard numerical evaluation method, many validate general-purpose HG systems by rediscovering a handful of historical findings, and some wishing to be more thorough may run laboratory experiments based on automatic suggestions. These methods are expensive, time consuming, and cannot scale. Thus, we present a numerical evaluation framework for the purpose of validating HG systems that leverages thousands of validation hypotheses. This method evaluates a HG system by its ability to rank hypotheses by plausibility; a process reminiscent of human candidate selection. Because HG systems do not produce a ranking criteria, specifically those that produce topic models, we additionally present novel metrics to quantify the plausibility of hypotheses given topic model system output. Finally, we demonstrate that our proposed validation method aligns with real-world research goals by deploying our method within MOLIERE, our recent topic-driven HG system, in order to automatically generate a set of candidate genes related to HIV-associated neurodegenerative disease (HAND). By performing laboratory experiments based on this candidate set, we discover a new connection between HAND and Dead Box RNA Helicase 3 (DDX3). REPRODUCIBILITY code, validation data, and results can be found at sybrandt.com/2018/validation.
Collapse
Affiliation(s)
| | - Michael Shtutman
- University of South Carolina, Drug Discovery and Biomedical Sciences, Columbia, USA
| | - Ilya Safro
- Clemson University, School of Computing, Clemson, USA
| |
Collapse
|