1
|
Singh G, Papoutsoglou EA, Keijts-Lalleman F, Vencheva B, Rice M, Visser RG, Bachem CW, Finkers R. Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait. BMC PLANT BIOLOGY 2021; 21:198. [PMID: 33894758 PMCID: PMC8070292 DOI: 10.1186/s12870-021-02943-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 03/29/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Scientific literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the flesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes. RESULTS We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to flesh color. A novel time-based analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature. CONCLUSIONS Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientific research.
Collapse
Affiliation(s)
- Gurnoor Singh
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | | | | | | | - Mark Rice
- IBM Netherlands, Amsterdam, The Netherlands
| | - Richard G.F. Visser
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Christian W.B. Bachem
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| |
Collapse
|
2
|
Legrand J, Gogdemir R, Bousquet C, Dalleau K, Devignes MD, Digan W, Lee CJ, Ndiaye NC, Petitpain N, Ringot P, Smaïl-Tabbone M, Toussaint Y, Coulet A. PGxCorpus, a manually annotated corpus for pharmacogenomics. Sci Data 2020; 7:3. [PMID: 31896797 PMCID: PMC6940385 DOI: 10.1038/s41597-019-0342-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 12/02/2019] [Indexed: 11/09/2022] Open
Abstract
Pharmacogenomics (PGx) studies how individual gene variations impact drug response phenotypes, which makes PGx-related knowledge a key component towards precision medicine. A significant part of the state-of-the-art knowledge in PGx is accumulated in scientific publications, where it is hardly reusable by humans or software. Natural language processing techniques have been developed to guide experts who curate this amount of knowledge. But existing works are limited by the absence of a high quality annotated corpus focusing on PGx domain. In particular, this absence restricts the use of supervised machine learning. This article introduces PGxCorpus, a manually annotated corpus, designed to fill this gap and to enable the automatic extraction of PGx relationships from text. It comprises 945 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly gene variations, genes, drugs and phenotypes), and relationships between those. In this article, we present the corpus itself, its construction and a baseline experiment that illustrates how it may be leveraged to synthesize and summarize PGx knowledge.
Collapse
Affiliation(s)
- Joël Legrand
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France.
| | | | - Cédric Bousquet
- Sorbonne Université, INSERM, Université Paris 13, LIMICS, Paris, France
| | - Kevin Dalleau
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
| | | | - William Digan
- Hôpital Européen Georges Pompidou, AP-HP, Université Paris Descartes, Université Sorbonne Paris Cité, Paris, France
- INSERM UMR 1138 Equipe 22, Université Paris Descartes, Université Sorbonne Paris Cité, Paris, France
| | - Chia-Ju Lee
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, USA
| | | | - Nadine Petitpain
- Centre Régional de Pharmacovigilance, CHRU of Nancy, Nancy, France
| | - Patrice Ringot
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
| | | | | | - Adrien Coulet
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| |
Collapse
|
3
|
Zhang M, Zhang M, Ge C, Liu Q, Wang J, Wei J, Zhu KQ. Automatic discovery of adverse reactions through Chinese social media. Data Min Knowl Discov 2019. [DOI: 10.1007/s10618-018-00610-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
4
|
Tian Z, Teng Z, Cheng S, Guo M. Computational drug repositioning using meta-path-based semantic network analysis. BMC SYSTEMS BIOLOGY 2018; 12:134. [PMID: 30598084 PMCID: PMC6311940 DOI: 10.1186/s12918-018-0658-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
BACKGROUND Drug repositioning is a promising and efficient way to discover new indications for existing drugs, which holds the great potential for precision medicine in the post-genomic era. Many network-based approaches have been proposed for drug repositioning based on similarity networks, which integrate multiple sources of drugs and diseases. However, these methods may simply view nodes as the same-typed and neglect the semantic meanings of different meta-paths in the heterogeneous network. Therefore, it is urgent to develop a rational method to infer new indications for approved drugs. RESULTS In this study, we proposed a novel methodology named HeteSim_DrugDisease (HSDD) for the prediction of drug repositioning. Firstly, we build the drug-drug similarity network and disease-disease similarity network by integrating the information of drugs and diseases. Secondly, a drug-disease heterogeneous network is constructed, which combines the drug similarity network, disease similarity network as well as the known drug-disease association network. Finally, HSDD predicts novel drug-disease associations based on the HeteSim scores of different meta-paths. The experimental results show that HSDD performs significantly better than the existing state-of-the-art approaches. HSDD achieves an AUC score of 0.8994 in the leave-one-out cross validation experiment. Moreover, case studies for selected drugs further illustrate the practical usefulness of HSDD. CONCLUSIONS HSDD can be an effective and feasible way to infer the associations between drugs and diseases using on meta-path-based semantic network analysis.
Collapse
Affiliation(s)
- Zhen Tian
- School of Information Engineering, Zhengzhou University, Zhengzhou, 450001, People's Republic of China
| | - Zhixia Teng
- School of information and computer engineering, Northeast Forestry, Harbin, 150001, People's Republic of China
| | - Shuang Cheng
- Institute of Materials, China Academy of Engineering Physics, Jiang You, 621907, Sichuan, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, 100044, People's Republic of China. .,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, 100044, China.
| |
Collapse
|
5
|
Shameer K, Glicksberg BS, Hodos R, Johnson KW, Badgeley MA, Readhead B, Tomlinson MS, O’Connor T, Miotto R, Kidd BA, Chen R, Ma’ayan A, Dudley JT. Systematic analyses of drugs and disease indications in RepurposeDB reveal pharmacological, biological and epidemiological factors influencing drug repositioning. Brief Bioinform 2018; 19:656-678. [PMID: 28200013 PMCID: PMC6192146 DOI: 10.1093/bib/bbw136] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2016] [Revised: 11/29/2016] [Indexed: 12/22/2022] Open
Abstract
Increase in global population and growing disease burden due to the emergence of infectious diseases (Zika virus), multidrug-resistant pathogens, drug-resistant cancers (cisplatin-resistant ovarian cancer) and chronic diseases (arterial hypertension) necessitate effective therapies to improve health outcomes. However, the rapid increase in drug development cost demands innovative and sustainable drug discovery approaches. Drug repositioning, the discovery of new or improved therapies by reevaluation of approved or investigational compounds, solves a significant gap in the public health setting and improves the productivity of drug development. As the number of drug repurposing investigations increases, a new opportunity has emerged to understand factors driving drug repositioning through systematic analyses of drugs, drug targets and associated disease indications. However, such analyses have so far been hampered by the lack of a centralized knowledgebase, benchmarking data sets and reporting standards. To address these knowledge and clinical needs, here, we present RepurposeDB, a collection of repurposed drugs, drug targets and diseases, which was assembled, indexed and annotated from public data. RepurposeDB combines information on 253 drugs [small molecules (74.30%) and protein drugs (25.29%)] and 1125 diseases. Using RepurposeDB data, we identified pharmacological (chemical descriptors, physicochemical features and absorption, distribution, metabolism, excretion and toxicity properties), biological (protein domains, functional process, molecular mechanisms and pathway cross talks) and epidemiological (shared genetic architectures, disease comorbidities and clinical phenotype similarities) factors mediating drug repositioning. Collectively, RepurposeDB is developed as the reference database for drug repositioning investigations. The pharmacological, biological and epidemiological principles of drug repositioning identified from the meta-analyses could augment therapeutic development.
Collapse
Affiliation(s)
- Khader Shameer
- Institute of Next Generation Healthcare, Mount Sinai Health System, New York,
NY, USA
| | - Benjamin S Glicksberg
- Icahn School of Medicine at Mount Sinai, Mount Sinai Health System, New York,
NY, USA
| | - Rachel Hodos
- Icahn School of Medicine at Mount Sinai, Mount Sinai Health System, New York,
NY, USA
- New York University, New York, NY, USA
| | - Kipp W Johnson
- Icahn School of Medicine at Mount Sinai, Mount Sinai Health System, New York,
NY, USA
| | - Marcus A Badgeley
- Icahn School of Medicine at Mount Sinai, Mount Sinai Health System, New York,
NY, USA
| | - Ben Readhead
- Institute of Next Generation Healthcare, Mount Sinai Health System, New York,
NY, USA
| | - Max S Tomlinson
- Institute of Next Generation Healthcare, Mount Sinai Health System, New York,
NY, USA
| | | | - Riccardo Miotto
- Institute of Next Generation Healthcare, Mount Sinai Health System, New York,
NY, USA
| | - Brian A Kidd
- Institute of Next Generation Healthcare, Mount Sinai Health System, New York,
NY, USA
| | - Rong Chen
- Clinical Genome Informatics, Icahn Institute of Genetics and Multiscale
Biology, Mount Sinai Health System, New York, NY
| | - Avi Ma’ayan
- Mount Sinai Center for Bioinformatics, Mount Sinai Health System, New York,
NY
| | - Joel T Dudley
- Institute of Next Generation Healthcare, Mount Sinai Health System, New York,
NY, USA
- Department of Genetics and Genomic Sciences, Mount Sinai Health System, New
York, NY, USA
- Department of Population Health Science and Policy, Mount Sinai Health System,
New York, NY, USA
- Director of Biomedical Informatics, Icahn School of Medicine at Mount Sinai,
Mount Sinai Health System, New York, NY
| |
Collapse
|
6
|
Chen L, Friedman C, Finkelstein J. Automated Metabolic Phenotyping of Cytochrome Polymorphisms Using PubMed Abstract Mining. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:535-544. [PMID: 29854118 PMCID: PMC5977704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Pharmacogenetics-related publications, which are increasing rapidly, provide important new pharmacogenetics knowledge. Automated approaches to extract information of new alleles and to identify their impact on metabolic phenotypes from publications are urgently needed to facilitate personalized medicine and improve clinical outcomes. Cytochrome polymorphisms, responsible for a wide variation of drug pharmacodynamics, individual efficacy and adverse effects, have significant potential for optimizing drug therapy. A few studies have addressed specialized efforts to automatically extract cytochrome polymorphisms and their characterizations regarding metabolic phenotypes from the literature. In this paper, we present a novel rule-based text-mining system to extract metabolic phenotypes of polymorphisms from PubMed abstracts with a focus on cytochrome P450. This system is promising as it achieved a precision of 85.71% in a preliminary proof-of-concept evaluation and is expected to automatically provide up-to-date metabolic information for cytochrome polymorphisms, which is critical to advance personalized medicine and improve clinical care.
Collapse
Affiliation(s)
- Luoxin Chen
- Department of Biomedical Informatics, Columbia University, New York, NY, US
| | - Carol Friedman
- Department of Biomedical Informatics, Columbia University, New York, NY, US
| | - Joseph Finkelstein
- Department of Biomedical Informatics, Columbia University, New York, NY, US
| |
Collapse
|
7
|
Wang P, Hao T, Yan J, Jin L. Large-scale extraction of drug-disease pairs from the medical literature. J Assoc Inf Sci Technol 2017. [DOI: 10.1002/asi.23876] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- Pengwei Wang
- School of Electronic and Information Engineering; South China University of Technology; Guangzhou China
| | - Tianyong Hao
- Cisco School of Informatics; Guangdong University of Foreign Studies; Guangzhou China
| | - Jun Yan
- Microsoft Research Asia; Beijing China
| | - Lianwen Jin
- School of Electronic and Information Engineering; South China University of Technology; Guangzhou China
| |
Collapse
|
8
|
Abstract
AbstractLiterature-based discovery systems aim at discovering valuable latent connections between previously disparate research areas. This is achieved by analyzing the contents of their respective literatures with the help of various intelligent computational techniques. In this paper, we review the progress of literature-based discovery research, focusing on understanding their technical features and evaluating their performance. The present literature-based discovery techniques can be divided into two general approaches: the traditional approach and the emerging approach. The traditional approach, which dominate the current research landscape, comprises mainly of techniques that rely on utilizing lexical statistics, knowledge-based and visualization methods in order to address literature-based discovery problems. On the other hand, we have also observed the births of new trends and unprecedented paradigm shifts among the recently emerging literature-based discovery approach. These trends are likely to shape the future trajectory of the next generation literature-based discovery systems.
Collapse
|
9
|
Lou Y, Tu SW, Nyulas C, Tudorache T, Chalmers RJG, Musen MA. Use of ontology structure and Bayesian models to aid the crowdsourcing of ICD-11 sanctioning rules. J Biomed Inform 2017; 68:20-34. [PMID: 28192233 DOI: 10.1016/j.jbi.2017.02.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 02/02/2017] [Accepted: 02/08/2017] [Indexed: 11/18/2022]
Abstract
The International Classification of Diseases (ICD) is the de facto standard international classification for mortality reporting and for many epidemiological, clinical, and financial use cases. The next version of ICD, ICD-11, will be submitted for approval by the World Health Assembly in 2018. Unlike previous versions of ICD, where coders mostly select single codes from pre-enumerated disease and disorder codes, ICD-11 coding will allow extensive use of multiple codes to give more detailed disease descriptions. For example, "severe malignant neoplasms of left breast" may be coded using the combination of a "stem code" (e.g., code for malignant neoplasms of breast) with a variety of "extension codes" (e.g., codes for laterality and severity). The use of multiple codes (a process called post-coordination), while avoiding the pitfall of having to pre-enumerate vast number of possible disease and qualifier combinations, risks the creation of meaningless expressions that combine stem codes with inappropriate qualifiers. To prevent that from happening, "sanctioning rules" that define legal combinations are necessary. In this work, we developed a crowdsourcing method for obtaining sanctioning rules for the post-coordination of concepts in ICD-11. Our method utilized the hierarchical structures in the domain to improve the accuracy of the sanctioning rules and to lower the crowdsourcing cost. We used Bayesian networks to model crowd workers' skills, the accuracy of their responses, and our confidence in the acquired sanctioning rules. We applied reinforcement learning to develop an agent that constantly adjusted the confidence cutoffs during the crowdsourcing process to maximize the overall quality of sanctioning rules under a fixed budget. Finally, we performed formative evaluations using a skin-disease branch of the draft ICD-11 and demonstrated that the crowd-sourced sanctioning rules replicated those defined by an expert dermatologist with high precision and recall. This work demonstrated that a crowdsourcing approach could offer a reasonably efficient method for generating a first draft of sanctioning rules that subject matter experts could verify and edit, thus relieving them of the tedium and cost of formulating the initial set of rules.
Collapse
Affiliation(s)
- Yun Lou
- Stanford University, Stanford, CA, USA
| | | | | | | | | | | |
Collapse
|
10
|
Rodriguez-Esteban R, Bundschus M. Text mining patents for biomedical knowledge. Drug Discov Today 2016; 21:997-1002. [PMID: 27179985 DOI: 10.1016/j.drudis.2016.05.002] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Revised: 04/07/2016] [Accepted: 05/04/2016] [Indexed: 11/16/2022]
Abstract
Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center Basel, 4070 Basel, Switzerland.
| | - Markus Bundschus
- Scientific & Business Information Services, Roche Diagnostics GmbH, 82377 Penzberg, Germany
| |
Collapse
|
11
|
Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 939:139-166. [PMID: 27807747 DOI: 10.1007/978-981-10-1503-8_7] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.
Collapse
|
12
|
An Overview of Biomolecular Event Extraction from Scientific Documents. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:571381. [PMID: 26587051 PMCID: PMC4637451 DOI: 10.1155/2015/571381] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Revised: 08/10/2015] [Accepted: 08/18/2015] [Indexed: 01/09/2023]
Abstract
This paper presents a review of state-of-the-art approaches to automatic extraction of biomolecular events from scientific texts. Events involving biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biological processes and functions and provide valuable information for describing physiological and pathogenesis mechanisms. Event extraction from biomedical literature has a broad range of applications, including support for information retrieval, knowledge summarization, and information extraction and discovery. However, automatic event extraction is a challenging task due to the ambiguity and diversity of natural language and higher-level linguistic phenomena, such as speculations and negations, which occur in biological texts and can lead to misunderstanding or incorrect interpretation. Many strategies have been proposed in the last decade, originating from different research areas such as natural language processing, machine learning, and statistics. This review summarizes the most representative approaches in biomolecular event extraction and presents an analysis of the current state of the art and of commonly used methods, features, and tools. Finally, current research trends and future perspectives are also discussed.
Collapse
|
13
|
Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery. Brief Bioinform 2015; 17:33-42. [PMID: 26420781 PMCID: PMC4719073 DOI: 10.1093/bib/bbv087] [Citation(s) in RCA: 103] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.
Collapse
|
14
|
Bravo À, Piñero J, Queralt-Rosinach N, Rautschka M, Furlong LI. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 2015; 16:55. [PMID: 25886734 PMCID: PMC4466840 DOI: 10.1186/s12859-015-0472-9] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 01/19/2015] [Indexed: 11/23/2022] Open
Abstract
Background Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. Results By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. Conclusions BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0472-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Núria Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Michael Rautschka
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
15
|
Application of text mining in the biomedical domain. Methods 2015; 74:97-106. [PMID: 25641519 DOI: 10.1016/j.ymeth.2015.01.015] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2014] [Revised: 01/21/2015] [Accepted: 01/23/2015] [Indexed: 12/12/2022] Open
Abstract
In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for.
Collapse
|
16
|
de la Iglesia D, García-Remesal M, Anguita A, Muñoz-Mármol M, Kulikowski C, Maojo V. A machine learning approach to identify clinical trials involving nanodrugs and nanodevices from ClinicalTrials.gov. PLoS One 2014; 9:e110331. [PMID: 25347075 PMCID: PMC4210133 DOI: 10.1371/journal.pone.0110331] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2014] [Accepted: 09/21/2014] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Clinical Trials (CTs) are essential for bridging the gap between experimental research on new drugs and their clinical application. Just like CTs for traditional drugs and biologics have helped accelerate the translation of biomedical findings into medical practice, CTs for nanodrugs and nanodevices could advance novel nanomaterials as agents for diagnosis and therapy. Although there is publicly available information about nanomedicine-related CTs, the online archiving of this information is carried out without adhering to criteria that discriminate between studies involving nanomaterials or nanotechnology-based processes (nano), and CTs that do not involve nanotechnology (non-nano). Finding out whether nanodrugs and nanodevices were involved in a study from CT summaries alone is a challenging task. At the time of writing, CTs archived in the well-known online registry ClinicalTrials.gov are not easily told apart as to whether they are nano or non-nano CTs-even when performed by domain experts, due to the lack of both a common definition for nanotechnology and of standards for reporting nanomedical experiments and results. METHODS We propose a supervised learning approach for classifying CT summaries from ClinicalTrials.gov according to whether they fall into the nano or the non-nano categories. Our method involves several stages: i) extraction and manual annotation of CTs as nano vs. non-nano, ii) pre-processing and automatic classification, and iii) performance evaluation using several state-of-the-art classifiers under different transformations of the original dataset. RESULTS AND CONCLUSIONS The performance of the best automated classifier closely matches that of experts (AUC over 0.95), suggesting that it is feasible to automatically detect the presence of nanotechnology products in CT summaries with a high degree of accuracy. This can significantly speed up the process of finding whether reports on ClinicalTrials.gov might be relevant to a particular nanoparticle or nanodevice, which is essential to discover any precedents for nanotoxicity events or advantages for targeted drug therapy.
Collapse
Affiliation(s)
- Diana de la Iglesia
- Biomedical Informatics Group, Dept. Inteligencia Artificial, Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, Spain
| | - Miguel García-Remesal
- Biomedical Informatics Group, Dept. Inteligencia Artificial, Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, Spain
| | - Alberto Anguita
- Biomedical Informatics Group, Dept. Inteligencia Artificial, Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, Spain
| | - Miguel Muñoz-Mármol
- Biomedical Informatics Group, Dept. Inteligencia Artificial, Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, Spain
| | - Casimir Kulikowski
- Department of Computer Science, Rutgers – The State University of New Jersey, Piscataway, New Jersey, United States of America
| | - Víctor Maojo
- Biomedical Informatics Group, Dept. Inteligencia Artificial, Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, Spain
| |
Collapse
|
17
|
Sampathkumar H, Chen XW, Luo B. Mining adverse drug reactions from online healthcare forums using hidden Markov model. BMC Med Inform Decis Mak 2014; 14:91. [PMID: 25341686 PMCID: PMC4283122 DOI: 10.1186/1472-6947-14-91] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Accepted: 08/18/2014] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Adverse Drug Reactions are one of the leading causes of injury or death among patients undergoing medical treatments. Not all Adverse Drug Reactions are identified before a drug is made available in the market. Current post-marketing drug surveillance methods, which are based purely on voluntary spontaneous reports, are unable to provide the early indications necessary to prevent the occurrence of such injuries or fatalities. The objective of this research is to extract reports of adverse drug side-effects from messages in online healthcare forums and use them as early indicators to assist in post-marketing drug surveillance. METHODS We treat the task of extracting adverse side-effects of drugs from healthcare forum messages as a sequence labeling problem and present a Hidden Markov Model(HMM) based Text Mining system that can be used to classify a message as containing drug side-effect information and then extract the adverse side-effect mentions from it. A manually annotated dataset from http://www.medications.com is used in the training and validation of the HMM based Text Mining system. RESULTS A 10-fold cross-validation on the manually annotated dataset yielded on average an F-Score of 0.76 from the HMM Classifier, in comparison to 0.575 from the Baseline classifier. Without the Plain Text Filter component as a part of the Text Processing module, the F-Score of the HMM Classifier was reduced to 0.378 on average, while absence of the HTML Filter component was found to have no impact. Reducing the Drug names dictionary size by half, on average reduced the F-Score of the HMM Classifier to 0.359, while a similar reduction to the side-effects dictionary yielded an F-Score of 0.651 on average. Adverse side-effects mined from http://www.medications.com and http://www.steadyhealth.com were found to match the Adverse Drug Reactions on the Drug Package Labels of several drugs. In addition, some novel adverse side-effects, which can be potential Adverse Drug Reactions, were also identified. CONCLUSIONS The results from the HMM based Text Miner are encouraging to pursue further enhancements to this approach. The mined novel side-effects can act as early indicators for health authorities to help focus their efforts in post-marketing drug surveillance.
Collapse
Affiliation(s)
| | - Xue-wen Chen
- />Dept. of Computer Science, Wayne State University, 48202 Detroit, USA
| | - Bo Luo
- />EECS, University of Kansas, 66045 Lawrence, USA
| |
Collapse
|
18
|
Bui QC, Sloot PMA, van Mulligen EM, Kors JA. A novel feature-based approach to extract drug-drug interactions from biomedical text. Bioinformatics 2014; 30:3365-71. [PMID: 25143286 DOI: 10.1093/bioinformatics/btu557] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Knowledge of drug-drug interactions (DDIs) is crucial for health-care professionals to avoid adverse effects when co-administering drugs to patients. As most newly discovered DDIs are made available through scientific publications, automatic DDI extraction is highly relevant. RESULTS We propose a novel feature-based approach to extract DDIs from text. Our approach consists of three steps. First, we apply text preprocessing to convert input sentences from a given dataset into structured representations. Second, we map each candidate DDI pair from that dataset into a suitable syntactic structure. Based on that, a novel set of features is used to generate feature vectors for these candidate DDI pairs. Third, the obtained feature vectors are used to train a support vector machine (SVM) classifier. When evaluated on two DDI extraction challenge test datasets from 2011 and 2013, our system achieves F-scores of 71.1% and 83.5%, respectively, outperforming any state-of-the-art DDI extraction system. AVAILABILITY AND IMPLEMENTATION The source code is available for academic use at http://www.biosemantics.org/uploads/DDI.zip.
Collapse
Affiliation(s)
- Quoc-Chinh Bui
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Informatics Institute, University of Amsterdam, The Netherlands, Complexity Institute, Nanyang Technological University, Singapore and ITMO University, St. Petersburg, Russian Federation
| | - Peter M A Sloot
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Informatics Institute, University of Amsterdam, The Netherlands, Complexity Institute, Nanyang Technological University, Singapore and ITMO University, St. Petersburg, Russian Federation Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Informatics Institute, University of Amsterdam, The Netherlands, Complexity Institute, Nanyang Technological University, Singapore and ITMO University, St. Petersburg, Russian Federation Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Informatics Institute, University of Amsterdam, The Netherlands, Complexity Institute, Nanyang Technological University, Singapore and ITMO University, St. Petersburg, Russian Federation
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Informatics Institute, University of Amsterdam, The Netherlands, Complexity Institute, Nanyang Technological University, Singapore and ITMO University, St. Petersburg, Russian Federation
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Informatics Institute, University of Amsterdam, The Netherlands, Complexity Institute, Nanyang Technological University, Singapore and ITMO University, St. Petersburg, Russian Federation
| |
Collapse
|
19
|
Abstract
OBJECTIVES To summarise current research that takes advantage of "Big Data" in health and biomedical informatics applications. METHODS Survey of trends in this work, and exploration of literature describing how large-scale structured and unstructured data sources are being used to support applications from clinical decision making and health policy, to drug design and pharmacovigilance, and further to systems biology and genetics. RESULTS The survey highlights ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas. Consideration of this body of work identifies several important paradigm shifts that are facilitated by Big Data resources and methods: in clinical and translational research, from hypothesis-driven research to data-driven research, and in medicine, from evidence-based practice to practice-based evidence. CONCLUSIONS The increasing scale and availability of large quantities of health data require strategies for data management, data linkage, and data integration beyond the limits of many existing information systems, and substantial effort is underway to meet those needs. As our ability to make sense of that data improves, the value of the data will continue to increase. Health systems, genetics and genomics, population and public health; all areas of biomedicine stand to benefit from Big Data and the associated technologies.
Collapse
Affiliation(s)
- F Martin-Sanchez
- Fernando Martin-Sanchez, Health and Biomedical Informatics Centre, The University of Melbourne, Parkville VIC 3010, Australia, E-mail:
| | | |
Collapse
|
20
|
Hasegawa T, Nagasaki M, Yamaguchi R, Imoto S, Miyano S. An efficient method of exploring simulation models by assimilating literature and biological observational data. Biosystems 2014; 121:54-66. [PMID: 24907678 DOI: 10.1016/j.biosystems.2014.06.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 04/09/2014] [Accepted: 06/01/2014] [Indexed: 11/26/2022]
Abstract
Recently, several biological simulation models of, e.g., gene regulatory networks and metabolic pathways, have been constructed based on existing knowledge of biomolecular reactions, e.g., DNA-protein and protein-protein interactions. However, since these do not always contain all necessary molecules and reactions, their simulation results can be inconsistent with observational data. Therefore, improvements in such simulation models are urgently required. A previously reported method created multiple candidate simulation models by partially modifying existing models. However, this approach was computationally costly and could not handle a large number of candidates that are required to find models whose simulation results are highly consistent with the data. In order to overcome the problem, we focused on the fact that the qualitative dynamics of simulation models are highly similar if they share a certain amount of regulatory structures. This indicates that better fitting candidates tend to share the basic regulatory structure of the best fitting candidate, which can best predict the data among candidates. Thus, instead of evaluating all candidates, we propose an efficient explorative method that can selectively and sequentially evaluate candidates based on the similarity of their regulatory structures. Furthermore, in estimating the parameter values of a candidate, e.g., synthesis and degradation rates of mRNA, for the data, those of the previously evaluated candidates can be utilized. The method is applied here to the pharmacogenomic pathways for corticosteroids in rats, using time-series microarray expression data. In the performance test, we succeeded in obtaining more than 80% of consistent solutions within 15% of the computational time as compared to the comprehensive evaluation. Then, we applied this approach to 142 literature-recorded simulation models of corticosteroid-induced genes, and consequently selected 134 newly constructed better models. The method described here was found to be capable of efficiently exploring candidate simulation models and obtaining better models within a short span of time. Furthermore, the results suggest that there may be room for improvement in literature recorded pathways and that they can be systematically updated using biological observational data.
Collapse
Affiliation(s)
- Takanori Hasegawa
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan.
| | - Masao Nagasaki
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 6-3-09 Aoba, Aramaki, Aoba-ku, Sendai, Japan.
| | - Rui Yamaguchi
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, Japan.
| | - Seiya Imoto
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, Japan.
| | - Satoru Miyano
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, Japan.
| |
Collapse
|
21
|
Bravo À, Cases M, Queralt-Rosinach N, Sanz F, Furlong LI. A knowledge-driven approach to extract disease-related biomarkers from the literature. BIOMED RESEARCH INTERNATIONAL 2014; 2014:253128. [PMID: 24839601 PMCID: PMC4009255 DOI: 10.1155/2014/253128] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2013] [Revised: 02/17/2014] [Accepted: 02/20/2014] [Indexed: 12/16/2022]
Abstract
The biomedical literature represents a rich source of biomarker information. However, both the size of literature databases and their lack of standardization hamper the automatic exploitation of the information contained in these resources. Text mining approaches have proven to be useful for the exploitation of information contained in the scientific publications. Here, we show that a knowledge-driven text mining approach can exploit a large literature database to extract a dataset of biomarkers related to diseases covering all therapeutic areas. Our methodology takes advantage of the annotation of MEDLINE publications pertaining to biomarkers with MeSH terms, narrowing the search to specific publications and, therefore, minimizing the false positive ratio. It is based on a dictionary-based named entity recognition system and a relation extraction module. The application of this methodology resulted in the identification of 131,012 disease-biomarker associations between 2,803 genes and 2,751 diseases, and represents a valuable knowledge base for those interested in disease-related biomarkers. Additionally, we present a bibliometric analysis of the journals reporting biomarker related information during the last 40 years.
Collapse
Affiliation(s)
- À. Bravo
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - M. Cases
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - N. Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - F. Sanz
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - L. I. Furlong
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| |
Collapse
|
22
|
Jimeno Yepes A, Verspoor K. Literature mining of genetic variants for curation: quantifying the importance of supplementary material. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau003. [PMID: 24520105 PMCID: PMC3920087 DOI: 10.1093/database/bau003] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains 'all of the information', and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia and Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | | |
Collapse
|
23
|
Rebholz-Schuhmann D, Kafkas S, Kim JH, Li C, Jimeno Yepes A, Hoehndorf R, Backofen R, Lewin I. Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources. J Biomed Semantics 2013; 4:28. [PMID: 24112383 PMCID: PMC4021975 DOI: 10.1186/2041-1480-4-28] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 09/11/2013] [Indexed: 11/10/2022] Open
Abstract
Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. Results In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions. Conclusion The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.
Collapse
|
24
|
Trugenberger CA, Wälti C, Peregrim D, Sharp ME, Bureeva S. Discovery of novel biomarkers and phenotypes by semantic technologies. BMC Bioinformatics 2013; 14:51. [PMID: 23402646 PMCID: PMC3605201 DOI: 10.1186/1471-2105-14-51] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2012] [Accepted: 02/01/2013] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Biomarkers and target-specific phenotypes are important to targeted drug design and individualized medicine, thus constituting an important aspect of modern pharmaceutical research and development. More and more, the discovery of relevant biomarkers is aided by in silico techniques based on applying data mining and computational chemistry on large molecular databases. However, there is an even larger source of valuable information available that can potentially be tapped for such discoveries: repositories constituted by research documents. RESULTS This paper reports on a pilot experiment to discover potential novel biomarkers and phenotypes for diabetes and obesity by self-organized text mining of about 120,000 PubMed abstracts, public clinical trial summaries, and internal Merck research documents. These documents were directly analyzed by the InfoCodex semantic engine, without prior human manipulations such as parsing. Recall and precision against established, but different benchmarks lie in ranges up to 30% and 50% respectively. Retrieval of known entities missed by other traditional approaches could be demonstrated. Finally, the InfoCodex semantic engine was shown to discover new diabetes and obesity biomarkers and phenotypes. Amongst these were many interesting candidates with a high potential, although noticeable noise (uninteresting or obvious terms) was generated. CONCLUSIONS The reported approach of employing autonomous self-organising semantic engines to aid biomarker discovery, supplemented by appropriate manual curation processes, shows promise and has potential to impact, conservatively, a faster alternative to vocabulary processes dependent on humans having to read and analyze all the texts. More optimistically, it could impact pharmaceutical research, for example to shorten time-to-market of novel drugs, or speed up early recognition of dead ends and adverse reactions.
Collapse
Affiliation(s)
- Carlo A Trugenberger
- InfoCodex AG, Semantic Technologies, Bahnhofstrasse 50, Buchs (SG), CH-9470, Switzerland
| | - Christoph Wälti
- InfoCodex AG, Semantic Technologies, Bahnhofstrasse 50, Buchs (SG), CH-9470, Switzerland
| | - David Peregrim
- Merck Research Laboratories, 126 East Lincoln Avenue, Rahway, NJ 07065, USA
| | - Mark E Sharp
- Merck Research Laboratories, 126 East Lincoln Avenue, Rahway, NJ 07065, USA
| | - Svetlana Bureeva
- Thomson Reuters, 5901 Priestly Drive, STE 200, Carlsbad, CA, 92008, USA
| |
Collapse
|