Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012;13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]

For:	Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012;13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]

Number

Cited by Other Article(s)

101

Ekins S, Southan C, Coffee M. Finding small molecules for the 'next Ebola'. F1000Res 2015;4:58. [PMID: 25949804 DOI: 10.12688/f1000research.6181.1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/26/2015] [Indexed: 01/11/2023] Open

102

Bravo À, Piñero J, Queralt-Rosinach N, Rautschka M, Furlong LI. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 2015;16:55. [PMID: 25886734 PMCID: PMC4466840 DOI: 10.1186/s12859-015-0472-9] [Citation(s) in RCA: 116] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 01/19/2015] [Indexed: 11/23/2022] Open

Abstract

Background

Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.

Results

By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.

Conclusions

BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0472-9) contains supplementary material, which is available to authorized users.

Collapse

103

Pérez-Pérez M, Glez-Peña D, Fdez-Riverola F, Lourenço A. Marky: a tool supporting annotation consistency in multi-user and iterative document annotation projects. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015;118:242-251. [PMID: 25480679 DOI: 10.1016/j.cmpb.2014.11.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Revised: 10/24/2014] [Accepted: 11/18/2014] [Indexed: 06/04/2023]

Abstract

BACKGROUND AND OBJECTIVES

Document annotation is a key task in the development of Text Mining methods and applications. High quality annotated corpora are invaluable, but their preparation requires a considerable amount of resources and time. Although the existing annotation tools offer good user interaction interfaces to domain experts, project management and quality control abilities are still limited. Therefore, the current work introduces Marky, a new Web-based document annotation tool equipped to manage multi-user and iterative projects, and to evaluate annotation quality throughout the project life cycle.

METHODS

At the core, Marky is a Web application based on the open source CakePHP framework. User interface relies on HTML5 and CSS3 technologies. Rangy library assists in browser-independent implementation of common DOM range and selection tasks, and Ajax and JQuery technologies are used to enhance user-system interaction.

RESULTS

Marky grants solid management of inter- and intra-annotator work. Most notably, its annotation tracking system supports systematic and on-demand agreement analysis and annotation amendment. Each annotator may work over documents as usual, but all the annotations made are saved by the tracking system and may be further compared. So, the project administrator is able to evaluate annotation consistency among annotators and across rounds of annotation, while annotators are able to reject or amend subsets of annotations made in previous rounds. As a side effect, the tracking system minimises resource and time consumption.

CONCLUSIONS

Marky is a novel environment for managing multi-user and iterative document annotation projects. Compared to other tools, Marky offers a similar visually intuitive annotation experience while providing unique means to minimise annotation effort and enforce annotation quality, and therefore corpus consistency. Marky is freely available for non-commercial use at http://sing.ei.uvigo.es/marky.

Collapse

104

Xie B, Ding Q, Wu D. Text Mining on Big and Complex Biomedical Literature. BIG DATA ANALYTICS IN BIOINFORMATICS AND HEALTHCARE 2015. [DOI: 10.4018/978-1-4666-6611-5.ch002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]

105

Machado CM, Rebholz-Schuhmann D, Freitas AT, Couto FM. The semantic web in translational medicine: current applications and future directions. Brief Bioinform 2015;16:89-103. [PMID: 24197933 PMCID: PMC4293377 DOI: 10.1093/bib/bbt079] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2013] [Accepted: 10/08/2013] [Indexed: 11/14/2022] Open

106

Nim HT, Boyd SE, Rosenthal NA. Systems approaches in integrative cardiac biology: illustrations from cardiac heterocellular signalling studies. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2014;117:69-77. [PMID: 25499442 DOI: 10.1016/j.pbiomolbio.2014.11.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Revised: 11/26/2014] [Accepted: 11/28/2014] [Indexed: 12/27/2022]

107

Yi X, Zhang Z, Ling Y, Xu W, Su Z. PNRD: a plant non-coding RNA database. Nucleic Acids Res 2014;43:D982-9. [PMID: 25398903 PMCID: PMC4383960 DOI: 10.1093/nar/gku1162] [Citation(s) in RCA: 167] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open

108

Jamieson DG, Moss A, Kennedy M, Jones S, Nenadic G, Robertson DL, Sidders B. The pain interactome: connecting pain-specific protein interactions. Pain 2014;155:2243-52. [PMID: 24978826 PMCID: PMC4247380 DOI: 10.1016/j.pain.2014.06.020] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2014] [Revised: 06/13/2014] [Accepted: 06/23/2014] [Indexed: 11/29/2022]

109

Grady CR, Knepper MA, Burg MB, Ferraris JD. Database of osmoregulated proteins in mammalian cells. Physiol Rep 2014;2:e12180. [PMID: 25355853 PMCID: PMC4254105 DOI: 10.14814/phy2.12180] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2014] [Revised: 09/15/2014] [Accepted: 09/29/2014] [Indexed: 11/24/2022] Open

110

Bui QC, Sloot PMA, van Mulligen EM, Kors JA. A novel feature-based approach to extract drug-drug interactions from biomedical text. Bioinformatics 2014;30:3365-71. [PMID: 25143286 DOI: 10.1093/bioinformatics/btu557] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

111

Martin-Sanchez F, Verspoor K. Big data in medicine is driving big changes. Yearb Med Inform 2014;9:14-20. [PMID: 25123716 DOI: 10.15265/iy-2014-0020] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open

112

Titz B, Elamin A, Martin F, Schneider T, Dijon S, Ivanov NV, Hoeng J, Peitsch MC. Proteomics for systems toxicology. Comput Struct Biotechnol J 2014;11:73-90. [PMID: 25379146 PMCID: PMC4212285 DOI: 10.1016/j.csbj.2014.08.004] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open

113

Sanghi A, Zaringhalam M, Corcoran CC, Saeed F, Hoffert JD, Sandoval P, Pisitkun T, Knepper MA. A knowledge base of vasopressin actions in the kidney. Am J Physiol Renal Physiol 2014;307:F747-55. [PMID: 25056354 DOI: 10.1152/ajprenal.00012.2014] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open

114

Kwon D, Kim S, Shin SY, Chatr-aryamontri A, Wilbur WJ. Assisting manual literature curation for protein-protein interactions using BioQRator. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau067. [PMID: 25052701 PMCID: PMC4105708 DOI: 10.1093/database/bau067] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Affiliation(s)

Dongseop Kwon Department of Computer Engineering, Myongji University, Yongin 449-728, South Korea, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, Department of Biomedical Informatics, Asan Medical Center, Seoul 138-736, South Korea and Institute for Research in Immunology and Cancer, Université de Montréal, Montréal QC H3C 3J7, Canada
Sun Kim Department of Computer Engineering, Myongji University, Yongin 449-728, South Korea, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, Department of Biomedical Informatics, Asan Medical Center, Seoul 138-736, South Korea and Institute for Research in Immunology and Cancer, Université de Montréal, Montréal QC H3C 3J7, Canada
Soo-Yong Shin Department of Computer Engineering, Myongji University, Yongin 449-728, South Korea, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, Department of Biomedical Informatics, Asan Medical Center, Seoul 138-736, South Korea and Institute for Research in Immunology and Cancer, Université de Montréal, Montréal QC H3C 3J7, Canada
Andrew Chatr-aryamontri Department of Computer Engineering, Myongji University, Yongin 449-728, South Korea, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, Department of Biomedical Informatics, Asan Medical Center, Seoul 138-736, South Korea and Institute for Research in Immunology and Cancer, Université de Montréal, Montréal QC H3C 3J7, Canada
W John Wilbur Department of Computer Engineering, Myongji University, Yongin 449-728, South Korea, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, Department of Biomedical Informatics, Asan Medical Center, Seoul 138-736, South Korea and Institute for Research in Immunology and Cancer, Université de Montréal, Montréal QC H3C 3J7, Canada

Collapse

115

Jung JY, DeLuca TF, Nelson TH, Wall DP. A literature search tool for intelligent extraction of disease-associated genes. J Am Med Inform Assoc 2014;21:399-405. [PMID: 23999671 PMCID: PMC3994846 DOI: 10.1136/amiajnl-2012-001563] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Revised: 07/15/2013] [Accepted: 08/08/2013] [Indexed: 12/27/2022] Open

116

Hoehndorf R, Haendel M, Stevens R, Rebholz-Schuhmann D. Thematic series on biomedical ontologies in JBMS: challenges and new directions. J Biomed Semantics 2014;5:15. [PMID: 24602198 PMCID: PMC4006457 DOI: 10.1186/2041-1480-5-15] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2014] [Accepted: 02/09/2014] [Indexed: 01/08/2023] Open

117

Staunton L, Clancy T, Tonry C, Hernández B, Ademowo S, Dharsee M, Evans K, Parnell AC, Watson RW, Tasken KA, Pennington SR. Protein Quantification by MRM for Biomarker Validation. QUANTITATIVE PROTEOMICS 2014. [DOI: 10.1039/9781782626985-00277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

118

Stubben CJ, Challacombe JF. Mining locus tags in PubMed Central to improve microbial gene annotation. BMC Bioinformatics 2014;15:43. [PMID: 24499370 PMCID: PMC3937057 DOI: 10.1186/1471-2105-15-43] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2013] [Accepted: 01/18/2014] [Indexed: 11/10/2022] Open

119

Pesquita C, Ferreira JD, Couto FM, Silva MJ. The epidemiology ontology: an ontology for the semantic annotation of epidemiological resources. J Biomed Semantics 2014;5:4. [PMID: 24438387 PMCID: PMC3926306 DOI: 10.1186/2041-1480-5-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2013] [Accepted: 12/24/2013] [Indexed: 11/29/2022] Open

120

Graph theory enables drug repurposing--how a mathematical model can drive the discovery of hidden mechanisms of action. PLoS One 2014;9:e84912. [PMID: 24416311 PMCID: PMC3886994 DOI: 10.1371/journal.pone.0084912] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2013] [Accepted: 11/28/2013] [Indexed: 12/21/2022] Open

121

Heinzel A, Mühlberger I, Fechete R, Mayer B, Perco P. Functional molecular units for guiding biomarker panel design. Methods Mol Biol 2014;1159:109-133. [PMID: 24788264 DOI: 10.1007/978-1-4939-0709-0_7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]

122

Link Prediction on the Semantic MEDLINE Network. DISCOVERY SCIENCE 2014. [DOI: 10.1007/978-3-319-11812-3_12] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

123

Pavlopoulos GA, Promponas VJ, Ouzounis CA, Iliopoulos I. Biological information extraction and co-occurrence analysis. Methods Mol Biol 2014;1159:77-92. [PMID: 24788262 DOI: 10.1007/978-1-4939-0709-0_5] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]

124

Integrative literature and data mining to rank disease candidate genes. Methods Mol Biol 2014;1159:207-26. [PMID: 24788269 DOI: 10.1007/978-1-4939-0709-0_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]

125

Imboden M, Probst-Hensch NM. Biobanking across the phenome - at the center of chronic disease research. BMC Public Health 2013;13:1094. [PMID: 24274136 PMCID: PMC4222669 DOI: 10.1186/1471-2458-13-1094] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2012] [Accepted: 09/25/2013] [Indexed: 11/10/2022] Open

Abstract

Background

Recognized public health relevant risk factors such as obesity, physical inactivity, smoking or air pollution are common to many non-communicable diseases (NCDs). NCDs cluster and co-morbidities increase in parallel to age. Pleiotropic genes and genetic variants have been identified by genome-wide association studies (GWAS) linking NCD entities hitherto thought to be distant in etiology. These different lines of evidence suggest that NCD disease mechanisms are in part shared.

Discussion

Identification of common exogenous and endogenous risk patterns may promote efficient prevention, an urgent need in the light of the global NCD epidemic. The prerequisite to investigate causal risk patterns including biologic, genetic and environmental factors across different NCDs are well characterized cohorts with associated biobanks. Prospectively collected data and biospecimen from subjects of various age, sociodemographic, and cultural groups, both healthy and affected by one or more NCD, are essential for exploring biologic mechanisms and susceptibilities interlinking different environmental and lifestyle exposures, co-morbidities, as well as cellular senescence and aging. A paradigm shift in the research activities can currently be observed, moving from focused investigations on the effect of a single risk factor on an isolated health outcome to a more comprehensive assessment of risk patterns and a broader phenome approach. Though important methodological and analytical challenges need to be resolved, the ongoing international efforts to establish large-scale population-based biobank cohorts are a critical basis for moving NCD disease etiology forward.

Summary

Future epidemiologic and public health research should aim at sustaining a comprehensive systems view on health and disease. The political and public discussions about the utilitarian aspect of investing in and contributing to cohort and biobank research are essential and are indirectly linked to the achievement of public health programs effectively addressing the global NCD epidemic.

Collapse

126

van Haagen HHHBM, 't Hoen PAC, Mons B, Schultes EA. Generic information can retrieve known biological associations: implications for biomedical knowledge discovery. PLoS One 2013;8:e78665. [PMID: 24260124 PMCID: PMC3834066 DOI: 10.1371/journal.pone.0078665] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2013] [Accepted: 09/13/2013] [Indexed: 02/01/2023] Open

Abstract

Motivation

Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., ‘Homo sapiens’). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone.

Results

Filtering generic or specific information induced a two-phase response in retrieval performance: initially the effects of filtering were minimal but beyond a critical threshold network performance suddenly drops. Contrary to expectations, networks composed exclusively of generic information demonstrated retrieval performance comparable to unfiltered networks that also contain specific concepts. Furthermore, an analysis using individual generic concepts demonstrated that they can effectively support the retrieval of known protein-protein interactions. For instance the concept “binding” is indicative for PPI retrieval and the concept “mutation abnormality” is indicative for gene-disease associations.

Conclusion

Generic concepts are important for information retrieval and cannot be removed from semantic networks without negative impact on retrieval performance.

Collapse

127

Lee HJ, Shim SH, Song MR, Lee H, Park JC. CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations. BMC Bioinformatics 2013;14:323. [PMID: 24225062 PMCID: PMC3833657 DOI: 10.1186/1471-2105-14-323] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Accepted: 11/05/2013] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations.

RESULTS

In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGCunder the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0).

CONCLUSIONS

The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.

Collapse

128

Rebholz-Schuhmann D, Grabmüller C, Kavaliauskas S, Croset S, Woollard P, Backofen R, Filsell W, Clark D. A case study: semantic integration of gene-disease associations for type 2 diabetes mellitus from literature and biomedical data resources. Drug Discov Today 2013;19:882-9. [PMID: 24201223 DOI: 10.1016/j.drudis.2013.10.024] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2012] [Revised: 09/24/2013] [Accepted: 10/28/2013] [Indexed: 10/26/2022]

129

Rebholz-Schuhmann D, Kafkas S, Kim JH, Li C, Jimeno Yepes A, Hoehndorf R, Backofen R, Lewin I. Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources. J Biomed Semantics 2013;4:28. [PMID: 24112383 PMCID: PMC4021975 DOI: 10.1186/2041-1480-4-28] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 09/11/2013] [Indexed: 11/10/2022] Open

Abstract

Motivation

The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs.

Results

In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions.

Conclusion

The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.

Collapse

130

Rebholz-Schuhmann D, Kim JH, Yan Y, Dixit A, Friteyre C, Hoehndorf R, Backofen R, Lewin I. Evaluation and cross-comparison of lexical entities of biological interest (LexEBI). PLoS One 2013;8:e75185. [PMID: 24124474 PMCID: PMC3790750 DOI: 10.1371/journal.pone.0075185] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Accepted: 08/14/2013] [Indexed: 01/12/2023] Open

Abstract

MOTIVATION

Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).

RESULT

This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.

CONCLUSION

LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.

Collapse

131

Fluck J, Hofmann-Apitius M. Text mining for systems biology. Drug Discov Today 2013;19:140-4. [PMID: 24070668 DOI: 10.1016/j.drudis.2013.09.012] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2013] [Revised: 09/05/2013] [Accepted: 09/12/2013] [Indexed: 01/08/2023]

132

Rebholz-Schuhmann D, Kafkas S, Kim JH, Jimeno Yepes A, Lewin I. Monitoring named entity recognition: the League Table. J Biomed Semantics 2013;4:19. [PMID: 24034148 PMCID: PMC4015903 DOI: 10.1186/2041-1480-4-19] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2012] [Accepted: 07/25/2013] [Indexed: 01/13/2023] Open

133

Vos R, Aarts S, van Mulligen E, Metsemakers J, van Boxtel MP, Verhey F, van den Akker M. Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: exploring the use of literature-based discovery in primary care research. J Am Med Inform Assoc 2013;21:139-45. [PMID: 23775174 DOI: 10.1136/amiajnl-2012-001448] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open

Abstract

BACKGROUND

Multimorbidity, the co-occurrence of two or more chronic medical conditions within a single individual, is increasingly becoming part of daily care of general medical practice. Literature-based discovery may help to investigate the patterns of multimorbidity and to integrate medical knowledge for improving healthcare delivery for individuals with co-occurring chronic conditions.

OBJECTIVE

To explore the usefulness of literature-based discovery in primary care research through the key-case of finding associations between psychiatric and somatic diseases relevant to general practice in a large biomedical literature database (Medline).

METHODS

By using literature based discovery for matching disease profiles as vectors in a high-dimensional associative concept space, co-occurrences of a broad spectrum of chronic medical conditions were matched for their potential in biomedicine. An experimental setting was chosen in parallel with expert evaluations and expert meetings to assess performance and to generate targets for integrating literature-based discovery in multidisciplinary medical research of psychiatric and somatic disease associations.

RESULTS

Through stepwise reductions a reference set of 21,945 disease combinations was generated, from which a set of 166 combinations between psychiatric and somatic diseases was selected and assessed by text mining and expert evaluation.

CONCLUSIONS

Literature-based discovery tools generate specific patterns of associations between psychiatric and somatic diseases: one subset was appraised as promising for further research; the other subset surprised the experts, leading to intricate discussions and further eliciting of frameworks of biomedical knowledge. These frameworks enable us to specify targets for further developing and integrating literature-based discovery in multidisciplinary research of general practice, psychology and psychiatry, and epidemiology.

Collapse

134

Valsesia A, Macé A, Jacquemont S, Beckmann JS, Kutalik Z. The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation. Front Genet 2013;4:92. [PMID: 23750167 PMCID: PMC3667386 DOI: 10.3389/fgene.2013.00092] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2013] [Accepted: 05/04/2013] [Indexed: 02/03/2023] Open

135

Li C, Jimeno-Yepes A, Arregui M, Kirsch H, Rebholz-Schuhmann D. PCorral--interactive mining of protein interactions from MEDLINE. Database (Oxford) 2013;2013:bat030. [PMID: 23640984 PMCID: PMC3641755 DOI: 10.1093/database/bat030] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2012] [Revised: 03/15/2013] [Accepted: 03/27/2013] [Indexed: 11/13/2022]

136

Sookoian S, Gianotti TF, Burgueño AL, Pirola CJ. Fetal metabolic programming and epigenetic modifications: a systems biology approach. Pediatr Res 2013;73:531-42. [PMID: 23314294 DOI: 10.1038/pr.2013.2] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]