1
|
Browaeys R, Saelens W, Saeys Y. NicheNet: modeling intercellular communication by linking ligands to target genes. Nat Methods 2019; 17:159-162. [DOI: 10.1038/s41592-019-0667-5] [Citation(s) in RCA: 408] [Impact Index Per Article: 81.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Accepted: 10/29/2019] [Indexed: 12/15/2022]
|
2
|
Belenahalli Shekarappa S, Kandagalla S, H Malojirao V, G.S PK, B.T P, Hanumanthappa M. A systems biology approach to identify the key targets of curcumin and capsaicin that downregulate pro-inflammatory pathways in human monocytes. Comput Biol Chem 2019; 83:107162. [DOI: 10.1016/j.compbiolchem.2019.107162] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Revised: 10/25/2019] [Accepted: 11/07/2019] [Indexed: 12/17/2022]
|
3
|
Buchan DWA, Jones DT. Learning a functional grammar of protein domains using natural language word embedding techniques. Proteins 2019; 88:616-624. [PMID: 31703152 DOI: 10.1002/prot.25842] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2019] [Revised: 10/08/2019] [Accepted: 11/03/2019] [Indexed: 11/10/2022]
Abstract
In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words." Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.
Collapse
Affiliation(s)
- Daniel W A Buchan
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK
| |
Collapse
|
4
|
Profiti G, Martelli PL, Casadio R. The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation. Nucleic Acids Res 2019; 45:W285-W290. [PMID: 28453653 PMCID: PMC5570247 DOI: 10.1093/nar/gkx330] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2017] [Accepted: 04/18/2017] [Indexed: 01/03/2023] Open
Abstract
BAR 3.0 updates our server BAR (Bologna Annotation Resource) for predicting protein structural and functional features from sequence. We increase data volume, query capabilities and information conveyed to the user. The core of BAR 3.0 is a graph-based clustering procedure of UniProtKB sequences, following strict pairwise similarity criteria (sequence identity ≥40% with alignment coverage ≥90%). Each cluster contains the available annotation downloaded from UniProtKB, GO, PFAM and PDB. After statistical validation, GO terms and PFAM domains are cluster-specific and annotate new sequences entering the cluster after satisfying similarity constraints. BAR 3.0 includes 28 869 663 sequences in 1 361 773 clusters, of which 22.2% (22 241 661 sequences) and 47.4% (24 555 055 sequences) have at least one validated GO term and one PFAM domain, respectively. 1.4% of the clusters (36% of all sequences) include PDB structures and the cluster is associated to a hidden Markov model that allows building template-target alignment suitable for structural modeling. Some other 3 399 026 sequences are singletons. BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s. When evaluated on the CAFA2 targets, BAR 3.0 largely outperforms our previous version and scores among state-of-the-art methods. BAR 3.0 is publicly available and accessible at http://bar.biocomp.unibo.it/bar3.
Collapse
Affiliation(s)
- Giuseppe Profiti
- Biocomputing Group, BiGeA/CIG, 'Luigi Galvani' Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics and Biocomplexity, University of Bologna, Bologna 40126, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, BiGeA/CIG, 'Luigi Galvani' Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics and Biocomplexity, University of Bologna, Bologna 40126, Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA/CIG, 'Luigi Galvani' Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics and Biocomplexity, University of Bologna, Bologna 40126, Italy
| |
Collapse
|
5
|
Hoyt CT, Domingo-Fernández D, Aldisi R, Xu L, Kolpeja K, Spalek S, Wollert E, Bachman J, Gyori BM, Greene P, Hofmann-Apitius M. Re-curation and rational enrichment of knowledge graphs in Biological Expression Language. Database (Oxford) 2019; 2019:baz068. [PMID: 31225582 PMCID: PMC6587072 DOI: 10.1093/database/baz068] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 04/03/2019] [Accepted: 04/29/2019] [Indexed: 12/23/2022]
Abstract
The rapid accumulation of new biomedical literature not only causes curated knowledge graphs (KGs) to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich KGs. We have developed two workflows: one for re-curating a given KG to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the KGs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full-text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.
Collapse
Affiliation(s)
- Charles Tapley Hoyt
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Rana Aldisi
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Lingling Xu
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Kristian Kolpeja
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
| | - Sandra Spalek
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
| | - Esther Wollert
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
| | - John Bachman
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, USA
| | - Patrick Greene
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, USA
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| |
Collapse
|
6
|
Savojardo C, Martelli P, Fariselli P, Profiti G, Casadio R. BUSCA: an integrative web server to predict subcellular localization of proteins. Nucleic Acids Res 2018; 46:W459-W466. [PMID: 29718411 PMCID: PMC6031068 DOI: 10.1093/nar/gky320] [Citation(s) in RCA: 249] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Revised: 04/12/2018] [Accepted: 04/17/2018] [Indexed: 12/28/2022] Open
Abstract
Here, we present BUSCA (http://busca.biocomp.unibo.it), a novel web server that integrates different computational tools for predicting protein subcellular localization. BUSCA combines methods for identifying signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware) with tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). Outcomes from the different tools are processed and integrated for annotating subcellular localization of both eukaryotic and bacterial protein sequences. We benchmark BUSCA against protein targets derived from recent CAFA experiments and other specific data sets, reporting performance at the state-of-the-art. BUSCA scores better than all other evaluated methods on 2732 targets from CAFA2, with a F1 value equal to 0.49 and among the best methods when predicting targets from CAFA3. We propose BUSCA as an integrated and accurate resource for the annotation of protein subcellular localization.
Collapse
Affiliation(s)
- Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science, University of Padova, Padova 35020, Italy
| | - Giuseppe Profiti
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
- Institute of Biomembrane, Bioenergetics and Molecular Biotechnologies, Italian National Research Council (CNR), Bari 70126, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
- Institute of Biomembrane, Bioenergetics and Molecular Biotechnologies, Italian National Research Council (CNR), Bari 70126, Italy
| |
Collapse
|
7
|
Roth A, Subramanian S, Ganapathiraju MK. Towards Extracting Supporting Information About Predicted Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1239-1246. [PMID: 26672046 DOI: 10.1109/tcbb.2015.2505278] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
One of the goals of relation extraction is to identify protein-protein interactions (PPIs) in biomedical literature. Current systems are capturing binary relations and also the direction and type of an interaction. Besides assisting in the curation PPIs into databases, there has been little real-world application of these algorithms. We describe UPSITE, a text mining tool for extracting evidence in support of a hypothesized interaction. Given a predicted PPI, UPSITE uses a binary relation detector to check whether a PPI is found in abstracts in PubMed. If it is not found, UPSITE retrieves documents relevant to each of the two proteins separately, and extracts contextual information about biological events surrounding each protein, and calculates semantic similarity of the two proteins to provide evidential support for the predicted PPI. In evaluations, relation extraction achieved an Fscore of 0.88 on the HPRD50 corpus, and semantic similarity measured with angular distance was found to be statistically significant. With the development of PPI prediction algorithms, the burden of interpreting the validity and relevance of novel PPIs is on biologists. We suggest that presenting annotations of the two proteins in a PPI side-by-side and a score that quantifies their similarity lessens this burden to some extent.
Collapse
|
8
|
You R, Huang X, Zhu S. DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods 2018; 145:82-90. [PMID: 29883746 DOI: 10.1016/j.ymeth.2018.05.026] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Revised: 04/30/2018] [Accepted: 05/31/2018] [Indexed: 11/16/2022] Open
Abstract
As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority.
Collapse
Affiliation(s)
- Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, China; Center for Computational System Biology, ISTBI, Fudan University, Shanghai 200433, China
| | - Xiaodi Huang
- School of Computing and Mathematics, Charles Sturt University, Albury, NSW 2640, Australia
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, China; Center for Computational System Biology, ISTBI, Fudan University, Shanghai 200433, China.
| |
Collapse
|
9
|
Lee D, Cho KH. Topological estimation of signal flow in complex signaling networks. Sci Rep 2018; 8:5262. [PMID: 29588498 PMCID: PMC5869720 DOI: 10.1038/s41598-018-23643-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 03/16/2018] [Indexed: 12/15/2022] Open
Abstract
In a cell, any information about extra- or intra-cellular changes is transferred and processed through a signaling network and dysregulation of signal flow often leads to disease such as cancer. So, understanding of signal flow in the signaling network is critical to identify drug targets. Owing to the development of high-throughput measurement technologies, the structure of a signaling network is becoming more available, but detailed kinetic parameter information about molecular interactions is still very limited. A question then arises as to whether we can estimate the signal flow based only on the structure information of a signaling network. To answer this question, we develop a novel algorithm that can estimate the signal flow using only the topological information and apply it to predict the direction of activity change in various signaling networks. Interestingly, we find that the average accuracy of the estimation algorithm is about 60–80% even though we only use the topological information. We also find that this predictive power gets collapsed if we randomly alter the network topology, showing the importance of network topology. Our study provides a basis for utilizing the topological information of signaling networks in precision medicine or drug target discovery.
Collapse
Affiliation(s)
- Daewon Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
| | - Kwang-Hyun Cho
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
10
|
Rifaioglu AS, Doğan T, Saraç ÖS, Ersahin T, Saidi R, Atalay MV, Martin MJ, Cetin-Atalay R. Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins 2017; 86:135-151. [PMID: 29098713 DOI: 10.1002/prot.25416] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2017] [Revised: 10/24/2017] [Accepted: 11/01/2017] [Indexed: 12/24/2022]
Abstract
Recent advances in computing power and machine learning empower functional annotation of protein sequences and their transcript variations. Here, we present an automated prediction system UniGOPred, for GO annotations and a database of GO term predictions for proteomes of several organisms in UniProt Knowledgebase (UniProtKB). UniGOPred provides function predictions for 514 molecular function (MF), 2909 biological process (BP), and 438 cellular component (CC) GO terms for each protein sequence. UniGOPred covers nearly the whole functionality spectrum in Gene Ontology system and it can predict both generic and specific GO terms. UniGOPred was run on CAFA2 challenge target protein sequences and it is categorized within the top 10 best performing methods for the molecular function category. In addition, the performance of UniGOPred is higher compared to the baseline BLAST classifier in all categories of GO. UniGOPred predictions are compared with UniProtKB/TrEMBL database annotations as well. Furthermore, the proposed tool's ability to predict negatively associated GO terms that defines the functions that a protein does not possess, is discussed. UniGOPred annotations were also validated by case studies on PTEN protein variants experimentally and on CHD8 protein variants with literature. UniGOPred protein functional annotation system is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.
Collapse
Affiliation(s)
- Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey.,Department of Computer Engineering, İskenderun Technical University, Hatay, 31200, Turkey
| | - Tunca Doğan
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.,CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Ömer Sinan Saraç
- Department of Computer Engineering, Istanbul Technical University, İstanbul, 34467, Turkey
| | - Tulin Ersahin
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Rabie Saidi
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Mehmet Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey
| | - Maria Jesus Martin
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Rengul Cetin-Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| |
Collapse
|
11
|
Abstract
Selecting and filtering a reference expression and interaction dataset when studying specific pathways and regulatory interactions can be a very time-consuming and error-prone task. In order to reduce the duplicated efforts required to amass such datasets, we have created the CORNET (CORrelation NETworks) platform which allows for easy access to a wide variety of data types: coexpression data, protein-protein interactions, regulatory interactions, and functional annotations. The CORNET platform outputs its results in either text format or through the Cytoscape framework, which is automatically launched by the CORNET website.CORNET 3.0 is the third iteration of the web platform designed for the user exploration of the coexpression space of plant genomes, with a focus on the model species Arabidopsis thaliana. Here we describe the platform: the tools, data, and best practices when using the platform. We indicate how the platform can be used to infer networks from a set of input genes, such as upregulated genes from an expression experiment. By exploring the network, new target and regulator genes can be discovered, allowing for follow-up experiments and more in-depth study. We also indicate how to avoid common pitfalls when evaluating the networks and how to avoid over interpretation of the results.All CORNET versions are available at http://bioinformatics.psb.ugent.be/cornet/ .
Collapse
Affiliation(s)
- Michiel Van Bel
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052, Ghent, Belgium.
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052, Ghent, Belgium.
| | - Frederik Coppens
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052, Ghent, Belgium
| |
Collapse
|
12
|
Mehryary F, Kaewphan S, Hakala K, Ginter F. Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification. J Biomed Semantics 2016; 7:27. [PMID: 27175227 PMCID: PMC4864999 DOI: 10.1186/s13326-016-0070-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Accepted: 05/01/2016] [Indexed: 11/19/2022] Open
Abstract
Background Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Methods Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. Results The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. Availability The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/. Electronic supplementary material The online version of this article (doi:10.1186/s13326-016-0070-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Farrokh Mehryary
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland
| | - Suwisa Kaewphan
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland ; Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Kai Hakala
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland
| | - Filip Ginter
- Department of Information Technology, University of Turku, Turku, Finland
| |
Collapse
|
13
|
Wu C, Schwartz JM, Brabant G, Peng SL, Nenadic G. Constructing a molecular interaction network for thyroid cancer via large-scale text mining of gene and pathway events. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 6:S5. [PMID: 26679379 PMCID: PMC4674859 DOI: 10.1186/1752-0509-9-s6-s5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Background Biomedical studies need assistance from automated tools and easily accessible data to address the problem of the rapidly accumulating literature. Text-mining tools and curated databases have been developed to address such needs and they can be applied to improve the understanding of molecular pathogenesis of complex diseases like thyroid cancer. Results We have developed a system, PWTEES, which extracts pathway interactions from the literature utilizing an existing event extraction tool (TEES) and pathway named entity recognition (PathNER). We then applied the system on a thyroid cancer corpus and systematically extracted molecular interactions involving either genes or pathways. With the extracted information, we constructed a molecular interaction network taking genes and pathways as nodes. Using curated pathway information and network topological analyses, we highlight key genes and pathways involved in thyroid carcinogenesis. Conclusions Mining events involving genes and pathways from the literature and integrating curated pathway knowledge can help improve the understanding of molecular interactions of complex diseases. The system developed for this study can be applied in studies other than thyroid cancer. The source code is freely available online at https://github.com/chengkun-wu/PWTEES.
Collapse
|
14
|
Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F. Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis. BMC Bioinformatics 2015; 16 Suppl 16:S3. [PMID: 26551766 PMCID: PMC4642107 DOI: 10.1186/1471-2105-16-s16-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems. RESULTS In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge. CONCLUSIONS For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates.
Collapse
|
15
|
Verkest A, Byzova M, Martens C, Willems P, Verwulgen T, Slabbinck B, Rombaut D, Van de Velde J, Vandepoele K, Standaert E, Peeters M, Van Lijsebettens M, Van Breusegem F, De Block M. Selection for Improved Energy Use Efficiency and Drought Tolerance in Canola Results in Distinct Transcriptome and Epigenome Changes. PLANT PHYSIOLOGY 2015; 168:1338-50. [PMID: 26082400 PMCID: PMC4528734 DOI: 10.1104/pp.15.00155] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Accepted: 06/11/2015] [Indexed: 05/21/2023]
Abstract
To increase both the yield potential and stability of crops, integrated breeding strategies are used that have mostly a direct genetic basis, but the utility of epigenetics to improve complex traits is unclear. A better understanding of the status of the epigenome and its contribution to agronomic performance would help in developing approaches to incorporate the epigenetic component of complex traits into breeding programs. Starting from isogenic canola (Brassica napus) lines, epilines were generated by selecting, repeatedly for three generations, for increased energy use efficiency and drought tolerance. These epilines had an enhanced energy use efficiency, drought tolerance, and nitrogen use efficiency. Transcriptome analysis of the epilines and a line selected for its energy use efficiency solely revealed common differentially expressed genes related to the onset of stress tolerance-regulating signaling events. Genes related to responses to salt, osmotic, abscisic acid, and drought treatments were specifically differentially expressed in the drought-tolerant epilines. The status of the epigenome, scored as differential trimethylation of lysine-4 of histone 3, further supported the phenotype by targeting drought-responsive genes and facilitating the transcription of the differentially expressed genes. From these results, we conclude that the canola epigenome can be shaped by selection to increase energy use efficiency and stress tolerance. Hence, these findings warrant the further development of strategies to incorporate epigenetics into breeding.
Collapse
Affiliation(s)
- Aurine Verkest
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Marina Byzova
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Cindy Martens
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Patrick Willems
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Tom Verwulgen
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Bram Slabbinck
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Debbie Rombaut
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Jan Van de Velde
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Klaas Vandepoele
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Evi Standaert
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Marrit Peeters
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Mieke Van Lijsebettens
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Frank Van Breusegem
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| | - Marc De Block
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (A.V., M.B., C.M., P.W., T.V., B.S., D.R., J.V.d.V., K.V., M.V.L., F.V.B.);Bayer CropScience, 9052 Ghent, Belgium (C.M., E.S., M.P., M.D.B.);Department of Medical Protein Research, VIB, 9000 Ghent, Belgium (P.W.); andDepartment of Biochemistry, Ghent University, 9000 Ghent, Belgium (P.W.)
| |
Collapse
|
16
|
Clauw P, Coppens F, De Beuf K, Dhondt S, Van Daele T, Maleux K, Storme V, Clement L, Gonzalez N, Inzé D. Leaf responses to mild drought stress in natural variants of Arabidopsis. PLANT PHYSIOLOGY 2015; 167:800-16. [PMID: 25604532 PMCID: PMC4348775 DOI: 10.1104/pp.114.254284] [Citation(s) in RCA: 112] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 01/16/2015] [Indexed: 05/18/2023]
Abstract
Although the response of plants exposed to severe drought stress has been studied extensively, little is known about how plants adapt their growth under mild drought stress conditions. Here, we analyzed the leaf and rosette growth response of six Arabidopsis (Arabidopsis thaliana) accessions originating from different geographic regions when exposed to mild drought stress. The automated phenotyping platform WIWAM was used to impose stress early during leaf development, when the third leaf emerges from the shoot apical meristem. Analysis of growth-related phenotypes showed differences in leaf development between the accessions. In all six accessions, mild drought stress reduced both leaf pavement cell area and number without affecting the stomatal index. Genome-wide transcriptome analysis (using RNA sequencing) of early developing leaf tissue identified 354 genes differentially expressed under mild drought stress in the six accessions. Our results indicate the existence of a robust response over different genetic backgrounds to mild drought stress in developing leaves. The processes involved in the overall mild drought stress response comprised abscisic acid signaling, proline metabolism, and cell wall adjustments. In addition to these known severe drought-related responses, 87 genes were found to be specific for the response of young developing leaves to mild drought stress.
Collapse
Affiliation(s)
- Pieter Clauw
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Frederik Coppens
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Kristof De Beuf
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Stijn Dhondt
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Twiggy Van Daele
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Katrien Maleux
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Veronique Storme
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Lieven Clement
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Nathalie Gonzalez
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| | - Dirk Inzé
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.);Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium (P.C., F.C., S.D., T.V.D., K.M., V.S., N.G., D.I.); andDepartment of Applied Mathematics Computer Science and Statistics (K.D.B., L.C.) and Stat-Gent CRESCENDO, Department of Applied Mathematics and Computer Science (K.D.B.), Ghent University, 9000 Ghent, Belgium
| |
Collapse
|
17
|
Zwick M. Automated curation of gene name normalization results using the Konstanz information miner. J Biomed Inform 2014; 53:58-64. [PMID: 25218035 DOI: 10.1016/j.jbi.2014.08.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2014] [Revised: 08/13/2014] [Accepted: 08/31/2014] [Indexed: 10/24/2022]
Abstract
BACKGROUND Gene name recognition and normalization is, together with detection of other named entities, a crucial step in biomedical text mining and the underlying basis for development of more advanced techniques like extraction of complex events. While the current state of the art solutions achieve highly promising results on average, performance can drop significantly for specific genes with highly ambiguous synonyms. Depending on the topic of interest, this can cause the need for extensive manual curation of such text mining results. Our goal was to enhance this curation step based on tools widely used in pharmaceutical industry utilizing the text processing and classification capabilities of the Konstanz Information Miner (KNIME) along with publicly available sources. RESULTS F-score achieved on gene specific test corpora for highly ambiguous genes could be improved from values close to zero, due to very low precision, to values >0.9 for several cases. Interestingly the presented approach even resulted in an increased F-score for genes showing already good results in initial gene name normalization. For most test cases, we could significantly improve precision, while retaining a high recall. CONCLUSIONS We could show that KNIME can be used to assist in manual curation of text mining results containing high numbers of false positive hits. Our results also indicate that it could be beneficial for future development in the field of gene name normalization to create gene specific training corpora based on incorrectly identified genes common to current state of the art algorithms.
Collapse
Affiliation(s)
- Matthias Zwick
- Department Research Networking, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88400 Biberach a.d. Riß, Germany.
| |
Collapse
|
18
|
Lee HJ, Dang TC, Lee H, Park JC. OncoSearch: cancer gene search engine with literature evidence. Nucleic Acids Res 2014; 42:W416-21. [PMID: 24813447 PMCID: PMC4086113 DOI: 10.1093/nar/gku368] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through http://oncosearch.biopathway.org.
Collapse
Affiliation(s)
- Hee-Jin Lee
- Department of Computer Science, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea
| | - Tien Cuong Dang
- Department of Computer Science, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea
| | - Hyunju Lee
- School of Information and Communications, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-gu, Gwangju 500-712, Republic of Korea
| | - Jong C Park
- Department of Computer Science, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea
| |
Collapse
|
19
|
Funk C, Baumgartner W, Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K. Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics 2014; 15:59. [PMID: 24571547 PMCID: PMC4015610 DOI: 10.1186/1471-2105-15-59] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2013] [Accepted: 01/24/2014] [Indexed: 11/10/2022] Open
Abstract
Background Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem. Results Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented. Conclusions Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14–0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.
Collapse
Affiliation(s)
- Christopher Funk
- Computational Bioscience Program, U, of Colorado School of Medicine, Aurora, CO 80045, USA.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Large-scale event extraction from literature with multi-level gene normalization. PLoS One 2013; 8:e55814. [PMID: 23613707 PMCID: PMC3629104 DOI: 10.1371/journal.pone.0055814] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2012] [Accepted: 01/02/2013] [Indexed: 11/19/2022] Open
Abstract
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.
Collapse
|
21
|
Van Landeghem S, De Bodt S, Drebert ZJ, Inzé D, Van de Peer Y. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis. THE PLANT CELL 2013; 25:794-807. [PMID: 23532071 PMCID: PMC3634689 DOI: 10.1105/tpc.112.108753] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 02/27/2013] [Accepted: 03/08/2013] [Indexed: 05/21/2023]
Abstract
Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
Collapse
Affiliation(s)
- Sofie Van Landeghem
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Stefanie De Bodt
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Zuzanna J. Drebert
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Dirk Inzé
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
- Address correspondence to
| |
Collapse
|