1
|
Žitnik S, Žitnik M, Zupan B, Bajec M. Sieve-based relation extraction of gene regulatory networks from biological literature. BMC Bioinformatics 2015; 16 Suppl 16:S1. [PMID: 26551454 PMCID: PMC4642041 DOI: 10.1186/1471-2105-16-s16-s1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual descriptions of biological entities, their interactions and results of related experiments. To extract them in an explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The current challenge is the development of information extraction procedures that can directly infer more complex relational structures, such as gene regulatory networks. Results We develop a computational approach for extraction of gene regulatory networks from textual data. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each of which is able to extract different relationship types. Following the shared task, we conducted additional analysis using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from 0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. Analysis of distances between different mention types in the text shows that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions. Conclusions Linear-chain conditional random fields, along with appropriate data transformations, can be efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction tasks and data domains.
Collapse
|
2
|
Abstract
Background Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline. Results We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition.
Collapse
|
3
|
Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Brief Funct Genomics 2015; 14:213-30. [PMID: 24907365 PMCID: PMC4499874 DOI: 10.1093/bfgp/elu015] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of 'events', i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research.
Collapse
|
4
|
Torii M, Arighi CN, Li G, Wang Q, Wu CH, Vijay-Shanker K. RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:17-29. [PMID: 26357075 PMCID: PMC4568560 DOI: 10.1109/tcbb.2014.2372765] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
We introduce RLIMS-P version 2.0, an enhanced rule-based information extraction (IE) system for mining kinase, substrate, and phosphorylation site information from scientific literature. Consisting of natural language processing and IE modules, the system has integrated several new features, including the capability of processing full-text articles and generalizability towards different post-translational modifications (PTMs). To evaluate the system, sets of abstracts and full-text articles, containing a variety of textual expressions, were annotated. On the abstract corpus, the system achieved F-scores of 0.91, 0.92, and 0.95 for kinases, substrates, and sites, respectively. The corresponding scores on the full-text corpus were 0.88, 0.91, and 0.92. It was additionally evaluated on the corpus of the 2013 BioNLP-ST GE task, and achieved an F-score of 0.87 for the phosphorylation core task, improving upon the results previously reported on the corpus. Full-scale processing of all abstracts in MEDLINE and all articles in PubMed Central Open Access Subset has demonstrated scalability for mining rich information in literature, enabling its adoption for biocuration and for knowledge discovery. The new system is generalizable and it will be adapted to tackle other major PTM types. RLIMS-P 2.0 online system is available online (http://proteininformationresource.org/rlimsp/) and the developed corpora are available from iProLINK (http://proteininformationresource.org/iprolink/).
Collapse
Affiliation(s)
- Manabu Torii
- Medical Informatics Group, Kaiser Permanente Southern California, 11975 El Camino Real, San Diego, CA 92130
| | - Cecilia N. Arighi
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711
| | - Gang Li
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 1971
| | - Qinghua Wang
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711
| | - Cathy H. Wu
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE 19716
| |
Collapse
|
5
|
Peng Y, Torii M, Wu CH, Vijay-Shanker K. A generalizable NLP framework for fast development of pattern-based biomedical relation extraction systems. BMC Bioinformatics 2014; 15:285. [PMID: 25149151 PMCID: PMC4262219 DOI: 10.1186/1471-2105-15-285] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2013] [Accepted: 08/15/2014] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task. RESULTS A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations. CONCLUSIONS In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Collapse
Affiliation(s)
- Yifan Peng
- />Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716 USA
| | - Manabu Torii
- />Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716 USA
- />Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711 USA
| | - Cathy H Wu
- />Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716 USA
- />Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711 USA
| | - K Vijay-Shanker
- />Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716 USA
| |
Collapse
|
6
|
Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C, Sobral B, Tsujii J, Ananiadou S. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011. BMC Bioinformatics 2012; 13 Suppl 11:S2. [PMID: 22759456 PMCID: PMC3384257 DOI: 10.1186/1471-2105-13-s11-s2] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.
Collapse
Affiliation(s)
- Sampo Pyysalo
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Tomoko Ohta
- Department of Computer Science, University of Tokyo, Tokyo, Japan
| | - Rafal Rak
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Dan Sullivan
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunhong Mao
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunxia Wang
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Bruno Sobral
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | | | - Sophia Ananiadou
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| |
Collapse
|
7
|
Van Landeghem S, Hakala K, Rönnqvist S, Salakoski T, Van de Peer Y, Ginter F. Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations. Adv Bioinformatics 2012; 2012:582765. [PMID: 22719757 PMCID: PMC3375141 DOI: 10.1155/2012/582765] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2011] [Revised: 03/16/2012] [Accepted: 03/28/2012] [Indexed: 01/20/2023] Open
Abstract
Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
Collapse
Affiliation(s)
- Sofie Van Landeghem
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Kai Hakala
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Samuel Rönnqvist
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
- Turku BioNLP Group, Turku Centre for Computer Science (TUCS), Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Yves Van de Peer
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Filip Ginter
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| |
Collapse
|
8
|
Friedel S, Usadel B, von Wirén N, Sreenivasulu N. Reverse engineering: a key component of systems biology to unravel global abiotic stress cross-talk. FRONTIERS IN PLANT SCIENCE 2012; 3:294. [PMID: 23293646 PMCID: PMC3533172 DOI: 10.3389/fpls.2012.00294] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2012] [Accepted: 12/10/2012] [Indexed: 05/18/2023]
Abstract
Understanding the global abiotic stress response is an important stepping stone for the development of universal stress tolerance in plants in the era of climate change. Although co-occurrence of several stress factors (abiotic and biotic) in nature is found to be frequent, current attempts are poor to understand the complex physiological processes impacting plant growth under combinatory factors. In this review article, we discuss the recent advances of reverse engineering approaches that led to seminal discoveries of key candidate regulatory genes involved in cross-talk of abiotic stress responses and summarized the available tools of reverse engineering and its relevant application. Among the universally induced regulators involved in various abiotic stress responses, we highlight the importance of (i) abscisic acid (ABA) and jasmonic acid (JA) hormonal cross-talks and (ii) the central role of WRKY transcription factors (TF), potentially mediating both abiotic and biotic stress responses. Such interactome networks help not only to derive hypotheses but also play a vital role in identifying key regulatory targets and interconnected hormonal responses. To explore the full potential of gene network inference in the area of abiotic stress tolerance, we need to validate hypotheses by implementing time-dependent gene expression data from genetically engineered plants with modulated expression of target genes. We further propose to combine information on gene-by-gene interactions with data from physical interaction platforms such as protein-protein or TF-gene networks.
Collapse
Affiliation(s)
- Swetlana Friedel
- Leibniz Institute of Plant Genetics and Crop Plant ResearchGatersleben, Germany
| | - Björn Usadel
- RWTH Aachen UniversityAachen, Germany
- IBG-2: Plant Sciences, Institute of Bio- and Geosciences, Forschungszentrum JülichJülich, Germany
| | - Nicolaus von Wirén
- Leibniz Institute of Plant Genetics and Crop Plant ResearchGatersleben, Germany
| | - Nese Sreenivasulu
- Leibniz Institute of Plant Genetics and Crop Plant ResearchGatersleben, Germany
- *Correspondence: Nese Sreenivasulu, Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany. e-mail:
| |
Collapse
|