Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Thompson P, Iqbal SA, McNaught J, Ananiadou S. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 2009;10:349. [PMID: 19852798 PMCID: PMC2774701 DOI: 10.1186/1471-2105-10-349] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2009] [Accepted: 10/23/2009] [Indexed: 11/10/2022] Open

For:	Thompson P, Iqbal SA, McNaught J, Ananiadou S. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 2009;10:349. [PMID: 19852798 PMCID: PMC2774701 DOI: 10.1186/1471-2105-10-349] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2009] [Accepted: 10/23/2009] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Wu P, Li X, Gu J, Qian L, Zhou G. Pipelined biomedical event extraction rivaling joint learning. Methods 2024;226:9-18. [PMID: 38604412 DOI: 10.1016/j.ymeth.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 11/20/2023] [Accepted: 04/07/2024] [Indexed: 04/13/2024] Open

Nagano N, Tokunaga N, Ikeda M, Inoura H, Khoa DA, Miwa M, Sohrab MG, Topić G, Nogami-Itoh M, Takamura H. A novel corpus of molecular to higher-order events that facilitates the understanding of the pathogenic mechanisms of idiopathic pulmonary fibrosis. Sci Rep 2023;13:5986. [PMID: 37045907 PMCID: PMC10092917 DOI: 10.1038/s41598-023-32915-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Accepted: 04/04/2023] [Indexed: 04/14/2023] Open

Affiliation(s)

Nozomi Nagano Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan.
Narumi Tokunaga Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan
Masami Ikeda Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan
Hiroko Inoura Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan
Duong A Khoa Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan
Makoto Miwa Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-Ku, Nagoya, 468-8511, Japan
Mohammad G Sohrab Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan
Goran Topić Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan
Mari Nogami-Itoh Laboratory of Bioinformatics, Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, 3-17, Senrioka-Shinmachi, Settsu, Osaka, 566-0002, Japan
Hiroya Takamura Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan

Collapse

Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022;5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open

Affiliation(s)

Honghan Wu Institute of Health Informatics, University College London, London, UK.
Minhong Wang Institute of Health Informatics, University College London, London, UK
Jinge Wu Institute of Health Informatics, University College London, London, UK Usher Institute, University of Edinburgh, Edinburgh, UK
Farah Francis Usher Institute, University of Edinburgh, Edinburgh, UK
Yun-Hsuan Chang Institute of Health Informatics, University College London, London, UK
Alex Shavick Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
Hang Dong Usher Institute, University of Edinburgh, Edinburgh, UK Department of Computer Science, University of Oxford, Oxford, UK
Michael T C Poon Usher Institute, University of Edinburgh, Edinburgh, UK
Natalie Fitzpatrick Institute of Health Informatics, University College London, London, UK
Adam P Levine Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
Luke T Slater Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
Alex Handy Institute of Health Informatics, University College London, London, UK University College London Hospitals NHS Trust, London, UK
Andreas Karwath Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
Georgios V Gkoutos Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
Claude Chelala Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
Anoop Dinesh Shah Institute of Health Informatics, University College London, London, UK
Robert Stewart Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK South London and Maudsley NHS Foundation Trust, London, UK
Nigel Collier Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
Beatrice Alex Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
William Whiteley Usher Institute, University of Edinburgh, Edinburgh, UK
Cathie Sudlow Usher Institute, University of Edinburgh, Edinburgh, UK
Angus Roberts Department of Biostatistics & Health Informatics, King's College London, London, UK
Richard J B Dobson Institute of Health Informatics, University College London, London, UK Department of Biostatistics & Health Informatics, King's College London, London, UK

Collapse

Rojas-Garcia J. Semantic Representation of Context for Description of Named Rivers in a Terminological Knowledge Base. Front Psychol 2022;13:847024. [PMID: 36059758 PMCID: PMC9435467 DOI: 10.3389/fpsyg.2022.847024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 06/22/2022] [Indexed: 11/18/2022] Open

Abstract

The description of named entities in terminological knowledge bases has never been addressed in any depth in terminology. Firm preconceptions, rooted in philosophy, about the only referential function of proper names have presumably led to disparage their inclusion in terminology resources, despite the relevance of named entities having been highlighted by prominent figures in the discipline of terminology. Scholars from different branches of linguistics depart from the conservative stance on proper names and have foregrounded the need for a novel approach, more linguistic than philosophical, to describing proper names. Therefore, this paper proposed a linguistic and terminological approach to the study of named entities when used in scientific discourse, with the purpose of representing them in EcoLexicon, an environmental knowledge base designed according to the premises of Frame-based Terminology. We focused more specifically on named rivers (or potamonyms) mentioned in a coastal engineering corpus. Inclusion of named entities in terminological knowledge bases requires analyzing the context that surrounds them in specialized texts because these contexts convey specialized knowledge about named entities. For the semantic representation of context, this paper thus analyzed the local syntactic and semantic contexts that surrounded potamonyms in coastal engineering texts and described the semantic annotation of the predicate-argument structure of sentences where a potamonym was mentioned. The semantic variables annotated were the following: (1) semantic category of the arguments; (2) semantic role of the arguments; (3) semantic relation between the arguments; and (4) lexical domain of the verbs. This method yielded valuable insight into the different semantic roles that named rivers played, the entities and processes that participated in the events educed by potamonyms through verbs, and how they all interacted. Furthermore, since arguments are specialized terms and verbs are relational constructs, the analysis of argument structure led to the construction of semantic networks that depicted specialized knowledge about named rivers. These conceptual networks were then used to craft the thematic description of potamonyms. Accordingly, the semantic network and the thematic description not only constituted the representation of a potamonym in EcoLexicon, but also allowed the geographic contextualization of specialized concepts in the terminological resource.

Collapse

Using Twitter to Detect Hate Crimes and Their Motivations: The HateMotiv Corpus. DATA 2022. [DOI: 10.3390/data7060069] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open

Using Social Media to Detect Fake News Information Related to Product Marketing: The FakeAds Corpus. DATA 2022. [DOI: 10.3390/data7040044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open

Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform 2021;22:6326536. [PMID: 34308472 DOI: 10.1093/bib/bbab282] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/07/2021] [Accepted: 07/02/2021] [Indexed: 11/13/2022] Open

Alnazzawi N. Building a semantically annotated corpus for chronic disease complications using two document types. PLoS One 2021;16:e0247319. [PMID: 33735207 PMCID: PMC7971867 DOI: 10.1371/journal.pone.0247319] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 02/04/2021] [Indexed: 11/19/2022] Open

Ju M, Short AD, Thompson P, Bakerly ND, Gkoutos GV, Tsaprouni L, Ananiadou S. Annotating and detecting phenotypic information for chronic obstructive pulmonary disease. JAMIA Open 2020;2:261-271. [PMID: 31984360 PMCID: PMC6951876 DOI: 10.1093/jamiaopen/ooz009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 02/21/2019] [Accepted: 03/19/2019] [Indexed: 12/29/2022] Open

Abstract

Objectives

Chronic obstructive pulmonary disease (COPD) phenotypes cover a range of lung abnormalities. To allow text mining methods to identify pertinent and potentially complex information about these phenotypes from textual data, we have developed a novel annotated corpus, which we use to train a neural network-based named entity recognizer to detect fine-grained COPD phenotypic information.

Materials and methods

Since COPD phenotype descriptions often mention other concepts within them (proteins, treatments, etc.), our corpus annotations include both outermost phenotype descriptions and concepts nested within them. Our neural layered bidirectional long short-term memory conditional random field (BiLSTM-CRF) network firstly recognizes nested mentions, which are fed into subsequent BiLSTM-CRF layers, to help to recognize enclosing phenotype mentions.

Results

Our corpus of 30 full papers (available at: http://www.nactem.ac.uk/COPD) is annotated by experts with 27 030 phenotype-related concept mentions, most of which are automatically linked to UMLS Metathesaurus concepts. When trained using the corpus, our BiLSTM-CRF network outperforms other popular approaches in recognizing detailed phenotypic information.

Discussion

Information extracted by our method can facilitate efficient location and exploration of detailed information about phenotypes, for example, those specifically concerning reactions to treatments.

Conclusion

The importance of our corpus for developing methods to extract fine-grained information about COPD phenotypes is demonstrated through its successful use to train a layered BiLSTM-CRF network to extract phenotypic information at various levels of granularity. The minimal human intervention needed for training should permit ready adaption to extracting phenotypic information about other diseases.

Collapse

Galea D, Laponogov I, Veselkov K. Exploiting and assessing multi-source data for supervised biomedical named entity recognition. Bioinformatics 2019. [PMID: 29538614 PMCID: PMC6041968 DOI: 10.1093/bioinformatics/bty152] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Abstract

Motivation

Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed.

Results

Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model 'overtraining') which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data.

Availability and implementation

Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

ProtFus: A Comprehensive Method Characterizing Protein-Protein Interactions of Fusion Proteins. PLoS Comput Biol 2019;15:e1007239. [PMID: 31437145 PMCID: PMC6705771 DOI: 10.1371/journal.pcbi.1007239] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 07/03/2019] [Indexed: 01/10/2023] Open

Abstract

Tailored therapy aims to cure cancer patients effectively and safely, based on the complex interactions between patients' genomic features, disease pathology and drug metabolism. Thus, the continual increase in scientific literature drives the need for efficient methods of data mining to improve the extraction of useful information from texts based on patients' genomic features. An important application of text mining to tailored therapy in cancer encompasses the use of mutations and cancer fusion genes as moieties that change patients' cellular networks to develop cancer, and also affect drug metabolism. Fusion proteins, which are derived from the slippage of two parental genes, are produced in cancer by chromosomal aberrations and trans-splicing. Given that the two parental proteins for predicted fusion proteins are known, we used our previously developed method for identifying chimeric protein-protein interactions (ChiPPIs) associated with the fusion proteins. Here, we present a validation approach that receives fusion proteins of interest, predicts their cellular network alterations by ChiPPI and validates them by our new method, ProtFus, using an online literature search. This process resulted in a set of 358 fusion proteins and their corresponding protein interactions, as a training set for a Naïve Bayes classifier, to identify predicted fusion proteins that have reliable evidence in the literature and that were confirmed experimentally. Next, for a test group of 1817 fusion proteins, we were able to identify from the literature 2908 PPIs in total, across 18 cancer types. The described method, ProtFus, can be used for screening the literature to identify unique cases of fusion proteins and their PPIs, as means of studying alterations of protein networks in cancers. Availability: http://protfus.md.biu.ac.il/.

Collapse

Representing oncology in datasets: Standard or custom biomedical terminology? INFORMATICS IN MEDICINE UNLOCKED 2019. [DOI: 10.1016/j.imu.2019.100186] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform 2018;10:37. [PMID: 30105604 PMCID: PMC6089860 DOI: 10.1186/s13321-018-0290-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 07/20/2018] [Indexed: 02/02/2023] Open

Abstract

Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .

Collapse

Arguello Casteleiro M, Demetriou G, Read W, Fernandez Prieto MJ, Maroto N, Maseda Fernandez D, Nenadic G, Klein J, Keane J, Stevens R. Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature. J Biomed Semantics 2018;9:13. [PMID: 29650041 PMCID: PMC5896136 DOI: 10.1186/s13326-018-0181-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Accepted: 03/06/2018] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created.

METHODS

We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels.

RESULTS

In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%.

CONCLUSIONS

This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature.

Collapse

Zerva C, Batista-Navarro R, Day P, Ananiadou S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 2017;33:3784-3792. [PMID: 29036627 PMCID: PMC5860317 DOI: 10.1093/bioinformatics/btx466] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Revised: 06/27/2017] [Accepted: 07/21/2017] [Indexed: 11/20/2022] Open

Abstract

MOTIVATION

In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models.

RESULTS

We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research.

AVAILABILITY AND IMPLEMENTATION

The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material.

CONTACT

sophia.ananiadou@manchester.ac.uk.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Raja K, Patrick M, Gao Y, Madu D, Yang Y, Tsoi LC. A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries. Int J Genomics 2017;2017:6213474. [PMID: 28331849 PMCID: PMC5346376 DOI: 10.1155/2017/6213474] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 02/09/2017] [Indexed: 12/13/2022] Open

Pérez-Pérez M, Pérez-Rodríguez G, Fdez-Riverola F, Lourenço A. Collaborative relation annotation and quality analysis in Markyt environment. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:4693828. [PMID: 29220479 PMCID: PMC5737204 DOI: 10.1093/database/bax090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 11/09/2017] [Indexed: 11/30/2022]

Kilicoglu H, Rosemblat G, Fiszman M, Rindflesch TC. Sortal anaphora resolution to enhance relation extraction from biomedical literature. BMC Bioinformatics 2016;17:163. [PMID: 27080229 PMCID: PMC4832532 DOI: 10.1186/s12859-016-1009-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Accepted: 04/01/2016] [Indexed: 11/16/2022] Open

Abstract

BACKGROUND

Entity coreference is common in biomedical literature and it can affect text understanding systems that rely on accurate identification of named entities, such as relation extraction and automatic summarization. Coreference resolution is a foundational yet challenging natural language processing task which, if performed successfully, is likely to enhance such systems significantly. In this paper, we propose a semantically oriented, rule-based method to resolve sortal anaphora, a specific type of coreference that forms the majority of coreference instances in biomedical literature. The method addresses all entity types and relies on linguistic components of SemRep, a broad-coverage biomedical relation extraction system. It has been incorporated into SemRep, extending its core semantic interpretation capability from sentence level to discourse level.

RESULTS

We evaluated our sortal anaphora resolution method in several ways. The first evaluation specifically focused on sortal anaphora relations. Our methodology achieved a F1 score of 59.6 on the test portion of a manually annotated corpus of 320 Medline abstracts, a 4-fold improvement over the baseline method. Investigating the impact of sortal anaphora resolution on relation extraction, we found that the overall effect was positive, with 50 % of the changes involving uninformative relations being replaced by more specific and informative ones, while 35 % of the changes had no effect, and only 15 % were negative. We estimate that anaphora resolution results in changes in about 1.5 % of approximately 82 million semantic relations extracted from the entire PubMed.

CONCLUSIONS

Our results demonstrate that a heavily semantic approach to sortal anaphora resolution is largely effective for biomedical literature. Our evaluation and error analysis highlight some areas for further improvements, such as coordination processing and intra-sentential antecedent selection.

Collapse

Thompson P, Batista-Navarro RT, Kontonatsios G, Carter J, Toon E, McNaught J, Timmermann C, Worboys M, Ananiadou S. Text Mining the History of Medicine. PLoS One 2016;11:e0144717. [PMID: 26734936 PMCID: PMC4703377 DOI: 10.1371/journal.pone.0144717] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Accepted: 11/23/2015] [Indexed: 11/19/2022] Open

Abstract

Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.

Collapse

An Overview of Biomolecular Event Extraction from Scientific Documents. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015;2015:571381. [PMID: 26587051 PMCID: PMC4637451 DOI: 10.1155/2015/571381] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Revised: 08/10/2015] [Accepted: 08/18/2015] [Indexed: 01/09/2023]

Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Brief Funct Genomics 2015;14:213-30. [PMID: 24907365 PMCID: PMC4499874 DOI: 10.1093/bfgp/elu015] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open

Pérez-Pérez M, Glez-Peña D, Fdez-Riverola F, Lourenço A. Marky: a tool supporting annotation consistency in multi-user and iterative document annotation projects. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015;118:242-251. [PMID: 25480679 DOI: 10.1016/j.cmpb.2014.11.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Revised: 10/24/2014] [Accepted: 11/18/2014] [Indexed: 06/04/2023]

Abstract

BACKGROUND AND OBJECTIVES

Document annotation is a key task in the development of Text Mining methods and applications. High quality annotated corpora are invaluable, but their preparation requires a considerable amount of resources and time. Although the existing annotation tools offer good user interaction interfaces to domain experts, project management and quality control abilities are still limited. Therefore, the current work introduces Marky, a new Web-based document annotation tool equipped to manage multi-user and iterative projects, and to evaluate annotation quality throughout the project life cycle.

METHODS

At the core, Marky is a Web application based on the open source CakePHP framework. User interface relies on HTML5 and CSS3 technologies. Rangy library assists in browser-independent implementation of common DOM range and selection tasks, and Ajax and JQuery technologies are used to enhance user-system interaction.

RESULTS

Marky grants solid management of inter- and intra-annotator work. Most notably, its annotation tracking system supports systematic and on-demand agreement analysis and annotation amendment. Each annotator may work over documents as usual, but all the annotations made are saved by the tracking system and may be further compared. So, the project administrator is able to evaluate annotation consistency among annotators and across rounds of annotation, while annotators are able to reject or amend subsets of annotations made in previous rounds. As a side effect, the tracking system minimises resource and time consumption.

CONCLUSIONS

Marky is a novel environment for managing multi-user and iterative document annotation projects. Compared to other tools, Marky offers a similar visually intuitive annotation experience while providing unique means to minimise annotation effort and enforce annotation quality, and therefore corpus consistency. Marky is freely available for non-commercial use at http://sing.ei.uvigo.es/marky.

Collapse

Pecina P, Dušek O, Goeuriot L, Hajič J, Hlaváčová J, Jones GJ, Kelly L, Leveling J, Mareček D, Novák M, Popel M, Rosa R, Tamchyna A, Urešová Z. Adaptation of machine translation for multilingual information retrieval in the medical domain. Artif Intell Med 2014;61:165-85. [DOI: 10.1016/j.artmed.2014.01.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Revised: 01/15/2014] [Accepted: 01/28/2014] [Indexed: 11/16/2022]

Tsai RTH, Lai PT. A resource-saving collective approach to biomedical semantic role labeling. BMC Bioinformatics 2014;15:160. [PMID: 24884358 PMCID: PMC4062501 DOI: 10.1186/1471-2105-15-160] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Accepted: 04/22/2014] [Indexed: 11/24/2022] Open

Abstract

Background

Biomedical semantic role labeling (BioSRL) is a natural language processing technique that identifies the semantic roles of the words or phrases in sentences describing biological processes and expresses them as predicate-argument structures (PAS’s). Currently, a major problem of BioSRL is that most systems label every node in a full parse tree independently; however, some nodes always exhibit dependency. In general SRL, collective approaches based on the Markov logic network (MLN) model have been successful in dealing with this problem. However, in BioSRL such an approach has not been attempted because it would require more training data to recognize the more specialized and diverse terms found in biomedical literature, increasing training time and computational complexity.

Results

We first constructed a collective BioSRL system based on MLN. This system, called collective BIOSMILE (CBIOSMILE), is trained on the BioProp corpus. To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE). Our experimental results show that our proposed CBIOSMILE system outperforms BIOSMILE, which is the top BioSRL system. Furthermore, our proposed RCBIOSMILE maintains the same level of accuracy as CBIOSMILE using 92% less memory and 57% less training time.

Conclusions

This greatly improved efficiency makes RCBIOSMILE potentially suitable for training on much larger BioSRL corpora over more biomedical domains. Compared to real-world biomedical corpora, BioProp is relatively small, containing only 445 MEDLINE abstracts and 30 event triggers. It is not large enough for practical applications, such as pathway construction. We consider it of primary importance to pursue SRL training on large corpora in the future.

Collapse

Neves M. An analysis on the entity annotations in biological corpora. F1000Res 2014;3:96. [PMID: 25254099 PMCID: PMC4168744 DOI: 10.12688/f1000research.3216.1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/17/2014] [Indexed: 11/20/2022] Open

Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014;47:1-10. [PMID: 24393765 DOI: 10.1016/j.jbi.2013.12.006] [Citation(s) in RCA: 242] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2013] [Revised: 11/06/2013] [Accepted: 12/07/2013] [Indexed: 10/25/2022]

Abstract

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.

Collapse

Nawaz R, Thompson P, Ananiadou S. Negated bio-events: analysis and identification. BMC Bioinformatics 2013;14:14. [PMID: 23323936 PMCID: PMC3561152 DOI: 10.1186/1471-2105-14-14] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2012] [Accepted: 01/10/2013] [Indexed: 12/19/2022] Open

Abstract

BACKGROUND

Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations.

RESULTS

We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP'09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP'09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events.

CONCLUSIONS

Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events.

Collapse

Mihăilă C, Ohta T, Pyysalo S, Ananiadou S. BioCause: Annotating and analysing causality in the biomedical domain. BMC Bioinformatics 2013;14:2. [PMID: 23323613 PMCID: PMC3621543 DOI: 10.1186/1471-2105-14-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2012] [Accepted: 12/29/2012] [Indexed: 11/24/2022] Open

Abstract

Background

Biomedical corpora annotated with event-level information represent an important resource for domain-specific information extraction (IE) systems. However, bio-event annotation alone cannot cater for all the needs of biologists. Unlike work on relation and event extraction, most of which focusses on specific events and named entities, we aim to build a comprehensive resource, covering all statements of causal association present in discourse. Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining.

Results

We have defined an annotation scheme for enriching biomedical domain corpora with causality relations. This schema has subsequently been used to annotate 851 causal relations to form BioCause, a collection of 19 open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of previous shared tasks. We report an inter-annotator agreement rate of over 60% for triggers and of over 80% for arguments using an exact match constraint. These increase significantly using a relaxed match setting. Moreover, we analyse and describe the causality relations in BioCause from various points of view. This information can then be leveraged for the training of automatic causality detection systems.

Conclusion

Augmenting named entity and event annotations with information about causal discourse relations could benefit the development of more sophisticated IE systems. These will further influence the development of multiple tasks, such as enabling textual inference to detect entailments, discovering new facts and providing new hypotheses for experimental work.

Collapse

Neves M, Leser U. A survey on annotation tools for the biomedical literature. Brief Bioinform 2012;15:327-40. [PMID: 23255168 DOI: 10.1093/bib/bbs084] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open

Ozyurt IB. Automatic identification and classification of noun argument structures in biomedical literature. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012;9:1639-1648. [PMID: 22868678 DOI: 10.1109/tcbb.2012.111] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]

Kang N, Singh B, Afzal Z, van Mulligen EM, Kors JA. Using rule-based natural language processing to improve disease normalization in biomedical text. J Am Med Inform Assoc 2012;20:876-81. [PMID: 23043124 PMCID: PMC3756254 DOI: 10.1136/amiajnl-2012-001173] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open

Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA, Cohen KB, Verspoor K, Blake JA, Hunter LE. Concept annotation in the CRAFT corpus. BMC Bioinformatics 2012;13:161. [PMID: 22776079 PMCID: PMC3476437 DOI: 10.1186/1471-2105-13-161] [Citation(s) in RCA: 109] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Accepted: 06/08/2012] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

RESULTS

This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

CONCLUSIONS

As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Collapse

Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature--a survey of the state of the art. Brief Bioinform 2012;13:460-94. [PMID: 22833496 PMCID: PMC3404399 DOI: 10.1093/bib/bbs018] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 03/23/2012] [Indexed: 01/05/2023] Open

Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C, Sobral B, Tsujii J, Ananiadou S. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011. BMC Bioinformatics 2012;13 Suppl 11:S2. [PMID: 22759456 PMCID: PMC3384257 DOI: 10.1186/1471-2105-13-s11-s2] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open

Abstract

We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.

Collapse

Kilicoglu H, Bergler S. Biological event composition. BMC Bioinformatics 2012;13 Suppl 11:S7. [PMID: 22759461 PMCID: PMC3384260 DOI: 10.1186/1471-2105-13-s11-s7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open

Abstract

BACKGROUND

In recent years, biological event extraction has emerged as a key natural language processing task, aiming to address the information overload problem in accessing the molecular biology literature. The BioNLP shared task competitions have contributed to this recent interest considerably. The first competition (BioNLP'09) focused on extracting biological events from Medline abstracts from a narrow domain, while the theme of the latest competition (BioNLP-ST'11) was generalization and a wider range of text types, event types, and subject domains were considered. We view event extraction as a building block in larger discourse interpretation and propose a two-phase, linguistically-grounded, rule-based methodology. In the first phase, a general, underspecified semantic interpretation is composed from syntactic dependency relations in a bottom-up manner. The notion of embedding underpins this phase and it is informed by a trigger dictionary and argument identification rules. Coreference resolution is also performed at this step, allowing extraction of inter-sentential relations. The second phase is concerned with constraining the resulting semantic interpretation by shared task specifications. We evaluated our general methodology on core biological event extraction and speculation/negation tasks in three main tracks of BioNLP-ST'11 (GENIA, EPI, and ID).

RESULTS

We achieved competitive results in GENIA and ID tracks, while our results in the EPI track leave room for improvement. One notable feature of our system is that its performance across abstracts and articles bodies is stable. Coreference resolution results in minor improvement in system performance. Due to our interest in discourse-level elements, such as speculation/negation and coreference, we provide a more detailed analysis of our system performance in these subtasks.

CONCLUSIONS

The results demonstrate the viability of a robust, linguistically-oriented methodology, which clearly distinguishes general semantic interpretation from shared task specific aspects, for biological event extraction. Our error analysis pinpoints some shortcomings, which we plan to address in future work within our incremental system development methodology.

Collapse

Miwa M, Thompson P, McNaught J, Kell DB, Ananiadou S. Extracting semantically enriched events from biomedical literature. BMC Bioinformatics 2012;13:108. [PMID: 22621266 PMCID: PMC3464657 DOI: 10.1186/1471-2105-13-108] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Accepted: 05/23/2012] [Indexed: 11/21/2022] Open

Abstract

BACKGROUND

Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them.

RESULTS

Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP'09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP'09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task.

CONCLUSIONS

We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.

Collapse

Miwa M, Thompson P, Ananiadou S. Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 2012;28:1759-65. [PMID: 22539668 PMCID: PMC3381963 DOI: 10.1093/bioinformatics/bts237] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Kilicoglu H, Rosemblat G, Fiszman M, Rindflesch TC. Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics 2011;12:486. [PMID: 22185221 PMCID: PMC3281188 DOI: 10.1186/1471-2105-12-486] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2011] [Accepted: 12/20/2011] [Indexed: 11/30/2022] Open

Abstract

Background

Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology.

Results

We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475). With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536) in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations.

Conclusions

While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular entities and processes is particularly challenging. While the resulting gold standard is mainly intended to serve as a test collection for our semantic interpreter, we believe that the lessons learned are applicable generally.

Collapse

Carreira R, Carneiro S, Pereira R, Rocha M, Rocha I, Ferreira EC, Lourenço A. Semantic annotation of biological concepts interplaying microbial cellular responses. BMC Bioinformatics 2011;12:460. [PMID: 22122862 PMCID: PMC3259143 DOI: 10.1186/1471-2105-12-460] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2011] [Accepted: 11/28/2011] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes.

RESULTS

Here, we present a novel corpus concerning the integrated cellular responses to nutrient starvation in the model-organism Escherichia coli. Our corpus is a unique resource in that it annotates biomedical concepts that play a functional role in expression, regulation and metabolism. Namely, it includes annotations for genetic information carriers (genes and DNA, RNA molecules), proteins (transcription factors, enzymes and transporters), small metabolites, physiological states and laboratory techniques. The corpus consists of 130 full-text papers with a total of 59043 annotations for 3649 different biomedical concepts; the two dominant classes are genes (highest number of unique concepts) and compounds (most frequently annotated concepts), whereas other important cellular concepts such as proteins account for no more than 10% of the annotated concepts.

CONCLUSIONS

To the best of our knowledge, a corpus that details such a wide range of biological concepts has never been presented to the text mining community. The inter-annotator agreement statistics provide evidence of the importance of a consolidated background when dealing with such complex descriptions, the ambiguities naturally arising from the terminology and their impact for modelling purposes.Availability is granted for the full-text corpora of 130 freely accessible documents, the annotation scheme and the annotation guidelines. Also, we include a corpus of 340 abstracts.

Collapse

Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, Monachini M, Pezik P, Quochi V, Rupp CJ, Sasaki Y, Venturi G, Rebholz-Schuhmann D, Ananiadou S. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 2011;12:397. [PMID: 21992002 PMCID: PMC3228855 DOI: 10.1186/1471-2105-12-397] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 10/12/2011] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events.

RESULTS

This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard.

CONCLUSIONS

The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

Collapse

Thompson P, Nawaz R, McNaught J, Ananiadou S. Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics 2011;12:393. [PMID: 21985429 PMCID: PMC3222636 DOI: 10.1186/1471-2105-12-393] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/10/2011] [Indexed: 11/17/2022] Open

Abstract

Background

Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE) systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event.

Results

We have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events). High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa.

Conclusion

By augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.

Collapse

Event extraction for DNA methylation. J Biomed Semantics 2011;2 Suppl 5:S2. [PMID: 22166595 PMCID: PMC3239302 DOI: 10.1186/2041-1480-2-s5-s2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics 2011;11:1467-89. [PMID: 21047206 DOI: 10.2217/pgs.10.136] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

Bada M, Hunter L. Desiderata for ontologies to be used in semantic annotation of biomedical documents. J Biomed Inform 2010;44:94-101. [PMID: 20971216 DOI: 10.1016/j.jbi.2010.10.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2009] [Revised: 10/03/2010] [Accepted: 10/09/2010] [Indexed: 11/20/2022]

Ananiadou S, Pyysalo S, Tsujii J, Kell DB. Event extraction for systems biology by text mining the literature. Trends Biotechnol 2010;28:381-90. [PMID: 20570001 DOI: 10.1016/j.tibtech.2010.04.005] [Citation(s) in RCA: 140] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2010] [Revised: 04/20/2010] [Accepted: 04/26/2010] [Indexed: 01/08/2023]