Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, Monachini M, Pezik P, Quochi V, Rupp CJ, Sasaki Y, Venturi G, Rebholz-Schuhmann D, Ananiadou S. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 2011;12:397. [PMID: 21992002 PMCID: PMC3228855 DOI: 10.1186/1471-2105-12-397] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 10/12/2011] [Indexed: 11/10/2022] Open

For:	Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, Monachini M, Pezik P, Quochi V, Rupp CJ, Sasaki Y, Venturi G, Rebholz-Schuhmann D, Ananiadou S. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 2011;12:397. [PMID: 21992002 PMCID: PMC3228855 DOI: 10.1186/1471-2105-12-397] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 10/12/2011] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Campillos-Llanos L. MedLexSp - a medical lexicon for Spanish medical natural language processing. J Biomed Semantics 2023;14:2. [PMID: 36732862 PMCID: PMC9892682 DOI: 10.1186/s13326-022-00281-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 12/03/2022] [Indexed: 02/04/2023] Open

Abstract

BACKGROUND

Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish.

CONSTRUCTION AND CONTENT

This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries.

CONCLUSIONS

The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.

Collapse

Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022;5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open

Affiliation(s)

Honghan Wu Institute of Health Informatics, University College London, London, UK.
Minhong Wang Institute of Health Informatics, University College London, London, UK
Jinge Wu Institute of Health Informatics, University College London, London, UK Usher Institute, University of Edinburgh, Edinburgh, UK
Farah Francis Usher Institute, University of Edinburgh, Edinburgh, UK
Yun-Hsuan Chang Institute of Health Informatics, University College London, London, UK
Alex Shavick Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
Hang Dong Usher Institute, University of Edinburgh, Edinburgh, UK Department of Computer Science, University of Oxford, Oxford, UK
Michael T C Poon Usher Institute, University of Edinburgh, Edinburgh, UK
Natalie Fitzpatrick Institute of Health Informatics, University College London, London, UK
Adam P Levine Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
Luke T Slater Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
Alex Handy Institute of Health Informatics, University College London, London, UK University College London Hospitals NHS Trust, London, UK
Andreas Karwath Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
Georgios V Gkoutos Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
Claude Chelala Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
Anoop Dinesh Shah Institute of Health Informatics, University College London, London, UK
Robert Stewart Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK South London and Maudsley NHS Foundation Trust, London, UK
Nigel Collier Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
Beatrice Alex Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
William Whiteley Usher Institute, University of Edinburgh, Edinburgh, UK
Cathie Sudlow Usher Institute, University of Edinburgh, Edinburgh, UK
Angus Roberts Department of Biostatistics & Health Informatics, King's College London, London, UK
Richard J B Dobson Institute of Health Informatics, University College London, London, UK Department of Biostatistics & Health Informatics, King's College London, London, UK

Collapse

Jung E, Jain H, Sinha AP, Gaudioso C. Building a specialized lexicon for breast cancer clinical trial subject eligibility analysis. Health Informatics J 2021;27:1460458221989392. [PMID: 33535885 DOI: 10.1177/1460458221989392] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Queirós P, Novikova P, Wilmes P, May P. Unification of functional annotation descriptions using text mining. Biol Chem 2021;402:983-990. [PMID: 33984880 DOI: 10.1515/hsz-2021-0125] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 05/03/2021] [Indexed: 02/06/2023]

Inferring Drug-Protein⁻Side Effect Relationships from Biomedical Text. Genes (Basel) 2019;10:genes10020159. [PMID: 30791472 PMCID: PMC6409686 DOI: 10.3390/genes10020159] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 02/13/2019] [Accepted: 02/14/2019] [Indexed: 11/16/2022] Open

Abstract

Background: Although there are many studies of drugs and their side effects, the underlying mechanisms of these side effects are not well understood. It is also difficult to understand the specific pathways between drugs and side effects. Objective: The present study seeks to construct putative paths between drugs and their side effects by applying text-mining techniques to free text of biomedical studies, and to develop ranking metrics that could identify the most-likely paths. Materials and Methods: We extracted three types of relationships—drug-protein, protein-protein, and protein–side effect—from biomedical texts by using text mining and predefined relation-extraction rules. Based on the extracted relationships, we constructed whole drug-protein–side effect paths. For each path, we calculated its ranking score by a new ranking function that combines corpus- and ontology-based semantic similarity as well as co-occurrence frequency. Results: We extracted 13 plausible biomedical paths connecting drugs and their side effects from cancer-related abstracts in the PubMed database. The top 20 paths were examined, and the proposed ranking function outperformed the other methods tested, including co-occurrence, COALS, and UMLS by P@5-P@20. In addition, we confirmed that the paths are novel hypotheses that are worth investigating further. Discussion: The risk of side effects has been an important issue for the US Food and Drug Administration (FDA). However, the causes and mechanisms of such side effects have not been fully elucidated. This study extends previous research on understanding drug side effects by using various techniques such as Named Entity Recognition (NER), Relation Extraction (RE), and semantic similarity. Conclusion: It is not easy to reveal the biomedical mechanisms of side effects due to a huge number of possible paths. However, we automatically generated predictable paths using the proposed approach, which could provide meaningful information to biomedical researchers to generate plausible hypotheses for the understanding of such mechanisms.

Collapse

Chiu B, Majewska O, Pyysalo S, Wey L, Stenius U, Korhonen A, Palmer M. A neural classification method for supporting the creation of BioVerbNet. J Biomed Semantics 2019;10:2. [PMID: 30658707 PMCID: PMC6339329 DOI: 10.1186/s13326-018-0193-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Accepted: 12/19/2018] [Indexed: 11/21/2022] Open

Constructing a biodiversity terminological inventory. PLoS One 2017;12:e0175277. [PMID: 28414821 PMCID: PMC5393592 DOI: 10.1371/journal.pone.0175277] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 03/23/2017] [Indexed: 11/22/2022] Open

Abstract

The increasing growth of literature in biodiversity presents challenges to users who need to discover pertinent information in an efficient and timely manner. In response, text mining techniques offer solutions by facilitating the automated discovery of knowledge from large textual data. An important step in text mining is the recognition of concepts via their linguistic realisation, i.e., terms. However, a given concept may be referred to in text using various synonyms or term variants, making search systems likely to overlook documents mentioning less known variants, which are albeit relevant to a query term. Domain-specific terminological resources, which include term variants, synonyms and related terms, are thus important in supporting semantic search over large textual archives. This article describes the use of text mining methods for the automatic construction of a large-scale biodiversity term inventory. The inventory consists of names of species, amongst which naming variations are prevalent. We apply a number of distributional semantic techniques on all of the titles in the Biodiversity Heritage Library, to compute semantic similarity between species names and support the automated construction of the resource. With the construction of our biodiversity term inventory, we demonstrate that distributional semantic models are able to identify semantically similar names that are not yet recorded in existing taxonomies. Such methods can thus be used to update existing taxonomies semi-automatically by deriving semantically related taxonomic names from a text corpus and allowing expert curators to validate them. We also evaluate our inventory as a means to improve search by facilitating automatic query expansion. Specifically, we developed a visual search interface that suggests semantically related species names, which are available in our inventory but not always in other repositories, to incorporate into the search query. An assessment of the interface by domain experts reveals that our query expansion based on related names is useful for increasing the number of relevant documents retrieved. Its exploitation can benefit both users and developers of search engines and text mining applications.

Collapse

Question-driven topic-based extraction of Protein–Protein Interaction Methods from biomedical literature. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.04.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]

Kim S, Islamaj Doğan R, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Batista-Navarro R, Carter J, Ananiadou S, Matos S, Santos A, Campos D, Oliveira JL, Singh O, Jonnagaddala J, Dai HJ, Su ECY, Chang YC, Su YC, Chu CH, Chen CC, Hsu WL, Peng Y, Arighi C, Wu CH, Vijay-Shanker K, Aydın F, Hüsünbeyi ZM, Özgür A, Shin SY, Kwon D, Dolinski K, Tyers M, Wilbur WJ, Comeau DC. BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database (Oxford) 2016;2016:baw121. [PMID: 27589962 PMCID: PMC5009341 DOI: 10.1093/database/baw121] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Revised: 07/29/2016] [Accepted: 08/02/2016] [Indexed: 11/14/2022]

Affiliation(s)

Sun Kim National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Rezarta Islamaj Doğan National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Andrew Chatr-Aryamontri Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada
Christie S Chang Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Rose Oughtred Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Jennifer Rust Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Riza Batista-Navarro National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Jacob Carter National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Sophia Ananiadou National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Sérgio Matos DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
André Santos DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
David Campos BMD Software, Lda, Rua Calouste Gulbenkian 1, 3810-074 Aveiro, Portugal
José Luís Oliveira DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
Onkar Singh Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Jitendra Jonnagaddala School of Public Health and Community Medicine, University of New South Wales, Kensington NSW 2033, Australia Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2033, Australia
Hong-Jie Dai Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
Emily Chia-Yu Su Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Yung-Chun Chang Institute of Information Science, Academia Sinica, Taipei, Taiwan Department of Information Management, National Taiwan University, Taipei, Taiwan
Yu-Chen Su Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Chun-Han Chu Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chien Chin Chen Department of Information Management, National Taiwan University, Taipei, Taiwan
Wen-Lian Hsu Institute of Information Science, Academia Sinica, Taipei, Taiwan
Yifan Peng Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
Cecilia Arighi Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
Cathy H Wu Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
K Vijay-Shanker Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
Ferhat Aydın Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Zehra Melce Hüsünbeyi Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Arzucan Özgür Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Soo-Yong Shin Department of Biomedical Informatics, Asan Medical Center, 138-736 Seoul, South Korea
Dongseop Kwon Department of Computer Engineering, Myongji University, 449-728 Yongin, South Korea
Kara Dolinski Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Mike Tyers Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
W John Wilbur National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Donald C Comeau National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

Collapse

Thompson P, Nawaz R, McNaught J, Ananiadou S. Enriching news events with meta-knowledge information. LANG RESOUR EVAL 2016. [DOI: 10.1007/s10579-016-9344-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Nguyen NTH, Miwa M, Tsuruoka Y, Tojo S. Identifying synonymy between relational phrases using word embeddings. J Biomed Inform 2015;56:94-102. [PMID: 26004792 DOI: 10.1016/j.jbi.2015.05.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Revised: 05/12/2015] [Accepted: 05/15/2015] [Indexed: 11/26/2022]

Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Brief Funct Genomics 2015;14:213-30. [PMID: 24907365 PMCID: PMC4499874 DOI: 10.1093/bfgp/elu015] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open

Soldatova LN, Nadis D, King RD, Basu PS, Haddi E, Baumlé V, Saunders NJ, Marwan W, Rudkin BB. EXACT2: the semantics of biomedical protocols. BMC Bioinformatics 2014;15 Suppl 14:S5. [PMID: 25472549 PMCID: PMC4255744 DOI: 10.1186/1471-2105-15-s14-s5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Abstract

Background

The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility.

Methods

We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility.

To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously 'unseen' (not used for the construction of EXACT2) protocols.

Results

The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed.

Conclusions

The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format.

Collapse

Blair DR, Wang K, Nestorov S, Evans JA, Rzhetsky A. Quantifying the impact and extent of undocumented biomedical synonymy. PLoS Comput Biol 2014;10:e1003799. [PMID: 25255227 PMCID: PMC4177665 DOI: 10.1371/journal.pcbi.1003799] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2013] [Accepted: 06/26/2014] [Indexed: 12/14/2022] Open

Abstract

Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through “crowd-sourcing.” Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for “next-generation,” high-coverage lexical terminologies.

Automated systems that extract and integrate information from the research literature have become common in biomedicine. As the same meaning can be expressed in many distinct but synonymous ways, access to comprehensive thesauri may enable such systems to maximize their performance. Here, we establish the importance of synonymy for a specific text-mining task (named-entity normalization), and we suggest that current thesauri may be woefully inadequate in their documentation of this linguistic phenomenon. To test this claim, we develop a model for estimating the amount of missing synonymy. We apply our model to both biomedical terminologies and general-English thesauri, predicting massive amounts of missing synonymy for both lexicons. Furthermore, we verify some of our predictions for the latter domain through “crowd-sourcing.” Overall, our work highlights the dramatic incompleteness of current biomedical thesauri, and to mitigate this issue, we propose the creation of “living” terminologies, which would automatically harvest undocumented synonymy and help smart machines enrich biomedicine.

Collapse

Rak R, Batista-Navarro RT, Carter J, Rowley A, Ananiadou S. Processing biological literature with customizable Web services supporting interoperable formats. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau064. [PMID: 25006225 PMCID: PMC4086403 DOI: 10.1093/database/bau064] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Rebholz-Schuhmann D, Grabmüller C, Kavaliauskas S, Croset S, Woollard P, Backofen R, Filsell W, Clark D. A case study: semantic integration of gene-disease associations for type 2 diabetes mellitus from literature and biomedical data resources. Drug Discov Today 2013;19:882-9. [PMID: 24201223 DOI: 10.1016/j.drudis.2013.10.024] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2012] [Revised: 09/24/2013] [Accepted: 10/28/2013] [Indexed: 10/26/2022]

Rebholz-Schuhmann D, Kafkas S, Kim JH, Li C, Jimeno Yepes A, Hoehndorf R, Backofen R, Lewin I. Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources. J Biomed Semantics 2013;4:28. [PMID: 24112383 PMCID: PMC4021975 DOI: 10.1186/2041-1480-4-28] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 09/11/2013] [Indexed: 11/10/2022] Open

Abstract

Motivation

The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs.

Results

In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions.

Conclusion

The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.

Collapse

Rebholz-Schuhmann D, Kim JH, Yan Y, Dixit A, Friteyre C, Hoehndorf R, Backofen R, Lewin I. Evaluation and cross-comparison of lexical entities of biological interest (LexEBI). PLoS One 2013;8:e75185. [PMID: 24124474 PMCID: PMC3790750 DOI: 10.1371/journal.pone.0075185] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Accepted: 08/14/2013] [Indexed: 01/12/2023] Open

Abstract

MOTIVATION

Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).

RESULT

This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.

CONCLUSION

LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.

Collapse

A modular framework for biomedical concept recognition. BMC Bioinformatics 2013;14:281. [PMID: 24063607 PMCID: PMC3849280 DOI: 10.1186/1471-2105-14-281] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2013] [Accepted: 09/20/2013] [Indexed: 11/24/2022] Open

Abstract

Background

Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools.

Results

This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification.

Conclusions

Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.

Collapse

Application of Information Retrieval Approaches to Case Classification in the Vaccine Adverse Event Reporting System. Drug Saf 2013;36:573-82. [PMID: 23703591 DOI: 10.1007/s40264-013-0064-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]

Biomedical text mining and its applications in cancer research. J Biomed Inform 2013;46:200-11. [DOI: 10.1016/j.jbi.2012.10.007] [Citation(s) in RCA: 159] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2012] [Revised: 10/30/2012] [Accepted: 10/30/2012] [Indexed: 11/21/2022]

Li C, Liakata M, Rebholz-Schuhmann D. Biological network extraction from scientific literature: state of the art and challenges. Brief Bioinform 2013;15:856-77. [PMID: 23434632 DOI: 10.1093/bib/bbt006] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Rimell L, Lippincott T, Verspoor K, Johnson HL, Korhonen A. Acquisition and evaluation of verb subcategorization resources for biomedicine. J Biomed Inform 2013;46:228-37. [PMID: 23347886 DOI: 10.1016/j.jbi.2013.01.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2012] [Revised: 01/03/2013] [Accepted: 01/05/2013] [Indexed: 10/27/2022]

Abstract

BACKGROUND

Biomedical natural language processing (NLP) applications that have access to detailed resources about the linguistic characteristics of biomedical language demonstrate improved performance on tasks such as relation extraction and syntactic or semantic parsing. Such applications are important for transforming the growing unstructured information buried in the biomedical literature into structured, actionable information. In this paper, we address the creation of linguistic resources that capture how individual biomedical verbs behave. We specifically consider verb subcategorization, or the tendency of verbs to "select" co-occurrence with particular phrase types, which influences the interpretation of verbs and identification of verbal arguments in context. There are currently a limited number of biomedical resources containing information about subcategorization frames (SCFs), and these are the result of either labor-intensive manual collation, or automatic methods that use tools adapted to a single biomedical subdomain. Either method may result in resources that lack coverage. Moreover, the quality of existing verb SCF resources for biomedicine is unknown, due to a lack of available gold standards for evaluation.

RESULTS

This paper presents three new resources related to verb subcategorization frames in biomedicine, and four experiments making use of the new resources. We present the first biomedical SCF gold standards, capturing two different but widely-used definitions of subcategorization, and a new SCF lexicon, BioCat, covering a large number of biomedical sub-domains. We evaluate the SCF acquisition methodologies for BioCat with respect to the gold standards, and compare the results with the accuracy of the only previously existing automatically-acquired SCF lexicon for biomedicine, the BioLexicon. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. Finally, we explore the definition of subcategorization using these resources and its implications for biomedical NLP. All resources are made publicly available.

CONCLUSION

The SCF resources we have evaluated still show considerably lower accuracy than that reported with general English lexicons, demonstrating the need for domain- and subdomain-specific SCF acquisition tools for biomedicine. Our new gold standards reveal major differences when annotators use the different definitions. Moreover, evaluation of BioCat yields major differences in accuracy depending on the gold standard, demonstrating that the definition of subcategorization adopted will have a direct impact on perceived system accuracy for specific tasks.

Collapse

Approaches to verb subcategorization for biomedicine. J Biomed Inform 2012;46:212-27. [PMID: 23276747 DOI: 10.1016/j.jbi.2012.12.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Revised: 12/05/2012] [Accepted: 12/06/2012] [Indexed: 11/23/2022]

Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature--a survey of the state of the art. Brief Bioinform 2012;13:460-94. [PMID: 22833496 PMCID: PMC3404399 DOI: 10.1093/bib/bbs018] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 03/23/2012] [Indexed: 01/05/2023] Open

Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. J Am Med Inform Assoc 2012;19:e149-56. [PMID: 22493050 PMCID: PMC3392861 DOI: 10.1136/amiajnl-2011-000744] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open

Liu H, Christiansen T, Baumgartner WA, Verspoor K. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J Biomed Semantics 2012;3:3. [PMID: 22464129 PMCID: PMC3359276 DOI: 10.1186/2041-1480-3-3] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2011] [Accepted: 04/01/2012] [Indexed: 11/10/2022] Open