1
|
Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019; 6:52. [PMID: 31076572 PMCID: PMC6510737 DOI: 10.1038/s41597-019-0055-0] [Citation(s) in RCA: 150] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Accepted: 03/27/2019] [Indexed: 11/10/2022] Open
Abstract
Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.
Collapse
Affiliation(s)
- Yijia Zhang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA.
| |
Collapse
|
2
|
Müller HM, Van Auken KM, Li Y, Sternberg PW. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics 2018; 19:94. [PMID: 29523070 PMCID: PMC5845379 DOI: 10.1186/s12859-018-2103-8] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 03/01/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. RESULTS We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. CONCLUSION Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world. Textpresso Central URL: http://www.textpresso.org/tpc.
Collapse
Affiliation(s)
- H.-M. Müller
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - K. M. Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - Y. Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - P. W. Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| |
Collapse
|
3
|
Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Teixeira D, Michel PA, Ruch P, Gaudet P. Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt. Database (Oxford) 2018; 2018:5255187. [PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 10/04/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022]
Abstract
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Collapse
Affiliation(s)
- Aurore Britan
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Isabelle Cusin
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valérie Hinard
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Luc Mottin
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Emilie Pasche
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Julien Gobeill
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valentine Rech de Laval
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Anne Gleizes
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Daniel Teixeira
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pierre-André Michel
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Patrick Ruch
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pascale Gaudet
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| |
Collapse
|