1
|
Kumar A, Sharaff A. ABEE: automated bio entity extraction from biomedical text documents. DATA TECHNOLOGIES AND APPLICATIONS 2022. [DOI: 10.1108/dta-04-2022-0151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
PurposeThe purpose of this study was to design a multitask learning model so that biomedical entities can be extracted without having any ambiguity from biomedical texts.Design/methodology/approachIn the proposed automated bio entity extraction (ABEE) model, a multitask learning model has been introduced with the combination of single-task learning models. Our model used Bidirectional Encoder Representations from Transformers to train the single-task learning model. Then combined model's outputs so that we can find the verity of entities from biomedical text.FindingsThe proposed ABEE model targeted unique gene/protein, chemical and disease entities from the biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug discoveries and new treatments.Research limitations/implicationsAs such, there are no limitations with the model, but the research team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can easily estimate the entities of similar groups.Practical implicationsAs far as the practical implication concerned, the ABEE model will be helpful in various natural language processing task as in information extraction (IE), it plays an important role in the biomedical named entity recognition and biomedical relation extraction and also in the information retrieval task like literature-based knowledge discovery.Social implicationsDuring the COVID-19 pandemic, the demands for this type of our work increased because of the increase in the clinical trials at that time. If this type of research has been introduced previously, then it would have reduced the time and effort for new drug discoveries in this area.Originality/valueIn this work we proposed a novel multitask learning model that is capable to extract biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-the-art performance in terms of precision, recall and F1 score.
Collapse
|
2
|
Dai HJ, Wang CK, Chang NW, Huang MS, Jonnagaddala J, Wang FD, Hsu WL. Statistical principle-based approach for recognizing and normalizing microRNAs described in scientific literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5365313. [PMID: 30809637 PMCID: PMC6391575 DOI: 10.1093/database/baz030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Revised: 02/01/2019] [Accepted: 02/06/2019] [Indexed: 01/08/2023]
Abstract
The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan, ROC
| | - Chen-Kai Wang
- Big Data Laboratories, Chunghwa Telecom Co., Taoyuan, Taiwan, ROC
| | - Nai-Wen Chang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.,Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Ming-Siang Huang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, University of New South Wales, Sydney, Australia
| | - Feng-Duo Wang
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
3
|
Borstein SR, O’Meara BC. AnnotationBustR: an R package to extract subsequences from GenBank annotations. PeerJ 2018; 6:e5179. [PMID: 30002984 PMCID: PMC6034590 DOI: 10.7717/peerj.5179] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Accepted: 06/18/2018] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND DNA sequences are pivotal for a wide array of research in biology. Large sequence databases, like GenBank, provide an amazing resource to utilize DNA sequences for large scale analyses. However, many sequence records on GenBank contain more than one gene or are portions of genomes. Inconsistencies in the way genes are annotated and the numerous synonyms a single gene may be listed under provide major challenges for extracting large numbers of subsequences for comparative analysis across taxa. At present, there is no easy way to extract portions from many GenBank accessions based on annotations where gene names may vary extensively. RESULTS The R package AnnotationBustR allows users to extract sequences based on GenBank annotations through the ACNUC retrieval system given search terms of gene synonyms and accession numbers. AnnotationBustR extracts subsequences of interest and then writes them to a FASTA file for users to employ in their research endeavors. CONCLUSION FASTA files of extracted subsequences and accession tables generated by AnnotationBustR allow users to quickly find and extract subsequences from GenBank accessions. These sequences can then be incorporated in various analyses, like the construction of phylogenies to test a wide range of ecological and evolutionary hypotheses.
Collapse
Affiliation(s)
- Samuel R. Borstein
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| | - Brian C. O’Meara
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| |
Collapse
|
4
|
Kang T, Zhang S, Tang Y, Hruby GW, Rusanov A, Elhadad N, Weng C. EliIE: An open-source information extraction system for clinical trial eligibility criteria. J Am Med Inform Assoc 2017; 24:1062-1071. [PMID: 28379377 PMCID: PMC6259668 DOI: 10.1093/jamia/ocx019] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 01/31/2017] [Accepted: 03/02/2017] [Indexed: 12/22/2022] Open
Abstract
OBJECTIVE To develop an open-source information extraction system called Eligibility Criteria Information Extraction (EliIE) for parsing and formalizing free-text clinical research eligibility criteria (EC) following Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) version 5.0. MATERIALS AND METHODS EliIE parses EC in 4 steps: (1) clinical entity and attribute recognition, (2) negation detection, (3) relation extraction, and (4) concept normalization and output structuring. Informaticians and domain experts were recruited to design an annotation guideline and generate a training corpus of annotated EC for 230 Alzheimer's clinical trials, which were represented as queries against the OMOP CDM and included 8008 entities, 3550 attributes, and 3529 relations. A sequence labeling-based method was developed for automatic entity and attribute recognition. Negation detection was supported by NegEx and a set of predefined rules. Relation extraction was achieved by a support vector machine classifier. We further performed terminology-based concept normalization and output structuring. RESULTS In task-specific evaluations, the best F1 score for entity recognition was 0.79, and for relation extraction was 0.89. The accuracy of negation detection was 0.94. The overall accuracy for query formalization was 0.71 in an end-to-end evaluation. CONCLUSIONS This study presents EliIE, an OMOP CDM-based information extraction system for automatic structuring and formalization of free-text EC. According to our evaluation, machine learning-based EliIE outperforms existing systems and shows promise to improve.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Shaodian Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Youlan Tang
- Institute of Human Nutrition, Columbia University, New York, NY, USA
| | - Gregory W Hruby
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Alexander Rusanov
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
5
|
Asiaee AH, Minning T, Doshi P, Tarleton RL. A framework for ontology-based question answering with application to parasite immunology. J Biomed Semantics 2015; 6:31. [PMID: 26185615 PMCID: PMC4504081 DOI: 10.1186/s13326-015-0029-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2013] [Accepted: 06/19/2015] [Indexed: 11/15/2022] Open
Abstract
Background Large quantities of biomedical data are being produced at a rapid pace for a variety of organisms. With ontologies proliferating, data is increasingly being stored using the RDF data model and queried using RDF based querying languages. While existing systems facilitate the querying in various ways, the scientist must map the question in his or her mind to the interface used by the systems. The field of natural language processing has long investigated the challenges of designing natural language based retrieval systems. Recent efforts seek to bring the ability to pose natural language questions to RDF data querying systems while leveraging the associated ontologies. These analyze the input question and extract triples (subject, relationship, object), if possible, mapping them to RDF triples in the data. However, in the biomedical context, relationships between entities are not always explicit in the question and these are often complex involving many intermediate concepts. Results We present a new framework, OntoNLQA, for querying RDF data annotated using ontologies which allows posing questions in natural language. OntoNLQA offers five steps in order to answer natural language questions. In comparison to previous systems, OntoNLQA differs in how some of the methods are realized. In particular, it introduces a novel approach for discovering the sophisticated semantic associations that may exist between the key terms of a natural language question, in order to build an intuitive query and retrieve precise answers. We apply this framework to the context of parasite immunology data, leading to a system called AskCuebee that allows parasitologists to pose genomic, proteomic and pathway questions in natural language related to the parasite, Trypanosoma cruzi. We separately evaluate the accuracy of each component of OntoNLQA as implemented in AskCuebee and the accuracy of the whole system. AskCuebee answers 68 % of the questions in a corpus of 125 questions, and 60 % of the questions in a new previously unseen corpus. If we allow simple corrections by the scientists, this proportion increases to 92 %. Conclusions We introduce a novel framework for question answering and apply it to parasite immunology data. Evaluations of translating the questions to RDF triple queries by combining machine learning, lexical similarity matching with ontology classes, properties and instances for specificity, and discovering associations between them demonstrate that the approach performs well and improves on previous systems. Subsequently, OntoNLQA offers a viable framework for building question answering systems in other biomedical domains. Electronic supplementary material The online version of this article (doi:10.1186/s13326-015-0029-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Amir H Asiaee
- THINC Lab, Department of Computer Science, University of Georgia, Athens, GA USA
| | - Todd Minning
- Tarleton Research Group, Department of Cellular Biology, University of Georgia, Athens, GA USA
| | - Prashant Doshi
- THINC Lab, Department of Computer Science, University of Georgia, Athens, GA USA
| | - Rick L Tarleton
- Tarleton Research Group, Department of Cellular Biology, University of Georgia, Athens, GA USA
| |
Collapse
|
6
|
Kim S, Lu Z, Wilbur WJ. Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 2015; 16:57. [PMID: 25887671 PMCID: PMC4349776 DOI: 10.1186/s12859-015-0487-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2014] [Accepted: 01/30/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Controlled vocabularies such as the Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE. RESULTS We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, "gene", "protein", "disease", "cell" and "cells". CONCLUSIONS Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
| |
Collapse
|
7
|
Collier N, Tran MV, Le HQ, Ha QT, Oellrich A, Rebholz-Schuhmann D. Learning to recognize phenotype candidates in the auto-immune literature using SVM re-ranking. PLoS One 2013; 8:e72965. [PMID: 24155869 PMCID: PMC3796529 DOI: 10.1371/journal.pone.0072965] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Accepted: 07/15/2013] [Indexed: 11/19/2022] Open
Abstract
The identification of phenotype descriptions in the scientific literature, case reports and patient records is a rewarding task for bio-medical text mining. Any progress will support knowledge discovery and linkage to other resources. However because of their wide variation a number of challenges still remain in terms of their identification and semantic normalisation before they can be fully exploited for research purposes. This paper presents novel techniques for identifying potential complex phenotype mentions by exploiting a hybrid model based on machine learning, rules and dictionary matching. A systematic study is made of how to combine sequence labels from these modules as well as the merits of various ontological resources. We evaluated our approach on a subset of Medline abstracts cited by the Online Mendelian Inheritance of Man database related to auto-immune diseases. Using partial matching the best micro-averaged F-score for phenotypes and five other entity classes was 79.9%. A best performance of 75.3% was achieved for phenotype candidates using all semantics resources. We observed the advantage of using SVM-based learn-to-rank for sequence label combination over maximum entropy and a priority list approach. The results indicate that the identification of simple entity types such as chemicals and genes are robustly supported by single semantic resources, whereas phenotypes require combinations. Altogether we conclude that our approach coped well with the compositional structure of phenotypes in the auto-immune domain.
Collapse
Affiliation(s)
- Nigel Collier
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom
- National Institute of Informatics, Tokyo, Japan
- * E-mail:
| | - Mai-vu Tran
- National Institute of Informatics, Tokyo, Japan
- Knowledge Technology Laboratory, University of Engineering and Technology - VNU, Hanoi, Vietnam
| | - Hoang-quynh Le
- National Institute of Informatics, Tokyo, Japan
- Knowledge Technology Laboratory, University of Engineering and Technology - VNU, Hanoi, Vietnam
| | - Quang-Thuy Ha
- Knowledge Technology Laboratory, University of Engineering and Technology - VNU, Hanoi, Vietnam
| | - Anika Oellrich
- Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Dietrich Rebholz-Schuhmann
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom
- Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
| |
Collapse
|
8
|
Kim S, Kim W, Wei CH, Lu Z, Wilbur WJ. Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas042. [PMID: 23160415 PMCID: PMC3500521 DOI: 10.1093/database/bas042] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|
9
|
Schober D, Tudose I, Svatek V, Boeker M. OntoCheck: verifying ontology naming conventions and metadata completeness in Protégé 4. J Biomed Semantics 2012; 3 Suppl 2:S4. [PMID: 23046606 PMCID: PMC3448530 DOI: 10.1186/2041-1480-3-s2-s4] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although policy providers have outlined minimal metadata guidelines and naming conventions, ontologies of today still display inter- and intra-ontology heterogeneities in class labelling schemes and metadata completeness. This fact is at least partially due to missing or inappropriate tools. Software support can ease this situation and contribute to overall ontology consistency and quality by helping to enforce such conventions. OBJECTIVE We provide a plugin for the Protégé Ontology editor to allow for easy checks on compliance towards ontology naming conventions and metadata completeness, as well as curation in case of found violations. IMPLEMENTATION In a requirement analysis, derived from a prior standardization approach carried out within the OBO Foundry, we investigate the needed capabilities for software tools to check, curate and maintain class naming conventions. A Protégé tab plugin was implemented accordingly using the Protégé 4.1 libraries. The plugin was tested on six different ontologies. Based on these test results, the plugin could be refined, also by the integration of new functionalities. RESULTS The new Protégé plugin, OntoCheck, allows for ontology tests to be carried out on OWL ontologies. In particular the OntoCheck plugin helps to clean up an ontology with regard to lexical heterogeneity, i.e. enforcing naming conventions and metadata completeness, meeting most of the requirements outlined for such a tool. Found test violations can be corrected to foster consistency in entity naming and meta-annotation within an artefact. Once specified, check constraints like name patterns can be stored and exchanged for later re-use. Here we describe a first version of the software, illustrate its capabilities and use within running ontology development efforts and briefly outline improvements resulting from its application. Further, we discuss OntoChecks capabilities in the context of related tools and highlight potential future expansions. CONCLUSIONS The OntoCheck plugin facilitates labelling error detection and curation, contributing to lexical quality assurance in OWL ontologies. Ultimately, we hope this Protégé extension will ease ontology alignments as well as lexical post-processing of annotated data and hence can increase overall secondary data usage by humans and computers.
Collapse
Affiliation(s)
- Daniel Schober
- Institute of Medical Biometry and Medical Informatics (IMBI), University Medical Center, 79104 Freiburg, Germany.
| | | | | | | |
Collapse
|
10
|
Abstract
Background To access and utilize the rich information contained in the biomedical literature, the ability to recognize and normalize gene mentions referenced in the literature is crucial. In this paper, we focus on improvements to the accuracy of gene normalization in cases where species information is not provided. Gene names are often ambiguous, in that they can refer to the genes of many species. Therefore, gene normalization is a difficult challenge. Methods We define “gene normalization” as a series of tasks involving several issues, including gene name recognition, species assignation and species-specific gene normalization. We propose an integrated method, GenNorm, consisting of three modules to handle the issues of this task. Every issue can affect overall performance, though the most important is species assignation. Clearly, correct identification of the species can decrease the ambiguity of orthologous genes. Results In experiments, the proposed model attained the top-1 threshold average precision (TAP-k) scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) when tested against 50 articles that had been selected for their difficulty and the most divergent results from pooled team submissions. In the silver-standard-507 evaluation, our TAP-k scores are 0.4591 for k=5, 10, and 20 and were ranked 2nd, 2nd, and 3rd respectively. Availability A web service and input, output formats of GenNorm are available at http://ikmbio.csie.ncku.edu.tw/GN/.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, ROC
| | | |
Collapse
|
11
|
Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics 2011; 11:1467-89. [PMID: 21047206 DOI: 10.2217/pgs.10.136] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications.
Collapse
Affiliation(s)
- Yael Garten
- Biomedical Informatics, Stanford University, Stanford, CA 94305, USA
| | | | | |
Collapse
|
12
|
van Haagen HHHBM, 't Hoen PAC, Botelho Bovo A, de Morrée A, van Mulligen EM, Chichester C, Kors JA, den Dunnen JT, van Ommen GJB, van der Maarel SM, Kern VM, Mons B, Schuemie MJ. Novel protein-protein interactions inferred from literature context. PLoS One 2009; 4:e7894. [PMID: 19924298 PMCID: PMC2774517 DOI: 10.1371/journal.pone.0007894] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2009] [Accepted: 10/09/2009] [Indexed: 01/15/2023] Open
Abstract
We have developed a method that predicts Protein-Protein Interactions (PPIs) based on the similarity of the context in which proteins appear in literature. This method outperforms previously developed PPI prediction algorithms that rely on the conjunction of two protein names in MEDLINE abstracts. We show significant increases in coverage (76% versus 32%) and sensitivity (66% versus 41% at a specificity of 95%) for the prediction of PPIs currently archived in 6 PPI databases. A retrospective analysis shows that PPIs can efficiently be predicted before they enter PPI databases and before their interaction is explicitly described in the literature. The practical value of the method for discovery of novel PPIs is illustrated by the experimental confirmation of the inferred physical interaction between CAPN3 and PARVB, which was based on frequent co-occurrence of both proteins with concepts like Z-disc, dysferlin, and alpha-actinin. The relationships between proteins predicted by our method are broader than PPIs, and include proteins in the same complex or pathway. Dependent on the type of relationships deemed useful, the precision of our method can be as high as 90%. The full set of predicted interactions is available in a downloadable matrix and through the webtool Nermal, which lists the most likely interaction partners for a given protein. Our framework can be used for prioritizing potential interaction partners, hitherto undiscovered, for follow-up studies and to aid the generation of accurate protein interaction maps.
Collapse
Affiliation(s)
- Herman H H B M van Haagen
- Biosemantics Association, Department of Human Genetics, Leiden University Medical Center, Leiden, and Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Roos M, Marshall MS, Gibson AP, Schuemie M, Meij E, Katrenko S, van Hage WR, Krommydas K, Adriaans PW. Structuring and extracting knowledge for the support of hypothesis generation in molecular biology. BMC Bioinformatics 2009; 10 Suppl 10:S9. [PMID: 19796406 PMCID: PMC2755830 DOI: 10.1186/1471-2105-10-s10-s9] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Hypothesis generation in molecular and cellular biology is an empirical process in which knowledge derived from prior experiments is distilled into a comprehensible model. The requirement of automated support is exemplified by the difficulty of considering all relevant facts that are contained in the millions of documents available from PubMed. Semantic Web provides tools for sharing prior knowledge, while information retrieval and information extraction techniques enable its extraction from literature. Their combination makes prior knowledge available for computational analysis and inference. While some tools provide complete solutions that limit the control over the modeling and extraction processes, we seek a methodology that supports control by the experimenter over these critical processes. RESULTS We describe progress towards automated support for the generation of biomolecular hypotheses. Semantic Web technologies are used to structure and store knowledge, while a workflow extracts knowledge from text. We designed minimal proto-ontologies in OWL for capturing different aspects of a text mining experiment: the biological hypothesis, text and documents, text mining, and workflow provenance. The models fit a methodology that allows focus on the requirements of a single experiment while supporting reuse and posterior analysis of extracted knowledge from multiple experiments. Our workflow is composed of services from the 'Adaptive Information Disclosure Application' (AIDA) toolkit as well as a few others. The output is a semantic model with putative biological relations, with each relation linked to the corresponding evidence. CONCLUSION We demonstrated a 'do-it-yourself' approach for structuring and extracting knowledge in the context of experimental research on biomolecular mechanisms. The methodology can be used to bootstrap the construction of semantically rich biological models using the results of knowledge extraction processes. Models specific to particular experiments can be constructed that, in turn, link with other semantic models, creating a web of knowledge that spans experiments. Mapping mechanisms can link to other knowledge resources such as OBO ontologies or SKOS vocabularies. AIDA Web Services can be used to design personalized knowledge extraction procedures. In our example experiment, we found three proteins (NF-Kappa B, p21, and Bax) potentially playing a role in the interplay between nutrients and epigenetic gene regulation.
Collapse
Affiliation(s)
- Marco Roos
- grid.7177.60000000084992262Informatics Institute, University of Amsterdam, Amsterdam, 1098 SJ The Netherlands
| | - M Scott Marshall
- grid.7177.60000000084992262Informatics Institute, University of Amsterdam, Amsterdam, 1098 SJ The Netherlands
| | - Andrew P Gibson
- grid.7177.60000000084992262Swammerdam Institute for Life Science, University of Amsterdam, Amsterdam, 1018 WB The Netherlands
| | - Martijn Schuemie
- grid.6906.90000000092621349BioSemantics group, Erasmus University of Rotterdam, Rotterdam, 3000 DR The Netherlands
| | - Edgar Meij
- grid.7177.60000000084992262Informatics Institute, University of Amsterdam, Amsterdam, 1098 SJ The Netherlands
| | - Sophia Katrenko
- grid.7177.60000000084992262Informatics Institute, University of Amsterdam, Amsterdam, 1098 SJ The Netherlands
| | - Willem Robert van Hage
- grid.12380.380000000417549227Business Informatics, Faculty of Sciences, Vrije Universiteit, Amsterdam, 1081 HV The Netherlands
| | - Konstantinos Krommydas
- grid.7177.60000000084992262Informatics Institute, University of Amsterdam, Amsterdam, 1098 SJ The Netherlands
| | - Pieter W Adriaans
- grid.7177.60000000084992262Informatics Institute, University of Amsterdam, Amsterdam, 1098 SJ The Netherlands
| |
Collapse
|
14
|
Schober D, Smith B, Lewis SE, Kusnierczyk W, Lomax J, Mungall C, Taylor CF, Rocca-Serra P, Sansone SA. Survey-based naming conventions for use in OBO Foundry ontology development. BMC Bioinformatics 2009; 10:125. [PMID: 19397794 PMCID: PMC2684543 DOI: 10.1186/1471-2105-10-125] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2008] [Accepted: 04/27/2009] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND A wide variety of ontologies relevant to the biological and medical domains are available through the OBO Foundry portal, and their number is growing rapidly. Integration of these ontologies, while requiring considerable effort, is extremely desirable. However, heterogeneities in format and style pose serious obstacles to such integration. In particular, inconsistencies in naming conventions can impair the readability and navigability of ontology class hierarchies, and hinder their alignment and integration. While other sources of diversity are tremendously complex and challenging, agreeing a set of common naming conventions is an achievable goal, particularly if those conventions are based on lessons drawn from pooled practical experience and surveys of community opinion. RESULTS We summarize a review of existing naming conventions and highlight certain disadvantages with respect to general applicability in the biological domain. We also present the results of a survey carried out to establish which naming conventions are currently employed by OBO Foundry ontologies and to determine what their special requirements regarding the naming of entities might be. Lastly, we propose an initial set of typographic, syntactic and semantic conventions for labelling classes in OBO Foundry ontologies. CONCLUSION Adherence to common naming conventions is more than just a matter of aesthetics. Such conventions provide guidance to ontology creators, help developers avoid flaws and inaccuracies when editing, and especially when interlinking, ontologies. Common naming conventions will also assist consumers of ontologies to more readily understand what meanings were intended by the authors of ontologies used in annotating bodies of data.
Collapse
Affiliation(s)
- Daniel Schober
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- Institute of Medical Biometry and Medical Informatics (IMBI), University Medical Center, 79104 Freiburg, Germany
| | - Barry Smith
- Center of Excellence in Bioinformatics and Life Sciences, and Department of Philosophy, University at Buffalo, NY, USA
| | - Suzanna E Lewis
- Berkeley Bioinformatics and Ontologies Project, Lawrence Berkeley National Labs, Berkeley, CA 94720 USA
| | - Waclaw Kusnierczyk
- Department of Information and Computer Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Jane Lomax
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Chris Mungall
- Berkeley Bioinformatics and Ontologies Project, Lawrence Berkeley National Labs, Berkeley, CA 94720 USA
| | - Chris F Taylor
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- NERC Environmental Bioinformatics Centre (NEBC), Mansfield Road, Oxford, OX1 3SR, UK
| | | | | |
Collapse
|
15
|
Baumgartner WA, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A, White EK, Medvedeva O, Cohen KB, Hunter L. Concept recognition for extracting protein interaction relations from biomedical text. Genome Biol 2008; 9 Suppl 2:S9. [PMID: 18834500 PMCID: PMC2559993 DOI: 10.1186/gb-2008-9-s2-s9] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background: Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concept's semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing. Results: Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist. Conclusion: Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet .
Collapse
Affiliation(s)
- William A Baumgartner
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Fundel K, Zimmer R. Gene and protein nomenclature in public databases. BMC Bioinformatics 2006; 7:372. [PMID: 16899134 PMCID: PMC1560172 DOI: 10.1186/1471-2105-7-372] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2006] [Accepted: 08/09/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap. RESULTS We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism. CONCLUSION In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application.
Collapse
Affiliation(s)
- Katrin Fundel
- Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstrasse 17, 80333 München, Germany
| | - Ralf Zimmer
- Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstrasse 17, 80333 München, Germany
| |
Collapse
|
17
|
Abstract
Background Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. Results Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. Conclusion We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents.
Collapse
Affiliation(s)
- Aditya K Sehgal
- Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA
| | - Padmini Srinivasan
- Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA
- School of Library and Information Science, The University of Iowa, Iowa City, IA 52246, USA
| |
Collapse
|
18
|
Marshall B, Su H, McDonald D, Eggers S, Chen H. Aggregating automatically extracted regulatory pathway relations. ACTA ACUST UNITED AC 2006; 10:100-8. [PMID: 16445255 DOI: 10.1109/titb.2005.856857] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Automatic tools to extract information from biomedical texts are needed to help researchers leverage the vast and increasing body of biomedical literature. While several biomedical relation extraction systems have been created and tested, little work has been done to meaningfully organize the extracted relations. Organizational processes should consolidate multiple references to the same objects over various levels of granularity, connect those references to other resources, and capture contextual information. We propose a feature decomposition approach to relation aggregation to support a five-level aggregation framework. Our BioAggregate tagger uses this approach to identify key features in extracted relation name strings. We show encouraging feature assignment accuracy and report substantial consolidation in a network of extracted relations.
Collapse
|
19
|
Schijvenaars BJA, Mons B, Weeber M, Schuemie MJ, van Mulligen EM, Wain HM, Kors JA. Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 2005; 6:149. [PMID: 15958172 PMCID: PMC1183190 DOI: 10.1186/1471-2105-6-149] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2004] [Accepted: 06/16/2005] [Indexed: 11/28/2022] Open
Abstract
Background Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. Results We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. Conclusion The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
Collapse
Affiliation(s)
- Bob JA Schijvenaars
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Barend Mons
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Marc Weeber
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Martijn J Schuemie
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Hester M Wain
- HUGO Gene Nomenclature Committee, Department of Biology, University College London, Wolfson House, 4 Stephenson Way, London NW1 2HE, UK
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| |
Collapse
|