1
|
Gong L, Yang R, Liu Q, Dong Z, Chen H, Yang G. A Dictionary-Based Approach for Identifying Biomedical Concepts. INT J PATTERN RECOGN 2017. [DOI: 10.1142/s021800141757004x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this research, we provided a dictionary-based approach for identifying biomedical concepts from the literature. The approach first crawled experimental corpus by E-utilities and built a concept dictionary. Then, we developed an algorithm called Variable-step Window Identification Algorithm (VWIA) for matching biomedical concepts based on preprocessing, POS tagging and the formation of phrase block. The approach could identify embedded biomedical concepts and new concepts, which could identify concepts more completely. The proposed approach obtain 95.0% F-measure overall for the test dataset. Thus, it is promising for the method of biomedical text mining.
Collapse
Affiliation(s)
- Lejun Gong
- Jiangsu Key Lab of Big Data Security and Intelligent Processing, Jiangsu High Technology Research Key Lab for Wireless Sensor Networks, College of Computer Science & Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P. R. China
| | - Ronggen Yang
- College of Intelligent Science and Control Engineering, Jinling Institute of Technology, Nanjing, 211169, P. R. China
| | - Quan Liu
- Jiangsu Key Lab of Big Data Security and Intelligent Processing, Jiangsu High Technology Research Key Lab for Wireless Sensor Networks, College of Computer Science & Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P. R. China
| | - Zhenjiang Dong
- Zhongxing Telecommunication Equipment Corporation, Shenzhen, 518057, P. R. China
| | - Hong Chen
- Zhongxing Telecommunication Equipment Corporation, Shenzhen, 518057, P. R. China
| | - Geng Yang
- Jiangsu Key Lab of Big Data Security and Intelligent Processing, Jiangsu High Technology Research Key Lab for Wireless Sensor Networks, College of Computer Science & Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P. R. China
| |
Collapse
|
2
|
Mahmood ASMA, Wu TJ, Mazumder R, Vijay-Shanker K. DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One 2016; 11:e0152725. [PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 03/19/2016] [Indexed: 11/22/2022] Open
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, District of Columbia, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
3
|
Klein A, Riazanov A, Hindle MM, Baker CJO. Benchmarking infrastructure for mutation text mining. J Biomed Semantics 2014; 5:11. [PMID: 24568600 PMCID: PMC3939821 DOI: 10.1186/2041-1480-5-11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Accepted: 02/05/2014] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. RESULTS We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. CONCLUSION We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.
Collapse
Affiliation(s)
- Artjom Klein
- Computer Science And Applied Statistics Department, University of New Brunswick, Saint John, Canada
| | | | - Matthew M Hindle
- Synthetic and Systems Biology, Edinburgh University, Edinburgh, UK
| | - Christopher JO Baker
- Computer Science And Applied Statistics Department, University of New Brunswick, Saint John, Canada
| |
Collapse
|
4
|
Wei CH, Harris BR, Kao HY, Lu Z. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics 2013; 29:1433-9. [PMID: 23564842 DOI: 10.1093/bioinformatics/btt156] [Citation(s) in RCA: 101] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy. RESULTS Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. AVAILABILITY tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/pub/tmVar
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
5
|
Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS One 2012; 7:e39230. [PMID: 22745720 PMCID: PMC3383748 DOI: 10.1371/journal.pone.0039230] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 05/21/2012] [Indexed: 11/25/2022] Open
Abstract
Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F1 of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.
Collapse
|
6
|
Vroling B, Thorne D, McDermott P, Attwood TK, Vriend G, Pettifer S. Integrating GPCR-specific information with full text articles. BMC Bioinformatics 2011; 12:362. [PMID: 21910883 PMCID: PMC3179973 DOI: 10.1186/1471-2105-12-362] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 09/12/2011] [Indexed: 11/29/2022] Open
Abstract
Background With the continued growth in the volume both of experimental G protein-coupled receptor (GPCR) data and of the related peer-reviewed literature, the ability of GPCR researchers to keep up-to-date is becoming increasingly curtailed. Results We present work that integrates the biological data and annotations in the GPCR information system (GPCRDB) with next-generation methods for intelligently exploring, visualising and interacting with the scientific articles used to disseminate them. This solution automatically retrieves relevant information from GPCRDB and displays it both within and as an adjunct to an article. Conclusions This approach allows researchers to extract more knowledge more swiftly from literature. Importantly, it allows reinterpretation of data in articles published before GPCR structure data became widely available, thereby rescuing these valuable data from long-dormant sources.
Collapse
Affiliation(s)
- Bas Vroling
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein 26-28, Nijmegen, 6525 GA, The Netherlands
| | | | | | | | | | | |
Collapse
|
7
|
Laurila JB, Naderi N, Witte R, Riazanov A, Kouznetsov A, Baker CJO. Algorithms and semantic infrastructure for mutation impact extraction and grounding. BMC Genomics 2010; 11 Suppl 4:S24. [PMID: 21143808 PMCID: PMC3005927 DOI: 10.1186/1471-2164-11-s4-s24] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Background Mutation impact extraction is a hitherto unaccomplished task in state of the art mutation extraction systems. Protein mutations and their impacts on protein properties are hidden in scientific literature, making them poorly accessible for protein engineers and inaccessible for phenotype-prediction systems that currently depend on manually curated genomic variation databases. Results We present the first rule-based approach for the extraction of mutation impacts on protein properties, categorizing their directionality as positive, negative or neutral. Furthermore protein and mutation mentions are grounded to their respective UniProtKB IDs and selected protein properties, namely protein functions to concepts found in the Gene Ontology. The extracted entities are populated to an OWL-DL Mutation Impact ontology facilitating complex querying for mutation impacts using SPARQL. We illustrate retrieval of proteins and mutant sequences for a given direction of impact on specific protein properties. Moreover we provide programmatic access to the data through semantic web services using the SADI (Semantic Automated Discovery and Integration) framework. Conclusion We address the problem of access to legacy mutation data in unstructured form through the creation of novel mutation impact extraction methods which are evaluated on a corpus of full-text articles on haloalkane dehalogenases, tagged by domain experts. Our approaches show state of the art levels of precision and recall for Mutation Grounding and respectable level of precision but lower recall for the task of Mutant-Impact relation extraction. The system is deployed using text mining and semantic web technologies with the goal of publishing to a broad spectrum of consumers.
Collapse
Affiliation(s)
- Jonas B Laurila
- Department of Computer Science & Applied Statistics, University of New Brunswick, Saint John, New Brunswick, Canada.
| | | | | | | | | | | |
Collapse
|
8
|
Preissner S, Kroll K, Dunkel M, Senger C, Goldsobel G, Kuzman D, Guenther S, Winnenburg R, Schroeder M, Preissner R. SuperCYP: a comprehensive database on Cytochrome P450 enzymes including a tool for analysis of CYP-drug interactions. Nucleic Acids Res 2009; 38:D237-43. [PMID: 19934256 PMCID: PMC2808967 DOI: 10.1093/nar/gkp970] [Citation(s) in RCA: 185] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Much of the information on the Cytochrome P450 enzymes (CYPs) is spread across literature and the internet. Aggregating knowledge about CYPs into one database makes the search more efficient. Text mining on 57 CYPs and drugs led to a mass of papers, which were screened manually for facts about metabolism, SNPs and their effects on drug degradation. Information was put into a database, which enables the user not only to look up a particular CYP and all metabolized drugs, but also to check tolerability of drug-cocktails and to find alternative combinations, to use metabolic pathways more efficiently. The SuperCYP database contains 1170 drugs with more than 3800 interactions including references. Approximately 2000 SNPs and mutations are listed and ordered according to their effect on expression and/or activity. SuperCYP (http://bioinformatics.charite.de/supercyp) is a comprehensive resource focused on CYPs and drug metabolism. Homology-modeled structures of the CYPs can be downloaded in PDB format and related drugs are available as MOL-files. Within the resource, CYPs can be aligned with each other, drug-cocktails can be 'mixed', SNPs, protein point mutations, and their effects can be viewed and corresponding PubMed IDs are given. SuperCYP is meant to be a platform and a starting point for scientists and health professionals for furthering their research.
Collapse
Affiliation(s)
- Saskia Preissner
- Structural Bioinformatics Group, Institute of Physiology, Charité-University Medicine Berlin, Arnimallee 22, 14197 Berlin, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Marsico A, Scheubert K, Tuukkanen A, Henschel A, Winter C, Winnenburg R, Schroeder M. MeMotif: a database of linear motifs in alpha-helical transmembrane proteins. Nucleic Acids Res 2009; 38:D181-9. [PMID: 19910368 PMCID: PMC2808916 DOI: 10.1093/nar/gkp1042] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Membrane proteins are important for many processes in the cell and used as main drug targets. The increasing number of high-resolution structures available makes for the first time a characterization of local structural and functional motifs in α-helical transmembrane proteins possible. MeMotif (http://projects.biotec.tu-dresden.de/memotif) is a database and wiki which collects more than 2000 known and novel computationally predicted linear motifs in α-helical transmembrane proteins. Motifs are fully described in terms of several structural and functional features and editable. Motifs contained in MeMotif can be used in different biological applications, from the identification of biochemically important functional residues which are candidates for mutagenesis experiments to the improvement of tools for transmembrane protein modeling.
Collapse
Affiliation(s)
- Annalisa Marsico
- Bioinformatics Department, Biotechnology Center, TU Dresden, Tatzberg 47/49, 01307 Dresden, Germany.
| | | | | | | | | | | | | |
Collapse
|
10
|
Baker CJO, Rebholz-Schuhmann D. Between proteins and phenotypes: annotation and interpretation of mutations. BMC Bioinformatics 2009; 10 Suppl 8:I1. [PMID: 19758463 PMCID: PMC2745581 DOI: 10.1186/1471-2105-10-s8-i1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|