1
|
Arumugam K, Shanker RR. Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed. Methods Mol Biol 2022; 2496:159-177. [PMID: 35713864 DOI: 10.1007/978-1-0716-2305-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In the modern health care research, protein phosphorylation has gained an enormous attention from the researchers across the globe and requires automated approaches to process a huge volume of data on proteins and their modifications at the cellular level. The data generated at the cellular level is unique as well as arbitrary, and an accumulation of massive volume of information is inevitable. Biological research has revealed that a huge array of cellular communication aided by protein phosphorylation and other similar mechanisms imply different and diverse meanings. This led to a collection of huge volume of data to understand the biological functions of human evolution, especially for combating diseases in a better way. Text mining, an automated approach to mine the information from an unstructured data, finds its application in extracting protein phosphorylation information from the biomedical literature databases such as PubMed. This chapter outlines a recent text mining protocol that applies natural language parsing (NLP) for named entity recognition and text processing, and support vector machines (SVM), a machine learning algorithm for classifying the processed text related human protein phosphorylation. We discuss on evaluating the text mining system which is the outcome of the protocol on three corpora, namely, human Protein Phosphorylation (hPP) corpus, Integrated Protein Literature Information and Knowledge corpus (iProLink), and Phosphorylation Literature corpus (PLC). We also present a basic understanding on the chemistry and biology that drive the protein phosphorylation process in a human body. We believe that this basic understanding will be useful to advance the existing text mining systems for extracting protein phosphorylation information from PubMed.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India
| |
Collapse
|
2
|
Wang M, Xia H, Sun D, Chen Z, Wang M, Li A. Literature mining of protein phosphorylation using dependency parse trees. Methods 2014; 67:386-93. [PMID: 24440484 DOI: 10.1016/j.ymeth.2014.01.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2013] [Revised: 12/23/2013] [Accepted: 01/05/2014] [Indexed: 10/25/2022] Open
Abstract
As one of the most common post-translational modifications (PTMs), protein phosphorylation plays an important role in various biological processes, such as signaling transduction, cellular metabolism, differentiation, growth, regulation and apoptosis. Protein phosphorylation is of great value not only in illustrating the underlying molecular mechanisms but also in treatment of diseases and design of new drugs. Recently, there is an increasing interest in automatically extracting phosphorylation information from biomedical literatures. However, it still remains a challenging task due to the tremendous volume of literature and circuitous modes of expression for protein phosphorylation. To address this issue, we propose a novel text-mining method for efficiently retrieving and extracting protein phosphorylation information from literature. By employing natural language processing (NLP) technologies, this method transforms each sentence into dependency parse trees that can precisely reflect the intrinsic relationship of phosphorylation-related key words, from which detailed information of substrates, kinases and phosphorylation sites is extracted based on syntactic patterns. Compared with other existing approaches, the proposed method demonstrates significantly improved performance, suggesting it is a powerful bioinformatics approach to retrieving phosphorylation information from a large amount of literature. A web server for the proposed method is freely available at http://bioinformatics.ustc.edu.cn/pptm/.
Collapse
Affiliation(s)
- Mang Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.
| | - Hong Xia
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.
| | - Dongdong Sun
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.
| | - Zhaoxiong Chen
- School of Life Sciences, University of Science and Technology of China, Hefei AH230027, China.
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China; Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China; Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
| |
Collapse
|
3
|
Graph theory enables drug repurposing--how a mathematical model can drive the discovery of hidden mechanisms of action. PLoS One 2014; 9:e84912. [PMID: 24416311 PMCID: PMC3886994 DOI: 10.1371/journal.pone.0084912] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2013] [Accepted: 11/28/2013] [Indexed: 12/21/2022] Open
Abstract
We introduce a methodology to efficiently exploit natural-language expressed biomedical knowledge for repurposing existing drugs towards diseases for which they were not initially intended. Leveraging on developments in Computational Linguistics and Graph Theory, a methodology is defined to build a graph representation of knowledge, which is automatically analysed to discover hidden relations between any drug and any disease: these relations are specific paths among the biomedical entities of the graph, representing possible Modes of Action for any given pharmacological compound. We propose a measure for the likeliness of these paths based on a stochastic process on the graph. This measure depends on the abundance of indirect paths between a peptide and a disease, rather than solely on the strength of the shortest path connecting them. We provide real-world examples, showing how the method successfully retrieves known pathophysiological Mode of Action and finds new ones by meaningfully selecting and aggregating contributions from known bio-molecular interactions. Applications of this methodology are presented, and prove the efficacy of the method for selecting drugs as treatment options for rare diseases.
Collapse
|
4
|
Comeau DC, Islamaj Doğan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ. BioC: a minimalist approach to interoperability for biomedical text processing. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat064. [PMID: 24048470 PMCID: PMC3889917 DOI: 10.1093/database/bat064] [Citation(s) in RCA: 100] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, Harvard Medical School, Harvard University, Boston, MA 02115 USA, Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO 80045, USA, Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid E-28029, Spain, Center for Bioinformatics and Computational Biology, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, National ICT Australia (NICTA), Victoria Research Laboratory, The University of Melbourne, Parkville VIC 3010, Australia and Department of Biology, North Carolina State University, Raleigh, NC 27695, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Ross KE, Arighi CN, Ren J, Huang H, Wu CH. Construction of protein phosphorylation networks by data mining, text mining and ontology integration: analysis of the spindle checkpoint. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat038. [PMID: 23749465 PMCID: PMC3675891 DOI: 10.1093/database/bat038] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Knowledge representation of the role of phosphorylation is essential for the meaningful understanding of many biological processes. However, such a representation is challenging because proteins can exist in numerous phosphorylated forms with each one having its own characteristic protein–protein interactions (PPIs), functions and subcellular localization. In this article, we evaluate the current state of phosphorylation event curation and then present a bioinformatics framework for the annotation and representation of phosphorylated proteins and construction of phosphorylation networks that addresses some of the gaps in current curation efforts. The integrated approach involves (i) text mining guided by RLIMS-P, a tool that identifies phosphorylation-related information in scientific literature; (ii) data mining from curated PPI databases; (iii) protein form and complex representation using the Protein Ontology (PRO); (iv) functional annotation using the Gene Ontology (GO); and (v) network visualization and analysis with Cytoscape. We use this framework to study the spindle checkpoint, the process that monitors the assembly of the mitotic spindle and blocks cell cycle progression at metaphase until all chromosomes have made bipolar spindle attachments. The phosphorylation networks we construct, centered on the human checkpoint kinase BUB1B (BubR1) and its yeast counterpart MAD3, offer a unique view of the spindle checkpoint that emphasizes biologically relevant phosphorylated forms, phosphorylation-state–specific PPIs and kinase–substrate relationships. Our approach for constructing protein phosphorylation networks can be applied to any biological process that is affected by phosphorylation. Database URL:http://www.yeastgenome.org/
Collapse
Affiliation(s)
- Karen E Ross
- Center for Bioinformatics and Computational Biology, 15 Innovation Way, Suite 205, University of Delaware, Newark, DE 19711, USA.
| | | | | | | | | |
Collapse
|
6
|
Ross KE, Arighi CN, Ren J, Natale DA, Huang H, Wu CH. Use of the protein ontology for multi-faceted analysis of biological processes: a case study of the spindle checkpoint. Front Genet 2013; 4:62. [PMID: 23637705 PMCID: PMC3636526 DOI: 10.3389/fgene.2013.00062] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2013] [Accepted: 04/05/2013] [Indexed: 11/13/2022] Open
Abstract
As a member of the Open Biomedical Ontologies (OBO) foundry, the Protein Ontology (PRO) provides an ontological representation of protein forms and complexes and their relationships. Annotations in PRO can be assigned to individual protein forms and complexes, each distinguishable down to the level of post-translational modification, thereby allowing for a more precise depiction of protein function than is possible with annotations to the gene as a whole. Moreover, PRO is fully interoperable with other OBO ontologies and integrates knowledge from other protein-centric resources such as UniProt and Reactome. Here we demonstrate the value of the PRO framework in the investigation of the spindle checkpoint, a highly conserved biological process that relies extensively on protein modification and protein complex formation. The spindle checkpoint maintains genomic integrity by monitoring the attachment of chromosomes to spindle microtubules and delaying cell cycle progression until the spindle is fully assembled. Using PRO in conjunction with other bioinformatics tools, we explored the cross-species conservation of spindle checkpoint proteins, including phosphorylated forms and complexes; studied the impact of phosphorylation on spindle checkpoint function; and examined the interactions of spindle checkpoint proteins with the kinetochore, the site of checkpoint activation. Our approach can be generalized to any biological process of interest.
Collapse
Affiliation(s)
- Karen E Ross
- Center for Bioinformatics and Computational Biology, University of Delaware Newark, DE, USA
| | | | | | | | | | | |
Collapse
|
7
|
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012; 13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
8
|
Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C, Sobral B, Tsujii J, Ananiadou S. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011. BMC Bioinformatics 2012; 13 Suppl 11:S2. [PMID: 22759456 PMCID: PMC3384257 DOI: 10.1186/1471-2105-13-s11-s2] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.
Collapse
Affiliation(s)
- Sampo Pyysalo
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Tomoko Ohta
- Department of Computer Science, University of Tokyo, Tokyo, Japan
| | - Rafal Rak
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Dan Sullivan
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunhong Mao
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunxia Wang
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Bruno Sobral
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | | | - Sophia Ananiadou
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| |
Collapse
|
9
|
Xu H, Schaniel C, Lemischka IR, Ma'ayan A. Toward a complete in silico, multi-layered embryonic stem cell regulatory network. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2011; 2:708-33. [PMID: 20890967 DOI: 10.1002/wsbm.93] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Recent efforts in systematically profiling embryonic stem (ES) cells have yielded a wealth of high-throughput data. Complementarily, emerging databases and computational tools facilitate ES cell studies and further pave the way toward the in silico reconstruction of regulatory networks encompassing multiple molecular layers. Here, we briefly survey databases, algorithms, and software tools used to organize and analyze high-throughput experimental data collected to study mammalian cellular systems with a focus on ES cells. The vision of using heterogeneous data to reconstruct a complete multi-layered ES cell regulatory network is discussed. This review also provides an accompanying manually extracted dataset of different types of regulatory interactions from low-throughput experimental ES cell studies available at http://amp.pharm.mssm.edu/iscmid/literature.
Collapse
Affiliation(s)
- Huilei Xu
- Department of Gene and Cell Medicine and The Black Family Stem Cell Institute, Mount Sinai School of Medicine, New York, NY 10029, USA
| | | | | | | |
Collapse
|
10
|
Mewes HW, Ruepp A, Theis F, Rattei T, Walter M, Frishman D, Suhre K, Spannagl M, Mayer KFX, Stümpflen V, Antonov A. MIPS: curated databases and comprehensive secondary data resources in 2010. Nucleic Acids Res 2010; 39:D220-4. [PMID: 21109531 PMCID: PMC3013725 DOI: 10.1093/nar/gkq1157] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38 000 000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de).
Collapse
Affiliation(s)
- H Werner Mewes
- Institute for Bioinformatics and Systems Biology, MIPS, Helmholtz Center F Health and Environment, Ingolstädter Landstr 1, D-85764 Neuherberg, Germany.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Everett L, Hansen M, Hannenhalli S. Regulating the regulators: modulators of transcription factor activity. Methods Mol Biol 2010; 674:297-312. [PMID: 20827600 DOI: 10.1007/978-1-60761-854-6_19] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Gene transcription is largely regulated by DNA-binding transcription factors (TFs). However, the TF activity itself is modulated via, among other things, post-translational modifications (PTMs) by specific modification enzymes in response to cellular stimuli. TF-PTMs thus serve as "molecular switchboards" that map upstream signaling events to the downstream transcriptional events. An important long-term goal is to obtain a genome-wide map of "regulatory triplets" consisting of a TF, target gene, and a modulator gene that specifically modulates the regulation of the target gene by the TF. A variety of genome-wide data sets can be exploited by computational methods to obtain a rough map of regulatory triplets, which can guide directed experiments. However, a prerequisite to developing such computational tools is a systematic catalog of known instances of regulatory triplets. We first describe PTM-Switchboard, a recent database that stores triplets of genes such that the ability of one gene (the TF) to regulate a target gene is dependent on one or more PTMs catalyzed by a third gene, the modifying enzyme. We also review current computational approaches to infer regulatory triplets from genome-wide data sets and conclude with a discussion of potential future research. PTM-Switchboard is accessible at http://cagr.pcbi.upenn.edu/PTMswitchboard /
Collapse
Affiliation(s)
- Logan Everett
- Department of Genetics, Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA, USA.
| | | | | |
Collapse
|
12
|
|
13
|
Wiegers TC, Davis AP, Cohen KB, Hirschman L, Mattingly CJ. Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinformatics 2009; 10:326. [PMID: 19814812 PMCID: PMC2768719 DOI: 10.1186/1471-2105-10-326] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2009] [Accepted: 10/08/2009] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND The Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage. RESULTS Prototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking). CONCLUSION This text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.
Collapse
Affiliation(s)
- Thomas C Wiegers
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME, USA
| | - Allan Peter Davis
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME, USA
| | - K Bretonnel Cohen
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO, USA
- Information Technology Center, The MITRE Corporation, 202 Burlington Road, Bedford, MA, USA
| | - Lynette Hirschman
- Information Technology Center, The MITRE Corporation, 202 Burlington Road, Bedford, MA, USA
| | - Carolyn J Mattingly
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME, USA
| |
Collapse
|
14
|
Song YL, Chen SS. Text mining biomedical literature for constructing gene regulatory networks. Interdiscip Sci 2009; 1:179-86. [PMID: 20640836 DOI: 10.1007/s12539-009-0028-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2008] [Revised: 12/06/2008] [Accepted: 12/07/2008] [Indexed: 11/26/2022]
Abstract
In this paper, we present the framework of a Gene Regulatory Networks System: GRNS. The goals of GRNS include automatically mining biomedical literature to extract gene regulatory information (strain number, genotype, gene regulatory relation, and phenotype), automatically constructing gene regulatory networks based on extracted information and integrating biomedical knowledge into the regulatory networks.
Collapse
Affiliation(s)
- Yong-Ling Song
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | | |
Collapse
|
15
|
Yang CY, Chang CH, Yu YL, Lin TCE, Lee SA, Yen CC, Yang JM, Lai JM, Hong YR, Tseng TL, Chao KM, Huang CYF. PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database. ACTA ACUST UNITED AC 2008; 24:i14-20. [PMID: 18689816 DOI: 10.1093/bioinformatics/btn297] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION To fully understand how a protein kinase regulates biological processes, it is imperative to first identify its substrate(s) and interacting protein(s). However, of the 518 known human serine/threonine/tyrosine kinases, 35% of these have known substrates, while 14% of the kinases have identified substrate recognition motifs. In contrast, 85% of the kinases have protein-protein interaction (PPI) datasets, raising the possibility that we might reveal potential kinase-substrate pairs from these PPIs. RESULTS PhosphoPOINT, a comprehensive human kinase interactome and phospho-protein database, is a collection of 4195 phospho-proteins with a total of 15 738 phosphorylation sites. PhosphoPOINT annotates the interactions among kinases, with their down-stream substrates and with interacting (phospho)-proteins to modulate the kinase-substrate pairs. PhosphoPOINT implements various gene expression profiles and Gene Ontology cellular component information to evaluate each kinase and their interacting (phospho)-proteins/substrates. Integration of cSNPs that cause amino acids change with the proteins with the phosphoprotein dataset reveals that 64 phosphorylation sites result in a disease phenotypes when changed; the linked phenotypes include schizophrenia and hypertension. PhosphoPOINT also provides a search function for all phospho-peptides using about 300 known kinase/phosphatase substrate/binding motifs. Altogether, PhosphoPOINT provides robust annotation for kinases, their downstream substrates and their interaction (phospho)-proteins and this should accelerate the functional characterization of kinomemediated signaling. AVAILABILITY PhosphoPOINT can be freely accessed in http://kinase. bioinformatics.tw/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chia-Ying Yang
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan, Republic of China
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Everett L, Vo A, Hannenhalli S. PTM-Switchboard--a database of posttranslational modifications of transcription factors, the mediating enzymes and target genes. Nucleic Acids Res 2008; 37:D66-71. [PMID: 18927104 PMCID: PMC2686453 DOI: 10.1093/nar/gkn731] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Gene transcription is largely regulated by sequence-specific transcription factors (TFs). The TF activity is significantly regulated by its posttranslational modifications (PTMs). TF-PTMs serve as ‘molecular switchboards’ that map multiple upstream signaling events, in response to various environmental perturbations, to the downstream transcriptional events. While many instances of TF-PTMs and their effect on gene regulation have been experimentally determined, a systematic meta-analysis or a quantitative model-based investigation of this process has not been undertaken. A prerequisite to such analyses is a database of known instances of TF-PTMs affecting transcriptional regulation. The PTM-Switchboard database meets this need by cataloging such instances in the model system Saccharomyces cerevisiae. The database stores triplets of genes such that the ability of one gene (TF) to regulate a target gene is dependent on one or more PTMs catalyzed by a third gene (modifying enzyme). The database currently includes a large sample of experimentally characterized instances curated from the literature. In addition to providing a framework for searching and analyzing the data, the database will serve to benchmark computational methods. In the future, the database will be expanded to mammalian organisms, and will also include triplets predicted from computational approaches. The database can be accessed at http://cagr.pcbi.upenn.edu/PTMswitchboard.
Collapse
Affiliation(s)
- Logan Everett
- Penn Center for Bioinformatics, Department of Genetics and Department of Computer Science, University of Pennsylvania, Philadelphia, PA, USA.
| | | | | |
Collapse
|
17
|
Cohen KB, Palmer M, Hunter L. Nominalization and alternations in biomedical language. PLoS One 2008; 3:e3158. [PMID: 18779866 PMCID: PMC2527518 DOI: 10.1371/journal.pone.0003158] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2008] [Accepted: 06/04/2008] [Indexed: 12/04/2022] Open
Abstract
Background This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings We examined 1,872 tokens of the ten most common domain-specific verbs or their zero-related nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedical language. We also report on a previously undescribed alternation involving an adjectival present participle.
Collapse
Affiliation(s)
- K Bretonnel Cohen
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, United States of America.
| | | | | |
Collapse
|
18
|
Tsai RTH, Chou WC, Su YS, Lin YC, Sung CL, Dai HJ, Yeh ITH, Ku W, Sung TY, Hsu WL. BIOSMILE: a semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics 2007; 8:325. [PMID: 17764570 PMCID: PMC2072962 DOI: 10.1186/1471-2105-8-325] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2006] [Accepted: 09/01/2007] [Indexed: 11/21/2022] Open
Abstract
Background Bioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events. Results To evaluate the performance of BIOSMILE, we conducted two experiments to (1) compare the performance of SRL systems trained on newswire and biomedical corpora; and (2) examine the effects of using biomedical-specific features. The experimental results show that using BioProp improves the F-score of the SRL system by 21.45% over an SRL system that uses a newswire corpus. It is noteworthy that adding automatically generated template features improves the overall F-score by a further 0.52%. Specifically, ArgM-LOC, ArgM-MNR, and Arg2 achieve statistically significant performance improvements of 3.33%, 2.27%, and 1.44%, respectively. Conclusion We demonstrate the necessity of using a biomedical proposition bank for training SRL systems in the biomedical domain. Besides the different characteristics of biomedical and newswire sentences, factors such as cross-domain framesets and verb usage variations also influence the performance of SRL systems. For argument classification, we find that NE (named entity) features indicating if the target node matches with NEs are not effective, since NEs may match with a node of the parsing tree that does not have semantic role labels in the training set. We therefore incorporate templates composed of specific words, NE types, and POS tags into the SRL system. As a result, the classification accuracy for adjunct arguments, which is especially important for biomedical SRL, is improved significantly.
Collapse
Affiliation(s)
| | - Wen-Chi Chou
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| | - Ying-Shan Su
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
- Institute of Human Nutrition, Columbia University, New York, NY 10032, USA
| | - Yu-Chun Lin
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| | - Cheng-Lung Sung
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| | - Hong-Jie Dai
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| | - Irene Tzu-Hsuan Yeh
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
- Biological Sciences & Psychology, Mellon College of Sciences, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Wei Ku
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| | - Ting-Yi Sung
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, PRoC
| |
Collapse
|
19
|
Liu H, Hu ZZ, Torii M, Wu C, Friedman C. Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc 2006; 13:497-507. [PMID: 16799122 PMCID: PMC1561801 DOI: 10.1197/jamia.m2085] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. METHODS We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. RESULTS The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. CONCLUSION The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.
Collapse
Affiliation(s)
- Hongfang Liu
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC 20007, USA.
| | | | | | | | | |
Collapse
|