1
|
Jiang Y, Wang Y, Shen L, Adjeroh DA, Liu Z, Lin J. Identification of all-against-all protein-protein interactions based on deep hash learning. BMC Bioinformatics 2022; 23:266. [PMID: 35804303 PMCID: PMC9264577 DOI: 10.1186/s12859-022-04811-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 06/17/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming. RESULTS In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database with M proteins can be transformed into a much more simpler problem: to find a number inside a sorted array of length M. This pre-screening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHL-PPI uses the Hamming distance to verify the final PPI relationship. CONCLUSIONS The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from [Formula: see text] to [Formula: see text] for performing an all-against-all PPI prediction for a database with M proteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.
Collapse
Affiliation(s)
- Yue Jiang
- College of Computer and Cyber Security, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Yuxuan Wang
- No. 2 Thoracic Surgery Department Beijing Chest Hospital, Capital Medical University, Beijing Tuberculosis and Thoracic Tumor Research Institute, Beijing, 101149, People's Republic of China
| | - Lin Shen
- College of Computer and Cyber Security, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald A Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, USA
| | - Zhidong Liu
- No. 2 Thoracic Surgery Department Beijing Chest Hospital, Capital Medical University, Beijing Tuberculosis and Thoracic Tumor Research Institute, Beijing, 101149, People's Republic of China.
| | - Jie Lin
- College of Computer and Cyber Security, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
2
|
Wang H, Qiu J, Liu H, Xu Y, Jia Y, Zhao Y. HKPocket: human kinase pocket database for drug design. BMC Bioinformatics 2019; 20:617. [PMID: 31783725 PMCID: PMC6884818 DOI: 10.1186/s12859-019-3254-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Accepted: 11/15/2019] [Indexed: 01/06/2023] Open
Abstract
Background The kinase pocket structural information is important for drug discovery targeting cancer or other diseases. Although some kinase sequence, structure or drug databases have been developed, the databases cannot be directly used in the kinase drug study. Therefore, a comprehensive database of human kinase protein pockets is urgently needed to be developed. Results Here, we have developed HKPocket, a comprehensive Human Kinase Pocket database. This database provides sequence, structure, hydrophilic-hydrophobic, critical interactions, and druggability information including 1717 pockets from 255 kinases. We further divided these pockets into 91 pocket clusters using structural and position features in each kinase group. The pocket structural information would be useful for preliminary drug screening. Then, the potential drugs can be further selected and optimized by analyzing the sequence conservation, critical interactions, and hydrophobicity of identified drug pockets. HKPocket also provides online visualization and pse files of all identified pockets. Conclusion The HKPocket database would be helpful for drug screening and optimization. Besides, drugs targeting the non-catalytic pockets would cause fewer side effects. HKPocket is available at http://zhaoserver.com.cn/HKPocket/HKPocket.html.
Collapse
Affiliation(s)
- Huiwen Wang
- Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Jiadi Qiu
- Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Haoquan Liu
- Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Ying Xu
- Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Ya Jia
- Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Yunjie Zhao
- Department of Physics, Central China Normal University, Wuhan, 430079, China.
| |
Collapse
|
3
|
Hyung D, Mallon AM, Kyung DS, Cho SY, Seong JK. TarGo: network based target gene selection system for human disease related mouse models. Lab Anim Res 2019; 35:23. [PMID: 32257911 PMCID: PMC7081697 DOI: 10.1186/s42826-019-0023-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 10/21/2019] [Indexed: 11/25/2022] Open
Abstract
Genetically engineered mouse models are used in high-throughput phenotyping screens to understand genotype-phenotype associations and their relevance to human diseases. However, not all mutant mouse lines with detectable phenotypes are associated with human diseases. Here, we propose the “Target gene selection system for Genetically engineered mouse models” (TarGo). Using a combination of human disease descriptions, network topology, and genotype-phenotype correlations, novel genes that are potentially related to human diseases are suggested. We constructed a gene interaction network using protein-protein interactions, molecular pathways, and co-expression data. Several repositories for human disease signatures were used to obtain information on human disease-related genes. We calculated disease- or phenotype-specific gene ranks using network topology and disease signatures. In conclusion, TarGo provides many novel features for gene function prediction.
Collapse
Affiliation(s)
- Daejin Hyung
- 1National Cancer Center, 323 Ilsan-ro, Goyang-si, Kyeonggi-do 10408 Republic of Korea
| | - Ann-Marie Mallon
- 2MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire, OX11 0RD UK
| | - Dong Soo Kyung
- 3Laboratory of Developmental Biology and Genomics, Research Institute for Veterinary Science, and BK21 Plus Program for Creative Veterinary Science, College of Veterinary Medicine, Seoul National University, Seoul, 08826 Republic of Korea.,4Korea Mouse Phenotyping Center (KMPC), Seoul National University, Seoul, 08826 Republic of Korea.,5Interdisciplinary Program for Bioinformatics, Program for Cancer Biology and BIO-MAX institute, Seoul National University, Seoul, 08826 Republic of Korea
| | - Soo Young Cho
- 1National Cancer Center, 323 Ilsan-ro, Goyang-si, Kyeonggi-do 10408 Republic of Korea.,4Korea Mouse Phenotyping Center (KMPC), Seoul National University, Seoul, 08826 Republic of Korea
| | - Je Kyung Seong
- 3Laboratory of Developmental Biology and Genomics, Research Institute for Veterinary Science, and BK21 Plus Program for Creative Veterinary Science, College of Veterinary Medicine, Seoul National University, Seoul, 08826 Republic of Korea.,4Korea Mouse Phenotyping Center (KMPC), Seoul National University, Seoul, 08826 Republic of Korea.,5Interdisciplinary Program for Bioinformatics, Program for Cancer Biology and BIO-MAX institute, Seoul National University, Seoul, 08826 Republic of Korea
| |
Collapse
|
4
|
Pozzi B, Mammi P, Bragado L, Giono LE, Srebrow A. When SUMO met splicing. RNA Biol 2018; 15:689-695. [PMID: 29741121 PMCID: PMC6152442 DOI: 10.1080/15476286.2018.1457936] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2017] [Revised: 02/22/2018] [Accepted: 03/20/2018] [Indexed: 12/12/2022] Open
Abstract
Spliceosomal proteins have been revealed as SUMO conjugation targets. Moreover, we have reported that many of these are in a SUMO-conjugated form when bound to a pre-mRNA substrate during a splicing reaction. We demonstrated that SUMOylation of Prp3 (PRPF3), a component of the U4/U6 di-snRNP, is required for U4/U6•U5 tri-snRNP formation and/or recruitment to active spliceosomes. Expanding upon our previous results, we have shown that the splicing factor SRSF1 stimulates SUMO conjugation to several spliceosomal proteins. Given the relevance of the splicing process, as well as the complex and dynamic nature of its governing machinery, the spliceosome, the molecular mechanisms that modulate its function represent an attractive topic of research. We posit that SUMO conjugation could represent a way of modulating spliceosome assembly and thus, splicing efficiency. How cycles of SUMOylation/de-SUMOylation of spliceosomal proteins become integrated throughout the highly choreographed spliceosomal cycle awaits further investigation.
Collapse
Affiliation(s)
- Berta Pozzi
- Instituto de Fisiología, Biología Molecular y Neurociencias (IFIBYNE, UBA- CONICET); Departamento de Fisiología, Biología Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Ciudad Universitaria, Buenos Aires, Argentina
| | - Pablo Mammi
- Instituto de Fisiología, Biología Molecular y Neurociencias (IFIBYNE, UBA- CONICET); Departamento de Fisiología, Biología Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Ciudad Universitaria, Buenos Aires, Argentina
| | - Laureano Bragado
- Instituto de Fisiología, Biología Molecular y Neurociencias (IFIBYNE, UBA- CONICET); Departamento de Fisiología, Biología Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Ciudad Universitaria, Buenos Aires, Argentina
| | - Luciana E. Giono
- Instituto de Fisiología, Biología Molecular y Neurociencias (IFIBYNE, UBA- CONICET); Departamento de Fisiología, Biología Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Ciudad Universitaria, Buenos Aires, Argentina
| | - Anabella Srebrow
- Instituto de Fisiología, Biología Molecular y Neurociencias (IFIBYNE, UBA- CONICET); Departamento de Fisiología, Biología Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Ciudad Universitaria, Buenos Aires, Argentina
| |
Collapse
|
5
|
Stroehlein AJ, Young ND, Gasser RB. Advances in kinome research of parasitic worms - implications for fundamental research and applied biotechnological outcomes. Biotechnol Adv 2018; 36:915-934. [PMID: 29477756 DOI: 10.1016/j.biotechadv.2018.02.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 02/15/2018] [Accepted: 02/21/2018] [Indexed: 12/17/2022]
Abstract
Protein kinases are enzymes that play essential roles in the regulation of many cellular processes. Despite expansions in the fields of genomics, transcriptomics and bioinformatics, there is limited information on the kinase complements (kinomes) of most eukaryotic organisms, including parasitic worms that cause serious diseases of humans and animals. The biological uniqueness of these worms and the draft status of their genomes pose challenges for the identification and classification of protein kinases using established tools. In this article, we provide an account of kinase biology, the roles of kinases in diseases and their importance as drug targets, and drug discovery efforts in key socioeconomically important parasitic worms. In this context, we summarise methods and resources commonly used for the curation, identification, classification and functional annotation of protein kinase sequences from draft genomes; review recent advances made in the characterisation of the worm kinomes; and discuss the implications of these advances for investigating kinase signalling and developing small-molecule inhibitors as new anti-parasitic drugs.
Collapse
Affiliation(s)
- Andreas J Stroehlein
- Melbourne Veterinary School, Department of Veterinary Biosciences, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, Victoria 3010, Australia.
| | - Neil D Young
- Melbourne Veterinary School, Department of Veterinary Biosciences, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Robin B Gasser
- Melbourne Veterinary School, Department of Veterinary Biosciences, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, Victoria 3010, Australia.
| |
Collapse
|
6
|
Žitnik S, Žitnik M, Zupan B, Bajec M. Sieve-based relation extraction of gene regulatory networks from biological literature. BMC Bioinformatics 2015; 16 Suppl 16:S1. [PMID: 26551454 PMCID: PMC4642041 DOI: 10.1186/1471-2105-16-s16-s1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual descriptions of biological entities, their interactions and results of related experiments. To extract them in an explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The current challenge is the development of information extraction procedures that can directly infer more complex relational structures, such as gene regulatory networks. Results We develop a computational approach for extraction of gene regulatory networks from textual data. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each of which is able to extract different relationship types. Following the shared task, we conducted additional analysis using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from 0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. Analysis of distances between different mention types in the text shows that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions. Conclusions Linear-chain conditional random fields, along with appropriate data transformations, can be efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction tasks and data domains.
Collapse
|
7
|
van Linden OPJ, Kooistra AJ, Leurs R, de Esch IJP, de Graaf C. KLIFS: a knowledge-based structural database to navigate kinase-ligand interaction space. J Med Chem 2013; 57:249-77. [PMID: 23941661 DOI: 10.1021/jm400378w] [Citation(s) in RCA: 212] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Protein kinases regulate the majority of signal transduction pathways in cells and have become important targets for the development of designer drugs. We present a systematic analysis of kinase-ligand interactions in all regions of the catalytic cleft of all 1252 human kinase-ligand cocrystal structures present in the Protein Data Bank (PDB). The kinase-ligand interaction fingerprints and structure database (KLIFS) contains a consistent alignment of 85 kinase ligand binding site residues that enables the identification of family specific interaction features and classification of ligands according to their binding modes. We illustrate how systematic mining of kinase-ligand interaction space gives new insights into how conserved and selective kinase interaction hot spots can accommodate the large diversity of chemical scaffolds in kinase ligands. These analyses lead to an improved understanding of the structural requirements of kinase binding that will be useful in ligand discovery and design studies.
Collapse
Affiliation(s)
- Oscar P J van Linden
- Division of Medicinal Chemistry, Faculty of Sciences, Amsterdam Institute for Molecules, Medicines and Systems (AIMMS), VU University Amsterdam , De Boelelaan 1083, 1081 HV Amsterdam, The Netherlands
| | | | | | | | | |
Collapse
|
8
|
Friedman C, Rindflesch TC, Corn M. Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J Biomed Inform 2013; 46:765-73. [PMID: 23810857 DOI: 10.1016/j.jbi.2013.06.004] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2013] [Revised: 06/07/2013] [Accepted: 06/07/2013] [Indexed: 01/29/2023]
Abstract
Natural language processing (NLP) is crucial for advancing healthcare because it is needed to transform relevant information locked in text into structured data that can be used by computer processes aimed at improving patient care and advancing medicine. In light of the importance of NLP to health, the National Library of Medicine (NLM) recently sponsored a workshop to review the state of the art in NLP focusing on text in English, both in biomedicine and in the general language domain. Specific goals of the NLM-sponsored workshop were to identify the current state of the art, grand challenges and specific roadblocks, and to identify effective use and best practices. This paper reports on the main outcomes of the workshop, including an overview of the state of the art, strategies for advancing the field, and obstacles that need to be addressed, resulting in recommendations for a research agenda intended to advance the field.
Collapse
Affiliation(s)
- Carol Friedman
- Department of Biomedical Informatics, Columbia University, United States.
| | | | | |
Collapse
|
9
|
Tikk D, Solt I, Thomas P, Leser U. A detailed error analysis of 13 kernel methods for protein-protein interaction extraction. BMC Bioinformatics 2013; 14:12. [PMID: 23323857 PMCID: PMC3680070 DOI: 10.1186/1471-2105-14-12] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2012] [Accepted: 12/19/2012] [Indexed: 11/21/2022] Open
Abstract
Background Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level. Results We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance. Conclusions Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.
Collapse
Affiliation(s)
- Domonkos Tikk
- Knowledge Management in Bioinformatics, Computer Science Department, Humboldt-Universität zu Berlin, 10099 Berlin, Germany.
| | | | | | | |
Collapse
|
10
|
RASOnD-a comprehensive resource and search tool for RAS superfamily oncogenes from various species. BMC Genomics 2011; 12:341. [PMID: 21729256 PMCID: PMC3141677 DOI: 10.1186/1471-2164-12-341] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Accepted: 07/05/2011] [Indexed: 12/30/2022] Open
Abstract
Background The Ras superfamily plays an important role in the control of cell signalling and division. Mutations in the Ras genes convert them into active oncogenes. The Ras oncogenes form a major thrust of global cancer research as they are involved in the development and progression of tumors. This has resulted in the exponential growth of data on Ras superfamily across different public databases and in literature. However, no dedicated public resource is currently available for data mining and analysis on this family. The present database was developed to facilitate straightforward accession, retrieval and analysis of information available on Ras oncogenes from one particular site. Description We have developed the RAS Oncogene Database (RASOnD) as a comprehensive knowledgebase that provides integrated and curated information on a single platform for oncogenes of Ras superfamily. RASOnD encompasses exhaustive genomics and proteomics data existing across diverse publicly accessible databases. This resource presently includes overall 199,046 entries from 101 different species. It provides a search tool to generate information about their nucleotide and amino acid sequences, single nucleotide polymorphisms, chromosome positions, orthologies, motifs, structures, related pathways and associated diseases. We have implemented a number of user-friendly search interfaces and sequence analysis tools. At present the user can (i) browse the data (ii) search any field through a simple or advance search interface and (iii) perform a BLAST search and subsequently CLUSTALW multiple sequence alignment by selecting sequences of Ras oncogenes. The Generic gene browser, GBrowse, JMOL for structural visualization and TREEVIEW for phylograms have been integrated for clear perception of retrieved data. External links to related databases have been included in RASOnD. Conclusions This database is a resource and search tool dedicated to Ras oncogenes. It has utility to cancer biologists and cell molecular biologists as it is a ready source for research, identification and elucidation of the role of these oncogenes. The data generated can be used for understanding the relationship between the Ras oncogenes and their association with cancer. The database updated monthly is freely accessible online at http://202.141.47.181/rasond/ and http://www.aiims.edu/RAS.html.
Collapse
|
11
|
Dlxin-1, a member of MAGE family, inhibits cell proliferation, invasion and tumorigenicity of glioma stem cells. Cancer Gene Ther 2010; 18:206-18. [DOI: 10.1038/cgt.2010.71] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
12
|
Yang L, Zhang X, Chen J, Wang Q, Wang L, Jiang Y, Pan Y. ReCGiP, a database of reproduction candidate genes in pigs based on bibliomics. Reprod Biol Endocrinol 2010; 8:96. [PMID: 20707928 PMCID: PMC3224910 DOI: 10.1186/1477-7827-8-96] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/21/2010] [Accepted: 08/14/2010] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Reproduction in pigs is one of the most economically important traits. To improve the reproductive performances, numerous studies have focused on the identification of candidate genes. However, it is hard for one to read all literatures thoroughly to get information. So we have developed a database providing candidate genes for reproductive researches in pig by mining and processing existing biological literatures in human and pigs, named as ReCGiP. DESCRIPTION Based on text-mining and comparative genomics, ReCGiP presents diverse information of reproduction-relevant genes in human and pig. The genes were sorted by the degree of relevance with the reproduction topics and were visualized in a gene's co-occurrence network where two genes were connected if they were co-cited in a PubMed abstract. The 'hub' genes which had more 'neighbors' were thought to be have more important functions and could be identified by the user in their web browser. In addition, ReCGiP provided integrated GO annotation, OMIM and biological pathway information collected from the Internet. Both pig and human gene information can be found in the database, which is now available. CONCLUSIONS ReCGiP is a unique database providing information on reproduction related genes for pig. It can be used in the area of the molecular genetics, the genetic linkage map, and the breeding of the pig and other livestock. Moreover, it can be used as a reference for human reproduction research.
Collapse
Affiliation(s)
- Lun Yang
- School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 200240, China
- Bio-X Center, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xiangzhe Zhang
- School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Key Laboratory of Veterinary Biotechnology, Shanghai, 200240, China
| | - Jian Chen
- Bio-X Center, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Qishan Wang
- School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Lishan Wang
- Bio-X Center, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yue Jiang
- School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuchun Pan
- School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Key Laboratory of Veterinary Biotechnology, Shanghai, 200240, China
| |
Collapse
|
13
|
Krallinger M, Leitner F, Valencia A. Analysis of biological processes and diseases using text mining approaches. Methods Mol Biol 2010; 593:341-382. [PMID: 19957157 DOI: 10.1007/978-1-60327-194-3_16] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.
Collapse
|
14
|
Abstract
Background: The RAS/RAF/MEK/ERK pathway is involved in the balance between melanocyte proliferation and differentiation. The same pathway is constitutively activated in cutaneous and uveal melanoma (UM) and related to tumour growth and survival. Whereas mutant BRAF and NRAS are responsible for the activation of the RAS/RAF/MEK/ERK pathway in most cutaneous melanoma, mutations in these genes are usually absent in UM. Methods: We set out to explore the RAS/RAF/MEK/ERK pathway and used mitogen-activated protein kinase profiling and tyrosine kinase arrays. Results: We identified Src as a kinase that is associated with ERK1/2 activation in UM. However, low Src levels and reduced ERK1/2 activation in metastatic cell lines suggest that proliferation in metastases can become independent of Src and RAS/RAF/MEK/ERK signalling. Inhibition of Src led to the growth reduction of primary UM cultures and cell lines, whereas metastatic cell line growth was only slightly reduced. Conclusion: We identified Src as an important kinase and a potential target for treatment in primary UM. Metastasis cell lines seemed largely resistant to Src inhibition and indicate that in metastases treatment, a different approach may be required.
Collapse
|
15
|
Yang CY, Chang CH, Yu YL, Lin TCE, Lee SA, Yen CC, Yang JM, Lai JM, Hong YR, Tseng TL, Chao KM, Huang CYF. PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database. ACTA ACUST UNITED AC 2008; 24:i14-20. [PMID: 18689816 DOI: 10.1093/bioinformatics/btn297] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION To fully understand how a protein kinase regulates biological processes, it is imperative to first identify its substrate(s) and interacting protein(s). However, of the 518 known human serine/threonine/tyrosine kinases, 35% of these have known substrates, while 14% of the kinases have identified substrate recognition motifs. In contrast, 85% of the kinases have protein-protein interaction (PPI) datasets, raising the possibility that we might reveal potential kinase-substrate pairs from these PPIs. RESULTS PhosphoPOINT, a comprehensive human kinase interactome and phospho-protein database, is a collection of 4195 phospho-proteins with a total of 15 738 phosphorylation sites. PhosphoPOINT annotates the interactions among kinases, with their down-stream substrates and with interacting (phospho)-proteins to modulate the kinase-substrate pairs. PhosphoPOINT implements various gene expression profiles and Gene Ontology cellular component information to evaluate each kinase and their interacting (phospho)-proteins/substrates. Integration of cSNPs that cause amino acids change with the proteins with the phosphoprotein dataset reveals that 64 phosphorylation sites result in a disease phenotypes when changed; the linked phenotypes include schizophrenia and hypertension. PhosphoPOINT also provides a search function for all phospho-peptides using about 300 known kinase/phosphatase substrate/binding motifs. Altogether, PhosphoPOINT provides robust annotation for kinases, their downstream substrates and their interaction (phospho)-proteins and this should accelerate the functional characterization of kinomemediated signaling. AVAILABILITY PhosphoPOINT can be freely accessed in http://kinase. bioinformatics.tw/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chia-Ying Yang
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan, Republic of China
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Rinaldi F, Kappeler T, Kaljurand K, Schneider G, Klenner M, Clematide S, Hess M, von Allmen JM, Parisot P, Romacker M, Vachon T. OntoGene in BioCreative II. Genome Biol 2008; 9 Suppl 2:S13. [PMID: 18834491 PMCID: PMC2559984 DOI: 10.1186/gb-2008-9-s2-s13] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background: Research scientists and companies working in the domains of biomedicine and genomics are increasingly faced with the problem of efficiently locating, within the vast body of published scientific findings, the critical pieces of information that are needed to direct current and future research investment. Results: In this report we describe approaches taken within the scope of the second BioCreative competition in order to solve two aspects of this problem: detection of novel protein interactions reported in scientific articles, and detection of the experimental method that was used to confirm the interaction. Our approach to the former problem is based on a high-recall protein annotation step, followed by two strict disambiguation steps. The remaining proteins are then combined according to a number of lexico-syntactic filters, which deliver high-precision results while maintaining reasonable recall. The detection of the experimental methods is tackled by a pattern matching approach, which has delivered the best results in the official BioCreative evaluation. Conclusion: Although the results of BioCreative clearly show that no tool is sufficiently reliable for fully automated annotations, a few of the proposed approaches (including our own) already perform at a competitive level. This makes them interesting either as standalone tools for preliminary document inspection, or as modules within an environment aimed at supporting the process of curation of biomedical literature.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Abstract
Background Relationships between entities such as genes, chemicals, metabolites, phenotypes and diseases in MEDLINE are often directional. That is, one may affect the other in a positive or negative manner. Detection of causality and direction is key in piecing pathways together and in examining possible implications of experimental results. Because of the size and growth of biomedical literature, it is increasingly important to be able to automate this process as much as possible. Results Here we present a method of relation extraction using dependency graph parsing with SVM classification. We tested the SVM classifier first on gold standard corpora from GENIA and find it achieved 82% precision and 94.8% recall (F-measure of 87.9) on these standardized test sets. We then applied the entire system to all available MEDLINE abstracts for two target interactions with known effects. We find that while some directional relations are extracted with low ambiguity, others are apparently contradictory, at least when considered in an isolated context. When examined, it is apparent some are dependent upon the surrounding context (e.g. whether the relationship referred to short-term or long-term effects, or whether the focus was extracellular versus intracellular). Conclusion Thesaurus-based directional relation extraction can be done with reasonable accuracy, but is prone to false-positives on larger corpora due to noun modifiers. Furthermore, methods of resolving or disambiguating relationship context and contingencies are important for large-scale corpora.
Collapse
Affiliation(s)
- Cory B Giles
- Arthritis and Immunology Research Program, Oklahoma Medical Research Foundation, 825 N,E, 13th Street, Oklahoma City, Oklahoma 73104-5005, USA.
| | | |
Collapse
|
18
|
Kim JD, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 2008; 9:10. [PMID: 18182099 PMCID: PMC2267702 DOI: 10.1186/1471-2105-9-10] [Citation(s) in RCA: 120] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2007] [Accepted: 01/08/2008] [Indexed: 11/24/2022] Open
Abstract
Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
Collapse
Affiliation(s)
- Jin-Dong Kim
- Department of Computer Science, School of Information Science and Technology, University of Tokyo, Tokyo, Japan.
| | | | | |
Collapse
|
19
|
Tsuruoka Y, McNaught J, Tsujii J, Ananiadou S. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics 2007; 23:2768-74. [PMID: 17698493 DOI: 10.1093/bioinformatics/btm393] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed. RESULTS We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks. AVAILABILITY A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.
Collapse
Affiliation(s)
- Yoshimasa Tsuruoka
- School of Computer Science, The University of Manchester, Manchester, UK.
| | | | | | | |
Collapse
|
20
|
Xuan W, Wang P, Watson SJ, Meng F. Medline search engine for finding genetic markers with biological significance. Bioinformatics 2007; 23:2477-84. [PMID: 17823133 DOI: 10.1093/bioinformatics/btm375] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Genome-wide high density SNP association studies are expected to identify various SNP alleles associated with different complex disorders. Understanding the biological significance of these SNP alleles in the context of existing literature is a major challenge since existing search engines are not designed to search literature for SNPs or other genetic markers. The literature mining of gene and protein functions has received significant attention and effort while similar work on genetic markers and their related diseases is still in its infancy. Our goal is to develop a web-based tool that facilitates the mining of Medline literature related to genetic studies and gene/protein function studies. Our solution consists of four main function modules for (1) identification of different types of genetic markers or genetic variations in Medline records (2) distinguishing positive versus negative linkage or association between genetic markers and diseases (3) integrating marker genomic location data from different databases to enable the retrieval of Medline records related to markers in the same linkage disequilibrium region (4) and a web interface called MarkerInfoFinder to search, display, sort and download Medline citation results. Tests using published data suggest MarkerInfoFinder can significantly increase the efficiency of finding genetic disorders and their underlying molecular mechanisms. The functions we developed will also be used to build a knowledge base for genetic markers and diseases. AVAILABILITY The MarkerInfoFinder is publicly available at: http://brainarray.mbni.med.umich.edu/brainarray/datamining/MarkerInfoFinder.
Collapse
Affiliation(s)
- Weijian Xuan
- Molecular and Behavioral Neuroscience Institute and Department of Psychiatry, University of Michigan, Ann Arbor, Michigan 48109, USA
| | | | | | | |
Collapse
|
21
|
Mueller M, Martens L, Apweiler R. Annotating the human proteome: Beyond establishing a parts list. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2007; 1774:175-91. [PMID: 17223395 DOI: 10.1016/j.bbapap.2006.11.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2006] [Revised: 11/16/2006] [Accepted: 11/21/2006] [Indexed: 12/31/2022]
Abstract
The completion of the human genome has shifted the attention from deciphering the sequence to the identification and characterisation of the functional components, including genes. Improved gene prediction algorithms, together with the existing transcript and protein information, have enabled the identification of most exons in a genome. Availability of the 'parts list' has fostered the development of experimental approaches to systematically interrogate gene function on the genome, transcriptome and proteome level. Studying gene function at the protein level is vital to the understanding of how cells perform their functions as variations in protein isoforms and protein quantity which may underlie a change in phenotype can often not be deduced from sequence or transcript level genomics experiments alone. Recent advancements in proteomics have afforded technologies capable of measuring protein expression, post-translational modifications of these proteins, their subcellular localisation and assembly into complexes and pathways. Although an enormous amount of data already exists on the function of many human proteins, much of it is scattered over multiple resources. Public domain databases are therefore required to manage and collate this information and present it to the user community in both a human and machine readable manner. Of special importance here is the integration of heterogeneous data to facilitate the creation of resources that go beyond a mere parts list.
Collapse
Affiliation(s)
- Michael Mueller
- EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
22
|
Rinaldi F, Schneider G, Kaljurand K, Hess M, Romacker M. An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinformatics 2006; 7 Suppl 3:S3. [PMID: 17134476 PMCID: PMC1764447 DOI: 10.1186/1471-2105-7-s3-s3] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background The biomedical domain is witnessing a rapid growth of the amount of published scientific results, which makes it increasingly difficult to filter the core information. There is a real need for support tools that 'digest' the published results and extract the most important information. Results We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. Conclusion The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | - Gerold Schneider
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | - Kaarel Kaljurand
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | - Michael Hess
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | | |
Collapse
|
23
|
Kohn KW, Aladjem MI, Weinstein JN, Pommier Y. Molecular interaction maps of bioregulatory networks: a general rubric for systems biology. Mol Biol Cell 2005; 17:1-13. [PMID: 16267266 PMCID: PMC1345641 DOI: 10.1091/mbc.e05-09-0824] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
A standard for bioregulatory network diagrams is urgently needed in the same way that circuit diagrams are needed in electronics. Several graphical notations have been proposed, but none has become standard. We have prepared many detailed bioregulatory network diagrams using the molecular interaction map (MIM) notation, and we now feel confident that it is suitable as a standard. Here, we describe the MIM notation formally and discuss its merits relative to alternative proposals. We show by simple examples how to denote all of the molecular interactions commonly found in bioregulatory networks. There are two forms of MIM diagrams. "Heuristic" MIMs present the repertoire of interactions possible for molecules that are colocalized in time and place. "Explicit" MIMs define particular models (derived from heuristic MIMs) for computer simulation. We show also how pathways or processes can be highlighted on a canonical heuristic MIM. Drawing a MIM diagram, adhering to the rules of notation, imposes a logical discipline that sharpens one's understanding of the structure and function of a network.
Collapse
Affiliation(s)
- Kurt W Kohn
- Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | | | | | | |
Collapse
|
24
|
Abstract
UNLABELLED BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus
Collapse
Affiliation(s)
- Hongfang Liu
- Department of Information Systems, University of Maryland at Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| | | | | | | |
Collapse
|
25
|
Abstract
A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators.
Collapse
Affiliation(s)
- Martin Krallinger
- Protein Design Group, National Center of Biotechnology, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain
| | - Alfonso Valencia
- Protein Design Group, National Center of Biotechnology, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain
| |
Collapse
|
26
|
Mullen T, Mizuta Y, Collier N. A baseline feature set for learning rhetorical zones using full articles in the biomedical domain. ACTA ACUST UNITED AC 2005. [DOI: 10.1145/1089815.1089823] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
At a time when experimental throughput in the field of molecular biology is increasing, it is necessary for biologists and people working in related fields to have access to sophisticated tools to enable them to efficiently process large amounts of information in order to stay abreast of current research.Rhetorical zone analysis is an application of natural language processing in which areas of text in scientific papers are classified in terms of argumentation and intellectual contribution in order to pinpoint and distinguish certain types of information. Such analysis can be employed to assist in information extraction, helping to assess and integrate data generated by experiments into the scientific community's store of knowledge.We present results for several experiments in automatic zone identification on the ZAISA-1 dataset, a new dataset composed of full biomedical research papers hand-annotated for rhetorical zones. We concentrate on general purpose and linguistically motivated features, and report results for a variety of sets of features. It is our intention to provide a baseline feature set for modeling, which can be extended in future work using combinations of heuristics and more sophisticated and task-specific modeling techniques.
Collapse
Affiliation(s)
- Tony Mullen
- National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
| | - Yoko Mizuta
- National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
| | - Nigel Collier
- National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
| |
Collapse
|
27
|
Hakenberg J, Schmeier S, Kowald A, Klipp E, Leser U. Finding kinetic parameters using text mining. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2005; 8:131-52. [PMID: 15268772 DOI: 10.1089/1536231041388366] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
The mathematical modeling and description of complex biological processes has become more and more important over the last years. Systems biology aims at the computational simulation of complex systems, up to whole cell simulations. An essential part focuses on solving a large number of parameterized differential equations. However, measuring those parameters is an expensive task, and finding them in the literature is very laborious. We developed a text mining system that supports researchers in their search for experimentally obtained parameters for kinetic models. Our system classifies full text documents regarding the question whether or not they contain appropriate data using a support vector machine. We evaluated our approach on a manually tagged corpus of 800 documents and found that it outperforms keyword searches in abstracts by a factor of five in terms of precision.
Collapse
Affiliation(s)
- Jörg Hakenberg
- Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany.
| | | | | | | | | |
Collapse
|
28
|
Eis K, Ince SJ, Jahn C, Jautelat R, Katchourovsky V, Kettschau G, Woloszczak R. Kinase Data Mining: Dealing with the Information (Over-)Flow. Chembiochem 2005; 6:567-70. [PMID: 15712317 DOI: 10.1002/cbic.200400154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Knut Eis
- Medicinal Chemistry, Research Center Europe, Schering AG, Corporate Research, 13342 Berlin, Germany.
| | | | | | | | | | | | | |
Collapse
|
29
|
Santos C, Eggle D, States DJ. Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 2004; 21:1653-8. [PMID: 15564295 DOI: 10.1093/bioinformatics/bti165] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map. RESULTS A 'gold standard' set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the 'gold standard' Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases. AVAILABILITY The pipeline software components are freely available on request to the authors. CONTACT dstates@umich.edu SUPPLEMENTARY INFORMATION http://stateslab.bioinformatics.med.umich.edu/software.html.
Collapse
Affiliation(s)
- Carlos Santos
- Bioinformatics Program, The University of Michigan, Ann Arbor, MI 48109, USA
| | | | | |
Collapse
|
30
|
Koike A, Niwa Y, Takagi T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2004; 21:1227-36. [PMID: 15509601 DOI: 10.1093/bioinformatics/bti084] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. RESULTS We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54-64% with a precision of 91-94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene-GO relationships and 150 000 family-GO relationships for major eukaryotes.
Collapse
Affiliation(s)
- Asako Koike
- Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo Kiban-3A1(CB01) 5-1-5, Kashiwanoha Kashiwa, Chiba 277-8561, Japan.
| | | | | |
Collapse
|
31
|
Affiliation(s)
- Samir Hanash
- Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, M5-C800, PO Box 19024, Seattle, Washington 98109, USA.
| |
Collapse
|