101
|
Li S, Wu L, Zhang Z. Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach. Bioinformatics 2006; 22:2143-50. [PMID: 16820422 DOI: 10.1093/bioinformatics/btl363] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Network reconstruction of biological entities is very important for understanding biological processes and the organizational principles of biological systems. This work focuses on integrating both the literatures and microarray gene-expression data, and a combined literature mining and microarray analysis (LMMA) approach is developed to construct gene networks of a specific biological system. RESULTS In the LMMA approach, a global network is first constructed using the literature-based co-occurrence method. It is then refined using microarray data through a multivariate selection procedure. An application of LMMA to the angiogenesis is presented. Our result shows that the LMMA-based network is more reliable than the co-occurrence-based network in dealing with multiple levels of KEGG gene, KEGG Orthology and pathway. AVAILABILITY The LMMA program is available upon request.
Collapse
|
102
|
Abstract
MOTIVATION Recently, several information extraction systems have been developed to retrieve relevant information out of biomedical text. However, these methods represent individual efforts. In this paper, we show that by combining different algorithms and their outcome, the results improve significantly. For this reason, CONAN has been created, a system which combines different programs and their outcome. Its methods include tagging of gene/protein names, finding interaction and mutation data, tagging of biological concepts and linking to MeSH and Gene Ontology terms. RESULTS In this paper, we will present data that show that combining different text-mining algorithms significantly improves the results. Not only is CONAN a full-scale approach that will ultimately cover all of PubMed/MEDLINE, we also show that this universality has no effect on quality: our system performs as well as or better than existing systems. AVAILABILITY The LDD corpus presented is available by request to the author. The system will be available shortly. For information and updates on CONAN please visit http://www.cs.uu.nl/people/rainer/conan.html.
Collapse
|
103
|
McGuffin LJ, Smith RT, Bryson K, Sørensen SA, Jones DT. High throughput profile-profile based fold recognition for the entire human proteome. BMC Bioinformatics 2006; 7:288. [PMID: 16759376 PMCID: PMC1513610 DOI: 10.1186/1471-2105-7-288] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2006] [Accepted: 06/07/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In order to maintain the most comprehensive structural annotation databases we must carry out regular updates for each proteome using the latest profile-profile fold recognition methods. The ability to carry out these updates on demand is necessary to keep pace with the regular updates of sequence and structure databases. Providing the highest quality structural models requires the most intensive profile-profile fold recognition methods running with the very latest available sequence databases and fold libraries. However, running these methods on such a regular basis for every sequenced proteome requires large amounts of processing power. In this paper we describe and benchmark the JYDE (Job Yield Distribution Environment) system, which is a meta-scheduler designed to work above cluster schedulers, such as Sun Grid Engine (SGE) or Condor. We demonstrate the ability of JYDE to distribute the load of genomic-scale fold recognition across multiple independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software in order to annotate the latest version of the Human proteome against the latest sequence and structure databases in as short a time as possible. RESULTS We show that our JYDE system is able to scale to large numbers of intensive fold recognition jobs running across several independent computer clusters. Using our JYDE system we have been able to annotate 99.9% of the protein sequences within the Human proteome in less than 24 hours, by harnessing over 500 CPUs from 3 independent Grid domains. CONCLUSION This study clearly demonstrates the feasibility of carrying out on demand high quality structural annotations for the proteomes of major eukaryotic organisms. Specifically, we have shown that it is now possible to provide complete regular updates of profile-profile based fold recognition models for entire eukaryotic proteomes, through the use of Grid middleware such as JYDE.
Collapse
|
104
|
Stein KK, Go JC, Lane WS, Primakoff P, Myles DG. Proteomic analysis of sperm regions that mediate sperm-egg interactions. Proteomics 2006; 6:3533-43. [PMID: 16758446 DOI: 10.1002/pmic.200500845] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The sperm interacts with three oocyte-associated structures during fertilization: the cumulus cell layer surrounding the oocyte, the egg extracellular matrix (the zona pellucida), and the oocyte plasma membrane. Each of these interactions is mediated by the sperm head, probably through proteins both on the sperm surface and within the acrosome, a specialized secretory granule. In this study, we have used subcellular fractionation in order to generate a proteome of the sperm head subcellular compartments that interact with oocytes. Of the proteins we identified for which a gene knockout has been tested, a third have been shown to be essential for efficient reproduction in vivo. Many of the other presently untested proteins are likely to have a similarly important role. Twenty-five percent of the cell surface fraction proteins are previously uncharacterized. We have shown that at least two of these novel proteins are localized to the sperm head. In summary, we have identified over 100 proteins that are expressed on mature sperm at the site of sperm-oocyte interactions.
Collapse
|
105
|
García-Serna R, Opatowski L, Mestres J. FCP: functional coverage of the proteome by structures. Bioinformatics 2006; 22:1792-3. [PMID: 16705012 DOI: 10.1093/bioinformatics/btl188] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Tools and resources for translating the remarkable growth witnessed in recent years in the number of protein structures determined experimentally into actual gain in the functional coverage of the proteome are becoming increasingly necessary. We introduce FCP, a publicly accessible web tool dedicated to analyzing the current state and trends of the population of structures within protein families. FCP offers both graphical and quantitative data on the degree of functional coverage of enzymes and nuclear receptors by existing structures, as well as on the bias observed in the distribution of structures along their respective functional classification schemes. AVAILABILITY http://cgl.imim.es/fcp CONTACT jmestres@imim.es.
Collapse
|
106
|
Abstract
MOTIVATION Amino acid changing mutations in proteins are contstrained by purifying selection and accumulate at different rates. We estimate evolutionary rates on multiple alignments of eukaryotic protein families in a maximum likelihood framework and spot sets of slow and fast evolving proteins. RESULTS We find that the evolution of indispensable proteins is constrained by selection and that protein secretion is coupled to an increased evolutionary rate.
Collapse
|
107
|
Palidwor G, Reynaud EG, Andrade-Navarro MA. Taxonomic colouring of phylogenetic trees of protein sequences. BMC Bioinformatics 2006; 7:79. [PMID: 16503967 PMCID: PMC1386715 DOI: 10.1186/1471-2105-7-79] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2005] [Accepted: 02/17/2006] [Indexed: 11/10/2022] Open
Abstract
Background Phylogenetic analyses of protein families are used to define the evolutionary relationships between homologous proteins. The interpretation of protein-sequence phylogenetic trees requires the examination of the taxonomic properties of the species associated to those sequences. However, there is no online tool to facilitate this interpretation, for example, by automatically attaching taxonomic information to the nodes of a tree, or by interactively colouring the branches of a tree according to any combination of taxonomic divisions. This is especially problematic if the tree contains on the order of hundreds of sequences, which, given the accelerated increase in the size of the protein sequence databases, is a situation that is becoming common. Results We have developed PhyloView, a web based tool for colouring phylogenetic trees upon arbitrary taxonomic properties of the species represented in a protein sequence phylogenetic tree. Provided that the tree contains SwissProt, SpTrembl, or GenBank protein identifiers, the tool retrieves the taxonomic information from the corresponding database. A colour picker displays a summary of the findings and allows the user to associate colours to the leaves of the tree according to any number of taxonomic partitions. Then, the colours are propagated to the branches of the tree. Conclusion PhyloView can be used at . A tutorial, the software with documentation, and GPL licensed source code, can be accessed at the same web address.
Collapse
|
108
|
Yang C, Zeng E, Li T, Narasimhan G. Clustering genes using gene expression and text literature data. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:329-40. [PMID: 16447990 DOI: 10.1109/csb.2005.23] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Clustering of gene expression data is a standard technique used to identify closely related genes. In this paper, we develop a new clustering algorithm, MSC (Multi-Source Clustering), to perform exploratory analysis using two or more diverse sources of data. In particular, we investigate the problem of improving the clustering by integrating information obtained from gene expression data with knowledge extracted from biomedical text literature. In each iteration of algorithm MSC, an EM-type procedure is employed to bootstrap the model obtained from one data source by starting with the cluster assignments obtained in the previous iteration using the other data sources. Upon convergence, the two individual models are used to construct the final cluster assignment. We compare the results of algorithm MSC for two data sources with the results obtained when the clustering is applied on the two sources of data separately. We also compare it with that obtained using the feature level integration method that performs the clustering after simply concatenating the features obtained from the two data sources. We show that the z-scores of the clustering results from MSC are better than that from the other methods. To evaluate our clusters better, function enrichment results are presented using terms from the Gene Ontology database. Finally, by investigating the success of motif detection programs that use the clusters, we show that our approach integrating gene expression data and text data reveals clusters that are biologically more meaningful than those identified using gene expression data alone.
Collapse
|
109
|
Li H, Li J, Wong L. Discovering motif pairs at interaction sites from protein sequences on a proteome-wide scale. Bioinformatics 2006; 22:989-96. [PMID: 16446278 DOI: 10.1093/bioinformatics/btl020] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Protein-protein interaction, mediated by protein interaction sites, is intrinsic to many functional processes in the cell. In this paper, we propose a novel method to discover patterns in protein interaction sites. We observed from protein interaction networks that there exist a kind of significant substructures called interacting protein group pairs, which exhibit an all-versus-all interaction between the two protein-sets in such a pair. The full-interaction between the pair indicates a common interaction mechanism shared by the proteins in the pair, which can be referred as an interaction type. Motif pairs at the interaction sites of the protein group pairs can be used to represent such interaction type, with each motif derived from the sequences of a protein group by standard motif discovery algorithms. The systematic discovery of all pairs of interacting protein groups from large protein interaction networks is a computationally challenging problem. By a careful and sophisticated problem transformation, the problem is solved using efficient algorithms for mining frequent patterns, a problem extensively studied in data mining. RESULTS We found 5349 pairs of interacting protein groups from a yeast interaction dataset. The expected value of sequence identity within the groups is only 7.48%, indicating non-homology within these protein groups. We derived 5343 motif pairs from these group pairs, represented in the form of blocks. Comparing our motifs with domains in the BLOCKS and PRINTS databases, we found that our blocks could be mapped to an average of 3.08 correlated blocks in these two databases. The mapped blocks occur 4221 out of total 6794 domains (protein groups) in these two databases. Comparing our motif pairs with iPfam consisting of 3045 interacting domain pairs derived from PDB, we found 47 matches occurring in 105 distinct PDB complexes. Comparing with another putative domain interaction database InterDom, we found 203 matches. AVAILABILITY http://research.i2r.a-star.edu.sg/BindingMotifPairs/resources. SUPPLEMENTARY INFORMATION http://research.i2r.a-star.edu.sg/BindingMotifPairs and Bioinformatics online.
Collapse
|
110
|
Höglund A, Dönnes P, Blum T, Adolph HW, Kohlbacher O. MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics 2006; 22:1158-65. [PMID: 16428265 DOI: 10.1093/bioinformatics/btl002] [Citation(s) in RCA: 213] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Functional annotation of unknown proteins is a major goal in proteomics. A key annotation is the prediction of a protein's subcellular localization. Numerous prediction techniques have been developed, typically focusing on a single underlying biological aspect or predicting a subset of all possible localizations. An important step is taken towards emulating the protein sorting process by capturing and bringing together biologically relevant information, and addressing the clear need to improve prediction accuracy and localization coverage. RESULTS Here we present a novel SVM-based approach for predicting subcellular localization, which integrates N-terminal targeting sequences, amino acid composition and protein sequence motifs. We show how this approach improves the prediction based on N-terminal targeting sequences, by comparing our method TargetLoc against existing methods. Furthermore, MultiLoc performs considerably better than comparable methods predicting all major eukaryotic subcellular localizations, and shows better or comparable results to methods that are specialized on fewer localizations or for one organism. AVAILABILITY http://www-bs.informatik.uni-tuebingen.de/Services/MultiLoc/
Collapse
|
111
|
Chen F, Mackey AJ, Stoeckert CJ, Roos DS. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res 2006; 34:D363-8. [PMID: 16381887 PMCID: PMC1347485 DOI: 10.1093/nar/gkj123] [Citation(s) in RCA: 647] [Impact Index Per Article: 35.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2005] [Revised: 10/20/2005] [Accepted: 10/20/2005] [Indexed: 11/12/2022] Open
Abstract
The OrthoMCL database (http://orthomcl.cbil.upenn.edu) houses ortholog group predictions for 55 species, including 16 bacterial and 4 archaeal genomes representing phylogenetically diverse lineages, and most currently available complete eukaryotic genomes: 24 unikonts (12 animals, 9 fungi, microsporidium, Dictyostelium, Entamoeba), 4 plants/algae and 7 apicomplexan parasites. OrthoMCL software was used to cluster proteins based on sequence similarity, using an all-against-all BLAST search of each species' proteome, followed by normalization of inter-species differences, and Markov clustering. A total of 511,797 proteins (81.6% of the total dataset) were clustered into 70,388 ortholog groups. The ortholog database may be queried based on protein or group accession numbers, keyword descriptions or BLAST similarity. Ortholog groups exhibiting specific phyletic patterns may also be identified, using either a graphical interface or a text-based Phyletic Pattern Expression grammar. Information for ortholog groups includes the phyletic profile, the list of member proteins and a multiple sequence alignment, a statistical summary and graphical view of similarities, and a graphical representation of domain architecture. OrthoMCL software, the entire FASTA dataset employed and clustering results are available for download. OrthoMCL-DB provides a centralized warehouse for orthology prediction among multiple species, and will be updated and expanded as additional genome sequence data become available.
Collapse
|
112
|
Abstract
The structural analysis of large metabolic networks exhibits a combinatorial explosion of elementary modes. A new method of classification has been developed [called aggregation around common motif (ACoM)], which groups elementary modes into classes with similar substructures. This method is applied to the tricarboxylic acid cycle and metabolite carriers. The analysis of this network evidences a great number of elementary flux modes (204) despite the low number of reactions (23). The ACoM is used to class these elementary modes in a low number of sets (8) with biological meanings.
Collapse
|
113
|
Zhu J, Chen S, Alvarez S, Asirvatham VS, Schachtman DP, Wu Y, Sharp RE. Cell wall proteome in the maize primary root elongation zone. I. Extraction and identification of water-soluble and lightly ionically bound proteins. PLANT PHYSIOLOGY 2006; 140:311-25. [PMID: 16377746 PMCID: PMC1326053 DOI: 10.1104/pp.105.070219] [Citation(s) in RCA: 108] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/19/2005] [Revised: 10/04/2005] [Accepted: 11/07/2005] [Indexed: 05/05/2023]
Abstract
Cell wall proteins (CWPs) play important roles in various processes, including cell elongation. However, relatively little is known about the composition of CWPs in growing regions. We are using a proteomics approach to gain a comprehensive understanding of the identity of CWPs in the maize (Zea mays) primary root elongation zone. As the first step, we examined the effectiveness of a vacuum infiltration-centrifugation technique for extracting water-soluble and loosely ionically bound (fraction 1) CWPs from the root elongation zone. The purity of the CWP extract was evaluated by comparing with total soluble proteins extracted from homogenized tissue. Several lines of evidence indicated that the vacuum infiltration-centrifugation technique effectively enriched for CWPs. Protein identification revealed that 84% of the CWPs were different from the total soluble proteins. About 40% of the fraction 1 CWPs had traditional signal peptides and 33% were predicted to be nonclassical secretory proteins, whereas only 3% and 11%, respectively, of the total soluble proteins were in these categories. Many of the CWPs have previously been shown to be involved in cell wall metabolism and cell elongation. In addition, maize has type II cell walls, and several of the CWPs identified in this study have not been identified in previous cell wall proteomics studies that have focused only on type I walls. These proteins include endo-1,3;1,4-beta-D-glucanase and alpha-L-arabinofuranosidase, which act on the major polysaccharides only or mainly present in type II cell walls.
Collapse
|
114
|
Persico M, Ceol A, Gavrila C, Hoffmann R, Florio A, Cesareni G. HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms. BMC Bioinformatics 2005; 6 Suppl 4:S21. [PMID: 16351748 PMCID: PMC1866386 DOI: 10.1186/1471-2105-6-s4-s21] [Citation(s) in RCA: 108] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND The application of high throughput approaches to the identification of protein interactions has offered for the first time a glimpse of the global interactome of some model organisms. Until now, however, such genome-wide approaches have not been applied to the human proteome. RESULTS In order to fill this gap we have assembled an inferred human protein interaction network where interactions discovered in model organisms are mapped onto the corresponding human orthologs. In addition to a stringent assignment to orthology classes based on the InParanoid algorithm, we have implemented a string matching algorithm to filter out orthology assignments of proteins whose global domain organization is not conserved. Finally, we have assessed the accuracy of our own, and related, inferred networks by benchmarking them against i) an assembled experimental interactome, ii) a network derived by mining of the scientific literature and iii) by measuring the enrichment of interacting protein pairs sharing common Gene Ontology annotation. CONCLUSION The resulting networks are named HomoMINT and HomoMINT_filtered, the latter being based on the orthology table filtered by the domain architecture matching algorithm. They contains 9749 and 5203 interactions respectively and can be analyzed and viewed in the context of the experimentally verified interactions between human proteins stored in the MINT database. HomoMINT is constantly updated to take into account the growing information in the MINT database.
Collapse
|
115
|
López-Bigas N, Blencowe BJ, Ouzounis CA. Highly consistent patterns for inherited human diseases at the molecular level. ACTA ACUST UNITED AC 2005; 22:269-77. [PMID: 16287936 DOI: 10.1093/bioinformatics/bti781] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Over 1600 mammalian genes are known to cause an inherited disorder, when subjected to one or more mutations. These disease genes represent a unique resource for the identification and quantification of relationships between phenotypic attributes of a disease and the molecular features of the associated disease genes, including their ascribed annotated functional classes and expression patterns. Such analyses can provide a more global perspective and a deeper understanding of the probable causes underlying human hereditary diseases. In this perspective and critical view of disease genomics, we present a comparative analysis of genes reported to cause inherited diseases in humans in terms of their causative effects on physiology, their genetics and inheritance modes, the functional processes they are involved in and their expression profiles across a wide spectrum of tissues. Our analysis reveals that there are more extensive correlations between these attributes of genetic disease genes than previously appreciated. For instance, the functional pattern of genes causing dominant and recessive diseases is markedly different. Also, the function of the genes and their expression correlate with the type of disease they cause when mutated. The results further indicate that a comparative genomics approach for the analysis of genes linked to human genetic diseases will facilitate the elucidation of the underlying molecular and cellular mechanisms.
Collapse
|
116
|
Sharabiani MTA, Siermala M, Lehtinen TO, Vihinen M. Dynamic covariation between gene expression and proteome characteristics. BMC Bioinformatics 2005; 6:215. [PMID: 16131395 PMCID: PMC1236912 DOI: 10.1186/1471-2105-6-215] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2004] [Accepted: 08/30/2005] [Indexed: 02/07/2023] Open
Abstract
Background Cells react to changing intra- and extracellular signals by dynamically modulating complex biochemical networks. Cellular responses to extracellular signals lead to changes in gene and protein expression. Since the majority of genes encode proteins, we investigated possible correlations between protein parameters and gene expression patterns to identify proteome-wide characteristics indicative of trends common to expressed proteins. Results Numerous bioinformatics methods were used to filter and merge information regarding gene and protein annotations. A new statistical time point-oriented analysis was developed for the study of dynamic correlations in large time series data. The method was applied to investigate microarray datasets for different cell types, organisms and processes, including human B and T cell stimulation, Drosophila melanogaster life span, and Saccharomyces cerevisiae cell cycle. Conclusion We show that the properties of proteins synthesized correlate dynamically with the gene expression profile, indicating that not only is the actual identity and function of expressed proteins important for cellular responses but that several physicochemical and other protein properties correlate with gene expression as well. Gene expression correlates strongly with amino acid composition, composition- and sequence-derived variables, functional, structural, localization and gene ontology parameters. Thus, our results suggest that a dynamic relationship exists between proteome properties and gene expression in many biological systems, and therefore this relationship is fundamental to understanding cellular mechanisms in health and disease.
Collapse
|
117
|
Abstract
We analyzed length differences of eukaryotic, bacterial and archaeal proteins in relation to function, conservation and environmental factors. Comparing Eukaryotes and Prokaryotes, we found that the greater length of eukaryotic proteins is pervasive over all functional categories and involves the vast majority of protein families. The magnitude of these differences suggests that the evolution of eukaryotic proteins was influenced by processes of fusion of single-function proteins into extended multi-functional and multi-domain proteins. Comparing Bacteria and Archaea, we determined that the small but significant length difference observed between their proteins results from a combination of three factors: (i) bacterial proteomes include a greater proportion than archaeal proteomes of longer proteins involved in metabolism or cellular processes, (ii) within most functional classes, protein families unique to Bacteria are generally longer than protein families unique to Archaea and (iii) within the same protein family, homologs from Bacteria tend to be longer than the corresponding homologs from Archaea. These differences are interpreted with respect to evolutionary trends and prevailing environmental conditions within the two prokaryotic groups.
Collapse
|
118
|
Mao X, Cai T, Olyarchuk JG, Wei L. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 2005; 21:3787-93. [PMID: 15817693 DOI: 10.1093/bioinformatics/bti430] [Citation(s) in RCA: 2257] [Impact Index Per Article: 118.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION High-throughput technologies such as DNA sequencing and microarrays have created the need for automated annotation of large sets of genes, including whole genomes, and automated identification of pathways. Ontologies, such as the popular Gene Ontology (GO), provide a common controlled vocabulary for these types of automated analysis. Yet, while GO offers tremendous value, it also has certain limitations such as the lack of direct association with pathways. RESULTS We demonstrated the use of the KEGG Orthology (KO), part of the KEGG suite of resources, as an alternative controlled vocabulary for automated annotation and pathway identification. We developed a KO-Based Annotation System (KOBAS) that can automatically annotate a set of sequences with KO terms and identify both the most frequent and the statistically significantly enriched pathways. Results from both whole genome and microarray gene cluster annotations with KOBAS are comparable and complementary to known annotations. KOBAS is a freely available stand-alone Python program that can contribute significantly to genome annotation and microarray analysis.
Collapse
|
119
|
Pedrioli PGA, Eng JK, Hubley R, Vogelzang M, Deutsch EW, Raught B, Pratt B, Nilsson E, Angeletti RH, Apweiler R, Cheung K, Costello CE, Hermjakob H, Huang S, Julian RK, Kapp E, McComb ME, Oliver SG, Omenn G, Paton NW, Simpson R, Smith R, Taylor CF, Zhu W, Aebersold R. A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 2005; 22:1459-66. [PMID: 15529173 DOI: 10.1038/nbt1031] [Citation(s) in RCA: 570] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
A broad range of mass spectrometers are used in mass spectrometry (MS)-based proteomics research. Each type of instrument possesses a unique design, data system and performance specifications, resulting in strengths and weaknesses for different types of experiments. Unfortunately, the native binary data formats produced by each type of mass spectrometer also differ and are usually proprietary. The diverse, nontransparent nature of the data structure complicates the integration of new instruments into preexisting infrastructure, impedes the analysis, exchange, comparison and publication of results from different experiments and laboratories, and prevents the bioinformatics community from accessing data sets required for software development. Here, we introduce the 'mzXML' format, an open, generic XML (extensible markup language) representation of MS data. We have also developed an accompanying suite of supporting programs. We expect that this format will facilitate data management, interpretation and dissemination in proteomics research.
Collapse
|
120
|
Grønborg M, Bunkenborg J, Kristiansen TZ, Jensen ON, Yeo CJ, Hruban RH, Maitra A, Goggins MG, Pandey A. Comprehensive proteomic analysis of human pancreatic juice. J Proteome Res 2005; 3:1042-55. [PMID: 15473694 DOI: 10.1021/pr0499085] [Citation(s) in RCA: 173] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Proteomic technologies provide an excellent means for analysis of body fluids for cataloging protein constituents and identifying biomarkers for early detection of cancers. The biomarkers currently available for pancreatic cancer, such as CA19-9, lack adequate sensitivity and specificity contributing to late diagnosis of this deadly disease. In this study, we carried out a comprehensive characterization of the "pancreatic juice proteome" in patients with pancreatic adenocarcinoma. Pancreatic juice was first fractionated by 1-dimensional gel electrophoresis and subsequently analyzed by liquid chromatography tandem mass spectrometry (LC-MS/MS). A total of 170 unique proteins were identified including known pancreatic cancer tumor markers (e.g., CEA, MUC1) and proteins overexpressed in pancreatic cancers (e.g., hepatocarcinoma-intestine-pancreas/pancreatitis-associated protein (HIP/PAP) and lipocalin 2). In addition, we identified a number of proteins that have not been previously described in pancreatic juice (e.g., tumor rejection antigen (pg96) and azurocidin). Interestingly, a novel protein that is 85% identical to HIP/PAP was identified, which we have designated as PAP-2. The proteins identified in this study could be directly assessed for their potential as biomarkers for pancreatic cancer by quantitative proteomics methods or immunoassays.
Collapse
MESH Headings
- Agglutinins/analysis
- Agglutinins/genetics
- Agglutinins/metabolism
- Amino Acid Sequence
- Antigens, Neoplasm/analysis
- Antigens, Neoplasm/genetics
- Antigens, Neoplasm/metabolism
- Antimicrobial Cationic Peptides
- Biomarkers, Tumor/analysis
- Biomarkers, Tumor/genetics
- Biomarkers, Tumor/metabolism
- Blood Proteins/analysis
- Blood Proteins/metabolism
- Calcium-Binding Proteins/genetics
- Carrier Proteins/analysis
- Carrier Proteins/genetics
- Carrier Proteins/metabolism
- Cell Adhesion Molecules/analysis
- Cell Adhesion Molecules/genetics
- Cell Adhesion Molecules/metabolism
- Chromatography, Liquid
- DNA-Binding Proteins
- Electrophoresis, Polyacrylamide Gel
- Gene Expression/genetics
- Glycoproteins/genetics
- Humans
- Lectins, C-Type/analysis
- Lectins, C-Type/genetics
- Lectins, C-Type/metabolism
- Lithostathine
- Mass Spectrometry
- Membrane Proteins/analysis
- Membrane Proteins/genetics
- Membrane Proteins/metabolism
- Molecular Sequence Data
- Pancreatic Juice/chemistry
- Pancreatic Juice/metabolism
- Pancreatic Neoplasms/metabolism
- Pancreatitis-Associated Proteins
- Peptide Fragments/analysis
- Phylogeny
- Proteome/analysis
- Proteome/classification
- Proteome/genetics
- RNA, Messenger/genetics
- RNA, Messenger/metabolism
- Receptors, Cell Surface/analysis
- Receptors, Cell Surface/genetics
- Receptors, Cell Surface/metabolism
- Sequence Alignment
- Sequence Homology, Amino Acid
- Trypsin/metabolism
- Tumor Suppressor Proteins
- alpha-Defensins/analysis
- alpha-Defensins/genetics
- alpha-Defensins/metabolism
Collapse
|
121
|
Abstract
Previous studies have suggested that nature is restricted to about 1,000 protein folds to perform a great diversity of functions. Here, we use protein interaction data from different sources and three-dimensional structures to suggest that the total number of interaction types is also limited, and estimate that most interactions in nature will conform to one of about 10,000 types. We currently know fewer than 2,000, and at the present rate of structure determination, it will be more than 20 years before we know a full representative set.
Collapse
|
122
|
Andersen JS, Lam YW, Leung AKL, Ong SE, Lyon CE, Lamond AI, Mann M. Nucleolar proteome dynamics. Nature 2005; 433:77-83. [PMID: 15635413 DOI: 10.1038/nature03207] [Citation(s) in RCA: 890] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2004] [Accepted: 11/16/2004] [Indexed: 01/17/2023]
Abstract
The nucleolus is a key organelle that coordinates the synthesis and assembly of ribosomal subunits and forms in the nucleus around the repeated ribosomal gene clusters. Because the production of ribosomes is a major metabolic activity, the function of the nucleolus is tightly linked to cell growth and proliferation, and recent data suggest that the nucleolus also plays an important role in cell-cycle regulation, senescence and stress responses. Here, using mass-spectrometry-based organellar proteomics and stable isotope labelling, we perform a quantitative analysis of the proteome of human nucleoli. In vivo fluorescent imaging techniques are directly compared to endogenous protein changes measured by proteomics. We characterize the flux of 489 endogenous nucleolar proteins in response to three different metabolic inhibitors that each affect nucleolar morphology. Proteins that are stably associated, such as RNA polymerase I subunits and small nuclear ribonucleoprotein particle complexes, exit from or accumulate in the nucleolus with similar kinetics, whereas protein components of the large and small ribosomal subunits leave the nucleolus with markedly different kinetics. The data establish a quantitative proteomic approach for the temporal characterization of protein flux through cellular organelles and demonstrate that the nucleolar proteome changes significantly over time in response to changes in cellular growth conditions.
Collapse
|
123
|
Kopka J, Schauer N, Krueger S, Birkemeyer C, Usadel B, Bergmüller E, Dörmann P, Weckwerth W, Gibon Y, Stitt M, Willmitzer L, Fernie AR, Steinhauser D. GMD@CSB.DB: the Golm Metabolome Database. Bioinformatics 2004; 21:1635-8. [PMID: 15613389 DOI: 10.1093/bioinformatics/bti236] [Citation(s) in RCA: 875] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
UNLABELLED Metabolomics, in particular gas chromatography-mass spectrometry (GC-MS) based metabolite profiling of biological extracts, is rapidly becoming one of the cornerstones of functional genomics and systems biology. Metabolite profiling has profound applications in discovering the mode of action of drugs or herbicides, and in unravelling the effect of altered gene expression on metabolism and organism performance in biotechnological applications. As such the technology needs to be available to many laboratories. For this, an open exchange of information is required, like that already achieved for transcript and protein data. One of the key-steps in metabolite profiling is the unambiguous identification of metabolites in highly complex metabolite preparations from biological samples. Collections of mass spectra, which comprise frequently observed metabolites of either known or unknown exact chemical structure, represent the most effective means to pool the identification efforts currently performed in many laboratories around the world. Here we present GMD, The Golm Metabolome Database, an open access metabolome database, which should enable these processes. GMD provides public access to custom mass spectral libraries, metabolite profiling experiments as well as additional information and tools, e.g. with regard to methods, spectral information or compounds. The main goal will be the representation of an exchange platform for experimental research activities and bioinformatics to develop and improve metabolomics by multidisciplinary cooperation. AVAILABILITY http://csbdb.mpimp-golm.mpg.de/gmd.html CONTACT Steinhauser@mpimp-golm.mpg.de SUPPLEMENTARY INFORMATION http://csbdb.mpimp-golm.mpg.de/
Collapse
|
124
|
Halperin E, Buhler J, Karp R, Krauthgamer R, Westover B. Detecting protein sequence conservation via metric embeddings. Bioinformatics 2004; 19 Suppl 1:i122-9. [PMID: 12855448 DOI: 10.1093/bioinformatics/btg1016] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison. RESULTS This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.
Collapse
|
125
|
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M, Mewes HW. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 2004; 32:5539-45. [PMID: 15486203 PMCID: PMC524302 DOI: 10.1093/nar/gkh894] [Citation(s) in RCA: 757] [Impact Index Per Article: 37.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this paper, we present the Functional Catalogue (FunCat), a hierarchically structured, organism-independent, flexible and scalable controlled classification system enabling the functional description of proteins from any organism. FunCat has been applied for the manual annotation of prokaryotes, fungi, plants and animals. We describe how FunCat is implemented as a highly efficient and robust tool for the manual and automatic annotation of genomic sequences. Owing to its hierarchical architecture, FunCat has also proved to be useful for many subsequent downstream bioinformatic applications. This is illustrated by the analysis of large-scale experiments from various investigations in transcriptomics and proteomics, where FunCat was used to project experimental data into functional units, as 'gold standard' for functional classification methods, and also served to compare the significance of different experimental methods. Over the last decade, the FunCat has been established as a robust and stable annotation scheme that offers both, meaningful and manageable functional classification as well as ease of perception.
Collapse
|