1
|
Abstract
We have developed an automatic algorithm STRIDE for protein secondary structure assignment from atomic coordinates based on the combined use of hydrogen bond energy and statistically derived backbone torsional angle information. Parameters of the pattern recognition procedure were optimized using designations provided by the crystallographers as a standard-of-truth. Comparison to the currently most widely used technique DSSP by Kabsch and Sander (Biopolymers 22:2577-2637, 1983) shows that STRIDE and DSSP assign secondary structural states in 58 and 31% of 226 protein chains in our data sample, respectively, in greater agreement with the specific residue-by-residue definitions provided by the discoverers of the structures while in 11% of the chains, the assignments are the same. STRIDE delineates every 11th helix and every 32nd strand more in accord with published assignments.
Collapse
|
Comparative Study |
30 |
1939 |
2
|
Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, FitzHugh W, Ma LJ, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, Wang S, Nielsen CB, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Pedersen D, Nelson MA, Werner-Washburne M, Selitrennikoff CP, Kinsey JA, Braun EL, Zelter A, Schulte U, Kothe GO, Jedd G, Mewes W, Staben C, Marcotte E, Greenberg D, Roy A, Foley K, Naylor J, Stange-Thomann N, Barrett R, Gnerre S, Kamal M, Kamvysselis M, Mauceli E, Bielke C, Rudd S, Frishman D, Krystofova S, Rasmussen C, Metzenberg RL, Perkins DD, Kroken S, Cogoni C, Macino G, Catcheside D, Li W, Pratt RJ, Osmani SA, DeSouza CPC, Glass L, Orbach MJ, Berglund JA, Voelker R, Yarden O, Plamann M, Seiler S, Dunlap J, Radford A, Aramayo R, Natvig DO, Alex LA, Mannhaupt G, Ebbole DJ, Freitag M, Paulsen I, Sachs MS, Lander ES, Nusbaum C, Birren B. The genome sequence of the filamentous fungus Neurospora crassa. Nature 2003; 422:859-68. [PMID: 12712197 DOI: 10.1038/nature01554] [Citation(s) in RCA: 1145] [Impact Index Per Article: 52.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2002] [Accepted: 03/14/2003] [Indexed: 11/09/2022]
Abstract
Neurospora crassa is a central organism in the history of twentieth-century genetics, biochemistry and molecular biology. Here, we report a high-quality draft sequence of the N. crassa genome. The approximately 40-megabase genome encodes about 10,000 protein-coding genes--more than twice as many as in the fission yeast Schizosaccharomyces pombe and only about 25% fewer than in the fruitfly Drosophila melanogaster. Analysis of the gene set yields insights into unexpected aspects of Neurospora biology including the identification of genes potentially associated with red light photobiology, genes implicated in secondary metabolism, and important differences in Ca2+ signalling as compared with plants and animals. Neurospora possesses the widest array of genome defence mechanisms known for any eukaryotic organism, including a process unique to fungi called repeat-induced point mutation (RIP). Genome analysis suggests that RIP has had a profound impact on genome evolution, greatly slowing the creation of new genes through genomic duplication and resulting in a genome with an unusually low proportion of closely related genes.
Collapse
|
|
22 |
1145 |
3
|
Heinig M, Frishman D. STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res 2004; 32:W500-2. [PMID: 15215436 PMCID: PMC441567 DOI: 10.1093/nar/gkh429] [Citation(s) in RCA: 745] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
STRIDE is a software tool for secondary structure assignment from atomic resolution protein structures. It implements a knowledge-based algorithm that makes combined use of hydrogen bond energy and statistically derived backbone torsional angle information and is optimized to return resulting assignments in maximal agreement with crystallographers' designations. The STRIDE web server provides access to this tool and allows visualization of the secondary structure, as well as contact and Ramachandran maps for any file uploaded by the user with atomic coordinates in the Protein Data Bank (PDB) format. A searchable database of STRIDE assignments for the latest PDB release is also provided. The STRIDE server is accessible from http://webclu.bio.wzw.tum.de/stride/.
Collapse
|
Journal Article |
21 |
745 |
4
|
Strous M, Pelletier E, Mangenot S, Rattei T, Lehner A, Taylor MW, Horn M, Daims H, Bartol-Mavel D, Wincker P, Barbe V, Fonknechten N, Vallenet D, Segurens B, Schenowitz-Truong C, Médigue C, Collingro A, Snel B, Dutilh BE, Op den Camp HJM, van der Drift C, Cirpus I, van de Pas-Schoonen KT, Harhangi HR, van Niftrik L, Schmid M, Keltjens J, van de Vossenberg J, Kartal B, Meier H, Frishman D, Huynen MA, Mewes HW, Weissenbach J, Jetten MSM, Wagner M, Le Paslier D. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 2006; 440:790-4. [PMID: 16598256 DOI: 10.1038/nature04647] [Citation(s) in RCA: 740] [Impact Index Per Article: 38.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2005] [Accepted: 02/15/2006] [Indexed: 11/09/2022]
Abstract
Anaerobic ammonium oxidation (anammox) has become a main focus in oceanography and wastewater treatment. It is also the nitrogen cycle's major remaining biochemical enigma. Among its features, the occurrence of hydrazine as a free intermediate of catabolism, the biosynthesis of ladderane lipids and the role of cytoplasm differentiation are unique in biology. Here we use environmental genomics--the reconstruction of genomic data directly from the environment--to assemble the genome of the uncultured anammox bacterium Kuenenia stuttgartiensis from a complex bioreactor community. The genome data illuminate the evolutionary history of the Planctomycetes and allow us to expose the genetic blueprint of the organism's special properties. Most significantly, we identified candidate genes responsible for ladderane biosynthesis and biological hydrazine metabolism, and discovered unexpected metabolic versatility.
Collapse
|
|
19 |
740 |
5
|
Mewes HW, Frishman D, Güldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Münsterkötter M, Rudd S, Weil B. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002; 30:31-4. [PMID: 11752246 PMCID: PMC99165 DOI: 10.1093/nar/30.1.31] [Citation(s) in RCA: 532] [Impact Index Per Article: 23.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).
Collapse
|
research-article |
23 |
532 |
6
|
Kerner MJ, Naylor DJ, Ishihama Y, Maier T, Chang HC, Stines AP, Georgopoulos C, Frishman D, Hayer-Hartl M, Mann M, Hartl FU. Proteome-wide analysis of chaperonin-dependent protein folding in Escherichia coli. Cell 2005; 122:209-20. [PMID: 16051146 DOI: 10.1016/j.cell.2005.05.028] [Citation(s) in RCA: 495] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2005] [Revised: 05/15/2005] [Accepted: 05/27/2005] [Indexed: 11/18/2022]
Abstract
The E. coli chaperonin GroEL and its cofactor GroES promote protein folding by sequestering nonnative polypeptides in a cage-like structure. Here we define the contribution of this system to protein folding across the entire E. coli proteome. Approximately 250 different proteins interact with GroEL, but most of these can utilize either GroEL or the upstream chaperones trigger factor (TF) and DnaK for folding. Obligate GroEL-dependence is limited to only approximately 85 substrates, including 13 essential proteins, and occupying more than 75% of GroEL capacity. These proteins appear to populate kinetically trapped intermediates during folding; they are stabilized by TF/DnaK against aggregation but reach native state only upon transfer to GroEL/GroES. Interestingly, substantially enriched among the GroEL substrates are proteins with (betaalpha)8 TIM-barrel domains. We suggest that the chaperonin system may have facilitated the evolution of this fold into a versatile platform for the implementation of numerous enzymatic functions.
Collapse
|
Research Support, Non-U.S. Gov't |
20 |
495 |
7
|
Houry WA, Frishman D, Eckerskorn C, Lottspeich F, Hartl FU. Identification of in vivo substrates of the chaperonin GroEL. Nature 1999; 402:147-54. [PMID: 10647006 DOI: 10.1038/45977] [Citation(s) in RCA: 376] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The chaperonin GroEL has an essential role in mediating protein folding in the cytosol of Escherichia coli. Here we show that GroEL interacts strongly with a well-defined set of approximately 300 newly translated polypeptides, including essential components of the transcription/translation machinery and metabolic enzymes. About one third of these proteins are structurally unstable and repeatedly return to GroEL for conformational maintenance. GroEL substrates consist preferentially of two or more domains with alphabeta-folds, which contain alpha-helices and buried beta-sheets with extensive hydrophobic surfaces. These proteins are expected to fold slowly and be prone to aggregation. The hydrophobic binding regions of GroEL may be well adapted to interact with the non-native states of alphabeta-domain proteins.
Collapse
|
|
26 |
376 |
8
|
Ishihama Y, Schmidt T, Rappsilber J, Mann M, Hartl FU, Kerner MJ, Frishman D. Protein abundance profiling of the Escherichia coli cytosol. BMC Genomics 2008; 9:102. [PMID: 18304323 PMCID: PMC2292177 DOI: 10.1186/1471-2164-9-102] [Citation(s) in RCA: 363] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2008] [Accepted: 02/27/2008] [Indexed: 11/10/2022] Open
Abstract
Background Knowledge about the abundance of molecular components is an important prerequisite for building quantitative predictive models of cellular behavior. Proteins are central components of these models, since they carry out most of the fundamental processes in the cell. Thus far, protein concentrations have been difficult to measure on a large scale, but proteomic technologies have now advanced to a stage where this information becomes readily accessible. Results Here, we describe an experimental scheme to maximize the coverage of proteins identified by mass spectrometry of a complex biological sample. Using a combination of LC-MS/MS approaches with protein and peptide fractionation steps we identified 1103 proteins from the cytosolic fraction of the Escherichia coli strain MC4100. A measure of abundance is presented for each of the identified proteins, based on the recently developed emPAI approach which takes into account the number of sequenced peptides per protein. The values of abundance are within a broad range and accurately reflect independently measured copy numbers per cell. As expected, the most abundant proteins were those involved in protein synthesis, most notably ribosomal proteins. Proteins involved in energy metabolism as well as those with binding function were also found in high copy number while proteins annotated with the terms metabolism, transcription, transport, and cellular organization were rare. The barrel-sandwich fold was found to be the structural fold with the highest abundance. Highly abundant proteins are predicted to be less prone to aggregation based on their length, pI values, and occurrence patterns of hydrophobic stretches. We also find that abundant proteins tend to be predominantly essential. Additionally we observe a significant correlation between protein and mRNA abundance in E. coli cells. Conclusion Abundance measurements for more than 1000 E. coli proteins presented in this work represent the most complete study of protein abundance in a bacterial cell so far. We show significant associations between the abundance of a protein and its properties and functions in the cell. In this way, we provide both data and novel insights into the role of protein concentration in this model organism.
Collapse
|
Research Support, Non-U.S. Gov't |
17 |
363 |
9
|
Mewes HW, Amid C, Arnold R, Frishman D, Güldener U, Mannhaupt G, Münsterkötter M, Pagel P, Strack N, Stümpflen V, Warfsmann J, Ruepp A. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 2004; 32:D41-4. [PMID: 14681354 PMCID: PMC308826 DOI: 10.1093/nar/gkh092] [Citation(s) in RCA: 359] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
359 |
10
|
Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes HW, Ruepp A, Frishman D. The MIPS mammalian protein-protein interaction database. Bioinformatics 2004; 21:832-4. [PMID: 15531608 DOI: 10.1093/bioinformatics/bti115] [Citation(s) in RCA: 342] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
SUMMARY The MIPS mammalian protein-protein interaction database (MPPI) is a new resource of high-quality experimental protein interaction data in mammals. The content is based on published experimental evidence that has been processed by human expert curators. We provide the full dataset for download and a flexible and powerful web interface for users with various requirements.
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
342 |
11
|
Horn M, Collingro A, Schmitz-Esser S, Beier CL, Purkhold U, Fartmann B, Brandt P, Nyakatura GJ, Droege M, Frishman D, Rattei T, Mewes HW, Wagner M. Illuminating the evolutionary history of chlamydiae. Science 2004; 304:728-30. [PMID: 15073324 DOI: 10.1126/science.1096330] [Citation(s) in RCA: 312] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Chlamydiae are the major cause of preventable blindness and sexually transmitted disease. Genome analysis of a chlamydia-related symbiont of free-living amoebae revealed that it is twice as large as any of the pathogenic chlamydiae and had few signs of recent lateral gene acquisition. We showed that about 700 million years ago the last common ancestor of pathogenic and symbiotic chlamydiae was already adapted to intracellular survival in early eukaryotes and contained many virulence factors found in modern pathogenic chlamydiae, including a type III secretion system. Ancient chlamydiae appear to be the originators of mechanisms for the exploitation of eukaryotic cells.
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
312 |
12
|
Mewes HW, Albermann K, Bähr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver SG, Pfeiffer F, Zollner A. Overview of the yeast genome. Nature 1997; 387:7-65. [PMID: 9169865 DOI: 10.1038/42755] [Citation(s) in RCA: 294] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The collaboration of more than 600 scientists from over 100 laboratories to sequence the Saccharomyces cerevisiae genome was the largest decentralised experiment in modern molecular biology and resulted in a unique data resource representing the first complete set of genes from a eukaryotic organism. 12 million bases were sequenced in a truly international effort involving European, US, Canadian and Japanese laboratories. While the yeast genome represents only a small fraction of the information in today's public sequence databases, the complete, ordered and non-redundant sequence provides an invaluable resource for the detailed analysis of cellular gene function and genome architecture. In terms of throughput, completeness and information content, yeast has always been the lead eukaryotic organism in genomics; it is still the largest genome to be completely sequenced.
Collapse
|
|
28 |
294 |
13
|
Ruepp A, Graml W, Santos-Martinez ML, Koretke KK, Volker C, Mewes HW, Frishman D, Stocker S, Lupas AN, Baumeister W. The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 2000; 407:508-13. [PMID: 11029001 DOI: 10.1038/35035069] [Citation(s) in RCA: 289] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Thermoplasma acidophilum is a thermoacidophilic archaeon that thrives at 59 degrees C and pH 2, which was isolated from self-heating coal refuse piles and solfatara fields. Species of the genus Thermoplasma do not possess a rigid cell wall, but are only delimited by a plasma membrane. Many macromolecular assemblies from Thermoplasma, primarily proteases and chaperones, have been pivotal in elucidating the structure and function of their more complex eukaryotic homologues. Our interest in protein folding and degradation led us to seek a more complete representation of the proteins involved in these pathways by determining the genome sequence of the organism. Here we have sequenced the 1,564,905-base-pair genome in just 7,855 sequencing reactions by using a new strategy. The 1,509 open reading frames identify Thermoplasma as a typical euryarchaeon with a substantial complement of bacteria-related genes; however, evidence indicates that there has been much lateral gene transfer between Thermoplasma and Sulfolobus solfataricus, a phylogenetically distant crenarchaeon inhabiting the same environment. At least 252 open reading frames, including a complete protein degradation pathway and various transport proteins, resemble Sulfolobus proteins most closely.
Collapse
|
|
25 |
289 |
14
|
Abstract
In this study we present an accurate secondary structure prediction procedure by using an query and related sequences. The most novel aspect of our approach is its reliance on local pairwise alignment of the sequence to be predicted with each related sequence rather than utilization of a multiple alignment. The residue-by-residue accuracy of the method is 75% in three structural states after jack-knife tests. The gain in prediction accuracy compared with the existing techniques, which are at best 72%, is achieved by secondary structure propensities based on both local and long-range effects, utilization of similar sequence information in the form of carefully selected pairwise alignment fragments, and reliance on a large collection of known protein primary structures. The method is especially appropriate for large-scale sequence analysis of efforts such as genome characterization, where precise and significant multiple sequence alignments are not available or achievable.
Collapse
|
|
28 |
283 |
15
|
Frishman D, Argos P. Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. PROTEIN ENGINEERING 1996; 9:133-42. [PMID: 9005434 DOI: 10.1093/protein/9.2.133] [Citation(s) in RCA: 249] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on the statistics of local residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved accuracy limit of such single residue and single sequence prediction methods is 65% in three structural stages (alpha-helix, beta-strand and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure prediction based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. The unique feature of our approach involves database-derived statistics on residue type occurrences in different classes of beta-bridges to delineate interacting beta-strands. The alpha-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded pairs (i,i + 4). The algorithm has a prediction accuracy of 68% in three structural stages, relies only on a single protein sequence as input and has the potential to be improved by 5-7% if homologous aligned sequences are also considered.
Collapse
|
|
29 |
249 |
16
|
Mewes HW, Frishman D, Mayer KFX, Münsterkötter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stümpflen V. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 2006; 34:D169-72. [PMID: 16381839 PMCID: PMC1347510 DOI: 10.1093/nar/gkj148] [Citation(s) in RCA: 228] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein–protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server ().
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
228 |
17
|
Mayer K, Schüller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Düsterhöft A, Stiekema W, Entian KD, Terryn N, Harris B, Ansorge W, Brandt P, Grivell L, Rieger M, Weichselgartner M, de Simone V, Obermaier B, Mache R, Müller M, Kreis M, Delseny M, Puigdomenech P, Watson M, Schmidtheini T, Reichert B, Portatelle D, Perez-Alonso M, Boutry M, Bancroft I, Vos P, Hoheisel J, Zimmermann W, Wedler H, Ridley P, Langham SA, McCullagh B, Bilham L, Robben J, Van der Schueren J, Grymonprez B, Chuang YJ, Vandenbussche F, Braeken M, Weltjens I, Voet M, Bastiaens I, Aert R, Defoor E, Weitzenegger T, Bothe G, Ramsperger U, Hilbert H, Braun M, Holzer E, Brandt A, Peters S, van Staveren M, Dirske W, Mooijman P, Klein Lankhorst R, Rose M, Hauf J, Kötter P, Berneiser S, Hempel S, Feldpausch M, Lamberth S, Van den Daele H, De Keyser A, Buysshaert C, Gielen J, Villarroel R, De Clercq R, Van Montagu M, Rogers J, Cronin A, Quail M, Bray-Allen S, Clark L, Doggett J, Hall S, Kay M, Lennard N, McLay K, Mayes R, Pettett A, Rajandream MA, Lyne M, Benes V, Rechmann S, Borkova D, Blöcker H, Scharfe M, Grimm M, Löhnert TH, Dose S, de Haan M, Maarse A, Schäfer M, Müller-Auer S, Gabel C, Fuchs M, Fartmann B, Granderath K, Dauner D, Herzl A, Neumann S, Argiriou A, Vitale D, Liguori R, Piravandi E, Massenet O, Quigley F, Clabauld G, Mündlein A, Felber R, Schnabl S, Hiller R, Schmidt W, Lecharny A, Aubourg S, Chefdor F, Cooke R, Berger C, Montfort A, Casacuberta E, Gibbons T, Weber N, Vandenbol M, Bargues M, Terol J, Torres A, Perez-Perez A, Purnelle B, Bent E, Johnson S, Tacon D, Jesse T, Heijnen L, Schwarz S, Scholler P, Heber S, Francs P, Bielke C, Frishman D, Haase D, Lemcke K, Mewes HW, Stocker S, Zaccaria P, Bevan M, Wilson RK, de la Bastide M, Habermann K, Parnell L, Dedhia N, Gnoj L, Schutz K, Huang E, Spiegel L, Sehkon M, Murray J, Sheet P, Cordes M, Abu-Threideh J, Stoneking T, Kalicki J, Graves T, Harmon G, Edwards J, Latreille P, Courtney L, Cloud J, Abbott A, Scott K, Johnson D, Minx P, Bentley D, Fulton B, Miller N, Greco T, Kemp K, Kramer J, Fulton L, Mardis E, Dante M, Pepin K, Hillier L, Nelson J, Spieth J, Ryan E, Andrews S, Geisel C, Layman D, Du H, Ali J, Berghoff A, Jones K, Drone K, Cotton M, Joshu C, Antonoiu B, Zidanic M, Strong C, Sun H, Lamar B, Yordan C, Ma P, Zhong J, Preston R, Vil D, Shekher M, Matero A, Shah R, Swaby IK, O'Shaughnessy A, Rodriguez M, Hoffmann J, Till S, Granat S, Shohdy N, Hasegawa A, Hameed A, Lodhi M, Johnson A, Chen E, Marra M, Martienssen R, McCombie WR. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 1999; 402:769-77. [PMID: 10617198 DOI: 10.1038/47134] [Citation(s) in RCA: 228] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The higher plant Arabidopsis thaliana (Arabidopsis) is an important model for identifying plant genes and determining their function. To assist biological investigations and to define chromosome structure, a coordinated effort to sequence the Arabidopsis genome was initiated in late 1996. Here we report one of the first milestones of this project, the sequence of chromosome 4. Analysis of 17.38 megabases of unique sequence, representing about 17% of the genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements. Heterochromatic regions surrounding the putative centromere, which has not yet been completely sequenced, are characterized by an increased frequency of a variety of repeats, new repeats, reduced recombination, lowered gene density and lowered gene expression. Roughly 60% of the predicted protein-coding genes have been functionally characterized on the basis of their homology to known genes. Many genes encode predicted proteins that are homologous to human and Caenorhabditis elegans proteins.
Collapse
|
|
26 |
228 |
18
|
Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lemcke K, Mannhaupt G, Pfeiffer F, Schüller C, Stocker S, Weil B. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000; 28:37-40. [PMID: 10592176 PMCID: PMC102494 DOI: 10.1093/nar/28.1.37] [Citation(s) in RCA: 216] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried, near Munich, Germany, continues its longstanding tradition to develop and maintain high quality curated genome databases. In addition, efforts have been intensified to cover the wealth of complete genome sequences in a systematic, comprehensive form. Bioinformatics, supporting national as well as European sequencing and functional analysis projects, has resulted in several up-to-date genome-oriented databases. This report describes growing databases reflecting the progress of sequencing the Arabidopsis thaliana (MATDB) and Neurospora crassa genomes (MNCDB), the yeast genome database (MYGD) extended by functional analysis data, the database of annotated human EST-clusters (HIB) and the database of the complete cDNA sequences from the DHGP (German Human Genome Project). It also contains information on the up-to-date database of complete genomes (PEDANT), the classification of protein sequences (ProtFam) and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database. These databases can be accessed through the MIPS WWW server (http://www. mips.biochem.mpg.de).
Collapse
|
research-article |
25 |
216 |
19
|
Meyer TE, Tsapin AI, Vandenberghe I, de Smet L, Frishman D, Nealson KH, Cusanovich MA, van Beeumen JJ. Identification of 42 possible cytochrome C genes in the Shewanella oneidensis genome and characterization of six soluble cytochromes. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2004; 8:57-77. [PMID: 15107237 DOI: 10.1089/153623104773547499] [Citation(s) in RCA: 158] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Through pattern matching of the cytochrome c heme-binding site (CXXCH) against the genome sequence of Shewanella oneidensis MR-1, we identified 42 possible cytochrome c genes (27 of which should be soluble) out of a total of 4758. However, we found only six soluble cytochromes c in extracts of S. oneidensis grown under several different conditions: (1) a small tetraheme cytochrome c, (2) a tetraheme flavocytochrome c-fumarate reductase, (3) a diheme cytochrome c4, (4) a monoheme cytochrome c5, (5) a monoheme cytochrome c', and (6) a diheme bacterial cytochrome c peroxidase. These cytochromes were identified either through N-terminal or complete amino acid sequence determination combined with mass spectroscopy. All six cytochromes were about 10-fold more abundant when cells were grown at low than at high aeration, whereas the flavocytochrome c-fumarate reductase was specifically induced by anaerobic growth on fumarate. When adjusted for the different heme content, the monoheme cytochrome c5 is as abundant as are the small tetraheme cytochrome and the tetraheme fumarate reductase. Published results on regulation of cytochromes from DNA microarrays and 2D-PAGE differ somewhat from our results, emphasizing the importance of multifaceted analyses in proteomics.
Collapse
|
Research Support, U.S. Gov't, P.H.S. |
21 |
158 |
20
|
Mewes HW, Heumann K, Kaps A, Mayer K, Pfeiffer F, Stocker S, Frishman D. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 1999; 27:44-8. [PMID: 9847138 PMCID: PMC148093 DOI: 10.1093/nar/27.1.44] [Citation(s) in RCA: 152] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database.
Collapse
|
research-article |
26 |
152 |
21
|
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol 2009; 10:207. [PMID: 19226439 PMCID: PMC2688287 DOI: 10.1186/gb-2009-10-2-207] [Citation(s) in RCA: 149] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Where information on homologous proteins is available,
progress is being made in automated prediction of protein function
from sequence and structure. With many genomes now sequenced, computational annotation methods to characterize genes and proteins from their sequence are increasingly important. The BioSapiens Network has developed tools to address all stages of this process, and here we review progress in the automated prediction of protein function based on protein sequence and structure.
Collapse
|
Review |
16 |
149 |
22
|
Frishman D, Mironov A, Mewes HW, Gelfand M. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 1998; 26:2941-7. [PMID: 9611239 PMCID: PMC147632 DOI: 10.1093/nar/26.12.2941] [Citation(s) in RCA: 141] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.
Collapse
|
research-article |
27 |
141 |
23
|
Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D. PROSO II--a new method for protein solubility prediction. FEBS J 2012; 279:2192-200. [PMID: 22536855 DOI: 10.1111/j.1742-4658.2012.08603.x] [Citation(s) in RCA: 139] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Many fields of science and industry depend on efficient production of active protein using heterologous expression in Escherichia coli. The solubility of proteins upon expression is dependent on their amino acid sequence. Prediction of solubility from sequence is therefore highly valuable. We present a novel machine-learning-based model called PROSO II which makes use of new classification methods and growth in experimental data to improve coverage and accuracy of solubility predictions. The classification algorithm is organized as a two-layered structure in which the output of a primary Parzen window model for sequence similarity and a logistic regression classifier of amino acid k-mer composition serve as input for a second-level logistic regression classifier. Compared with previously published research our model is trained on five times more data than used by any other method before (82 000 proteins). When tested on a separate holdout set not used at any point of method development our server attained the best results in comparison with other currently available methods: accuracy 75.4%, Matthew's correlation coefficient 0.39, sensitivity 0.731, specificity 0.759, gain (soluble) 2.263. In summary, due to utilization of cutting edge machine learning technologies combined with the largest currently available experimental data set the PROSO II server constitutes a substantial improvement in protein solubility predictions. PROSO II is available at http://mips.helmholtz-muenchen.de/prosoII.
Collapse
|
Research Support, Non-U.S. Gov't |
13 |
139 |
24
|
Sturm M, Hackenberg M, Langenberger D, Frishman D. TargetSpy: a supervised machine learning approach for microRNA target prediction. BMC Bioinformatics 2010; 11:292. [PMID: 20509939 PMCID: PMC2889937 DOI: 10.1186/1471-2105-11-292] [Citation(s) in RCA: 129] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2009] [Accepted: 05/28/2010] [Indexed: 11/21/2022] Open
Abstract
Background Virtually all currently available microRNA target site prediction algorithms require the presence of a (conserved) seed match to the 5' end of the microRNA. Recently however, it has been shown that this requirement might be too stringent, leading to a substantial number of missed target sites. Results We developed TargetSpy, a novel computational approach for predicting target sites regardless of the presence of a seed match. It is based on machine learning and automatic feature selection using a wide spectrum of compositional, structural, and base pairing features covering current biological knowledge. Our model does not rely on evolutionary conservation, which allows the detection of species-specific interactions and makes TargetSpy suitable for analyzing unconserved genomic sequences. In order to allow for an unbiased comparison of TargetSpy to other methods, we classified all algorithms into three groups: I) no seed match requirement, II) seed match requirement, and III) conserved seed match requirement. TargetSpy predictions for classes II and III are generated by appropriate postfiltering. On a human dataset revealing fold-change in protein production for five selected microRNAs our method shows superior performance in all classes. In Drosophila melanogaster not only our class II and III predictions are on par with other algorithms, but notably the class I (no-seed) predictions are just marginally less accurate. We estimate that TargetSpy predicts between 26 and 112 functional target sites without a seed match per microRNA that are missed by all other currently available algorithms. Conclusion Only a few algorithms can predict target sites without demanding a seed match and TargetSpy demonstrates a substantial improvement in prediction accuracy in that class. Furthermore, when conservation and the presence of a seed match are required, the performance is comparable with state-of-the-art algorithms. TargetSpy was trained on mouse and performs well in human and drosophila, suggesting that it may be applicable to a broad range of species. Moreover, we have demonstrated that the application of machine learning techniques in combination with upcoming deep sequencing data results in a powerful microRNA target site prediction tool http://www.targetspy.org.
Collapse
|
Research Support, Non-U.S. Gov't |
15 |
129 |
25
|
Böhm S, Frishman D, Mewes HW. Variations of the C2H2 zinc finger motif in the yeast genome and classification of yeast zinc finger proteins. Nucleic Acids Res 1997; 25:2464-9. [PMID: 9171100 PMCID: PMC146766 DOI: 10.1093/nar/25.12.2464] [Citation(s) in RCA: 121] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
The PROSITE pattern Zinc_Finger_C2H2 was extended to permit the detection of all C2H2 zinc fingers and their parent proteins in the recently completed sequence of the yeast genome. Additionally, a new computer program was written that extracts other zinc binding motifs (non C2H2 'fingers'), overlapping with the classical zinc finger pattern, from the found set of yeast C2H2 fingers. The complete and correct detection of all fingers is a prerequisite for the classification of the yeast zinc finger proteins in functional terms. The detected 53 yeast C2H2 zinc finger proteins do not contain finger clusters with 10 or more repeats, as is frequently found in higher eukaryotes. Only three proteins contain four or more fingers in a cluster. Moreover, nearly all 27 yeast proteins with tandem arrays of two or three finger domains can be classified into nine subgroups with high sequence conservation in their finger clusters, in particular of their DNA recognition helices. These results and application of the recently elaborated finger/DNA recognition rules suggest that the yeast proteins belonging to the same subgroup may recognize identical or very similar DNA sites.
Collapse
|
research-article |
28 |
121 |