51
|
Prediction of deleterious non-synonymous single nucleotide polymorphisms of genes related to ethanol-induced toxicity. Toxicol Lett 2009; 187:99-114. [DOI: 10.1016/j.toxlet.2009.02.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2008] [Revised: 02/05/2009] [Accepted: 02/09/2009] [Indexed: 12/30/2022]
|
52
|
Kumar A, Cowen L. Augmented training of hidden Markov models to recognize remote homologs via simulated evolution. Bioinformatics 2009; 25:1602-8. [PMID: 19389731 PMCID: PMC2732314 DOI: 10.1093/bioinformatics/btp265] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION While profile hidden Markov models (HMMs) are successful and powerful methods to recognize homologous proteins, they can break down when homology becomes too distant due to lack of sufficient training data. We show that we can improve the performance of HMMs in this domain by using a simple simulated model of evolution to create an augmented training set. RESULTS We show, in two different remote protein homolog tasks, that HMMs whose training is augmented with simulated evolution outperform HMMs trained only on real data. We find that a mutation rate between 15 and 20% performs best for recognizing G-protein coupled receptor proteins in different classes, and for recognizing SCOP super-family proteins from different families.
Collapse
Affiliation(s)
- Anoop Kumar
- Department of Computer Science, Tufts University, Medford, MA, USA.
| | | |
Collapse
|
53
|
Abstract
It has been known for more than 35 years that, during evolution, new proteins are formed by gene duplications, sequence and structural divergence and, in many cases, gene combinations. The genome projects have produced complete, or almost complete, descriptions of the protein repertoires of over 600 distinct organisms. Analyses of these data have dramatically increased our understanding of the formation of new proteins. At the present time, we can accurately trace the evolutionary relationships of about half the proteins found in most genomes, and it is these proteins that we discuss in the present review. Usually, the units of evolution are protein domains that are duplicated, diverge and form combinations. Small proteins contain one domain, and large proteins contain combinations of two or more domains. Domains descended from a common ancestor are clustered into superfamilies. In most genomes, the net growth of superfamily members means that more than 90% of domains are duplicates. In a section on domain duplications, we discuss the number of currently known superfamilies, their size and distribution, and superfamily expansions related to biological complexity and to specific lineages. In a section on divergence, we describe how sequences and structures diverge, the changes in stability produced by acceptable mutations, and the nature of functional divergence and selection. In a section on domain combinations, we discuss their general nature, the sequential order of domains, how combinations modify function, and the extraordinary variety of the domain combinations found in different genomes. We conclude with a brief note on other forms of protein evolution and speculations of the origins of the duplication, divergence and combination processes.
Collapse
|
54
|
Williams SG, Lovell SC. The effect of sequence evolution on protein structural divergence. Mol Biol Evol 2009; 26:1055-65. [PMID: 19193735 DOI: 10.1093/molbev/msp020] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The complex constraints imposed by protein structure and function result in varied rates of sequence and structural divergence in proteins. Analysis of sequence differences between homologous proteins can advance our understanding of structural divergence and some of the constraints that govern the evolution of these molecules. Here, we assess the relationship between amino acid sequence and structural divergence. Firstly, we demonstrate that the relationship between protein sequence and structural divergence is governed by a variety of evolutionary constraints, including solvent exposure and secondary structure. Secondly, although compensatory substitutions are widespread, we find many radical size-changing mutations that are not compensated by neighboring complementary changes. Instead, these noncompensated substitutions are mitigated by alteration of protein structure. These results suggest a combined mechanism of accommodating substitutions in proteins, involving both coevolution and structural accommodation. Such a mechanism can explain previously observed correlated substitutions of residues that are distant both in sequence and structure, allowing an integrated view of sequence and structural divergence of proteins.
Collapse
Affiliation(s)
- Simon G Williams
- Faculty of Life Sciences, University of Manchester, Manchester, UK
| | | |
Collapse
|
55
|
Altschul SF, Gertz EM, Agarwala R, Schäffer AA, Yu YK. PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 2009; 37:815-24. [PMID: 19088134 PMCID: PMC2647318 DOI: 10.1093/nar/gkn981] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
Collapse
Affiliation(s)
- Stephen F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Bethesda, MD 20894, USA.
| | | | | | | | | |
Collapse
|
56
|
Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A. Rfam: updates to the RNA families database. Nucleic Acids Res 2008; 37:D136-40. [PMID: 18953034 PMCID: PMC2686503 DOI: 10.1093/nar/gkn766] [Citation(s) in RCA: 708] [Impact Index Per Article: 41.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Rfam is a collection of RNA sequence families, represented by multiple sequence alignments and covariance models (CMs). The primary aim of Rfam is to annotate new members of known RNA families on nucleotide sequences, particularly complete genomes, using sensitive BLAST filters in combination with CMs. A minority of families with a very broad taxonomic range (e.g. tRNA and rRNA) provide the majority of the sequence annotations, whilst the majority of Rfam families (e.g. snoRNAs and miRNAs) have a limited taxonomic range and provide a limited number of annotations. Recent improvements to the website, methodologies and data used by Rfam are discussed. Rfam is freely available on the Web at http://rfam.sanger.ac.uk/and http://rfam.janelia.org/.
Collapse
Affiliation(s)
- Paul P Gardner
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
57
|
Buvoli M, Hamady M, Leinwand LA, Knight R. Bioinformatics assessment of beta-myosin mutations reveals myosin's high sensitivity to mutations. Trends Cardiovasc Med 2008; 18:141-9. [PMID: 18555187 DOI: 10.1016/j.tcm.2008.04.001] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/14/2008] [Revised: 04/01/2008] [Accepted: 04/04/2008] [Indexed: 01/12/2023]
Abstract
More than 200 mutations in the beta-myosin gene (MYH7) that cause clinically distinct cardiac and/or skeletal myopathies have been reported, but to date, no comprehensive statistical analysis of these mutations has been performed. As a part of this review, we developed a new interactive database and research tool called MyoMAPR (Myopathic Mutation Analysis Profiler and Repository). We report that the distribution of mutations along the beta-myosin gene is not homogeneous, and that myosin is a highly constrained molecule with an uncommon sensitivity to amino acid substitutions. Increasing knowledge of the characteristics of MH7 mutations may provide a valuable resource for scientists and clinicians studying diagnosis, risk stratification, and treatment of disease associated with these mutations.
Collapse
Affiliation(s)
- Massimo Buvoli
- Department of Molecular, Cellular, and Developmental Biology, University of Colorado, Boulder, CO 80309, USA
| | | | | | | |
Collapse
|
58
|
Sonavane S, Chakrabarti P. Cavities and atomic packing in protein structures and interfaces. PLoS Comput Biol 2008; 4:e1000188. [PMID: 19005575 PMCID: PMC2582456 DOI: 10.1371/journal.pcbi.1000188] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2008] [Accepted: 08/19/2008] [Indexed: 11/17/2022] Open
Abstract
A comparative analysis of cavities enclosed in a tertiary structure of proteins and interfaces formed by the interaction of two protein subunits in obligate and non-obligate categories (represented by homodimeric molecules and heterocomplexes, respectively) is presented. The total volume of cavities increases with the size of the protein (or the interface), though the exact relationship may vary in different cases. Likewise, for individual cavities also there is quantitative dependence of the volume on the number of atoms (or residues) lining the cavity. The larger cavities tend to be less spherical, solvated, and the interfaces are enriched in these. On average 15 A(3) of cavity volume is found to accommodate single water, with another 40-45 A(3) needed for each additional solvent molecule. Polar atoms/residues have a higher propensity to line solvated cavities. Relative to the frequency of occurrence in the whole structure (or interface), residues in beta-strands are found more often lining the cavities, and those in turn and loop the least. Any depression in one chain not complemented by a protrusion in the other results in a cavity in the protein-protein interface. Through the use of the Voronoi volume, the packing of residues involved in protein-protein interaction has been compared to that in the protein interior. For a comparable number of atoms the interface has about twice the number of cavities relative to the tertiary structure.
Collapse
|
59
|
Lobanov MY, Bogatyreva NS, Galzitskaya OV. Radius of gyration as an indicator of protein structure compactness. Mol Biol 2008. [DOI: 10.1134/s0026893308040195] [Citation(s) in RCA: 617] [Impact Index Per Article: 36.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
60
|
The distributions, mechanisms, and structures of metabolite-binding riboswitches. Genome Biol 2008; 8:R239. [PMID: 17997835 PMCID: PMC2258182 DOI: 10.1186/gb-2007-8-11-r239] [Citation(s) in RCA: 370] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2007] [Revised: 10/01/2007] [Accepted: 11/12/2007] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Riboswitches are noncoding RNA structures that appropriately regulate genes in response to changing cellular conditions. The expression of many proteins involved in fundamental metabolic processes is controlled by riboswitches that sense relevant small molecule ligands. Metabolite-binding riboswitches that recognize adenosylcobalamin (AdoCbl), thiamin pyrophosphate (TPP), lysine, glycine, flavin mononucleotide (FMN), guanine, adenine, glucosamine-6-phosphate (GlcN6P), 7-aminoethyl 7-deazaguanine (preQ1), and S-adenosylmethionine (SAM) have been reported. RESULTS We have used covariance model searches to identify examples of ten widespread riboswitch classes in the genomes of organisms from all three domains of life. This data set rigorously defines the phylogenetic distributions of these riboswitch classes and reveals how their gene control mechanisms vary across different microbial groups. By examining the expanded aptamer sequence alignments resulting from these searches, we have also re-evaluated and refined their consensus secondary structures. Updated riboswitch structure models highlight additional RNA structure motifs, including an unusual double T-loop arrangement common to AdoCbl and FMN riboswitch aptamers, and incorporate new, sometimes noncanonical, base-base interactions predicted by a mutual information analysis. CONCLUSION Riboswitches are vital components of many genomes. The additional riboswitch variants and updated aptamer structure models reported here will improve future efforts to annotate these widespread regulatory RNAs in genomic sequences and inform ongoing structural biology efforts. There remain significant questions about what physiological and evolutionary forces influence the distributions and mechanisms of riboswitches and about what forms of regulation substitute for riboswitches that appear to be missing in certain lineages.
Collapse
|
61
|
Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P. Drug target identification using side-effect similarity. Science 2008; 321:263-6. [PMID: 18621671 DOI: 10.1126/science.1158140] [Citation(s) in RCA: 840] [Impact Index Per Article: 49.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Targets for drugs have so far been predicted on the basis of molecular or cellular features, for example, by exploiting similarity in chemical structure or in activity across cell lines. We used phenotypic side-effect similarities to infer whether two drugs share a target. Applied to 746 marketed drugs, a network of 1018 side effect-driven drug-drug relations became apparent, 261 of which are formed by chemically dissimilar drugs from different therapeutic indications. We experimentally tested 20 of these unexpected drug-drug relations and validated 13 implied drug-target relations by in vitro binding assays, of which 11 reveal inhibition constants equal to less than 10 micromolar. Nine of these were tested and confirmed in cell assays, documenting the feasibility of using phenotypic information to infer molecular interactions and hinting at new uses of marketed drugs.
Collapse
Affiliation(s)
- Monica Campillos
- European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | | | | | | | | |
Collapse
|
62
|
Paquet ER, Rey G, Naef F. Modeling an evolutionary conserved circadian cis-element. PLoS Comput Biol 2008; 4:e38. [PMID: 18282089 PMCID: PMC2242825 DOI: 10.1371/journal.pcbi.0040038] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2007] [Accepted: 01/04/2008] [Indexed: 11/19/2022] Open
Abstract
Circadian oscillator networks rely on a transcriptional activator called CLOCK/CYCLE (CLK/CYC) in insects and CLOCK/BMAL1 or NPAS2/BMAL1 in mammals. Identifying the targets of this heterodimeric basic-helix-loop-helix (bHLH) transcription factor poses challenges and it has been difficult to decipher its specific sequence affinity beyond a canonical E-box motif, except perhaps for some flanking bases contributing weakly to the binding energy. Thus, no good computational model presently exists for predicting CLK/CYC, CLOCK/BMAL1, or NPAS2/BMAL1 targets. Here, we use a comparative genomics approach and first study the conservation properties of the best-known circadian enhancer: a 69-bp element upstream of the Drosophila melanogaster period gene. This fragment shows a signal involving the presence of two closely spaced E-box-like motifs, a configuration that we can also detect in the other four prominent CLK/CYC target genes in flies: timeless, vrille, Pdp1, and cwo. This allows for the training of a probabilistic sequence model that we test using functional genomics datasets. We find that the predicted sequences are overrepresented in promoters of genes induced in a recent study by a glucocorticoid receptor-CLK fusion protein. We then scanned the mouse genome with the fly model and found that many known CLOCK/BMAL1 targets harbor sequences matching our consensus. Moreover, the phase of predicted cyclers in liver agreed with known CLOCK/BMAL1 regulation. Taken together, we built a predictive model for CLK/CYC or CLOCK/BMAL1-bound cis-enhancers through the integration of comparative and functional genomics data. Finally, a deeper phylogenetic analysis reveals that the link between the CLOCK/BMAL1 complex and the circadian cis-element dates back to before insects and vertebrates diverged.
Collapse
Affiliation(s)
- Eric R Paquet
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
- Swiss Institute of Experimental Cancer Research (ISREC), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Guillaume Rey
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
- Swiss Institute of Experimental Cancer Research (ISREC), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Felix Naef
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
- Swiss Institute of Experimental Cancer Research (ISREC), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| |
Collapse
|
63
|
Yip KY, Patel P, Kim PM, Engelman DM, McDermott D, Gerstein M. An integrated system for studying residue coevolution in proteins. Bioinformatics 2007; 24:290-2. [PMID: 18056067 DOI: 10.1093/bioinformatics/btm584] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED Residue coevolution has recently emerged as an important concept, especially in the context of protein structures. While a multitude of different functions for quantifying it have been proposed, not much is known about their relative strengths and weaknesses. Also, subtle algorithmic details have discouraged implementing and comparing them. We addressed this issue by developing an integrated online system that enables comparative analyses with a comprehensive set of commonly used scoring functions, including Statistical Coupling Analysis (SCA), Explicit Likelihood of Subset Variation (ELSC), mutual information and correlation-based methods. A set of data preprocessing options are provided for improving the sensitivity and specificity of coevolution signal detection, including sequence weighting, residue grouping and the filtering of sequences, sites and site pairs. A total of more than 100 scoring variations are available. The system also provides facilities for studying the relationship between coevolution scores and inter-residue distances from a crystal structure if provided, which may help in understanding protein structures. AVAILABILITY The system is available at http://coevolution.gersteinlab.org. The source code and JavaDoc API can also be downloaded from the web site.
Collapse
Affiliation(s)
- Kevin Y Yip
- Department of Computer Science, Yale University, 51 Prospect Street, New Haven, CT 06511, USA
| | | | | | | | | | | |
Collapse
|
64
|
Galzitskaya OV, Reifsnyder DC, Bogatyreva NS, Ivankov DN, Garbuzynskiy SO. More compact protein globules exhibit slower folding rates. Proteins 2007; 70:329-32. [PMID: 17876831 DOI: 10.1002/prot.21619] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
We have demonstrated that, among proteins of the same size, alpha/beta proteins have on the average a greater number of contacts per residue due to their more compact (more "spherical") structure, rather than due to tighter packing. We have examined the relationship between the average number of contacts per residue and folding rates in globular proteins according to general protein structural class (all-alpha, all-beta, alpha/beta, alpha+beta). Our analysis demonstrates that alpha/beta proteins have both the greatest number of contacts and the slowest folding rates in comparison to proteins from the other structural classes. Because alpha/beta proteins are also known to be the oldest proteins, it can be suggested that proteins have evolved to pack more quickly and into looser structures.
Collapse
Affiliation(s)
- Oxana V Galzitskaya
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow Region, Russia.
| | | | | | | | | |
Collapse
|
65
|
Harrington ED, Singh AH, Doerks T, Letunic I, von Mering C, Jensen LJ, Raes J, Bork P. Quantitative assessment of protein function prediction from metagenomics shotgun sequences. Proc Natl Acad Sci U S A 2007; 104:13913-8. [PMID: 17717083 PMCID: PMC1955820 DOI: 10.1073/pnas.0702636104] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
Collapse
Affiliation(s)
- E. D. Harrington
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - A. H. Singh
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - T. Doerks
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - I. Letunic
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - C. von Mering
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - L. J. Jensen
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - J. Raes
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - P. Bork
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
- Max Delbrück Centre for Molecular Medicine, D-13092 Berlin, Germany
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
66
|
Trinklein ND, Karaöz U, Wu J, Halees A, Force Aldred S, Collins PJ, Zheng D, Zhang ZD, Gerstein MB, Snyder M, Myers RM, Weng Z. Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome. Genome Res 2007; 17:720-31. [PMID: 17567992 PMCID: PMC1891333 DOI: 10.1101/gr.5716607] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The regulation of transcriptional initiation in the human genome is a critical component of global gene regulation, but a complete catalog of human promoters currently does not exist. In order to identify regulatory regions, we developed four computational methods to integrate 129 sets of ENCODE-wide chromatin immunoprecipitation data. They collectively predicted 1393 regions. Roughly 47% of the regions were unique to one method, as each method makes different assumptions about the data. Overall, predicted regions tend to localize to highly conserved, DNase I hypersensitive, and actively transcribed regions in the genome. Interestingly, a significant portion of the regions overlaps with annotated 3'-UTRs, suggesting that some of them might regulate anti-sense transcription. The majority of the predicted regions are >2 kb away from the 5'-ends of previously annotated human cDNAs and hence are novel. These novel regions may regulate unannotated transcripts or may represent new alternative transcription start sites of known genes. We tested 163 such regions for promoter activity in four cell lines using transient transfection assays, and 25% of them showed transcriptional activity above background in at least one cell line. We also performed 5'-RACE experiments on 62 novel regions, and 76% of the regions were associated with the 5'-ends of at least two RACE products. Our results suggest that there are at least 35% more functional promoters in the human genome than currently annotated.
Collapse
Affiliation(s)
- Nathan D. Trinklein
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Ulaş Karaöz
- Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA
| | - Jiaqian Wu
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
| | - Anason Halees
- Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA
| | - Shelley Force Aldred
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Patrick J. Collins
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Deyou Zheng
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Zhengdong D. Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Michael Snyder
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Richard M. Myers
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
- Corresponding authors.E-mail ; fax (617) 353-6766.E-mail ; fax (650) 725-9689
| | - Zhiping Weng
- Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA
- Biomedical Engineering Department, Boston University, Boston, Massachusetts 02215, USA
- Corresponding authors.E-mail ; fax (617) 353-6766.E-mail ; fax (650) 725-9689
| |
Collapse
|
67
|
Vinogradov SN, Hoogewijs D, Bailly X, Mizuguchi K, Dewilde S, Moens L, Vanfleteren JR. A model of globin evolution. Gene 2007; 398:132-42. [PMID: 17540514 DOI: 10.1016/j.gene.2007.02.041] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2006] [Revised: 02/20/2007] [Accepted: 02/21/2007] [Indexed: 11/19/2022]
Abstract
Putative globins have been identified in 426 bacterial, 32 Archaeal and 67 eukaryote genomes. Among these sequences are the hitherto unsuspected presence of single domain sensor globins within Bacteria, Fungi, and a Euryarchaeote. Bayesian phylogenetic trees suggest that their occurrence in the latter two groups could be the result of lateral gene transfer from Bacteria. Iterated psiblast searches based on groups of globin sequences indicate that bacterial flavohemoglobins are closer to metazoan globins than to the other two lineages, the 2-over-2 globins and the globin-coupled sensors. Since Bacteria is the only kingdom to have all the subgroups of the three globin lineages, we propose a working model of globin evolution based on the assumption that all three lineages originated and evolved only in Bacteria. Although the 2-over-2 globins and the globin-coupled sensors recognize flavohemoglobins, there is little recognition between them. Thus, in the first stage of globin evolution, we favor a flavohemoglobin-like single domain protein as the ancestral globin. The next stage comprised the splitting off to single domain 2-over-2 and sensor-like globins, followed by the covalent addition of C-terminal domains resulting in the chimeric flavohemoglobins and globin-coupled sensors. The last stage encompassed the lateral gene transfers of some members of the three globin lineages to specific groups of Archaea and Eukaryotes.
Collapse
Affiliation(s)
- Serge N Vinogradov
- Department of Biochemistry and Molecular Biology, Wayne State University School of Medicine, Detroit, MI 48201, USA.
| | | | | | | | | | | | | |
Collapse
|
68
|
Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR. Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline. Nucleic Acids Res 2007; 35:4809-19. [PMID: 17621584 PMCID: PMC1950547 DOI: 10.1093/nar/gkm487] [Citation(s) in RCA: 231] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
We applied a computational pipeline based on comparative genomics to bacteria, and identified 22 novel candidate RNA motifs. We predicted six to be riboswitches, which are mRNA elements that regulate gene expression on binding a specific metabolite. In separate studies, we confirmed that two of these are novel riboswitches. Three other riboswitch candidates are upstream of either a putative transporter gene in the order Lactobacillales, citric acid cycle genes in Burkholderiales or molybdenum cofactor biosynthesis genes in several phyla. The remaining riboswitch candidate, the widespread Genes for the Environment, for Membranes and for Motility (GEMM) motif, is associated with genes important for natural competence in Vibrio cholerae and the use of metal ions as electron acceptors in Geobacter sulfurreducens. Among the other motifs, one has a genetic distribution similar to a previously published candidate riboswitch, ykkC/yxkD, but has a different structure. We identified possible non-coding RNAs in five phyla, and several additional cis-regulatory RNAs, including one in ε-proteobacteria (upstream of purD, involved in purine biosynthesis), and one in Cyanobacteria (within an ATP synthase operon). These candidate RNAs add to the growing list of RNA motifs involved in multiple cellular processes, and suggest that many additional RNAs remain to be discovered.
Collapse
Affiliation(s)
- Zasha Weinberg
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
- *To whom correspondence should be addressed.(203) 432-6554(203) 432-6161
| | - Jeffrey E. Barrick
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Zizhen Yao
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Adam Roth
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Jane N. Kim
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Jeremy Gore
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Joy Xin Wang
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Elaine R. Lee
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Kirsten F. Block
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Narasimhan Sudarsan
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Shane Neph
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Martin Tompa
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Walter L. Ruzzo
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Ronald R. Breaker
- Department of Molecular, Cellular and Developmental Biology, Howard Hughes Medical Institute, Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520-8103, USA Department of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| |
Collapse
|
69
|
Stone EA, Sidow A. Constructing a meaningful evolutionary average at the phylogenetic center of mass. BMC Bioinformatics 2007; 8:222. [PMID: 17594490 PMCID: PMC1919398 DOI: 10.1186/1471-2105-8-222] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2007] [Accepted: 06/26/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As a consequence of the evolutionary process, data collected from related species tend to be similar. This similarity by descent can obscure subtler signals in the data such as the evidence of constraint on variation due to shared selective pressures. In comparative sequence analysis, for example, sequence similarity is often used to illuminate important regions of the genome, but if the comparison is between closely related species, then similarity is the rule rather than the interesting exception. Furthermore, and perhaps worse yet, the contribution of a divergent third species may be masked by the strong similarity between the other two. Here we propose a remedy that weighs the contribution of each species according to its phylogenetic placement. RESULTS We first solve the problem of summarizing data related by phylogeny, and we explain why an average should operate on the entire evolutionary trajectory that relates the data. This perspective leads to a new approach in which we define the average in terms of the phylogeny, using the data and a stochastic model to obtain a probability on evolutionary trajectories. With the assumption that the data evolve according to a Brownian motion process on the tree, we show that our evolutionary average can be computed as convex combination of the species data. Thus, our approach, called the BranchManager, defines both an average and a novel taxon weighting scheme. We compare the BranchManager to two other methods, demonstrating why it exhibits desirable properties. In doing so, we devise a framework for comparison and introduce the concept of a representative point at which the average is situated. CONCLUSION The BranchManager uses as its representative point the phylogenetic center of mass, a choice which has both intuitive and practical appeal. Because our average is intrinsic to both the dataset and to the phylogeny, we expect it and its corresponding weighting scheme to be useful in all sorts of studies where interspecies data need to be combined. Obvious applications include evolutionary studies of morphology, physiology or behaviour, but quantitative measures such as sequence hydrophobicity and gene expression level are amenable to our approach as well. Other areas of potential impact include motif discovery and vaccine design. A Java implementation of the BranchManager is available for download, as is a script written in the statistical language R.
Collapse
Affiliation(s)
- Eric A Stone
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695-7566, USA
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA
| | - Arend Sidow
- Department of Pathology, Stanford University, Stanford, CA 94305-5324, USA
- Department of Genetics, Stanford University, Stanford, CA 94305-5120, USA
| |
Collapse
|
70
|
Roth A, Winkler WC, Regulski EE, Lee BWK, Lim J, Jona I, Barrick JE, Ritwik A, Kim JN, Welz R, Iwata-Reuyl D, Breaker RR. A riboswitch selective for the queuosine precursor preQ1 contains an unusually small aptamer domain. Nat Struct Mol Biol 2007; 14:308-17. [PMID: 17384645 DOI: 10.1038/nsmb1224] [Citation(s) in RCA: 190] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2006] [Accepted: 03/05/2007] [Indexed: 01/09/2023]
Abstract
A previous bioinformatics-based search for riboswitches yielded several candidate motifs in eubacteria. One of these motifs commonly resides in the 5' untranslated regions of genes involved in the biosynthesis of queuosine (Q), a hypermodified nucleoside occupying the anticodon wobble position of certain transfer RNAs. Here we show that this structured RNA is part of a riboswitch selective for 7-aminomethyl-7-deazaguanine (preQ(1)), an intermediate in queuosine biosynthesis. Compared with other natural metabolite-binding RNAs, the preQ(1) aptamer appears to have a simple structure, consisting of a single stem-loop and a short tail sequence that together are formed from as few as 34 nucleotides. Despite its small size, this aptamer is highly selective for its cognate ligand in vitro and has an affinity for preQ(1) in the low nanomolar range. Relatively compact RNA structures can therefore serve effectively as metabolite receptors to regulate gene expression.
Collapse
Affiliation(s)
- Adam Roth
- Howard Hughes Medical Institute, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
71
|
Li Y, Wang Y, Li Y, Yang L. Prediction of the deleterious nsSNPs in ABCB transporters. FEBS Lett 2006; 580:6800-6. [PMID: 17141228 DOI: 10.1016/j.febslet.2006.11.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2006] [Revised: 11/02/2006] [Accepted: 11/14/2006] [Indexed: 01/11/2023]
Abstract
The non-synonymous SNPs (nsSNPs) in coding regions, neutral or deleterious, could lead to the alteration of the function or structure of proteins. We have developed the computational models to analyze the deleterious nsSNPs in the transporters and predict ones in ABCB (ATP-binding cassette B) transporters of interest. The RPLS (ridge partial least square) and LDA (linear discriminant analysis) methods were applied to the problem, by training on a selection of datasets from a specified source, i.e., human transporters. The best combination of datasets and prediction attributes was ascertained. The prediction accuracy of the theoretical RPLS model for the training and testing sets is 84.8% and 80.4%, respectively (LDA: 84.3% and 80.4%), which indicates the models are reasonable and may be helpful for pharmacogenetics studies.
Collapse
Affiliation(s)
- Yanhong Li
- Laboratory of Pharmaceutical Resource Discovery, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, #457 Zhongshan Road, Dalian 116023, China
| | | | | | | |
Collapse
|
72
|
Campbell-Valois FX, Tarassov K, Michnick SW. Massive sequence perturbation of the Raf ras binding domain reveals relationships between sequence conservation, secondary structure propensity, hydrophobic core organization and stability. J Mol Biol 2006; 362:151-71. [PMID: 16916524 DOI: 10.1016/j.jmb.2006.06.061] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2006] [Revised: 05/23/2006] [Accepted: 06/21/2006] [Indexed: 11/25/2022]
Abstract
The contributions of specific residues to the delicate balance between function, stability and folding rates could be determined, in part by [corrected] comparing the sequences of structures having identical folds, but insignificant sequence homology. Recently, we have devised an experimental strategy to thoroughly explore residue substitutions consistent with a specific class of structure. Using this approach, the amino acids tolerated at virtually all residues of the c-Raf/Raf1 ras binding domain (Raf RBD), an exemplar of the common beta-grasp ubiquitin-like topology, were obtained and used to define the sequence determinants of this fold. Herein, we present analyses suggesting that more subtle sequence selection pressure, including propensity for secondary structure, the hydrophobic core organization and charge distribution are imposed on the Raf RBD sequence. Secondly, using the Gibbs free energies (DeltaG(F-U)) obtained for 51 mutants of Raf RBD, we demonstrate a strong correlation between amino acid conservation and the destabilization induced by truncating mutants. In addition, four mutants are shown to significantly stabilize Raf RBD native structure. Two of these mutations, including the well-studied R89L, are known to severely compromise binding affinity for ras. Another stabilized mutant consisted of a deletion of amino acid residues E104-K106. This deletion naturally occurs in the homologues a-Raf and b-Raf and could indicate functional divergence. Finally, the combination of mutations affecting five of 78 residues of Raf RBD results in stabilization of the structure by approximately 12 kJ mol(-1) (DeltaG(F-U) is -22 and -34 kJ mol(-1) for wt and mutant, respectively). The sequence perturbation approach combined with sequence/structure analysis of the ubiquitin-like fold provide a basis for the identification of sequence-specific requirements for function, stability and folding rate of the Raf RBD and structural analogues, highlighting the utility of conservation profiles as predictive tools of structural organization.
Collapse
Affiliation(s)
- F-X Campbell-Valois
- Département de Biochimie, Université de Montréal, C.P. 6128, Succ. centre-ville, Montréal, Québec, Canada H3C 3J7
| | | | | |
Collapse
|
73
|
Sutormin RA, Mironov AA. Membrane profile-based probabilistic method for predicting transmembrane segments via multiple protein sequence alignment. Mol Biol 2006. [DOI: 10.1134/s0026893306030150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
74
|
Vinogradov SN, Hoogewijs D, Bailly X, Arredondo-Peter R, Gough J, Dewilde S, Moens L, Vanfleteren JR. A phylogenomic profile of globins. BMC Evol Biol 2006; 6:31. [PMID: 16600051 PMCID: PMC1457004 DOI: 10.1186/1471-2148-6-31] [Citation(s) in RCA: 173] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2005] [Accepted: 04/07/2006] [Indexed: 12/26/2022] Open
Abstract
Background Globins occur in all three kingdoms of life: they can be classified into single-domain globins and chimeric globins. The latter comprise the flavohemoglobins with a C-terminal FAD-binding domain and the gene-regulating globin coupled sensors, with variable C-terminal domains. The single-domain globins encompass sequences related to chimeric globins and «truncated» hemoglobins with a 2-over-2 instead of the canonical 3-over-3 α-helical fold. Results A census of globins in 26 archaeal, 245 bacterial and 49 eukaryote genomes was carried out. Only ~25% of archaea have globins, including globin coupled sensors, related single domain globins and 2-over-2 globins. From one to seven globins per genome were found in ~65% of the bacterial genomes: the presence and number of globins are positively correlated with genome size. Globins appear to be mostly absent in Bacteroidetes/Chlorobi, Chlamydia, Lactobacillales, Mollicutes, Rickettsiales, Pastorellales and Spirochaetes. Single domain globins occur in metazoans and flavohemoglobins are found in fungi, diplomonads and mycetozoans. Although red algae have single domain globins, including 2-over-2 globins, the green algae and ciliates have only 2-over-2 globins. Plants have symbiotic and nonsymbiotic single domain hemoglobins and 2-over-2 hemoglobins. Over 90% of eukaryotes have globins: the nematode Caenorhabditis has the most putative globins, ~33. No globins occur in the parasitic, unicellular eukaryotes such as Encephalitozoon, Entamoeba, Plasmodium and Trypanosoma. Conclusion Although Bacteria have all three types of globins, Archaeado not have flavohemoglobins and Eukaryotes lack globin coupled sensors. Since the hemoglobins in organisms other than animals are enzymes or sensors, it is likely that the evolution of an oxygen transport function accompanied the emergence of multicellular animals.
Collapse
Affiliation(s)
- Serge N Vinogradov
- Department of Biochemistry and Molecular Biology, Wayne State University School of Medicine, Detroit, MI 48201, USA
| | - David Hoogewijs
- Department of Biology, Ghent University, B-9000 Ghent, Belgium
| | - Xavier Bailly
- Station Biologique de Roscoff, 29680 Roscoff, France
| | - Raúl Arredondo-Peter
- Laboratorio de Biofísica y Biología Molecular, Facultad de Ciencias, Universidad Autónoma del Estado de Morelos, 62210 Cuernavaca, Morelos, México
| | - Julian Gough
- RIKEN Genomic Sciences Centre, Yokohama 230-0045, Japan
| | - Sylvia Dewilde
- Department of Biomedical Sciences, University of Antwerp, 2610 Antwerp, Belgium
| | - Luc Moens
- Department of Biomedical Sciences, University of Antwerp, 2610 Antwerp, Belgium
| | | |
Collapse
|
75
|
Johnston CR, Shields DC. A sequence sub-sampling algorithm increases the power to detect distant homologues. Nucleic Acids Res 2005; 33:3772-8. [PMID: 16006623 PMCID: PMC1174907 DOI: 10.1093/nar/gki687] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.
Collapse
Affiliation(s)
- Catrióna R Johnston
- Department of Clinical Pharmacology, Bioinformatics Group, Royal College of Surgeons in Ireland, 123 St Stephens Green, Dublin 2, Ireland.
| | | |
Collapse
|
76
|
Barrick JE, Sudarsan N, Weinberg Z, Ruzzo WL, Breaker RR. 6S RNA is a widespread regulator of eubacterial RNA polymerase that resembles an open promoter. RNA (NEW YORK, N.Y.) 2005; 11:774-84. [PMID: 15811922 PMCID: PMC1370762 DOI: 10.1261/rna.7286705] [Citation(s) in RCA: 174] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2004] [Accepted: 02/01/2005] [Indexed: 05/24/2023]
Abstract
6S RNA is an abundant noncoding RNA in Escherichia coli that binds to sigma70 RNA polymerase holoenzyme to globally regulate gene expression in response to the shift from exponential growth to stationary phase. We have computationally identified >100 new 6S RNA homologs in diverse eubacterial lineages. Two abundant Bacillus subtilis RNAs of unknown function (BsrA and BsrB) and cyanobacterial 6Sa RNAs are now recognized as 6S homologs. Structural probing of E. coli 6S RNA and a B. subtilis homolog supports a common secondary structure derived from comparative sequence analysis. The conserved features of 6S RNA suggest that it binds RNA polymerase by mimicking the structure of DNA template in an open promoter complex. Interestingly, the two B. subtilis 6S RNAs are discoordinately expressed during growth, and many proteobacterial 6S RNAs could be cotranscribed with downstream homologs of the E. coli ygfA gene encoding a putative methenyltetrahydrofolate synthetase. The prevalence and robust expression of 6S RNAs emphasize their critical role in bacterial adaptation.
Collapse
Affiliation(s)
- Jeffrey E Barrick
- Department of Molecular, Cellular, and Developmental Biology, Yale University, P.O. Box 208103, New Haven, CT 06520, USA
| | | | | | | | | |
Collapse
|
77
|
Wistrand M, Sonnhammer ELL. Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics 2005; 6:99. [PMID: 15831105 PMCID: PMC1097716 DOI: 10.1186/1471-2105-6-99] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Accepted: 04/15/2005] [Indexed: 11/24/2022] Open
Abstract
Background Profile hidden Markov model (HMM) techniques are among the most powerful methods for protein homology detection. Yet, the critical features for successful modelling are not fully known. In the present work we approached this by using two of the most popular HMM packages: SAM and HMMER. The programs' abilities to build models and score sequences were compared on a SCOP/Pfam based test set. The comparison was done separately for local and global HMM scoring. Results Using default settings, SAM was overall more sensitive. SAM's model estimation was superior, while HMMER's model scoring was more accurate. Critical features for model building were then analysed by comparing the two packages' algorithmic choices and parameters. The weighting between prior probabilities and multiple alignment counts held the primary explanation why SAM's model building was superior. Our analysis suggests that HMMER gives too much weight to the sequence counts. SAM's emission prior probabilities were also shown to be more sensitive. The relative sequence weighting schemes are different in the two packages but performed equivalently. Conclusion SAM model estimation was more sensitive, while HMMER model scoring was more accurate. By combining the best algorithmic features from both packages the accuracy was substantially improved compared to their default performance.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | - Erik LL Sonnhammer
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| |
Collapse
|
78
|
Balasubramanian S, Xia Y, Freinkman E, Gerstein M. Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms. Nucleic Acids Res 2005; 33:1710-21. [PMID: 15784611 PMCID: PMC1069129 DOI: 10.1093/nar/gki311] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We assessed the disease-causing potential of single nucleotide polymorphisms (SNPs) based on a simple set of sequence-based features. We focused on SNPs from the dbSNP database in G-protein-coupled receptors (GPCRs), a large class of important transmembrane (TM) proteins. Apart from the location of the SNP in the protein, we evaluated the predictive power of three major classes of features to differentiate between disease-causing mutations and neutral changes: (i) properties derived from amino-acid scales, such as volume and hydrophobicity; (ii) position-specific phylogenetic features reflecting evolutionary conservation, such as normalized site entropy, residue frequency and SIFT score; and (iii) substitution-matrix scores, such as those derived from the BLOSUM62, GRANTHAM and PHAT matrices. We validated our approach using a control dataset consisting of known disease-causing mutations and neutral variations. Logistic regression analyses indicated that position-specific phylogenetic features that describe the conservation of an amino acid at a specific site are the best discriminators of disease mutations versus neutral variations, and integration of all our features improves discrimination power. Overall, we identify 115 SNPs in GPCRs from dbSNP that are likely to be associated with disease and thus are good candidates for genotyping in association studies.
Collapse
Affiliation(s)
- Suganthi Balasubramanian
- Department of Molecular Biophysics and Biochemistry, Yale University266 Whitney Avenue, New Haven, CT 06520-8114, USA
| | - Yu Xia
- Department of Molecular Biophysics and Biochemistry, Yale University266 Whitney Avenue, New Haven, CT 06520-8114, USA
| | - Elizaveta Freinkman
- Department of Molecular Biophysics and Biochemistry, Yale University266 Whitney Avenue, New Haven, CT 06520-8114, USA
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University266 Whitney Avenue, New Haven, CT 06520-8114, USA
- Department of Computer Science, Yale University266 Whitney Avenue, New Haven, CT 06520-8114, USA
- To whom correspondence should be addressed. Tel: +1 203 432 6105; Fax: +1 360 838 7861;
| |
Collapse
|
79
|
Ilyin VA, Abyzov A, Leslin CM. Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Sci 2005; 13:1865-74. [PMID: 15215530 PMCID: PMC2279929 DOI: 10.1110/ps.04672604] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Similarity of protein structures has been analyzed using three-dimensional Delaunay triangulation patterns derived from the backbone representation. It has been found that structurally related proteins have a common spatial invariant part, a set of tetrahedrons, mathematically described as a common spatial subgraph volume of the three-dimensional contact graph derived from Delaunay tessellation (DT). Based on this property of protein structures, we present a novel common volume superimposition (TOPOFIT) method to produce structural alignments. Structural alignments usually evaluated by a number of equivalent (aligned) positions (N(e)) with corresponding root mean square deviation (RMSD). The superimposition of the DT patterns allows one to uniquely identify a maximal common number of equivalent residues in the structural alignment. In other words, TOPOFIT identifies a feature point on the RMSD N(e) curve, a topomax point, until which the topologies of two structures correspond to each other, including backbone and interresidue contacts, whereas the growing number of mismatches between the DT patterns occurs at larger RMSD (N(e)) after the topomax point. It has been found that the topomax point is present in all alignments from different protein structural classes; therefore, the TOPOFIT method identifies common, invariant structural parts between proteins. The alignments produced by the TOPOFIT method have a good correlation with alignments produced by other current methods. This novel method opens new opportunities for the comparative analysis of protein structures and for more detailed studies on understanding the molecular principles of tertiary structure organization and functionality. The TOPOFIT method also helps to detect conformational changes, topological differences in variable parts, which are particularly important for studies of variations in active/ binding sites and protein classification.
Collapse
Affiliation(s)
- Valentin A Ilyin
- Biology Department, Northeastern University, 360 Huntington Avenue, Boston, MA 02115, USA.
| | | | | |
Collapse
|
80
|
Cuff AL, Martin ACR. Analysis of void volumes in proteins and application to stability of the p53 tumour suppressor protein. J Mol Biol 2005; 344:1199-209. [PMID: 15561139 DOI: 10.1016/j.jmb.2004.10.015] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2004] [Revised: 09/24/2004] [Accepted: 10/12/2004] [Indexed: 10/26/2022]
Abstract
We have developed a new method for the analysis of voids in proteins (defined as empty cavities not accessible to solvent). This method combines analysis of individual discrete voids with analysis of packing quality. While these are different aspects of the same effect, they have traditionally been analysed using different approaches. The method has been applied to the calculation of total void volume and maximum void size in a non-redundant set of protein domains and has been used to examine correlations between thermal stability and void size. The tumour-suppressor protein p53 has then been compared with the non-redundant data set to determine whether its low thermal stability results from poor packing. We found that p53 has average packing, but the detrimental effects of some previously unexplained mutations to p53 observed in cancer can be explained by the creation of unusually large voids.
Collapse
Affiliation(s)
- Alison L Cuff
- School of Animal and Microbial Sciences, University of Reading, Whiteknights, P.O. Box 228, Reading RG6 6AJ, UK
| | | |
Collapse
|
81
|
Voss NR, Gerstein M. Calculation of standard atomic volumes for RNA and comparison with proteins: RNA is packed more tightly. J Mol Biol 2005; 346:477-92. [PMID: 15670598 DOI: 10.1016/j.jmb.2004.11.072] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2004] [Revised: 11/24/2004] [Accepted: 11/24/2004] [Indexed: 11/22/2022]
Abstract
Traditionally, for biomolecular packing calculations research has focused on proteins. Besides proteins, RNA is the other large biomolecule that has tertiary structure interactions and complex packing. No one has yet quantitatively investigated RNA packing or compared its packing to that of proteins because, until recently, there were no large RNA structures. Here we address this question in detail, using Voronoi volume calculations on a set of high-resolution RNA crystal structures. We do a careful parameterization, taking into account many factors such as atomic radii, crystal packing, structural complexity, solvent, and associated protein to obtain a self-consistent, universal set of volumes that can be applied to both RNA and protein. We report this set of volumes, which we call the NucProt parameter set. Our measured values are consistent across the many different RNA structures and packing environments. When common atom types are compared between proteins and RNA, nine of 12 types show that RNA has a smaller volume and packs more tightly than protein, suggesting that close-packing may be as important for the folding of RNAs as it is for proteins. Moreover, calculated partial specific volumes show that RNA bases pack more densely than corresponding aromatic residues from proteins. Finally, we find that RNA bases have similar packing volumes to DNA bases, despite the absence of tertiary contacts in DNA. Programs, parameter sets and raw data are available online at.
Collapse
Affiliation(s)
- N R Voss
- Molecular Biophysics and Biochemistry, Yale University, 260 Whitney Ave, P.O. Box 208114, New Haven, CT 06520, USA
| | | |
Collapse
|
82
|
Littler SJ, Hubbard SJ. Conservation of orientation and sequence in protein domain--domain interactions. J Mol Biol 2004; 345:1265-79. [PMID: 15644220 DOI: 10.1016/j.jmb.2004.11.011] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2004] [Accepted: 11/05/2004] [Indexed: 11/16/2022]
Abstract
The repertoire of naturally occurring protein structures is usually characterised in structural terms at the domain level by their constituent folds. As structure is acknowledged to be an important stepping stone to the understanding of protein function, an appreciation of how individual domain interactions are built to form complete, functional protein structures is essential. A comprehensive study of protein domain interactions has been undertaken, covering all those observed in known structures, as well as those predicted to occur in 46 completed genome sequences from all three domains of life. In particular, we examine the promiscuity of protein domains characterised by SCOP superfamilies in terms of their interacting partners, the surface they use to form these interactions, and the relative orientations of their domain partners. Protein domains are shown to display a variety of behaviours, ranging from high promiscuity to absolute monogamy of domain surface employed, with both multiple and single domain partners. In addition, the conservation of sequence and volume at domain interface surfaces is observed to be significantly higher than at accessible surface in general, acting as a powerful potential predictor for domain interactions. We also examine the separation of interacting domains in protein sequence, showing that standard thresholds of 30 amino acid residues lead to a significant false positive rate, and an even more significant false negative rate of approximately 40%. These data suggest that there may be many more than the 2000 domain--domain interactions that have not yet been observed structurally, and we provide a top 30 hit-list of putative domain interactions which should be targeted.
Collapse
Affiliation(s)
- Stephen J Littler
- Faculty of Life Sciences, The University of Manchester, Jackson's Mill, P.O. Box 88, Manchester M60 1QD, UK
| | | |
Collapse
|
83
|
Wistrand M, Sonnhammer ELL. transition priors for protein hidden Markov models: an empirical study towards maximum discrimination. J Comput Biol 2004; 11:181-93. [PMID: 15072695 DOI: 10.1089/106652704773416957] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Insertions and deletions in a profile hidden Markov model (HMM) are modeled by transition probabilities between insert, delete and match states. These are estimated by combining observed data and prior probabilities. The transition prior probabilities can be defined either ad hoc or by maximum likelihood (ML) estimation. We show that the choice of transition prior greatly affects the HMM's ability to discriminate between true and false hits. HMM discrimination was measured using the HMMER 2.2 package applied to 373 families from Pfam. We measured the discrimination between true members and noise sequences employing various ML transition priors and also systematically scanned the parameter space of ad hoc transition priors. Our results indicate that ML priors produce far from optimal discrimination, and we present an empirically derived prior that considerably decreases the number of misclassifications compared to ML. Most of the difference stems from the probabilities for exiting a delete state. The ML prior, which is unaware of noise sequences, estimates a delete-to-delete probability that is relatively high and does not penalize noise sequences enough for optimal discrimination.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
84
|
Afonnikov DA, Kolchanov NA. CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences. Nucleic Acids Res 2004; 32:W64-8. [PMID: 15215352 PMCID: PMC441589 DOI: 10.1093/nar/gkh451] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Recent results suggest that during evolution certain substitutions at protein sites may occur in a coordinated manner due to interactions between amino acid residues. Information on these coordinated substitutions may be useful for analysis of protein structure and function. CRASP is an Internet-available software tool for the detection and analysis of coordinated substitutions in multiple alignments of protein sequences. The approach is based on estimation of the correlation coefficient between the values of a physicochemical parameter at a pair of positions of sequence alignment. The program enables the user to detect and analyze pairwise relationships between amino acid substitutions at protein sequence positions, estimate the contribution of the coordinated substitutions to the evolutionary invariance or variability in integral protein physicochemical characteristics such as the net charge of protein residues and hydrophobic core volume. The CRASP program is available at http://wwwmgs.bionet.nsc.ru/mgs/programs/crasp/.
Collapse
|
85
|
Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004; 5:113. [PMID: 15318951 PMCID: PMC517706 DOI: 10.1186/1471-2105-5-113] [Citation(s) in RCA: 6028] [Impact Index Per Article: 287.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2004] [Accepted: 08/19/2004] [Indexed: 11/22/2022] Open
Abstract
Background In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles. Results We compare the speed and accuracy of MUSCLE with CLUSTALW, Progressive POA and the MAFFT script FFTNS1, the fastest previously published program known to the author. Accuracy is measured using four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three variants that offer highest accuracy (MUSCLE with default settings), highest speed (MUSCLE-fast), and a carefully chosen compromise between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest algorithm on all test sets, achieving average alignment accuracy similar to CLUSTALW in times that are typically two to three orders of magnitude less. MUSCLE-fast is able to align 1,000 sequences of average length 282 in 21 seconds on a current desktop computer. Conclusions MUSCLE offers a range of options that provide improved speed and / or alignment accuracy compared with currently available programs. MUSCLE is freely available at .
Collapse
Affiliation(s)
- Robert C Edgar
- Department of Plant and Microbial Biology, 461 Koshland Hall, University of California, Berkeley, CA 94720-3102, USA.
| |
Collapse
|
86
|
Podell S, Gribskov M. Predicting N-terminal myristoylation sites in plant proteins. BMC Genomics 2004; 5:37. [PMID: 15202951 PMCID: PMC449705 DOI: 10.1186/1471-2164-5-37] [Citation(s) in RCA: 121] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2004] [Accepted: 06/17/2004] [Indexed: 01/09/2023] Open
Abstract
Background N-terminal myristoylation plays a vital role in membrane targeting and signal transduction in plant responses to environmental stress. Although N-myristoyltransferase enzymatic function is conserved across plant, animal, and fungal kingdoms, exact substrate specificities vary, making it difficult to predict protein myristoylation accurately within specific taxonomic groups. Results A new method for predicting N-terminal myristoylation sites specifically in plants has been developed and statistically tested for sensitivity, specificity, and robustness. Compared to previously available methods, the new model is both more sensitive in detecting known positives, and more selective in avoiding false positives. Scores of myristoylated and non-myristoylated proteins are more widely separated than with other methods, greatly reducing ambiguity and the number of sequences giving intermediate, uninformative results. The prediction model is available at . Conclusion Superior performance of the new model is due to the selection of a plant-specific training set, covering 266 unique sequence examples from 40 different species, the use of a probability-based hidden Markov model to obtain predictive scores, and a threshold cutoff value chosen to provide maximum positive-negative discrimination. The new model has been used to predict 589 plant proteins likely to contain N-terminal myristoylation signals, and to analyze the functional families in which these proteins occur.
Collapse
Affiliation(s)
- Sheila Podell
- San Diego Supercomputer Center, University of California San Diego, La Jolla CA 92093-0537, USA
- Department of Biology, University of California San Diego, La Jolla CA 92093-0537, USA
| | - Michael Gribskov
- San Diego Supercomputer Center, University of California San Diego, La Jolla CA 92093-0537, USA
- Department of Biology, University of California San Diego, La Jolla CA 92093-0537, USA
| |
Collapse
|
87
|
La D, Silver M, Edgar RC, Livesay DR. Using motif-based methods in multiple genome analyses: a case study comparing orthologous mesophilic and thermophilic proteins. Biochemistry 2003; 42:8988-98. [PMID: 12885231 DOI: 10.1021/bi027435e] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Protein motifs represent highly conserved regions within protein families and are generally accepted to describe critical regions required for protein stability and/or function. In this comprehensive analysis, we present a robust, unique approach to identify and compare corresponding mesophilic and thermophilic sequence motifs between all orthologous proteins within 44 microbial genomes. Motif similarity is determined through global sequence alignment of mesophilic and thermophilic motif pairs, which are identified by a greedy algorithm. Our results reveal only modest correlation between motif and overall sequence similarity, highlighting the rationale of motif-based approaches in comprehensive multigenome comparisons. Conserved mutations reflect previously suggested physiochemical principles for conferring thermostability. Additionally, comparisons between corresponding mesophilic and thermophilic motif pairs provide key biochemical insights related to thermostability and can be used to test the evolutionary robustness of individual structural comparisons. We demonstrate the ability of our unique approach to provide key insights in two examples: the TATA-box binding protein and glutamate dehydrogenase families. In the latter example, conserved mutations hint at novel origins leading to structural stability differences within the hexamer structures. Additionally, we present amino acid composition data and average protein length comparisons for all 44 microbial genomes.
Collapse
Affiliation(s)
- David La
- Department of Chemistry, California State Polytechnic University at Pomona, 3801 West Temple Avenue, Pomona, California 91768, USA
| | | | | | | |
Collapse
|
88
|
Abstract
Computing the volume occupied by individual atoms in macromolecular structures has been the subject of research for several decades. This interest has grown in the recent years, because weighted volumes are widely used in implicit solvent models. Applications of the latter in molecular mechanics simulations require that the derivatives of these weighted volumes be known. In this article, we give a formula for the volume derivative of a molecule modeled as a space-filling diagram made up of balls in motion. The formula is given in terms of the weights, radii, and distances between the centers as well as the sizes of the facets of the power diagram restricted to the space-filling diagram. Special attention is given to the detection and treatment of singularities as well as discontinuities of the derivative.
Collapse
|
89
|
Abstract
The importance of a residue for maintaining the structure and function of a protein can usually be inferred from how conserved it appears in a multiple sequence alignment of that protein and its homologues. A reliable metric for quantifying residue conservation is desirable. Over the last two decades many such scores have been proposed, but none has emerged as a generally accepted standard. This work surveys the range of scores that biologists, biochemists, and, more recently, bioinformatics workers have developed, and reviews the intrinsic problems associated with developing and evaluating such a score. A general formula is proposed that may be used to compare the properties of different particular conservation scores or as a measure of conservation in its own right.
Collapse
Affiliation(s)
- William S J Valdar
- Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
90
|
Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001; 29:2994-3005. [PMID: 11452024 PMCID: PMC55814 DOI: 10.1093/nar/29.14.2994] [Citation(s) in RCA: 977] [Impact Index Per Article: 40.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2001] [Revised: 05/30/2001] [Accepted: 05/30/2001] [Indexed: 11/13/2022] Open
Abstract
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
Collapse
Affiliation(s)
- A A Schäffer
- National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | | | | | | | | | | | | | | |
Collapse
|
91
|
May AC. Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics. PROTEIN ENGINEERING 2001; 14:209-17. [PMID: 11391012 DOI: 10.1093/protein/14.4.209] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.
Collapse
Affiliation(s)
- A C May
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.
| |
Collapse
|
92
|
Abstract
The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our results show that proteins are fairly close to random sequences. The entropy reduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to finite sample effects. Compression algorithms also indicate that the redundancy is in the order of 1%. These results confirm the idea that protein sequences can be regarded as slightly edited random strings. We discuss secondary structure and low-complexity regions as causes of the redundancy observed. The findings are related to numerical and biochemical experiments with random polypeptides.
Collapse
Affiliation(s)
- O Weiss
- Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstr. 43, Berlin, D-10115, Germany
| | | | | |
Collapse
|
93
|
Affiliation(s)
- S Henikoff
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109-1024, USA
| | | |
Collapse
|
94
|
Fleming PJ, Richards FM. Protein packing: dependence on protein size, secondary structure and amino acid composition. J Mol Biol 2000; 299:487-98. [PMID: 10860754 DOI: 10.1006/jmbi.2000.3750] [Citation(s) in RCA: 110] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We have used the occluded surface algorithm to estimate the packing of both buried and exposed amino acid residues in protein structures. This method works equally well for buried residues and solvent-exposed residues in contrast to the commonly used Voronoi method that works directly only on buried residues. The atomic packing of individual globular proteins may vary significantly from the average packing of a large data set of globular proteins. Here, we demonstrate that these variations in protein packing are due to a complex combination of protein size, secondary structure composition and amino acid composition. Differences in protein packing are conserved in protein families of similar structure despite significant sequence differences. This conclusion indicates that quality assessments of packing in protein structures should include a consideration of various parameters including the packing of known homologous proteins. Also, modeling of protein structures based on homologous templates should take into account the packing of the template protein structure.
Collapse
Affiliation(s)
- P J Fleming
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520-8114, USA.
| | | |
Collapse
|
95
|
Abstract
We have developed a novel, fully automatic method for aligning the three-dimensional structures of two proteins. The basic approach is to first align the proteins' secondary structure elements and then extend the alignment to include any equivalent residues found in loops or turns. The initial secondary structure element alignment is determined by a genetic algorithm. After refinement of the secondary structure element alignment, the protein backbones are superposed and a search is performed to identify any additional equivalent residues in a convergent process. Alignments are evaluated using intramolecular distance matrices. Alignments can be performed with or without sequential connectivity constraints. We have applied the method to proteins from several well-studied families: globins, immunoglobulins, serine proteases, dihydrofolate reductases, and DNA methyltransferases. Agreement with manually curated alignments is excellent. A web-based server and additional supporting information are available at http://engpub1.bu.edu/-josephs.
Collapse
Affiliation(s)
- J D Szustakowski
- Boston University, Department of Biomedical Engineering, Massachusetts, USA
| | | |
Collapse
|
96
|
Liu R, Baase WA, Matthews BW. The introduction of strain and its effects on the structure and stability of T4 lysozyme. J Mol Biol 2000; 295:127-45. [PMID: 10623513 DOI: 10.1006/jmbi.1999.3300] [Citation(s) in RCA: 43] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In order to try to better understand the role played by strain in the structure and stability of a protein a series of "small-to-large" mutations was made within the core of T4 lysozyme. Three different alanine residues, one involved in backbone contacts, one in side-chain contacts, and the third adjacent to a small cavity, were each replaced with subsets of the larger residues, Val, Leu, Ile, Met, Phe and Trp. As expected, the protein is progressively destabilized as the size of the introduced side-chain becomes larger. There does, however, seem to be a limit to the destabilization, suggesting that a protein of a given size may be capable of maintaining only a certain amount of strain. The changes in stability vary greatly from site to site. Substitution of larger residues for both Ala42 and Ala98 substantially destabilize the protein, even though the primary contacts in one case are predominantly with side-chain atoms and in the other with backbone. The results suggest that it is neither practical nor meaningful to try to separate the effects of introduced strain on side-chains from the effects on the backbone. Substitutions at Ala129 are much less destabilizing than at sites 42 or 98. This is most easily understood in terms of the pre-existing cavity, which provides partial space to accommodate the introduced side-chains. Crystal structures were obtained for a number of the mutants. These show that the changes in structure to accommodate the introduced side-chains usually consist of essentially rigid-body displacements of groups of linked atoms, achieved through relatively small changes in torsion angles. On rare occasions, a side-chain close to the site of substitution may change to a different rotamer. When such rotomer changes occur, they permit the structure to dissipate strain by a response that is plastic rather than elastic. In one case, a surface loop moves 1.2 A, not in direct response to a mutation, but in an interaction mediated via an intermolecular contact. It illustrates how the structure of a protein can be modified by crystal contacts.
Collapse
Affiliation(s)
- R Liu
- Institute of Molecular Biology, Howard Hughes Medical Institute and Department of Physics, Eugene, OR, 97403, USA
| | | | | |
Collapse
|
97
|
Drummond M, Stamper J. DNAPROBE, a computer program which generates oligonucleotide probes from protein alignments. Nucleic Acids Res 1999; 27:3493. [PMID: 10446238 PMCID: PMC148592 DOI: 10.1093/nar/27.17.3493] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We describe a program to assist in designing oligonucleotide probes on the basis of protein alignments and the codon usage of the target organism. If necessary, the input sequences can be weighted to neutralise the effect of closely similar sequences or to bias the output in favour of a particular taxon.
Collapse
Affiliation(s)
- M Drummond
- Nitrogen Fixation Laboratory, John Innes Institute, Colney Lane, Norwich NR4 7UH, UK.
| | | |
Collapse
|
98
|
Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, Kuznetsov EN. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. PROTEIN ENGINEERING 1999; 12:387-94. [PMID: 10360979 DOI: 10.1093/protein/12.5.387] [Citation(s) in RCA: 165] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Sequence weighting techniques are aimed at balancing redundant observed information from subsets of similar sequences in multiple alignments. Traditional approaches apply the same weight to all positions of a given sequence, hence equal efficiency of phylogenetic changes is assumed along the whole sequence. This restrictive assumption is not required for the new method PSIC (position-specific independent counts) described in this paper. The number of independent observations (counts) of an amino acid type at a given alignment position is calculated from the overall similarity of the sequences that share the amino acid type at this position with the help of statistical concepts. This approach allows the fast computation of position-specific sequence weights even for alignments containing hundreds of sequences. The PSIC approach has been applied to profile extraction and to the fold family assignment of protein sequences with known structures. Our method was shown to be very productive in finding distantly related sequences and more powerful than Hidden Markov Models or the profile methods in WiseTools and PSI-BLAST in many cases. The profile extraction routine is available on the WWW (http://www.bork.embl-heidelberg. de/PSIC or http://www.imb.ac.ru/PSIC).
Collapse
Affiliation(s)
- S R Sunyaev
- European Molecular Biology Laboratory, Meyerhofstrasse1, Postfach 10. 2209, D-69012 Heidelberg, Germany
| | | | | | | | | | | |
Collapse
|
99
|
Gerstein M, Hegyi H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 1998; 22:277-304. [PMID: 10357579 DOI: 10.1111/j.1574-6976.1998.tb00371.x] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
100
|
Vandergon TL, Riggs CK, Gorr TA, Colacino JM, Riggs AF. The mini-hemoglobins in neural and body wall tissue of the nemertean worm, Cerebratulus lacteus. J Biol Chem 1998; 273:16998-7011. [PMID: 9642264 DOI: 10.1074/jbc.273.27.16998] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Hemoglobin (Hb) occurs in circulating red blood cells, neural tissue, and body wall muscle tissue of the nemertean worm, Cerebratulus lacteus. The neural and body wall tissue each express single major Hb components for which the amino acid sequences have been deduced from cDNA and genomic DNA. These 109-residue globins form the smallest stable Hbs known. The globin genes have three exons and two introns with splice sites in the highly conserved positions of most globin genes. Alignment of the sequences with those of other globins indicates that the A, B, and H helices are about one-half the typical length. Phylogenetic analysis indicates that shortening results in a small tendency of globins to group together regardless of their actual relationships. The neural and body wall Hbs in situ are half-saturated with O2 at 2.9 and 4.1 torr, respectively. The Hill coefficient for the neural Hb in situ, approximately 2.9, suggests that the neural Hb self-associates in the deoxy state at least to tetramers at the 2-3 mM (heme) concentration estimated in the cells. The Hb must dissociate upon oxygenation and dilution because the weight-average molecular mass of the HbO2 in vitro is only about 18 kDa at 2-3 microM heme concentration. Calculations suggest that the Hb can function as an O2 store capable of extending neuronal activity in an anoxic environment for 5-30 min.
Collapse
Affiliation(s)
- T L Vandergon
- Department of Zoology, University of Texas, Austin, Texas 78712-1064, USA.
| | | | | | | | | |
Collapse
|