1
|
Huang YF, Siepel A. Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease. Genome Res 2019; 29:1310-1321. [PMID: 31249063 PMCID: PMC6673719 DOI: 10.1101/gr.245522.118] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 06/20/2019] [Indexed: 12/16/2022]
Abstract
A central challenge in human genomics is to understand the cellular, evolutionary, and clinical significance of genetic variants. Here, we introduce a unified population-genetic and machine-learning model, called Linear Allele-Specific Selection InferencE (LASSIE), for estimating the fitness effects of all observed and potential single-nucleotide variants, based on polymorphism data and predictive genomic features. We applied LASSIE to 51 high-coverage genome sequences annotated with 33 genomic features and constructed a map of allele-specific selection coefficients across all protein-coding sequences in the human genome. This map is generally consistent with previous inferences of the bulk distribution of fitness effects but reveals pervasive weak negative selection against synonymous mutations. In addition, the estimated selection coefficients are highly predictive of inherited pathogenic variants and cancer driver mutations, outperforming state-of-the-art variant prioritization methods. By contrasting our estimated model with ultrahigh coverage ExAC exome-sequencing data, we identified 1118 genes under unusually strong negative selection, which tend to be exclusively expressed in the central nervous system or associated with autism spectrum disorder, as well as 773 genes under unusually weak selection, which tend to be associated with metabolism. This combination of classical population genetic theory with modern machine-learning and large-scale genomic data is a powerful paradigm for the study of both human evolution and disease.
Collapse
Affiliation(s)
- Yi-Fei Huang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| |
Collapse
|
2
|
Savisaar R, Hurst LD. Exonic splice regulation imposes strong selection at synonymous sites. Genome Res 2018; 28:1442-1454. [PMID: 30143596 PMCID: PMC6169883 DOI: 10.1101/gr.233999.117] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Accepted: 07/31/2018] [Indexed: 01/17/2023]
Abstract
What proportion of coding sequence nucleotides have roles in splicing, and how strong is the selection that maintains them? Despite a large body of research into exonic splice regulatory signals, these questions have not been answered. This is because, to our knowledge, previous investigations have not explicitly disentangled the frequency of splice regulatory elements from the strength of the evolutionary constraint under which they evolve. Current data are consistent both with a scenario of weak and diffuse constraint, enveloping large swaths of sequence, as well as with well-defined pockets of strong purifying selection. In the former case, natural selection on exonic splice enhancers (ESEs) might primarily act as a slight modifier of codon usage bias. In the latter, mutations that disrupt ESEs are likely to have large fitness and, potentially, clinical effects. To distinguish between these scenarios, we used several different methods to determine the distribution of selection coefficients for new mutations within ESEs. The analyses converged to suggest that ∼15%-20% of fourfold degenerate sites are part of functional ESEs. Most of these sites are under strong evolutionary constraint. Therefore, exonic splice regulation does not simply impose a weak bias that gently nudges coding sequence evolution in a particular direction. Rather, the selection to preserve these motifs is a strong force that severely constrains the evolution of a substantial proportion of coding nucleotides. Thus synonymous mutations that disrupt ESEs should be considered as a potentially common cause of single-locus genetic disorders.
Collapse
Affiliation(s)
- Rosina Savisaar
- The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, United Kingdom
| | - Laurence D Hurst
- The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, United Kingdom
| |
Collapse
|
3
|
Kainov YA, Aushev VN, Naumenko SA, Tchevkina EM, Bazykin GA. Complex Selection on Human Polyadenylation Signals Revealed by Polymorphism and Divergence Data. Genome Biol Evol 2016; 8:1971-9. [PMID: 27324920 PMCID: PMC4943204 DOI: 10.1093/gbe/evw137] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/05/2016] [Indexed: 12/19/2022] Open
Abstract
Polyadenylation is a step of mRNA processing which is crucial for its expression and stability. The major polyadenylation signal (PAS) represents a nucleotide hexamer that adheres to the AATAAA consensus sequence. Over a half of human genes have multiple cleavage and polyadenylation sites, resulting in a great diversity of transcripts differing in function, stability, and translational activity. Here, we use available whole-genome human polymorphism data together with data on interspecies divergence to study the patterns of selection acting on PAS hexamers. Common variants of PAS hexamers are depleted of single nucleotide polymorphisms (SNPs), and SNPs within PAS hexamers have a reduced derived allele frequency (DAF) and increased conservation, indicating prevalent negative selection; at the same time, the SNPs that "improve" the PAS (i.e., those leading to higher cleavage efficiency) have increased DAF, compared to those that "impair" it. SNPs are rarer at PAS of "unique" polyadenylation sites (one site per gene); among alternative polyadenylation sites, at the distal PAS and at exonic PAS. Similar trends were observed in DAFs and divergence between species of placental mammals. Thus, selection permits PAS mutations mainly at redundant and/or weakly functional PAS. Nevertheless, a fraction of the SNPs at PAS hexamers likely affect gene functions; in particular, some of the observed SNPs are associated with disease.
Collapse
Affiliation(s)
- Yaroslav A Kainov
- Centre for Developmental Neurobiology, King's College London, London, United Kingdom Oncogenes Regulation Department, N.N. Blokhin Russian Cancer Research Center, Institute of Carcinogenesis, Moscow, Russia
| | - Vasily N Aushev
- Oncogenes Regulation Department, N.N. Blokhin Russian Cancer Research Center, Institute of Carcinogenesis, Moscow, Russia Department of Preventive Medicine, Icahn School of Medicine at Mount Sinai, New York
| | - Sergey A Naumenko
- Institute for Information Transmission Problems (Kharkevich Institute) of the Russian Academy of Sciences, Moscow, Russia Genetics and Genome Biology Program, The Hospital for Sick Children, Toronto, Canada
| | - Elena M Tchevkina
- Oncogenes Regulation Department, N.N. Blokhin Russian Cancer Research Center, Institute of Carcinogenesis, Moscow, Russia
| | - Georgii A Bazykin
- Institute for Information Transmission Problems (Kharkevich Institute) of the Russian Academy of Sciences, Moscow, Russia Skolkovo Institute of Science and Technology, Skolkovo, Russia Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Russia Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russia Pirogov Russian National Research Medical University, Moscow, Russia
| |
Collapse
|
4
|
Evidence for stabilizing selection on codon usage in chromosomal rearrangements of Drosophila pseudoobscura. G3-GENES GENOMES GENETICS 2014; 4:2433-49. [PMID: 25326424 PMCID: PMC4267939 DOI: 10.1534/g3.114.014860] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
There has been a renewed interest in investigating the role of stabilizing selection acting on genome-wide traits such as codon usage bias. Codon bias, when synonymous codons are used at unequal frequencies, occurs in a wide variety of taxa. Standard evolutionary models explain the maintenance of codon bias through a balance of genetic drift, mutation and weak purifying selection. The efficacy of selection is expected to be reduced in regions of suppressed recombination. Contrary to observations in Drosophila melanogaster, some recent studies have failed to detect a relationship between the recombination rate, intensity of selection acting at synonymous sites, and the magnitude of codon bias as predicted under these standard models. Here, we examined codon bias in 2798 protein coding loci on the third chromosome of D. pseudoobscura using whole-genome sequences of 47 individuals, representing five common third chromosome gene arrangements. Fine-scale recombination maps were constructed using more than 1 million segregating sites. As expected, recombination was demonstrated to be significantly suppressed between chromosome arrangements, allowing for a direct examination of the relationship between recombination, selection, and codon bias. As with other Drosophila species, we observe a strong mutational bias away from the most frequently used codons. We find the rate of synonymous and nonsynonymous polymorphism is variable between different amino acids. However, we do not observe a reduction in codon bias or the strength of selection in regions of suppressed recombination as expected. Instead, we find that the interaction between weak stabilizing selection and mutational bias likely plays a role in shaping the composition of synonymous codons across the third chromosome in D. pseudoobscura.
Collapse
|
5
|
Gonsky R, Deem RL, Landers CJ, Haritunians T, Yang S, Targan SR. IFNG rs1861494 polymorphism is associated with IBD disease severity and functional changes in both IFNG methylation and protein secretion. Inflamm Bowel Dis 2014; 20:1794-801. [PMID: 25171510 PMCID: PMC4327845 DOI: 10.1097/mib.0000000000000172] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
BACKGROUND Mucosal expression of interferon (IFN)-γ plays a pivotal role in the pathogenesis of inflammatory bowel disease (IBD) and IBD risk regions flank IFNG. The conserved IFNG rs1861494 T/C introduces a new CpG methylation site, is associated with disease severity and lack of therapeutic response in other infectious and immune-mediated disorders, and is in linkage disequilibrium with a ulcerative colitis (UC) disease severity region. It seems likely that CpG-altering single nucleotide polymorphisms modify methylation and gene expression. This study evaluated the association between rs1861494 and clinical, serologic, and methylation patterns in patients with IBD. METHODS Peripheral T cells of UC and Crohn's disease (CD) patients were genotyped for rs1861494 and analyzed for allele-specific and IFNG promoter methylation. Serum antineutrophil cytoplasmic autoantibodies and IFN-γ secretion were measured by enzyme-linked immunosorbent assay and nucleoprotein complex formation by electrophoretic mobility shift assay. RESULTS IFNG rs1861494 T allele carriage in patients with IBD was associated with enhanced secretion of IFN-γ. T allele carriage was associated in UC with high levels of antineutrophil cytoplasmic autoantibodies and faster progression to colectomy. In CD, it was associated with complicated disease involving a stricturing/penetrating phenotype. Likewise, IFNG rs1861494 displayed genotype-specific modulation of DNA methylation and transcription factor complex formation. CONCLUSIONS This study reports the first association of IFNG rs1861494 T allele with enhanced IFN-γ secretion and known IBD clinical parameters indicative of more aggressive disease and serological markers associated with treatment resistance to anti-tumor necrosis factor therapy in patients with IBD. These data may be useful prognostically as predictors of early response to anti-tumor necrosis factor therapy to identify patients with IBD for improved personalized therapeutics.
Collapse
Affiliation(s)
- Rivkah Gonsky
- F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California 90048 USA
| | - Richard L Deem
- F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California 90048 USA
| | - Carol J Landers
- F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California 90048 USA
| | - Talin Haritunians
- F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California 90048 USA
| | - Shaohong Yang
- F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California 90048 USA
| | - Stephan R Targan
- F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California 90048 USA
| |
Collapse
|
6
|
Kessler MD, Dean MD. Effective population size does not predict codon usage bias in mammals. Ecol Evol 2014; 4:3887-900. [PMID: 25505518 PMCID: PMC4242573 DOI: 10.1002/ece3.1249] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2014] [Revised: 08/04/2014] [Accepted: 08/07/2014] [Indexed: 12/20/2022] Open
Abstract
Synonymous codons are not used at equal frequency throughout the genome, a phenomenon termed codon usage bias (CUB). It is often assumed that interspecific variation in the intensity of CUB is related to species differences in effective population sizes (Ne), with selection on CUB operating less efficiently in species with small Ne. Here, we specifically ask whether variation in Ne predicts differences in CUB in mammals and report two main findings. First, across 41 mammalian genomes, CUB was not correlated with two indirect proxies of Ne (body mass and generation time), even though there was statistically significant evidence of selection shaping CUB across all species. Interestingly, autosomal genes showed higher codon usage bias compared to X-linked genes, and high-recombination genes showed higher codon usage bias compared to low recombination genes, suggesting intraspecific variation in Ne predicts variation in CUB. Second, across six mammalian species with genetic estimates of Ne (human, chimpanzee, rabbit, and three mouse species: Mus musculus, M. domesticus, and M. castaneus), Ne and CUB were weakly and inconsistently correlated. At least in mammals, interspecific divergence in Ne does not strongly predict variation in CUB. One hypothesis is that each species responds to a unique distribution of selection coefficients, confounding any straightforward link between Ne and CUB.
Collapse
Affiliation(s)
- Michael D Kessler
- Molecular and Computational Biology, University of Southern California 1050 Childs Way, Los Angeles, California, 90089
| | - Matthew D Dean
- Molecular and Computational Biology, University of Southern California 1050 Childs Way, Los Angeles, California, 90089
| |
Collapse
|
7
|
Abstract
Some species exhibit very high levels of DNA sequence variability; there is also evidence for the existence of heritable epigenetic variants that experience state changes at a much higher rate than sequence variants. In both cases, the resulting high diversity levels within a population (hyperdiversity) mean that standard population genetics methods are not trustworthy. We analyze a population genetics model that incorporates purifying selection, reversible mutations, and genetic drift, assuming a stationary population size. We derive analytical results for both population parameters and sample statistics and discuss their implications for studies of natural genetic and epigenetic variation. In particular, we find that (1) many more intermediate-frequency variants are expected than under standard models, even with moderately strong purifying selection, and (2) rates of evolution under purifying selection may be close to, or even exceed, neutral rates. These findings are related to empirical studies of sequence and epigenetic variation.
Collapse
|
8
|
Du J, Dungan SZ, Sabouhanian A, Chang BSW. Selection on synonymous codons in mammalian rhodopsins: a possible role in optimizing translational processes. BMC Evol Biol 2014; 14:96. [PMID: 24884412 PMCID: PMC4021273 DOI: 10.1186/1471-2148-14-96] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 04/11/2014] [Indexed: 01/21/2023] Open
Abstract
Background Synonymous codon usage can affect many cellular processes, particularly those associated with translation such as polypeptide elongation and folding, mRNA degradation/stability, and splicing. Highly expressed genes are thought to experience stronger selection pressures on synonymous codons. This should result in codon usage bias even in species with relatively low effective population sizes, like mammals, where synonymous site selection is thought to be weak. Here we use phylogenetic codon-based likelihood models to explore patterns of codon usage bias in a dataset of 18 mammalian rhodopsin sequences, the protein mediating the first step in vision in the eye, and one of the most highly expressed genes in vertebrates. We use these patterns to infer selection pressures on key translational mechanisms including polypeptide elongation, protein folding, mRNA stability, and splicing. Results Overall, patterns of selection in mammalian rhodopsin appear to be correlated with post-transcriptional and translational processes. We found significant evidence for selection at synonymous sites using phylogenetic mutation-selection likelihood models, with C-ending codons found to have the highest relative fitness, and to be significantly more abundant at conserved sites. In general, these codons corresponded with the most abundant tRNAs in mammals. We found significant differences in codon usage bias between rhodopsin loops versus helices, though there was no significant difference in mean synonymous substitution rate between these motifs. We also found a significantly higher proportion of GC-ending codons at paired sites in rhodopsin mRNA secondary structure, and significantly lower synonymous mutation rates in putative exonic splicing enhancer (ESE) regions than in non-ESE regions. Conclusions By focusing on a single highly expressed gene we both distinguish synonymous codon selection from mutational effects and analytically explore underlying functional mechanisms. Our results suggest that codon bias in mammalian rhodopsin arises from selection to optimally balance high overall translational speed, accuracy, and proper protein folding, especially in structurally complicated regions. Selection at synonymous sites may also be contributing to mRNA stability and splicing efficiency at exonic-splicing-enhancer (ESE) regions. Our results highlight the importance of investigating highly expressed genes in a broader phylogenetic context in order to better understand the evolution of synonymous substitutions.
Collapse
Affiliation(s)
| | | | | | - Belinda S W Chang
- Department of Ecology & Evolutionary Biology, University of Toronto, 25 Harbord Street, Toronto, ON M5S 3G5, Canada.
| |
Collapse
|
9
|
Charlesworth B. Stabilizing selection, purifying selection, and mutational bias in finite populations. Genetics 2013; 194:955-71. [PMID: 23709636 PMCID: PMC3730922 DOI: 10.1534/genetics.113.151555] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Accepted: 05/18/2013] [Indexed: 12/16/2022] Open
Abstract
Genomic traits such as codon usage and the lengths of noncoding sequences may be subject to stabilizing selection rather than purifying selection. Mutations affecting these traits are often biased in one direction. To investigate the potential role of stabilizing selection on genomic traits, the effects of mutational bias on the equilibrium value of a trait under stabilizing selection in a finite population were investigated, using two different mutational models. Numerical results were generated using a matrix method for calculating the probability distribution of variant frequencies at sites affecting the trait, as well as by Monte Carlo simulations. Analytical approximations were also derived, which provided useful insights into the numerical results. A novel conclusion is that the scaled intensity of selection acting on individual variants is nearly independent of the effective population size over a wide range of parameter space and is strongly determined by the logarithm of the mutational bias parameter. This is true even when there is a very small departure of the mean from the optimum, as is usually the case. This implies that studies of the frequency spectra of DNA sequence variants may be unable to distinguish between stabilizing and purifying selection. A similar investigation of purifying selection against deleterious mutations was also carried out. Contrary to previous suggestions, the scaled intensity of purifying selection with synergistic fitness effects is sensitive to population size, which is inconsistent with the general lack of sensitivity of codon usage to effective population size.
Collapse
Affiliation(s)
- Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, United Kingdom.
| |
Collapse
|
10
|
Joyner-Matos J, Hicks KA, Cousins D, Keller M, Denver DR, Baer CF, Estes S. Evolution of a higher intracellular oxidizing environment in Caenorhabditis elegans under relaxed selection. PLoS One 2013; 8:e65604. [PMID: 23776511 PMCID: PMC3679170 DOI: 10.1371/journal.pone.0065604] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 04/29/2013] [Indexed: 01/22/2023] Open
Abstract
We explored the relationship between relaxed selection, oxidative stress, and spontaneous mutation in a set of mutation-accumulation (MA) lines of the nematode Caenorhabditis elegans and in their common ancestor. We measured steady-state levels of free radicals and oxidatively damaged guanosine nucleosides in the somatic tissues of five MA lines for which nuclear genome base substitution and GC-TA transversion frequencies are known. The two markers of oxidative stress are highly correlated and are elevated in the MA lines relative to the ancestor; point estimates of the per-generation rate of mutational decay (ΔM) of these measures of oxidative stress are similar to those reported for fitness-related traits. Conversely, there is no significant relationship between either marker of oxidative stress and the per-generation frequencies of base substitution or GC-TA transversion. Although these results provide no direct evidence for a causative relationship between oxidative damage and base substitution mutations, to the extent that oxidative damage may be weakly mutagenic in the germline, the case for condition-dependent mutation is advanced.
Collapse
Affiliation(s)
- Joanna Joyner-Matos
- Department of Biology, Eastern Washington University, Cheney, Washington, United States of America.
| | | | | | | | | | | | | |
Collapse
|
11
|
|
12
|
Shabalina SA, Spiridonov NA, Kashina A. Sounds of silence: synonymous nucleotides as a key to biological regulation and complexity. Nucleic Acids Res 2013; 41:2073-94. [PMID: 23293005 PMCID: PMC3575835 DOI: 10.1093/nar/gks1205] [Citation(s) in RCA: 187] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Messenger RNA is a key component of an intricate regulatory network of its own. It accommodates numerous nucleotide signals that overlap protein coding sequences and are responsible for multiple levels of regulation and generation of biological complexity. A wealth of structural and regulatory information, which mRNA carries in addition to the encoded amino acid sequence, raises the question of how these signals and overlapping codes are delineated along non-synonymous and synonymous positions in protein coding regions, especially in eukaryotes. Silent or synonymous codon positions, which do not determine amino acid sequences of the encoded proteins, define mRNA secondary structure and stability and affect the rate of translation, folding and post-translational modifications of nascent polypeptides. The RNA level selection is acting on synonymous sites in both prokaryotes and eukaryotes and is more common than previously thought. Selection pressure on the coding gene regions follows three-nucleotide periodic pattern of nucleotide base-pairing in mRNA, which is imposed by the genetic code. Synonymous positions of the coding regions have a higher level of hybridization potential relative to non-synonymous positions, and are multifunctional in their regulatory and structural roles. Recent experimental evidence and analysis of mRNA structure and interspecies conservation suggest that there is an evolutionary tradeoff between selective pressure acting at the RNA and protein levels. Here we provide a comprehensive overview of the studies that define the role of silent positions in regulating RNA structure and processing that exert downstream effects on proteins and their functions.
Collapse
Affiliation(s)
- Svetlana A Shabalina
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20984, USA.
| | | | | |
Collapse
|
13
|
Povolotskaya IS, Kondrashov FA, Ledda A, Vlasov PK. Stop codons in bacteria are not selectively equivalent. Biol Direct 2012; 7:30. [PMID: 22974057 PMCID: PMC3549826 DOI: 10.1186/1745-6150-7-30] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2012] [Accepted: 08/22/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The evolution and genomic stop codon frequencies have not been rigorously studied with the exception of coding of non-canonical amino acids. Here we study the rate of evolution and frequency distribution of stop codons in bacterial genomes. RESULTS We show that in bacteria stop codons evolve slower than synonymous sites, suggesting the action of weak negative selection. However, the frequency of stop codons relative to genomic nucleotide content indicated that this selection regime is not straightforward. The frequency of TAA and TGA stop codons is GC-content dependent, with TAA decreasing and TGA increasing with GC-content, while TAG frequency is independent of GC-content. Applying a formal, analytical model to these data we found that the relationship between stop codon frequencies and nucleotide content cannot be explained by mutational biases or selection on nucleotide content. However, with weak nucleotide content-dependent selection on TAG, -0.5 < Nes < 1.5, the model fits all of the data and recapitulates the relationship between TAG and nucleotide content. For biologically plausible rates of mutations we show that, in bacteria, TAG stop codon is universally associated with lower fitness, with TAA being the optimal for G-content < 16% while for G-content > 16% TGA has a higher fitness than TAG. CONCLUSIONS Our data indicate that TAG codon is universally suboptimal in the bacterial lineage, such that TAA is likely to be the preferred stop codon for low GC content while the TGA is the preferred stop codon for high GC content. The optimization of stop codon usage may therefore be useful in genome engineering or gene expression optimization applications.
Collapse
Affiliation(s)
- Inna S Povolotskaya
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, 88 Dr, Aiguader, Barcelona 08003, Spain
| | | | | | | |
Collapse
|
14
|
Akashi H, Osada N, Ohta T. Weak selection and protein evolution. Genetics 2012; 192:15-31. [PMID: 22964835 PMCID: PMC3430532 DOI: 10.1534/genetics.112.140178] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 01/23/2023] Open
Abstract
The "nearly neutral" theory of molecular evolution proposes that many features of genomes arise from the interaction of three weak evolutionary forces: mutation, genetic drift, and natural selection acting at its limit of efficacy. Such forces generally have little impact on allele frequencies within populations from generation to generation but can have substantial effects on long-term evolution. The evolutionary dynamics of weakly selected mutations are highly sensitive to population size, and near neutrality was initially proposed as an adjustment to the neutral theory to account for general patterns in available protein and DNA variation data. Here, we review the motivation for the nearly neutral theory, discuss the structure of the model and its predictions, and evaluate current empirical support for interactions among weak evolutionary forces in protein evolution. Near neutrality may be a prevalent mode of evolution across a range of functional categories of mutations and taxa. However, multiple evolutionary mechanisms (including adaptive evolution, linked selection, changes in fitness-effect distributions, and weak selection) can often explain the same patterns of genome variation. Strong parameter sensitivity remains a limitation of the nearly neutral model, and we discuss concave fitness functions as a plausible underlying basis for weak selection.
Collapse
Affiliation(s)
- Hiroshi Akashi
- Division of Evolutionary Genetics, Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan.
| | | | | |
Collapse
|
15
|
On parameters of the human genome. J Theor Biol 2011; 288:92-104. [DOI: 10.1016/j.jtbi.2011.07.021] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Revised: 06/28/2011] [Accepted: 07/21/2011] [Indexed: 02/06/2023]
|
16
|
Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K. Statistics and truth in phylogenomics. Mol Biol Evol 2011; 29:457-72. [PMID: 21873298 DOI: 10.1093/molbev/msr202] [Citation(s) in RCA: 164] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University, Arizona, USA.
| | | | | | | | | |
Collapse
|
17
|
Abstract
SummaryPopulation genomics is the study of the amount and causes of genome-wide variability in natural populations, a topic that has been under discussion since Darwin. This paper first briefly reviews the early development of molecular approaches to the subject: the pioneering unbiased surveys of genetic variability at multiple loci by means of gel electrophoresis and restriction enzyme mapping. The results of surveys of levels of genome-wide variability using DNA resequencing studies are then discussed. Studies of the extent to which variability for different classes of variants (non-synonymous, synonymous and non-coding) are affected by natural selection, or other directional forces such as biased gene conversion, are also described. Finally, the effects of deleterious mutations on population fitness and the possible role of Hill–Robertson interference in shaping patterns of sequence variability are discussed.
Collapse
|
18
|
Baer CF, Joyner-Matos J, Ostrow D, Grigaltchik V, Salomon MP, Upadhyay A. Rapid decline in fitness of mutation accumulation lines of gonochoristic (outcrossing) Caenorhabditis nematodes. Evolution 2011; 64:3242-53. [PMID: 20649813 DOI: 10.1111/j.1558-5646.2010.01061.x] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Evolutionary theory predicts that the strength of natural selection to reduce the mutation rate should be stronger in self-fertilizing than in outcrossing taxa. However, the relative efficacy of selection on mutation rate relative to the many other factors influencing the evolution of any species is poorly understood. To address this question, we allowed mutations to accumulate for ∼100 generations in several sets of "mutation accumulation" (MA) lines in three species of gonochoristic (dieocious) Caenorhabditis (C. remanei, C. brenneri, C. sp. 5) as well as in a dioecious strain of the historically self-fertile hermaprohodite C. elegans. In every case, the rate of mutational decay is substantially greater in the gonochoristic taxa than in C. elegans (∼4× greater on average). Residual heterozygosity in the ancestral controls of these MA lines introduces some complications in interpreting the results, but circumstantial evidence suggests the results are not primarily due to inbreeding depression resulting from residual segregating variation. The results suggest that natural selection operates to optimize the mutation rate in Caenorhabditis and that the strength (or efficiency) of selection differs consistently on the basis of mating system, as predicted by theory. However, context-dependent environmental and/or synergistic epistasis could also explain the results.
Collapse
Affiliation(s)
- Charles F Baer
- Department of Biology, University of Florida, Gainesville, Florida 32611-8525, USA.
| | | | | | | | | | | |
Collapse
|
19
|
Misawa K, Kikuno RF. Relationship between amino acid composition and gene expression in the mouse genome. BMC Res Notes 2011; 4:20. [PMID: 21272306 PMCID: PMC3038927 DOI: 10.1186/1756-0500-4-20] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2010] [Accepted: 01/27/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Codon bias is a phenomenon that refers to the differences in the frequencies of synonymous codons among different genes. In many organisms, natural selection is considered to be a cause of codon bias because codon usage in highly expressed genes is biased toward optimal codons. Methods have previously been developed to predict the expression level of genes from their nucleotide sequences, which is based on the observation that synonymous codon usage shows an overall bias toward a few codons called major codons. However, the relationship between codon bias and gene expression level, as proposed by the translation-selection model, is less evident in mammals. FINDINGS We investigated the correlations between the expression levels of 1,182 mouse genes and amino acid composition, as well as between gene expression and codon preference. We found that a weak but significant correlation exists between gene expression levels and amino acid composition in mouse. In total, less than 10% of variation of expression levels is explained by amino acid components. We found the effect of codon preference on gene expression was weaker than the effect of amino acid composition, because no significant correlations were observed with respect to codon preference. CONCLUSION These results suggest that it is difficult to predict expression level from amino acid components or from codon bias in mouse.
Collapse
Affiliation(s)
- Kazuharu Misawa
- Research Program for Computational Science, Research and Development Group for Next-Generation Integrated Living Matter Simulation, Fusion of Data and Analysis Research and Development Team, RIKEN, 4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan.
| | | |
Collapse
|
20
|
CpG island clusters and pro-epigenetic selection for CpGs in protein-coding exons of HOX and other transcription factors. Proc Natl Acad Sci U S A 2010; 107:15485-90. [PMID: 20716685 DOI: 10.1073/pnas.1010506107] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
CpG dinucleotides contribute to epigenetic mechanisms by being the only site for DNA methylation in mammalian somatic cells. They are also mutation hotspots and approximately 5-fold depleted genome-wide. We report here a study focused on CpG sites in the coding regions of Hox and other transcription factor genes, comparing methylated genomes of Homo sapiens, Mus musculus, and Danio rerio with nonmethylated genomes of Drosophila melanogaster and Caenorhabditis elegans. We analyzed 4-fold degenerate, synonymous codons with the potential for CpG. That is, we studied "silent" changes that do not affect protein products but could damage epigenetic marking. We find that DNA-binding transcription factors and other developmentally relevant genes show, only in methylated genomes, a bimodal distribution of CpG usage. Several genetic code-based tests indicate, again for methylated genomes only, that the frequency of silent CpGs in Hox genes is much greater than expectation. Also informative are NCG-GNN and NCC-GNN codon doublets, for which an unusually high rate of G to C and C to G transversions was observed at the third (silent) position of the first codon. Together these results are interpreted as evidence for strong "pro-epigenetic" selection acting to preserve CpG sites in coding regions of many genes controlling development. We also report that DNA-binding transcription factors and developmentally important genes are dramatically overrepresented in or near clusters of three or more CpG islands, suggesting a possible relationship between evolutionary preservation of CpG dinucleotides in both coding regions and CpG islands.
Collapse
|
21
|
Abstract
The accumulation of base substitutions (mutations) not subject to natural selection is the neutral mutation rate. Because this rate reflects the in vivo processes involved in maintaining the integrity of genetic information, the factors that affect the neutral mutation rate are of considerable interest. Mammals exhibit two dramatically different neutral mutation rates: the CpG mutation rate, wherein the C of most CpGs (i.e., methyl-CpG) mutate at 10-50 times that of C in any other context or of any other base. The latter mutations constitute the non-CpG rate. The high CpG rate results from the spontaneous deamination of methyl-C to T and incomplete restoration of the ensuing T:G mismatches to C:Gs. Here, we determined the neutral non-CpG mutation rate as a function of CpG content by comparing sequence divergence of thousands of pairs of neutrally evolving chimpanzee and human orthologs that differ primarily in CpG content. Both the mutation rate and the mutational spectrum (transition/transversion ratio) of non-CpG residues change in parallel as sigmoidal (logistic) functions of CpG content. As different mechanisms generate transitions and transversions, these results indicate that both mutation rate and mutational processes are contingent on the local CpG content. We consider several possible mechanisms that might explain how CpG exerts these effects.
Collapse
Affiliation(s)
- Jean-Claude Walser
- Section on Genomic Structure and Function, Laboratory of Molecular and Cellular Biology, National Institute of Diabetes and Digestive and Kidney diseases, National Institutes of Health, Bethesda, Maryland 20892-0830, USA
| | | |
Collapse
|
22
|
Kondrashov FA, Kondrashov AS. Measurements of spontaneous rates of mutations in the recent past and the near future. Philos Trans R Soc Lond B Biol Sci 2010; 365:1169-76. [PMID: 20308091 PMCID: PMC2871817 DOI: 10.1098/rstb.2009.0286] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The rate of spontaneous mutation in natural populations is a fundamental parameter for many evolutionary phenomena. Because the rate of mutation is generally low, most of what is currently known about mutation has been obtained through indirect, complex and imprecise methodological approaches. However, in the past few years genome-wide sequencing of closely related individuals has made it possible to estimate the rates of mutation directly at the level of the DNA, avoiding most of the problems associated with using indirect methods. Here, we review the methods used in the past with an emphasis on next generation sequencing, which may soon make the accurate measurement of spontaneous mutation rates a matter of routine.
Collapse
Affiliation(s)
- Fyodor A Kondrashov
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation, , C/Dr. Aiguader 88, Barcelona Biomedical Research Park Building 08003, Barcelona, Spain.
| | | |
Collapse
|
23
|
Abstract
Although mutation provides the fuel for phenotypic evolution, it also imposes a substantial burden on fitness through the production of predominantly deleterious alleles, a matter of concern from a human-health perspective. Here, recently established databases on de novo mutations for monogenic disorders are used to estimate the rate and molecular spectrum of spontaneously arising mutations and to derive a number of inferences with respect to eukaryotic genome evolution. Although the human per-generation mutation rate is exceptionally high, on a per-cell division basis, the human germline mutation rate is lower than that recorded for any other species. Comparison with data from other species demonstrates a universal mutational bias toward A/T composition, and leads to the hypothesis that genome-wide nucleotide composition generally evolves to the point at which the power of selection in favor of G/C is approximately balanced by the power of random genetic drift, such that variation in equilibrium genome-wide nucleotide composition is largely defined by variation in mutation biases. Quantification of the hazards associated with introns reveals that mutations at key splice-site residues are a major source of human mortality. Finally, a consideration of the long-term consequences of current human behavior for deleterious-mutation accumulation leads to the conclusion that a substantial reduction in human fitness can be expected over the next few centuries in industrialized societies unless novel means of genetic intervention are developed.
Collapse
Affiliation(s)
- Michael Lynch
- Department of Biology, Indiana University, Bloomington, IN 47405, USA.
| |
Collapse
|
24
|
Medvedeva YA, Fridman MV, Oparina NJ, Malko DB, Ermakova EO, Kulakovskiy IV, Heinzel A, Makeev VJ. Intergenic, gene terminal, and intragenic CpG islands in the human genome. BMC Genomics 2010; 11:48. [PMID: 20085634 PMCID: PMC2817693 DOI: 10.1186/1471-2164-11-48] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2009] [Accepted: 01/19/2010] [Indexed: 11/10/2022] Open
Abstract
Background Recently, it has been discovered that the human genome contains many transcription start sites for non-coding RNA. Regulatory regions related to transcription of this non-coding RNAs are poorly studied. Some of these regulatory regions may be associated with CpG islands located far from transcription start-sites of any protein coding gene. The human genome contains many such CpG islands; however, until now their properties were not systematically studied. Results We studied CpG islands located in different regions of the human genome using methods of bioinformatics and comparative genomics. We have observed that CpG islands have a preference to overlap with exons, including exons located far from transcription start site, but usually extend well into introns. Synonymous substitution rate of CpG-containing codons becomes substantially reduced in regions where CpG islands overlap with protein-coding exons, even if they are located far downstream from transcription start site. CAGE tag analysis displayed frequent transcription start sites in all CpG islands, including those found far from transcription start sites of protein coding genes. Computational prediction and analysis of published ChIP-chip data revealed that CpG islands contain an increased number of sites recognized by Sp1 protein. CpG islands containing more CAGE tags usually also contain more Sp1 binding sites. This is especially relevant for CpG islands located in 3' gene regions. Various examples of transcription, confirmed by mRNAs or ESTs, but with no evidence of protein coding genes, were found in CAGE-enriched CpG islands located far from transcription start site of any known protein coding gene. Conclusions CpG islands located far from transcription start sites of protein coding genes have transcription initiation activity and display Sp1 binding properties. In exons, overlapping with these islands, the synonymous substitution rate of CpG containing codons is decreased. This suggests that these CpG islands are involved in transcription initiation, possibly of some non-coding RNAs.
Collapse
Affiliation(s)
- Yulia A Medvedeva
- Research Institute for Genetics and Selection of Industrial Microorganisms, Genetika, 1st Dorozhny proezd, 1, Moscow, 117545, Russia.
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Eyre-Walker A, Keightley PD. Estimating the rate of adaptive molecular evolution in the presence of slightly deleterious mutations and population size change. Mol Biol Evol 2009; 26:2097-108. [PMID: 19535738 DOI: 10.1093/molbev/msp119] [Citation(s) in RCA: 298] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The prevalence of adaptive evolution relative to genetic drift is a central problem in molecular evolution. Methods to estimate the fraction of adaptive nucleotide substitutions (alpha) have been developed, based on the McDonald-Kreitman test, that contrast polymorphism and divergence between selectively and neutrally evolving sites. However, these methods are expected to give downwardly biased estimates of alpha if there are slightly deleterious mutations, because these inflate polymorphism relative to divergence. Here, we estimate alpha by simultaneously estimating the distribution of fitness effects of new mutations at selected sites from the site frequency spectrum and the number of adaptive substitutions. We test the method using simulations. If data meet the assumptions of the analysis model, estimates of alpha show little bias, even when there is little or no recombination. However, population size differences between the divergence and polymorphism phases may cause alpha to be over or underestimated by a predictable factor that depends on the magnitude of the population size change and the shape of the distribution of effects of deleterious mutations. We analyze several data sets of protein-coding genes and noncoding regions from hominids and Drosophila. In Drosophila genes, we estimate that approximately 50% of amino acid substitutions and approximately 20% of substitutions in introns are adaptive. In protein-coding and noncoding data sets of humans, comparison to macaque sequences reveals little evidence for adaptive substitutions. However, the true frequency of adaptive substitutions in human-coding DNA could be as high as 40%, because estimates based on current polymorphism may be strongly downwardly biased by a decrease in the effective population size along the human lineage.
Collapse
Affiliation(s)
- Adam Eyre-Walker
- Centre for the Study of Evolution and School of Life Sciences, University of Sussex, Brighton, United Kingdom
| | | |
Collapse
|
26
|
Li JB, Gao Y, Aach J, Zhang K, Kryukov GV, Xie B, Ahlford A, Yoon JK, Rosenbaum AM, Zaranek AW, LeProust E, Sunyaev SR, Church GM. Multiplex padlock targeted sequencing reveals human hypermutable CpG variations. Genome Res 2009; 19:1606-15. [PMID: 19525355 DOI: 10.1101/gr.092213.109] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Utilizing the full power of next-generation sequencing often requires the ability to perform large-scale multiplex enrichment of many specific genomic loci in multiple samples. Several technologies have been recently developed but await substantial improvements. We report the 10,000-fold improvement of a previously developed padlock-based approach, and apply the assay to identifying genetic variations in hypermutable CpG regions across human chromosome 21. From approximately 3 million reads derived from a single Illumina Genome Analyzer lane, approximately 94% (approximately 50,500) target sites can be observed with at least one read. The uniformity of coverage was also greatly improved; up to 93% and 57% of all targets fell within a 100- and 10-fold coverage range, respectively. Alleles at >400,000 target base positions were determined across six subjects and examined for single nucleotide polymorphisms (SNPs), and the concordance with independently obtained genotypes was 98.4%-100%. We detected >500 SNPs not currently in dbSNP, 362 of which were in targeted CpG locations. Transitions in CpG sites were at least 13.7 times more abundant than non-CpG transitions. Fractions of polymorphic CpG sites are lower in CpG-rich regions and show higher correlation with human-chimpanzee divergence within CpG versus non-CpG sites. This is consistent with the hypothesis that methylation rate heterogeneity along chromosomes contributes to mutation rate variation in humans. Our success suggests that targeted CpG resequencing is an efficient way to identify common and rare genetic variations. In addition, the significantly improved padlock capture technology can be readily applied to other projects that require multiplex sample preparation.
Collapse
Affiliation(s)
- Jin Billy Li
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Phillips N, Salomon M, Custer A, Ostrow D, Baer CF. Spontaneous mutational and standing genetic (co)variation at dinucleotide microsatellites in Caenorhabditis briggsae and Caenorhabditis elegans. Mol Biol Evol 2008; 26:659-69. [PMID: 19109257 DOI: 10.1093/molbev/msn287] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Understanding the evolutionary processes responsible for shaping genetic variation within and between species requires separating the effects of mutation and selection. Differences between the patterns of genetic variation observed in nature and when mutations are allowed to accumulate in the relative absence of selection can reveal biases imposed by selection. We characterize the genetic variation at dinucleotide microsatellite repeats in four sets of 250-generation mutation accumulation (MA) lines, two in the species Caenorhabditis briggsae and two in Caenorhabditis elegans, and compare the mutational variation with the standing variation in those species. We also compare the mutational properties of microsatellites with the cumulative effects of mutations on fitness in the same lines. Integrated over the whole genome, we infer that the mutation rate of C. briggsae is about twice that of C. elegans, consistent with the cumulative mutational effects on fitness. The mutational spectrum (ratio of insertions to deletions) differs between repeat types and, in some cases, between species. The per-locus mutation rate is significantly positively correlated with the standing genetic variation at the same locus in both species, providing justification for the common practice of using the standing genetic variance as a surrogate for the mutation rate.
Collapse
|
28
|
Roy M, Kim N, Xing Y, Lee C. The effect of intron length on exon creation ratios during the evolution of mammalian genomes. RNA (NEW YORK, N.Y.) 2008; 14:2261-73. [PMID: 18796579 PMCID: PMC2578852 DOI: 10.1261/rna.1024908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Recent studies report that alternatively spliced exons tend to occur in longer introns, which is attributed to the length constraints for splice site pairing for the two major splicing mechanisms, intron definition versus exon definition. Using genome-wide studies of EST and microarray data from human and mouse, we have analyzed the distribution of various subsets of alternatively spliced exons, based on their inclusion level and evolutionary history, versus increasing intron length. Alternative exons may be included in either a major or minor fraction of all transcripts (known as major-form and minor-form exons, respectively). We find that major-form exons are seven- to eightfold more likely to be contained in short introns (<400 nt) than minor-form exons, which occur preferentially in longer introns. Since minor-form exons are more likely to be novel (approximately 75%), this implied that novel exons arise more frequently in longer introns. To test this hypothesis, we used whole genome alignments to classify exons according to their phylogenetic age. We find that older exons, i.e., exons that are conserved in all mammals, predominate at shorter intron lengths, for both major- and minor-form exons. In contrast, exons that arose recently during primate evolution are more prevalent at longer intron lengths (>1000 nt). This suggests that the observed correlation of longer intron lengths with alternatively spliced exons may be at least partly due to biases in the probability of exon creation, which is higher in long introns.
Collapse
Affiliation(s)
- Meenakshi Roy
- Molecular Biology Institute, University of California, Los Angeles, California 90024, USA
| | | | | | | |
Collapse
|
29
|
Schmidt S, Gerasimova A, Kondrashov FA, Adzuhbei IA, Kondrashov AS, Sunyaev S. Hypermutable non-synonymous sites are under stronger negative selection. PLoS Genet 2008; 4:e1000281. [PMID: 19043566 PMCID: PMC2583910 DOI: 10.1371/journal.pgen.1000281] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2007] [Accepted: 10/27/2008] [Indexed: 12/04/2022] Open
Abstract
Mutation rate varies greatly between nucleotide sites of the human genome and depends both on the global genomic location and the local sequence context of a site. In particular, CpG context elevates the mutation rate by an order of magnitude. Mutations also vary widely in their effect on the molecular function, phenotype, and fitness. Independence of the probability of occurrence of a new mutation's effect has been a fundamental premise in genetics. However, highly mutable contexts may be preserved by negative selection at important sites but destroyed by mutation at sites under no selection. Thus, there may be a positive correlation between the rate of mutations at a nucleotide site and the magnitude of their effect on fitness. We studied the impact of CpG context on the rate of human-chimpanzee divergence and on intrahuman nucleotide diversity at non-synonymous coding sites. We compared nucleotides that occupy identical positions within codons of identical amino acids and only differ by being within versus outside CpG context. Nucleotides within CpG context are under a stronger negative selection, as revealed by their lower, proportionally to the mutation rate, rate of evolution and nucleotide diversity. In particular, the probability of fixation of a non-synonymous transition at a CpG site is two times lower than at a CpG site. Thus, sites with different mutation rates are not necessarily selectively equivalent. This suggests that the mutation rate may complement sequence conservation as a characteristic predictive of functional importance of nucleotide sites.
Collapse
Affiliation(s)
- Steffen Schmidt
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
- Department of Biochemistry, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Anna Gerasimova
- Life Sciences Institute, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Fyodor A. Kondrashov
- Section on Ecology, Behavior, and Evolution, Division of Biological Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Ivan A. Adzuhbei
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Alexey S. Kondrashov
- Life Sciences Institute, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Shamil Sunyaev
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
30
|
Gaffney DJ, Keightley PD. Effect of the assignment of ancestral CpG state on the estimation of nucleotide substitution rates in mammals. BMC Evol Biol 2008; 8:265. [PMID: 18826599 PMCID: PMC2576242 DOI: 10.1186/1471-2148-8-265] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2008] [Accepted: 09/30/2008] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Molecular evolutionary studies in mammals often estimate nucleotide substitution rates within and outside CpG dinucleotides separately. Frequently, in alignments of two sequences, the division of sites into CpG and non-CpG classes is based simply on the presence or absence of a CpG dinucleotide in either sequence, a procedure that we refer to as CpG/non-CpG assignment. Although it likely that this procedure is biased, it is generally assumed that the bias is negligible if species are very closely related. RESULTS Using simulations of DNA sequence evolution we show that assignment of the ancestral CpG state based on the simple presence/absence of the CpG dinucleotide can seriously bias estimates of the substitution rate, because many true non-CpG changes are misassigned as CpG. Paradoxically, this bias is most severe between closely related species, because a minimum of two substitutions are required to misassign a true ancestral CpG site as non-CpG whereas only a single substitution is required to misassign a true ancestral non-CpG site as CpG in a two branch tree. We also show that CpG misassignment bias differentially affects fourfold degenerate and noncoding sites due to differences in base composition such that fourfold degenerate sites can appear to be evolving more slowly than noncoding sites. We demonstrate that the effects predicted by our simulations occur in a real evolutionary setting by comparing substitution rates estimated from human-chimp coding and intronic sequence using CpG/non-CpG assignment with estimates derived from a method that is largely free from bias. CONCLUSION Our study demonstrates that a common method of assigning sites into CpG and non CpG classes in pairwise alignments is seriously biased and recommends against the adoption of ad hoc methods of ancestral state assignment.
Collapse
Affiliation(s)
- Daniel J Gaffney
- McGill University and Genome Québec Innovation Centre, 740 ave Dr Penfield Rm 7208, Montréal (Québec), H3A 1A4, Canada
| | - Peter D Keightley
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK
| |
Collapse
|
31
|
Ramensky VE, Nurtdinov RN, Neverov AD, Mironov AA, Gelfand MS. Positive selection in alternatively spliced exons of human genes. Am J Hum Genet 2008; 83:94-8. [PMID: 18571144 DOI: 10.1016/j.ajhg.2008.05.017] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2008] [Revised: 04/08/2008] [Accepted: 05/30/2008] [Indexed: 10/21/2022] Open
Abstract
Alternative splicing is a well-recognized mechanism of accelerated genome evolution. We have studied single-nucleotide polymorphisms and human-chimpanzee divergence in the exons of 6672 alternatively spliced human genes, with the aim of understanding the forces driving the evolution of alternatively spliced sequences. Here, we show that alternatively spliced exons and exon fragments (alternative exons) from minor isoforms experience lower selective pressure at the amino acid level, accompanied by selection against synonymous sequence variation. The results of the McDonald-Kreitman test suggest that alternatively spliced exons, unlike exons constitutively included in the mRNA, are also subject to positive selection, with up to 27% of amino acids fixed by positive selection.
Collapse
|
32
|
Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, White TJ, Nielsen R, Clark AG, Bustamante CD. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 2008; 4:e1000083. [PMID: 18516229 PMCID: PMC2377339 DOI: 10.1371/journal.pgen.1000083] [Citation(s) in RCA: 462] [Impact Index Per Article: 28.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2007] [Accepted: 04/29/2008] [Indexed: 11/19/2022] Open
Abstract
Quantifying the distribution of fitness effects among newly arising mutations in the human genome is key to resolving important debates in medical and evolutionary genetics. Here, we present a method for inferring this distribution using Single Nucleotide Polymorphism (SNP) data from a population with non-stationary demographic history (such as that of modern humans). Application of our method to 47,576 coding SNPs found by direct resequencing of 11,404 protein coding-genes in 35 individuals (20 European Americans and 15 African Americans) allows us to assess the relative contribution of demographic and selective effects to patterning amino acid variation in the human genome. We find evidence of an ancient population expansion in the sample with African ancestry and a relatively recent bottleneck in the sample with European ancestry. After accounting for these demographic effects, we find strong evidence for great variability in the selective effects of new amino acid replacing mutations. In both populations, the patterns of variation are consistent with a leptokurtic distribution of selection coefficients (e.g., gamma or log-normal) peaked near neutrality. Specifically, we predict 27-29% of amino acid changing (nonsynonymous) mutations are neutral or nearly neutral (|s|<0.01%), 30-42% are moderately deleterious (0.01%<|s|<1%), and nearly all the remainder are highly deleterious or lethal (|s|>1%). Our results are consistent with 10-20% of amino acid differences between humans and chimpanzees having been fixed by positive selection with the remainder of differences being neutral or nearly neutral. Our analysis also predicts that many of the alleles identified via whole-genome association mapping may be selectively neutral or (formerly) positively selected, implying that deleterious genetic variation affecting disease phenotype may be missed by this widely used approach for mapping genes underlying complex traits.
Collapse
Affiliation(s)
- Adam R. Boyko
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, United States of America
| | - Scott H. Williamson
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Amit R. Indap
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Jeremiah D. Degenhardt
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Ryan D. Hernandez
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Kirk E. Lohmueller
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, United States of America
| | - Mark D. Adams
- Department of Genetics, BRB-624, Case Western Reserve University, Cleveland, Ohio, United States of America
| | - Steffen Schmidt
- Division of Genetics, Department of Medicine, Brigham & Women's Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
| | - John J. Sninsky
- Celera Diagnostics, Alameda, California, United States of America
| | - Shamil R. Sunyaev
- Division of Genetics, Department of Medicine, Brigham & Women's Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
| | - Thomas J. White
- Celera Diagnostics, Alameda, California, United States of America
| | - Rasmus Nielsen
- Center for Comparative Genomics, University of Copenhagen, Copenhagen, Denmark
| | - Andrew G. Clark
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, United States of America
| | - Carlos D. Bustamante
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| |
Collapse
|
33
|
Antezana MA, Jordan IK. Highly conserved regimes of neighbor-base-dependent mutation generated the background primary-structural heterogeneities along vertebrate chromosomes. PLoS One 2008; 3:e2145. [PMID: 18478116 PMCID: PMC2366069 DOI: 10.1371/journal.pone.0002145] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2007] [Accepted: 03/17/2008] [Indexed: 01/01/2023] Open
Abstract
The content of guanine+cytosine varies markedly along the chromosomes of homeotherms and great effort has been devoted to studying this heterogeneity and its biological implications. Already before the DNA-sequencing era, however, it was established that the dinucleotides in the DNA of mammals in particular, and of most organisms in general, show striking over- and under-representations that cannot be explained by the base composition. Here we show that in the coding regions of vertebrates both GC content and codon occurrences are strongly correlated with such "motif preferences" even though we quantify the latter using an index that is not affected by the base composition, codon usage, and protein-sequence encoding. These correlations are likely to be the result of the long-term shaping of the primary structure of genic and non-genic DNA by a regime of mutation of which central features have been maintained by natural selection. We find indeed that these preferences are conserved in vertebrates even more rigidly than codon occurrences and we show that the occurrence-preference correlations are stronger in intronic and non-genic DNA, with the R(2)s reaching 99% when GC content is approximately 0.5. The mutation regime appears to be characterized by rates that depend markedly on the bases present at the site preceding and at that following each mutating site, because when we estimate such rates of neighbor-base-dependent mutation (NBDM) from substitutions retrieved from alignments of coding, intronic, and non-genic mammalian DNA sorted and grouped by GC content, they suffice to simulate DNA sequences in which motif occurrences and preferences as well as the correlations of motif preferences with GC content and with motif occurrences, are very similar to the mammalian ones. The best fit, however, is obtained with NBDM regimes lacking strand effects, which indicates that over the long term NBDM switches strands in the germline as one would expect for effects due to loosely contained background transcription. Finally, we show that human coding regions are less mutable under the estimated NBDM regimes than under matched context-independent mutation and that this entails marked differences between the spectra of amino-acid mutations that either mutation regime should generate. In the Discussion we examine the mechanisms likely to underlie NBDM heterogeneity along chromosomes and propose that it reflects how the diversity and activity of lesion-bypass polymerases (LBPs) track the landscapes of scheduled and non-scheduled genome repair, replication, and transcription during the cell cycle. We conclude that the primary structure of vertebrate genic DNA at and below the trinucleotide level has been governed over the long term by highly conserved regimes of NBDM which should be under direct natural selection because they alter drastically missense-mutation rates and hence the somatic and the germline mutational loads. Therefore, the non-coding DNA of vertebrates may have been shaped by NBDM only epiphenomenally, with non-genic DNA being affected mainly when found in the proximity of genes.
Collapse
Affiliation(s)
- Marcos A Antezana
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, United States of America.
| | | |
Collapse
|
34
|
Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 2008; 177:2251-61. [PMID: 18073430 DOI: 10.1534/genetics.107.080663] [Citation(s) in RCA: 261] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The distribution of fitness effects of new mutations (DFE) is important for addressing several questions in genetics, including the nature of quantitative variation and the evolutionary fate of small populations. Properties of the DFE can be inferred by comparing the distributions of the frequencies of segregating nucleotide polymorphisms at selected and neutral sites in a population sample, but demographic changes alter the spectrum of allele frequencies at both neutral and selected sites, so can bias estimates of the DFE if not accounted for. We have developed a maximum-likelihood approach, based on the expected allele-frequency distribution generated by transition matrix methods, to estimate parameters of the DFE while simultaneously estimating parameters of a demographic model that allows a population size change at some time in the past. We tested the method using simulations and found that it accurately recovers simulated parameter values, even if the simulated demography differs substantially from that assumed in our analysis. We use our method to estimate parameters of the DFE for amino acid-changing mutations in humans and Drosophila melanogaster. For a model of unconditionally deleterious mutations, with effects sampled from a gamma distribution, the mean estimate for the distribution shape parameter is approximately 0.2 for human populations, which implies that the DFE is strongly leptokurtic. For Drosophila populations, we estimate that the shape parameter is approximately 0.35. Differences in the shape of the distribution and the mean selection coefficient between humans and Drosophila result in significantly more strongly deleterious mutations in Drosophila than in humans, and, conversely, nearly neutral mutations are significantly less frequent.
Collapse
|
35
|
Asthana S, Roytberg M, Stamatoyannopoulos J, Sunyaev S. Analysis of sequence conservation at nucleotide resolution. PLoS Comput Biol 2007; 3:e254. [PMID: 18166073 PMCID: PMC2230682 DOI: 10.1371/journal.pcbi.0030254] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 11/13/2007] [Indexed: 12/02/2022] Open
Abstract
One of the major goals of comparative genomics is to understand the evolutionary history of each nucleotide in the human genome sequence, and the degree to which it is under selective pressure. Ascertainment of selective constraint at nucleotide resolution is particularly important for predicting the functional significance of human genetic variation and for analyzing the sequence substructure of cis-regulatory sequences and other functional elements. Current methods for analysis of sequence conservation are focused on delineation of conserved regions comprising tens or even hundreds of consecutive nucleotides. We therefore developed a novel computational approach designed specifically for scoring evolutionary conservation at individual base-pair resolution. Our approach estimates the rate at which each nucleotide position is evolving, computes the probability of neutrality given this rate estimate, and summarizes the result in a Sequence CONservation Evaluation (SCONE) score. We computed SCONE scores in a continuous fashion across 1% of the human genome for which high-quality sequence information from up to 23 genomes are available. We show that SCONE scores are clearly correlated with the allele frequency of human polymorphisms in both coding and noncoding regions. We find that the majority of noncoding conserved nucleotides lie outside of longer conserved elements predicted by other conservation analyses, and are experiencing ongoing selection in modern humans as evident from the allele frequency spectrum of human polymorphism. We also applied SCONE to analyze the distribution of conserved nucleotides within functional regions. These regions are markedly enriched in individually conserved positions and short (<15 bp) conserved “chunks.” Our results collectively suggest that the majority of functionally important noncoding conserved positions are highly fragmented and reside outside of canonically defined long conserved noncoding sequences. A small subset of these fragmented positions may be identified with high confidence. The structure of the human genome remains largely unknown, including which parts of the genome are functionally relevant and which parts are “junk.” The availability of genomic sequence from a large number of mammals allows a more detailed exploration of this structure, using comparison of related sequences from different species to identify portions of the genome that have remained unchanged, conserved by the action of natural selection, and thus likely to be functionally significant. To date, most efforts focused on localizing the functional fraction of the human genome have been based on identifying contiguous stretches of positions conserved in multiple species. Here, we present an analysis that is based instead on a single-position measure of conservation called SCONE. Our analysis suggests that the majority of conserved and putatively functional positions are highly fragmented and lie outside contiguous regions of conserved sequence. A subset of these fragmented positions may be identified based on local clustering.
Collapse
Affiliation(s)
- Saurabh Asthana
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Mikhail Roytberg
- Computational Biology Group, Institute of Mathematical Problems in Biology, Russian Academy of Sciences, Pushchino, Russia
| | - John Stamatoyannopoulos
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * To whom correspondence should be addressed. E-mail: (SS), (JS)
| | - Shamil Sunyaev
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail: (SS), (JS)
| |
Collapse
|
36
|
Minovitsky S, Stegmaier P, Kel A, Kondrashov AS, Dubchak I. Short sequence motifs, overrepresented in mammalian conserved non-coding sequences. BMC Genomics 2007; 8:378. [PMID: 17945028 PMCID: PMC2176071 DOI: 10.1186/1471-2164-8-378] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2007] [Accepted: 10/18/2007] [Indexed: 12/22/2022] Open
Abstract
Background A substantial fraction of non-coding DNA sequences of multicellular eukaryotes is under selective constraint. In particular, ~5% of the human genome consists of conserved non-coding sequences (CNSs). CNSs differ from other genomic sequences in their nucleotide composition and must play important functional roles, which mostly remain obscure. Results We investigated relative abundances of short sequence motifs in all human CNSs present in the human/mouse whole-genome alignments vs. three background sets of sequences: (i) weakly conserved or unconserved non-coding sequences (non-CNSs); (ii) near-promoter sequences (located between nucleotides -500 and -1500, relative to a start of transcription); and (iii) random sequences with the same nucleotide composition as that of CNSs. When compared to non-CNSs and near-promoter sequences, CNSs possess an excess of AT-rich motifs, often containing runs of identical nucleotides. In contrast, when compared to random sequences, CNSs contain an excess of GC-rich motifs which, however, lack CpG dinucleotides. Thus, abundance of short sequence motifs in human CNSs, taken as a whole, is mostly determined by their overall compositional properties and not by overrepresentation of any specific short motifs. These properties are: (i) high AT-content of CNSs, (ii) a tendency, probably due to context-dependent mutation, of A's and T's to clump, (iii) presence of short GC-rich regions, and (iv) avoidance of CpG contexts, due to their hypermutability. Only a small number of short motifs, overrepresented in all human CNSs are similar to binding sites of transcription factors from the FOX family. Conclusion Human CNSs as a whole appear to be too broad a class of sequences to possess strong footprints of any short sequence-specific functions. Such footprints should be studied at the level of functional subclasses of CNSs, such as those which flank genes with a particular pattern of expression. Overall properties of CNSs are affected by patterns in mutation, suggesting that selection which causes their conservation is not always very strong.
Collapse
Affiliation(s)
- Simon Minovitsky
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | | | | | | | | |
Collapse
|
37
|
|
38
|
Baer CF, Miyamoto MM, Denver DR. Mutation rate variation in multicellular eukaryotes: causes and consequences. Nat Rev Genet 2007; 8:619-31. [PMID: 17637734 DOI: 10.1038/nrg2158] [Citation(s) in RCA: 294] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
A basic knowledge about mutation rates is central to our understanding of a myriad of evolutionary phenomena, including the maintenance of sex and rates of molecular evolution. Although there is substantial evidence that mutation rates vary among taxa, relatively little is known about the factors that underlie this variation at an empirical level, particularly in multicellular eukaryotes. Here we integrate several disparate lines of theoretical and empirical inquiry into a unified framework to guide future studies that are aimed at understanding why and how mutation rates evolve in multicellular species.
Collapse
Affiliation(s)
- Charles F Baer
- Department of Zoology, University of Florida, Gainesville, Florida 32611, USA.
| | | | | |
Collapse
|
39
|
Bazykin GA, Kondrashov FA, Brudno M, Poliakov A, Dubchak I, Kondrashov AS. Extensive parallelism in protein evolution. Biol Direct 2007; 2:20. [PMID: 17705846 PMCID: PMC2020468 DOI: 10.1186/1745-6150-2-20] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2007] [Accepted: 08/16/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Independently evolving lineages mostly accumulate different changes, which leads to their gradual divergence. However, parallel accumulation of identical changes is also common, especially in traits with only a small number of possible states. RESULTS We characterize parallelism in evolution of coding sequences in three four-species sets of genomes of mammals, Drosophila, and yeasts. Each such set contains two independent evolutionary paths, which we call paths I and II. An amino acid replacement which occurred along path I also occurs along path II with the probability 50-80% of that expected under selective neutrality. Thus, the per site rate of parallel evolution of proteins is several times higher than their average rate of evolution, but still lower than the rate of evolution of neutral sequences. This deficit may be caused by changes in the fitness landscape, leading to a replacement being possible along path I but not along path II. However, constant, weak selection assumed by the nearly neutral model of evolution appears to be a more likely explanation. Then, the average coefficient of selection associated with an amino acid replacement, in the units of the effective population size, must exceed approximately 0.4, and the fraction of effectively neutral replacements must be below approximately 30%. At a majority of evolvable amino acid sites, only a relatively small number of different amino acids is permitted. CONCLUSION High, but below-neutral, rates of parallel amino acid replacements suggest that a majority of amino acid replacements that occur in evolution are subject to weak, but non-trivial, selection, as predicted by Ohta's nearly-neutral theory.
Collapse
Affiliation(s)
- Georgii A Bazykin
- Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Bolshoi Karetny pereulok 19, Moscow, 127994, Russia
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08544, USA
| | - Fyodor A Kondrashov
- Section on Ecology, Behavior and Evolution, University of California at San Diego, La Jolla, CA 92093, USA
| | - Michael Brudno
- Department of Computer Science and Banting & Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3J4, Canada
| | - Alexander Poliakov
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Inna Dubchak
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA
| | - Alexey S Kondrashov
- Life Sciences Institute and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109-2216, USA
| |
Collapse
|
40
|
Abstract
While it has often been assumed that, in humans, synonymous mutations would have no effect on fitness, let alone cause disease, this position has been questioned over the last decade. There is now considerable evidence that such mutations can, for example, disrupt splicing and interfere with miRNA binding. Two recent publications suggest involvement of additional mechanisms: modification of protein abundance most probably mediated by alteration in mRNA stability and modification of protein structure and activity, probably mediated by induction of translational pausing. These case histories put a further nail into the coffin of the assumption that synonymous mutations must be neutral.
Collapse
Affiliation(s)
- Joanna L Parmley
- Department of Biology and Biochemistry, University of Bath, Bath, UK
| | | |
Collapse
|
41
|
Artamonova II, Gelfand MS. Comparative Genomics and Evolution of Alternative Splicing: The Pessimists' Science. Chem Rev 2007; 107:3407-30. [PMID: 17645315 DOI: 10.1021/cr068304c] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Irena I Artamonova
- Group of Bioinformatics, Vavilov Institute of General Genetics, RAS, Gubkina 3, Moscow 119991, Russia
| | | |
Collapse
|
42
|
Resch AM, Carmel L, Mariño-Ramírez L, Ogurtsov AY, Shabalina SA, Rogozin IB, Koonin EV. Widespread positive selection in synonymous sites of mammalian genes. Mol Biol Evol 2007; 24:1821-31. [PMID: 17522087 PMCID: PMC2632937 DOI: 10.1093/molbev/msm100] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in approximately 28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in approximately 12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation.
Collapse
Affiliation(s)
- Alissa M Resch
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | | | | | | | | |
Collapse
|
43
|
Haddrill PR, Halligan DL, Tomaras D, Charlesworth B. Reduced efficacy of selection in regions of the Drosophila genome that lack crossing over. Genome Biol 2007; 8:R18. [PMID: 17284312 PMCID: PMC1852418 DOI: 10.1186/gb-2007-8-2-r18] [Citation(s) in RCA: 125] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2006] [Revised: 12/18/2006] [Accepted: 02/06/2007] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The recombinational environment is predicted to influence patterns of protein sequence evolution through the effects of Hill-Robertson interference among linked sites subject to selection. In freely recombining regions of the genome, selection should more effectively incorporate new beneficial mutations, and eliminate deleterious ones, than in regions with low rates of genetic recombination. RESULTS We examined the effects of recombinational environment on patterns of evolution using a genome-wide comparison of Drosophila melanogaster and D. yakuba. In regions of the genome with no crossing over, we find elevated divergence at nonsynonymous sites and in long introns, a virtual absence of codon usage bias, and an increase in gene length. However, we find little evidence for differences in patterns of evolution between regions with high, intermediate, and low crossover frequencies. In addition, genes on the fourth chromosome exhibit more extreme deviations from regions with crossing over than do other, no crossover genes outside the fourth chromosome. CONCLUSION All of the patterns observed are consistent with a severe reduction in the efficacy of selection in the absence of crossing over, resulting in the accumulation of deleterious mutations in these regions. Our results also suggest that even a very low frequency of crossing over may be enough to maintain the efficacy of selection.
Collapse
Affiliation(s)
- Penelope R Haddrill
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JT, UK
| | - Daniel L Halligan
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JT, UK
| | - Dimitris Tomaras
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JT, UK
- 15 Smirnis St, 15669, Papagou, Athens, Greece
| | - Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JT, UK
| |
Collapse
|
44
|
Goodstadt L, Ponting CP. Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol 2006; 2:e133. [PMID: 17009864 PMCID: PMC1584324 DOI: 10.1371/journal.pcbi.0020133] [Citation(s) in RCA: 125] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2006] [Accepted: 08/21/2006] [Indexed: 01/22/2023] Open
Abstract
Accurate predictions of orthology and paralogy relationships are necessary to infer human molecular function from experiments in model organisms. Previous genome-scale approaches to predicting these relationships have been limited by their use of protein similarity and their failure to take into account multiple splicing events and gene prediction errors. We have developed PhyOP, a new phylogenetic orthology prediction pipeline based on synonymous rate estimates, which accurately predicts orthology and paralogy relationships for transcripts, genes, exons, or genomic segments between closely related genomes. We were able to identify orthologue relationships to human genes for 93% of all dog genes from Ensembl. Among 1:1 orthologues, the alignments covered a median of 97.4% of protein sequences, and 92% of orthologues shared essentially identical gene structures. PhyOP accurately recapitulated genomic maps of conserved synteny. Benchmarking against predictions from Ensembl and Inparanoid showed that PhyOP is more accurate, especially in its predictions of paralogy. Nearly half (46%) of PhyOP paralogy predictions are unique. Using PhyOP to investigate orthologues and paralogues in the human and dog genomes, we found that the human assembly contains 3-fold more gene duplications than the dog. Species-specific duplicate genes, or “in-paralogues,” are generally shorter and have fewer exons than 1:1 orthologues, which is consistent with selective constraints and mutation biases based on the sizes of duplicated genes. In-paralogues have experienced elevated amino acid and synonymous nucleotide substitution rates. Duplicates possess similar biological functions for either the dog or human lineages. Having accounted for 2,954 likely pseudogenes and gene fragments, and after separating 346 erroneously merged genes, we estimated that the human genome encodes a minimum of 19,700 protein-coding genes, similar to the gene count of nematode worms. PhyOP is a fast and robust approach to orthology prediction that will be applicable to whole genomes from multiple closely related species. PhyOP will be particularly useful in predicting orthology for mammalian genomes that have been incompletely sequenced, and for large families of rapidly duplicating genes. Biologists often exploit the evolutionary relationships between proteins in order to explain how their findings are relevant to the biology of other species, including Homo sapiens. The most natural way to define these relationships is to draw family trees showing, for example, which human protein is the counterpart (“orthologue”) of a protein in dog, and which human proteins have arisen by recent duplication of existing genes (“paralogues”). On a small-scale this is relatively straightforward, but it is difficult to do this automatically on a genome-wide scale. In this paper the authors describe a new approach to drawing a giant family tree of all proteins from humans and dogs. They show how this tree allows them to refine some protein predictions and discard others that are likely to be nonfunctional dead sequences. Family relationships can show how the dog and human genomes have been rearranged since their last common ancestor. In addition, they help to identify the proteins that are specific to either dog or human, and which contribute to these species' biological differences. Giant trees, drawn from this method, will help to associate the differences, duplications, and evolution of proteins in different mammals with their distinctive physiologies and behaviours.
Collapse
Affiliation(s)
- Leo Goodstadt
- Medical Research Council Functional Genetics Unit, University of Oxford, Department of Physiology, Anatomy, and Genetics, Oxford, United Kingdom.
| | | |
Collapse
|
45
|
Ponting CP, Lunter G. Signatures of adaptive evolution within human non-coding sequence. Hum Mol Genet 2006; 15 Spec No 2:R170-5. [PMID: 16987880 DOI: 10.1093/hmg/ddl182] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
The human genome is often portrayed as consisting of three sequence types, each distinguished by their mode of evolution. Purifying selection is estimated to act on 2.5-5.0% of the genome, whereas virtually all remaining sequence is considered to have evolved neutrally and to be devoid of functionality. The third mode of evolution, positive selection of advantageous changes, is considered rare. Such instances have been inferred only for a handful of sites, and these lie almost exclusively within protein-coding genes. Nevertheless, the majority of positively selected sequence is expected to lie within the wealth of functional 'dark matter' present outside of the coding sequence. Here, we review the evolutionary evidence for the majority of human-conserved DNA lying outside of the protein-coding sequence. We argue that within this non-coding fraction lies at least 1 Mb of functional sequence that has accumulated many beneficial nucleotide replacements. Illuminating the functions of this adaptive dark matter will lead to a better understanding of the sequence changes that have shaped the innovative biology of our species.
Collapse
Affiliation(s)
- Chris P Ponting
- MRC Functional Genetics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK.
| | | |
Collapse
|
46
|
Gaffney DJ, Keightley PD. Genomic selective constraints in murid noncoding DNA. PLoS Genet 2006; 2:e204. [PMID: 17166057 PMCID: PMC1657059 DOI: 10.1371/journal.pgen.0020204] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2006] [Accepted: 10/18/2006] [Indexed: 02/04/2023] Open
Abstract
Recent work has suggested that there are many more selectively constrained, functional noncoding than coding sites in mammalian genomes. However, little is known about how selective constraint varies amongst different classes of noncoding DNA. We estimated the magnitude of selective constraint on a large dataset of mouse-rat gene orthologs and their surrounding noncoding DNA. Our analysis indicates that there are more than three times as many selectively constrained, nonrepetitive sites within noncoding DNA as in coding DNA in murids. The majority of these constrained noncoding sites appear to be located within intergenic regions, at distances greater than 5 kilobases from known genes. Our study also shows that in murids, intron length and mean intronic selective constraint are negatively correlated with intron ordinal number. Our results therefore suggest that functional intronic sites tend to accumulate toward the 5′ end of murid genes. Our analysis also reveals that mean number of selectively constrained noncoding sites varies substantially with the function of the adjacent gene. We find that, among others, developmental and neuronal genes are associated with the greatest numbers of putatively functional noncoding sites compared with genes involved in electron transport and a variety of metabolic processes. Combining our estimates of the total number of constrained coding and noncoding bases we calculate that over twice as many deleterious mutations have occurred in intergenic regions as in known genic sequence and that the total genomic deleterious point mutation rate is 0.91 per diploid genome, per generation. This estimated rate is over twice as large as a previous estimate in murids. Most DNA can typically be divided into two categories: regions that encode the instructions for the assembly of a protein molecule (protein-coding genes) and those that do not (noncoding). Although mammalian genomes are primarily noncoding, relatively little is known about how much of this is functional, where such regions are found in the genome, and what functions they are likely to perform. In this study, the authors investigated the quantity and location of functional noncoding DNA in mice and rats. They estimate that functional noncoding DNA is at least three times as common as coding DNA in rodents, and the majority is located large distances from known protein-coding genes. Putatively functional intronic DNA tends to be clustered towards the gene 5′ end, suggesting that much intronic sequence is instrumental in regulating gene expression. This study also finds that genes involved in development and the nervous system are typically associated with much higher quantities of functional noncoding DNA, suggesting that these genes require more finely tuned control of their expression. One implication of this study is the finding that disease-causing mutations have occurred more frequently in noncoding regions and may have affected gene expression, rather than protein structure.
Collapse
Affiliation(s)
- Daniel J Gaffney
- Institute of Evolutionary Biology, Ashworth Laboratories, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom.
| | | |
Collapse
|
47
|
Xing Y, Wang Q, Lee C. Evolutionary divergence of exon flanks: a dissection of mutability and selection. Genetics 2006; 173:1787-91. [PMID: 16702427 PMCID: PMC1526697 DOI: 10.1534/genetics.106.057919] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2006] [Accepted: 05/01/2006] [Indexed: 01/29/2023] Open
Abstract
The intronic sequences flanking exon-intron junctions (i.e., exon flanks) are important for splice site recognition and pre-mRNA splicing. Recent studies show a higher degree of sequence conservation at flanks of alternative exons, compared to flanks of constitutive exons. In this article we performed a detailed analysis on the evolutionary divergence of exon flanks between human and chimpanzee, aiming to dissect the impact of mutability and selection on their evolution. Inside exon flanks, sites that might reside in ancestral CpG dinucleotides evolved significantly faster than sites outside of ancestral CpG dinucleotides. This result reflects a systematic variation of mutation rates (mutability) at exon flanks, depending on the local CpG contexts. Remarkably, we observed a significant reduction of the nucleotide substitution rate in flanks of alternatively spliced exons, independent of the site-by-site variation in mutability due to different CpG contexts. Our data provide concrete evidence for increased purifying selection at exon flanks associated with regulation of alternative splicing.
Collapse
Affiliation(s)
- Yi Xing
- Molecular Biology Institute, Center for Genomics and Proteomics, Department of Chemistry and Biochemistry, University of California, Los Angeles, California 90095-1570, USA
| | | | | |
Collapse
|
48
|
Hurst LD. Preliminary assessment of the impact of microRNA-mediated regulation on coding sequence evolution in mammals. J Mol Evol 2006; 63:174-82. [PMID: 16786435 DOI: 10.1007/s00239-005-0273-2] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2005] [Accepted: 03/02/2006] [Indexed: 10/24/2022]
Abstract
Despite prior claims to the contrary, several lines of evidence suggest that selection acts on synonymous mutations in mammals. What might be the mechanisms for such selection? Here I attempt to quantify the constraints on the evolution of the coding sequence resulting from regulation of mRNA by microRNAs (miRNAs) that antisense-bind to the coding region of mRNAs. I employ a set of genes recently experimentally verified to be the target of a miRNA, all with putative antisense pairing domains within the coding sequence. Although very small ( approximately 22 nucleotides), 2 of 13 pairing domains show evidence of significantly slow sequence evolution. This, along with evidence that these genes are regulated by the miRNA under consideration, provides the first good candidate domains for intra-CDS pairing of a miRNA in mammals. When analyzed en masse, the putative pairing domains have a significantly reduced rate of synonymous evolution (approximately 35% lower than null). However, given the size and rarity of pairing domains within the coding sequence, the effects that such constraint has on estimates of the mutation rate are small enough to be ignored (probably less than 1% reduction). The pairing sites also have low Ka values and the selection on the synonymous sites is unlikely to lead to misleading reports of localized high Ka/Ks ratios.
Collapse
Affiliation(s)
- Laurence D Hurst
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.
| |
Collapse
|
49
|
Shabalina SA, Ogurtsov AY, Spiridonov NA. A periodic pattern of mRNA secondary structure created by the genetic code. Nucleic Acids Res 2006; 34:2428-37. [PMID: 16682450 PMCID: PMC1458515 DOI: 10.1093/nar/gkl287] [Citation(s) in RCA: 151] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Single-stranded mRNA molecules form secondary structures through complementary self-interactions. Several hypotheses have been proposed on the relationship between the nucleotide sequence, encoded amino acid sequence and mRNA secondary structure. We performed the first transcriptome-wide in silico analysis of the human and mouse mRNA foldings and found a pronounced periodic pattern of nucleotide involvement in mRNA secondary structure. We show that this pattern is created by the structure of the genetic code, and the dinucleotide relative abundances are important for the maintenance of mRNA secondary structure. Although synonymous codon usage contributes to this pattern, it is intrinsic to the structure of the genetic code and manifests itself even in the absence of synonymous codon usage bias at the 4-fold degenerate sites. While all codon sites are important for the maintenance of mRNA secondary structure, degeneracy of the code allows regulation of stability and periodicity of mRNA secondary structure. We demonstrate that the third degenerate codon sites contribute most strongly to mRNA stability. These results convincingly support the hypothesis that redundancies in the genetic code allow transcripts to satisfy requirements for both protein structure and RNA structure. Our data show that selection may be operating on synonymous codons to maintain a more stable and ordered mRNA secondary structure, which is likely to be important for transcript stability and translation. We also demonstrate that functional domains of the mRNA [5′-untranslated region (5′-UTR), CDS and 3′-UTR] preferentially fold onto themselves, while the start codon and stop codon regions are characterized by relaxed secondary structures, which may facilitate initiation and termination of translation.
Collapse
Affiliation(s)
- Svetlana A Shabalina
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|
50
|
Vinogradov AE. "Genome design" model: evidence from conserved intronic sequence in human-mouse comparison. Genes Dev 2006; 16:347-54. [PMID: 16461636 PMCID: PMC1415212 DOI: 10.1101/gr.4318206] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2005] [Accepted: 12/15/2005] [Indexed: 11/25/2022]
Abstract
Introns are shorter in housekeeping genes than in tissue- or development-specific genes. Differing explanations have been offered for this phenomenon: selection for economy (in housekeeping genes), mutation bias or "genomic design." The large-scale implementation in this present paper of a rigorous local sequence alignment algorithm revealed an unprecedented fraction of evolutionarily conserved DNA in human-mouse introns ( approximately 60% of human and approximately 70% of mouse intron length remained after masking for lineage-specific repeats). The length distributions of both conserved and nonconserved regions are very broad but show peaks close to nucleosomal and di-nucleosomal DNA. Both the fraction of conserved sequence and its absolute length were higher in introns of tissue-specific genes than housekeeping genes. This difference remained after control for between-species identity of the conserved fraction, mutation rate, and GC content. In a more direct control, the product of the conserved sequence fraction and the between-species identity of this fraction (which can be considered to be the fraction of conserved nucleotides) was greater in introns of tissue-specific genes than housekeeping genes. Neither the fraction of intron length covered by repeats nor the balance of small insertions and deletions (indels) can explain the greater length of introns in tissue-specific genes. The length of the conserved intronic DNA in a gene is correlated with the number of functional domains in the protein encoded by that gene. These results suggest that the greater length of introns in tissue-specific genes is not due to selection for economy or mutation bias but instead is related to functional complexity (probably mediated by chromatin condensation), and that the evolution of the bulk of noncoding DNA is not completely neutral.
Collapse
|