1
|
Zhang J, Hou W, Zhao Q, Xiao S, Linghu H, Zhang L, Du J, Cui H, Yang X, Ling S, Su J, Kong Q. Deep annotation of long noncoding RNAs by assembling RNA-seq and small RNA-seq data. J Biol Chem 2023; 299:105130. [PMID: 37543366 PMCID: PMC10498003 DOI: 10.1016/j.jbc.2023.105130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 07/20/2023] [Accepted: 07/31/2023] [Indexed: 08/07/2023] Open
Abstract
Long noncoding RNAs (lncRNAs) are increasingly being recognized as modulators in various biological processes. However, due to their low expression, their systematic characterization is difficult to determine. Here, we performed transcript annotation by a newly developed computational pipeline, termed RNA-seq and small RNA-seq combined strategy (RSCS), in a wide variety of cellular contexts. Thousands of high-confidence potential novel transcripts were identified by the RSCS, and the reliability of the transcriptome was verified by analysis of transcript structure, base composition, and sequence complexity. Evidenced by the length comparison, the frequency of the core promoter and the polyadenylation signal motifs, and the locations of transcription start and end sites, the transcripts appear to be full length. Furthermore, taking advantage of our strategy, we identified a large number of endogenous retrovirus-associated lncRNAs, and a novel endogenous retrovirus-lncRNA that was functionally involved in control of Yap1 expression and essential for early embryogenesis was identified. In summary, the RSCS can generate a more complete and precise transcriptome, and our findings greatly expanded the transcriptome annotation for the mammalian community.
Collapse
Affiliation(s)
- Jiaming Zhang
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China; Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Weibo Hou
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Qi Zhao
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Songling Xiao
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Hongye Linghu
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Lixin Zhang
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Jiawei Du
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Hongdi Cui
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Xu Yang
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China
| | - Shukuan Ling
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou Medical University, Wenzhou, Zhejiang Province, China.
| | - Jianzhong Su
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou Medical University, Wenzhou, Zhejiang Province, China.
| | - Qingran Kong
- Oujiang Laboratory, Zhejiang Provincial Key Laboratory of Medical Genetics, Key Laboratory of Laboratory Medicine, Ministry of Education, School of Laboratory Medicine and Life Sciences, Wenzhou Medical University, Wenzhou, Zhejiang Province, China.
| |
Collapse
|
2
|
On the Base Composition of Transposable Elements. Int J Mol Sci 2022; 23:ijms23094755. [PMID: 35563146 PMCID: PMC9099904 DOI: 10.3390/ijms23094755] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 04/22/2022] [Accepted: 04/23/2022] [Indexed: 01/27/2023] Open
Abstract
Transposable elements exhibit a base composition that is often different from the genomic average and from hosts’ genes. The most common compositional bias is towards Adenosine and Thymine, although this bias is not universal, and elements with drastically different base composition can coexist within the same genome. The AT-richness of transposable elements is apparently maladaptive because it results in poor transcription and sub-optimal translation of proteins encoded by the elements. The cause(s) of this unusual base composition remain unclear and have yet to be investigated. Here, I review what is known about the nucleotide content of transposable elements and how this content can affect the genome of their host as well as their own replication. The compositional bias of transposable elements could result from several non-exclusive processes including horizontal transfer, mutational bias, and selection. It appears that mutation alone cannot explain the high AT-content of transposons and that selection plays a major role in the evolution of the compositional bias. The reason why selection would favor a maladaptive nucleotide content remains however unexplained and is an area of investigation that clearly deserves attention.
Collapse
|
3
|
Genome-Wide Prediction of Transcription Start Sites in Conifers. Int J Mol Sci 2022; 23:ijms23031735. [PMID: 35163661 PMCID: PMC8836283 DOI: 10.3390/ijms23031735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/30/2022] [Accepted: 02/01/2022] [Indexed: 02/04/2023] Open
Abstract
The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.
Collapse
|
4
|
Crump NT, Milne TA. Why are so many MLL lysine methyltransferases required for normal mammalian development? Cell Mol Life Sci 2019; 76:2885-2898. [PMID: 31098676 PMCID: PMC6647185 DOI: 10.1007/s00018-019-03143-z] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/10/2019] [Indexed: 12/12/2022]
Abstract
The mixed lineage leukemia (MLL) family of proteins became known initially for the leukemia link of its founding member. Over the decades, the MLL family has been recognized as an important class of histone H3 lysine 4 (H3K4) methyltransferases that control key aspects of normal cell physiology and development. Here, we provide a brief history of the discovery and study of this family of proteins. We address two main questions: why are there so many H3K4 methyltransferases in mammals; and is H3K4 methylation their key function?
Collapse
Affiliation(s)
- Nicholas T Crump
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, NIHR Oxford Biomedical Research Centre Haematology Theme, Radcliffe Department of Medicine, University of Oxford, Oxford, UK
| | - Thomas A Milne
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, NIHR Oxford Biomedical Research Centre Haematology Theme, Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
| |
Collapse
|
5
|
Cencini M, Pigolotti S. Energetic funnel facilitates facilitated diffusion. Nucleic Acids Res 2019; 46:558-567. [PMID: 29216364 PMCID: PMC5778461 DOI: 10.1093/nar/gkx1220] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Accepted: 11/24/2017] [Indexed: 01/25/2023] Open
Abstract
Transcription factors (TFs) are able to associate to their binding sites on DNA faster than the physical limit posed by diffusion. Such high association rates can be achieved by alternating between three-dimensional diffusion and one-dimensional sliding along the DNA chain, a mechanism-dubbed facilitated diffusion. By studying a collection of TF binding sites of Escherichia coli from the RegulonDB database and of Bacillus subtilis from DBTBS, we reveal a funnel in the binding energy landscape around the target sequences. We show that such a funnel is linked to the presence of gradients of AT in the base composition of the DNA region around the binding sites. An extensive computational study of the stochastic sliding process along the energetic landscapes obtained from the database shows that the funnel can significantly enhance the probability of TFs to find their target sequences when sliding in their proximity. We demonstrate that this enhancement leads to a speed-up of the association process.
Collapse
Affiliation(s)
- Massimo Cencini
- Istituto dei Sistemi Complessi, Consiglio Nazionale delle Ricerche, via dei Taurini 19, 00185 Rome, Italy
| | - Simone Pigolotti
- Biological Complexity Unit, Okinawa Institute of Science and Technology and Graduate University, Onna, Okinawa 904-0495, Japan.,Max Planck Institute for the Physics of Complex Systems, Nöthnitzerstraße 38, 01187 Dresden, Germany.,Departament de Fisica, Universitat Politecnica de Catalunya Edif. GAIA, Rambla Sant Nebridi 22, 08222 Terrassa, Barcelona, Spain
| |
Collapse
|
6
|
Jeong H, Wu X, Smith B, Yi SV. Genomic Landscape of Methylation Islands in Hymenopteran Insects. Genome Biol Evol 2018; 10:2766-2776. [PMID: 30239702 PMCID: PMC6195173 DOI: 10.1093/gbe/evy203] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/13/2018] [Indexed: 01/31/2023] Open
Abstract
Recent genome-wide DNA methylation analyses of insect genomes accentuate an intriguing contrast compared with those in mammals. In mammals, most CpGs are heavily methylated, with the exceptions of clusters of hypomethylated sites referred to as CpG islands. In contrast, DNA methylation in insects is localized to a small number of CpG sites. Here, we refer to clusters of methylated CpGs as “methylation islands (MIs),” and investigate their characteristics in seven hymenopteran insects with high-quality bisulfite sequencing data. Methylation islands were primarily located within gene bodies. They were significantly overrepresented in exon–intron boundaries, indicating their potential roles in splicing. Methylated CpGs within MIs exhibited stronger evolutionary conservation compared with those outside of MIs. Additionally, genes harboring MIs exhibited higher and more stable levels of gene expression compared with those that do not harbor MIs. The effects of MIs on evolutionary conservation and gene expression are independent and stronger than the effect of DNA methylation alone. These results indicate that MIs may be useful to gain additional insights into understanding the role of DNA methylation in gene expression and evolutionary conservation in invertebrate genomes.
Collapse
Affiliation(s)
- Hyeonsoo Jeong
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, Georgia
| | - Xin Wu
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, Georgia
| | - Brandon Smith
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, Georgia
| | - Soojin V Yi
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
7
|
A damaged genome's transcriptional landscape through multilayered expression profiling around in situ-mapped DNA double-strand breaks. Nat Commun 2017; 8:15656. [PMID: 28561034 PMCID: PMC5499205 DOI: 10.1038/ncomms15656] [Citation(s) in RCA: 73] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Accepted: 04/18/2017] [Indexed: 12/19/2022] Open
Abstract
Of the many types of DNA damage, DNA double-strand breaks (DSBs) are probably the most deleterious. Mounting evidence points to an intricate relationship between DSBs and transcription. A cell system in which the impact on transcription can be investigated at precisely mapped genomic DSBs is essential to study this relationship. Here in a human cell line, we map genome-wide and at high resolution the DSBs induced by a restriction enzyme, and we characterize their impact on gene expression by four independent approaches by monitoring steady-state RNA levels, rates of RNA synthesis, transcription initiation and RNA polymerase II elongation. We consistently observe transcriptional repression in proximity to DSBs. Downregulation of transcription depends on ATM kinase activity and on the distance from the DSB. Our study couples for the first time, to the best of our knowledge, high-resolution mapping of DSBs with multilayered transcriptomics to dissect the events shaping gene expression after DSB induction at multiple endogenous sites. DNA double strand breaks (DSBs) are among the most deleterious types of damage and there is strong evidence indicating a relationship between breaks and transcription. Here the authors provide a high-resolution, genome-wide map of induced DSBs and observe ATM-dependent transcriptional repression.
Collapse
|
8
|
Marsh AG, Hoadley KD, Warner ME. Distribution of CpG Motifs in Upstream Gene Domains in a Reef Coral and Sea Anemone: Implications for Epigenetics in Cnidarians. PLoS One 2016; 11:e0150840. [PMID: 26950882 PMCID: PMC4780780 DOI: 10.1371/journal.pone.0150840] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 02/20/2016] [Indexed: 12/26/2022] Open
Abstract
Coral reefs are under assault from stressors including global warming, ocean acidification, and urbanization. Knowing how these factors impact the future fate of reefs requires delineating stress responses across ecological, organismal and cellular scales. Recent advances in coral reef biology have integrated molecular processes with ecological fitness and have identified putative suites of temperature acclimation genes in a Scleractinian coral Acropora hyacinthus. We wondered what unique characteristics of these genes determined their coordinate expression in response to temperature acclimation, and whether or not other corals and cnidarians would likewise possess these features. Here, we focus on cytosine methylation as an epigenetic DNA modification that is responsive to environmental stressors. We identify common conserved patterns of cytosine-guanosine dinucleotide (CpG) motif frequencies in upstream promoter domains of different functional gene groups in two cnidarian genomes: a coral (Acropora digitifera) and an anemone (Nematostella vectensis). Our analyses show that CpG motif frequencies are prominent in the promoter domains of functional genes associated with environmental adaptation, particularly those identified in A. hyacinthus. Densities of CpG sites in upstream promoter domains near the transcriptional start site (TSS) are 1.38x higher than genomic background levels upstream of -2000 bp from the TSS. The increase in CpG usage suggests selection to allow for DNA methylation events to occur more frequently within 1 kb of the TSS. In addition, observed shifts in CpG densities among functional groups of genes suggests a potential role for epigenetic DNA methylation within promoter domains to impact functional gene expression responses in A. digitifera and N. vectensis. Identifying promoter epigenetic sequence motifs among genes within specific functional groups establishes an approach to describe integrated cellular responses to environmental stress in reef corals and potential roles of epigenetics on survival and fitness in the face of global climate change.
Collapse
Affiliation(s)
- Adam G. Marsh
- Marine Biosciences, School of Marine Science and Policy, University of Delaware, Lewes, DE, United States of America
- Center for Bioinformatics and Computational Biology/Delaware Biotechnology Institute/University of Delaware, Newark, DE, United States of America
- * E-mail:
| | - Kenneth D. Hoadley
- Marine Biosciences, School of Marine Science and Policy, University of Delaware, Lewes, DE, United States of America
| | - Mark E. Warner
- Marine Biosciences, School of Marine Science and Policy, University of Delaware, Lewes, DE, United States of America
| |
Collapse
|
9
|
Keller TE, Han P, Yi SV. Evolutionary Transition of Promoter and Gene Body DNA Methylation across Invertebrate-Vertebrate Boundary. Mol Biol Evol 2015; 33:1019-28. [PMID: 26715626 PMCID: PMC4776710 DOI: 10.1093/molbev/msv345] [Citation(s) in RCA: 74] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Genomes of invertebrates and vertebrates exhibit highly divergent patterns of DNA methylation. Invertebrate genomes tend to be sparsely methylated, and DNA methylation is mostly targeted to a subset of transcription units (gene bodies). In a drastic contrast, vertebrate genomes are generally globally and heavily methylated, punctuated by the limited local hypo-methylation of putative regulatory regions such as promoters. These genomic differences also translate into functional differences in DNA methylation and gene regulation. Although promoter DNA methylation is an important regulatory component of vertebrate gene expression, its role in invertebrate gene regulation has been little explored. Instead, gene body DNA methylation is associated with expression of invertebrate genes. However, the evolutionary steps leading to the differentiation of invertebrate and vertebrate genomic DNA methylation remain unresolved. Here we analyzed experimentally determined DNA methylation maps of several species across the invertebrate–vertebrate boundary, to elucidate how vertebrate gene methylation has evolved. We show that, in contrast to the prevailing idea, a substantial number of promoters in an invertebrate basal chordate Ciona intestinalis are methylated. Moreover, gene expression data indicate significant, epigenomic context-dependent associations between promoter methylation and expression in C. intestinalis. However, there is no evidence that promoter methylation in invertebrate chordate has been evolutionarily maintained across the invertebrate–vertebrate boundary. Rather, body-methylated invertebrate genes preferentially obtain hypo-methylated promoters among vertebrates. Conversely, promoter methylation is preferentially found in lineage- and tissue-specific vertebrate genes. These results provide important insights into the evolutionary origin of epigenetic regulation of vertebrate gene expression.
Collapse
Affiliation(s)
| | | | - Soojin V Yi
- School of Biology, Georgia Institute of Technology
| |
Collapse
|
10
|
Hartono SR, Korf IF, Chédin F. GC skew is a conserved property of unmethylated CpG island promoters across vertebrates. Nucleic Acids Res 2015; 43:9729-41. [PMID: 26253743 PMCID: PMC4787789 DOI: 10.1093/nar/gkv811] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2015] [Accepted: 07/29/2015] [Indexed: 01/18/2023] Open
Abstract
GC skew is a measure of the strand asymmetry in the distribution of guanines and cytosines. GC skew favors R-loops, a type of three stranded nucleic acid structures that form upon annealing of an RNA strand to one strand of DNA, creating a persistent RNA:DNA hybrid. Previous studies show that GC skew is prevalent at thousands of human CpG island (CGI) promoters and transcription termination regions, which correspond to hotspots of R-loop formation. Here, we investigated the conservation of GC skew patterns in 60 sequenced chordates genomes. We report that GC skew is a conserved sequence characteristic of the CGI promoter class in vertebrates. Furthermore, we reveal that promoter GC skew peaks at the exon 1/ intron1 junction and that it is highly correlated with gene age and CGI promoter strength. Our data also show that GC skew is predictive of unmethylated CGI promoters in a range of vertebrate species and that it imparts significant DNA hypomethylation for promoters with intermediate CpG densities. Finally, we observed that terminal GC skew is conserved for a subset of vertebrate genes that tend to be located significantly closer to their downstream neighbors, consistent with a role for R-loop formation in transcription termination.
Collapse
Affiliation(s)
- Stella R Hartono
- Department of Molecular and Cellular Biology and Genome Center, University of California, Davis, CA 95616, United States
| | - Ian F Korf
- Department of Molecular and Cellular Biology and Genome Center, University of California, Davis, CA 95616, United States
| | - Frédéric Chédin
- Department of Molecular and Cellular Biology and Genome Center, University of California, Davis, CA 95616, United States
| |
Collapse
|
11
|
Genome-wide analysis of promoters: clustering by alignment and analysis of regular patterns. PLoS One 2014; 9:e85260. [PMID: 24465517 PMCID: PMC3898993 DOI: 10.1371/journal.pone.0085260] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2013] [Accepted: 11/26/2013] [Indexed: 01/08/2023] Open
Abstract
In this paper we perform a genome-wide analysis of H. sapiens promoters. To this aim, we developed and combined two mathematical methods that allow us to (i) classify promoters into groups characterized by specific global structural features, and (ii) recover, in full generality, any regular sequence in the different classes of promoters. One of the main findings of this analysis is that H. sapiens promoters can be classified into three main groups. Two of them are distinguished by the prevalence of weak or strong nucleotides and are characterized by short compositionally biased sequences, while the most frequent regular sequences in the third group are strongly correlated with transposons. Taking advantage of the generality of these mathematical procedures, we have compared the promoter database of H. sapiens with those of other species. We have found that the above-mentioned features characterize also the evolutionary content appearing in mammalian promoters, at variance with ancestral species in the phylogenetic tree, that exhibit a definitely lower level of differentiation among promoters.
Collapse
|
12
|
Kumari S, Ware D. Genome-wide computational prediction and analysis of core promoter elements across plant monocots and dicots. PLoS One 2013; 8:e79011. [PMID: 24205361 PMCID: PMC3812177 DOI: 10.1371/journal.pone.0079011] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Accepted: 09/18/2013] [Indexed: 01/22/2023] Open
Abstract
Transcription initiation, essential to gene expression regulation, involves recruitment of basal transcription factors to the core promoter elements (CPEs). The distribution of currently known CPEs across plant genomes is largely unknown. This is the first large scale genome-wide report on the computational prediction of CPEs across eight plant genomes to help better understand the transcription initiation complex assembly. The distribution of thirteen known CPEs across four monocots (Brachypodium distachyon, Oryza sativa ssp. japonica, Sorghum bicolor, Zea mays) and four dicots (Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera, Glycine max) reveals the structural organization of the core promoter in relation to the TATA-box as well as with respect to other CPEs. The distribution of known CPE motifs with respect to transcription start site (TSS) exhibited positional conservation within monocots and dicots with slight differences across all eight genomes. Further, a more refined subset of annotated genes based on orthologs of the model monocot (O. sativa ssp. japonica) and dicot (A. thaliana) genomes supported the positional distribution of these thirteen known CPEs. DNA free energy profiles provided evidence that the structural properties of promoter regions are distinctly different from that of the non-regulatory genome sequence. It also showed that monocot core promoters have lower DNA free energy than dicot core promoters. The comparison of monocot and dicot promoter sequences highlights both the similarities and differences in the core promoter architecture irrespective of the species-specific nucleotide bias. This study will be useful for future work related to genome annotation projects and can inspire research efforts aimed to better understand regulatory mechanisms of transcription.
Collapse
Affiliation(s)
- Sunita Kumari
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America,
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America,
- United States Department of Agriculture-Agriculture Research Service, Robert W. Holley Center for Agriculture and Health, Ithaca, New York, United States of America
| |
Collapse
|
13
|
Kenigsberg E, Tanay A. Drosophila functional elements are embedded in structurally constrained sequences. PLoS Genet 2013; 9:e1003512. [PMID: 23750124 PMCID: PMC3671938 DOI: 10.1371/journal.pgen.1003512] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2012] [Accepted: 03/04/2013] [Indexed: 12/22/2022] Open
Abstract
Modern functional genomics uncovered numerous functional elements in metazoan genomes. Nevertheless, only a small fraction of the typical non-exonic genome contains elements that code for function directly. On the other hand, a much larger fraction of the genome is associated with significant evolutionary constraints, suggesting that much of the non-exonic genome is weakly functional. Here we show that in flies, local (30–70 bp) conserved sequence elements that are associated with multiple regulatory functions serve as focal points to a pattern of punctuated regional increase in G/C nucleotide frequencies. We show that this pattern, which covers a region tenfold larger than the conserved elements themselves, is an evolutionary consequence of a shift in the balance between gain and loss of G/C nucleotides and that it is correlated with nucleosome occupancy across multiple classes of epigenetic state. Evidence for compensatory evolution and analysis of SNP allele frequencies show that the evolutionary regime underlying this balance shift is likely to be non-neutral. These data suggest that current gaps in our understanding of genome function and evolutionary dynamics are explicable by a model of sparse sequence elements directly encoding for function, embedded into structural sequences that help to define the local and global epigenomic context of such functional elements. A key challenge in functional genomics is to predict evolutionary dynamics from functional annotation of the genome and vice versa. Modern epigenomic studies helped assign function to numerous new sequence elements, but left most of the genome essentially uncharacterized. Evolutionary genomics, on the other hand, consistently suggests that a much larger fraction of the un-annotated genome evolves under selective pressure. We hypothesize that this function-selection gap can be attributed to sequences that facilitate the physical organization of functional elements, such as transcription factor binding sites, within chromosomes. We exemplify this by studying in detail the sequences embedding small conserved elements (CEs) in Drosophila. We show that, while CEs have typically high AT content, high GC content levels around them are maintained by a non-neutral evolutionary balance between gain and loss of GC nucleotides. This non-uniform pattern is highly correlated with nucleosome organization around CEs, potentially imposing an evolutionary constraint on as much as one quarter of the genome. We suggest this can at least partly explain the above function-selection gap. Weak evolutionary constraints on “structural” sequences (at scales ranging from one nucleosome to recently described multi-megabase topological domains) may affect genome evolution just like structural motifs shape protein evolution.
Collapse
Affiliation(s)
- Ephraim Kenigsberg
- Department of Computer Science and Applied Mathematics and Department of Biological Regulation, Weizmann Institute, Rehovot, Israel
| | - Amos Tanay
- Department of Computer Science and Applied Mathematics and Department of Biological Regulation, Weizmann Institute, Rehovot, Israel
- * E-mail:
| |
Collapse
|
14
|
Position-dependent correlations between DNA methylation and the evolutionary rates of mammalian coding exons. Proc Natl Acad Sci U S A 2012; 109:15841-6. [PMID: 23019368 DOI: 10.1073/pnas.1208214109] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
DNA cytosine methylation is a central epigenetic marker that is usually mutagenic and may increase the level of sequence divergence. However, methylated genes have been reported to evolve more slowly than unmethylated genes. Hence, there is a controversy on whether DNA methylation is correlated with increased or decreased protein evolutionary rates. We hypothesize that this controversy has resulted from the differential correlations between DNA methylation and the evolutionary rates of coding exons in different genic positions. To test this hypothesis, we compare human-mouse and human-macaque exonic evolutionary rates against experimentally determined single-base resolution DNA methylation data derived from multiple human cell types. We show that DNA methylation is significantly related to within-gene variations in evolutionary rates. First, DNA methylation level is more strongly correlated with C-to-T mutations at CpG dinucleotides in the first coding exons than in the internal and last exons, although it is positively correlated with the synonymous substitution rate in all exon positions. Second, for the first exons, DNA methylation level is negatively correlated with exonic expression level, but positively correlated with both nonsynonymous substitution rate and the sample specificity of DNA methylation level. For the internal and last exons, however, we observe the opposite correlations. Our results imply that DNA methylation level is differentially correlated with the biological (and evolutionary) features of coding exons in different genic positions. The first exons appear more prone to the mutagenic effects, whereas the other exons are more influenced by the regulatory effects of DNA methylation.
Collapse
|
15
|
R-loop formation is a distinctive characteristic of unmethylated human CpG island promoters. Mol Cell 2012; 45:814-25. [PMID: 22387027 DOI: 10.1016/j.molcel.2012.01.017] [Citation(s) in RCA: 574] [Impact Index Per Article: 47.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 12/08/2011] [Accepted: 01/10/2012] [Indexed: 12/31/2022]
Abstract
CpG islands (CGIs) function as promoters for approximately 60% of human genes. Most of these elements remain protected from CpG methylation, a prevalent epigenetic modification associated with transcriptional silencing. Here, we report that methylation-resistant CGI promoters are characterized by significant strand asymmetry in the distribution of guanines and cytosines (GC skew) immediately downstream from their transcription start sites. Using innovative genomics methodologies, we show that transcription through regions of GC skew leads to the formation of long R loop structures. Furthermore, we show that GC skew and R loop formation potential is correlated with and predictive of the unmethylated state of CGIs. Finally, we provide evidence that R loop formation protects from DNMT3B1, the primary de novo DNA methyltransferase in early development. Altogether, these results suggest that protection from DNA methylation is a built-in characteristic of the DNA sequence of CGI promoters that is revealed by the cotranscriptional formation of R loop structures.
Collapse
|
16
|
Arhondakis S, Auletta F, Bernardi G. Isochores and the regulation of gene expression in the human genome. Genome Biol Evol 2012; 3:1080-9. [PMID: 21979159 PMCID: PMC3227402 DOI: 10.1093/gbe/evr017] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
It is well established that changes in the phenotype depend much more on changes in gene expression than on changes in protein-coding genes, and that cis-regulatory sequences and chromatin structure are two major factors influencing gene expression. Here, we investigated these factors at the genome-wide level by focusing on the trinucleotide patterns in the 0.1- to 25-kb regions flanking the human genes that are present in the GC-poorest L1 and GC-richest H3 isochore families, the other families exhibiting intermediate patterns. We could show 1) that the trinucleotide patterns of the 25-kb gene-flanking regions are representative of the very different patterns already reported for the whole isochores from the L1 and H3 families and, expectedly, identical in upstream and downstream locations; 2) that the patterns of the 0.1- to 0.5-kb regions in the L1 and H3 isochores are remarkably more divergent and more specific when compared with those of the 25-kb regions, as well as different in the upstream and downstream locations; and 3) that these patterns fade into the 25-kb patterns around 5kb in both upstream and downstream locations. The 25-kb findings indicate differences in nucleosome positioning and density in different isochore families, those of the 0.1- to 0.5-kb sequences indicate differences in the transcription factors that bind upstream and downstream of genes. These results indicate differences in the regulation of genes located in different isochore families, a point of functional and evolutionary relevance.
Collapse
Affiliation(s)
- Stilianos Arhondakis
- Bioinformatics and Medical Informatics Team, Biomedical Research Foundation of the Academy of Athens, Athens, Greece
| | | | | |
Collapse
|
17
|
McLean MA, Tirosh I. Opposite GC skews at the 5' and 3' ends of genes in unicellular fungi. BMC Genomics 2011; 12:638. [PMID: 22208287 PMCID: PMC3315797 DOI: 10.1186/1471-2164-12-638] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 12/30/2011] [Indexed: 11/24/2022] Open
Abstract
Background GC-skews have previously been linked to transcription in some eukaryotes. They have been associated with transcription start sites, with the coding strand G-biased in mammals and C-biased in fungi and invertebrates. Results We show a consistent and highly significant pattern of GC-skew within genes of almost all unicellular fungi. The pattern of GC-skew is asymmetrical: the coding strand of genes is typically C-biased at the 5' ends but G-biased at the 3' ends, with intermediate skews at the middle of genes. Thus, the initiation, elongation, and termination phases of transcription are associated with different skews. This pattern influences the encoded proteins by generating differential usage of amino acids at the 5' and 3' ends of genes. These biases also affect fourfold-degenerate positions and extend into promoters and 3' UTRs, indicating that skews cannot be accounted by selection for protein function or translation. Conclusions We propose two explanations, the mutational pressure hypothesis, and the adaptive hypothesis. The mutational pressure hypothesis is that different co-factors bind to RNA pol II at different phases of transcription, producing different mutational regimes. The adaptive hypothesis is that cytidine triphosphate deficiency may lead to C-avoidance at the 3' ends of transcripts to control the flow of RNA pol II molecules and reduce their frequency of collisions.
Collapse
Affiliation(s)
- Malcolm A McLean
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel.
| | | |
Collapse
|
18
|
Hu XS, Yeh FC, Wang Z. Structural genomics: correlation blocks, population structure, and genome architecture. Curr Genomics 2011; 12:55-70. [PMID: 21886455 PMCID: PMC3129043 DOI: 10.2174/138920211794520141] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2010] [Revised: 01/06/2011] [Accepted: 01/06/2011] [Indexed: 11/27/2022] Open
Abstract
An integration of the pattern of genome-wide inter-site associations with evolutionary forces is important for gaining insights into the genomic evolution in natural or artificial populations. Here, we assess the inter-site correlation blocks and their distributions along chromosomes. A correlation block is broadly termed as the DNA segment within which strong correlations exist between genetic diversities at any two sites. We bring together the population genetic structure and the genomic diversity structure that have been independently built on different scales and synthesize the existing theories and methods for characterizing genomic structure at the population level. We discuss how population structure could shape correlation blocks and their patterns within and between populations. Effects of evolutionary forces (selection, migration, genetic drift, and mutation) on the pattern of genome-wide correlation blocks are discussed. In eukaryote organisms, we briefly discuss the associations between the pattern of correlation blocks and genome assembly features in eukaryote organisms, including the impacts of multigene family, the perturbation of transposable elements, and the repetitive nongenic sequences and GC-rich isochores. Our reviews suggest that the observable pattern of correlation blocks can refine our understanding of the ecological and evolutionary processes underlying the genomic evolution at the population level.
Collapse
Affiliation(s)
- Xin-Sheng Hu
- 1400 College Plaza, Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB T6J 2C8, Canada
- Department of Renewable Resources, 751 General Service Building, University of Alberta, Edmonton, Alberta, T6G 2H1, Canada
| | - Francis C. Yeh
- Department of Renewable Resources, 751 General Service Building, University of Alberta, Edmonton, Alberta, T6G 2H1, Canada
| | - Zhiquan Wang
- 1400 College Plaza, Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB T6J 2C8, Canada
| |
Collapse
|
19
|
Abstract
CpG islands mark CpG-enriched regions in otherwise CpG-depleted vertebrate genomes. While the regulatory importance of CpG islands is widely accepted, it is little appreciated that CpG islands vary greatly in lengths. For example, CpG islands in the human genome vary ∼30-fold in their lengths. Here we report findings suggesting that the lengths of CpG islands have functional consequences. Specifically, we show that promoters associated with long CpG islands (long-CGI promoters) are distinct from other promoters. First, long-CGI promoters are uniquely associated with genes with an intermediate level of gene expression breadths. Notably, intermediate expression breadths require the most complex mode of gene regulation, from the standpoint of information content. Second, long-CGI promoters encode more RNA polymerase II (Polr2a) binding sites than other promoters. Third, the actual binding patterns of Polr2a occur in a more tissue-specific manner in long-CGI promoters compared to other CGI promoters. Moreover, long-CGI promoters contain the largest numbers of experimentally characterized transcription start sites compared to other promoters, and the types of transcription start sites in them are biased toward tissue-specific patterns of gene expression. Finally, long-CGI promoters are preferentially associated with genes involved in development and regulation. Together, these findings indicate that functionally relevant variations of CpG islands exist. By investigating consequences of certain CpG island traits, we can gain additional insights into the mechanism and evolution of regulatory complexity of gene expression.
Collapse
|
20
|
Medvedeva YA, Kulakovskii IV, Oparina NY, Favorov AV, Makeev VY. The GC skew near Pol II start sites and its association with SP1-binding site variants. Biophysics (Nagoya-shi) 2010. [DOI: 10.1134/s0006350910060023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
21
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|
22
|
Sharif J, Endo TA, Toyoda T, Koseki H. Divergence of CpG island promoters: A consequence or cause of evolution? Dev Growth Differ 2010; 52:545-54. [DOI: 10.1111/j.1440-169x.2010.01193.x] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
23
|
Polak P, Querfurth R, Arndt PF. The evolution of transcription-associated biases of mutations across vertebrates. BMC Evol Biol 2010; 10:187. [PMID: 20565875 PMCID: PMC2927911 DOI: 10.1186/1471-2148-10-187] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2009] [Accepted: 06/18/2010] [Indexed: 02/03/2024] Open
Abstract
Background The interplay between transcription and mutational processes can lead to particular mutation patterns in transcribed regions of the genome. Transcription introduces several biases in mutational patterns; in particular it invokes strand specific mutations. In order to understand the forces that have shaped transcripts during evolution, one has to study mutation patterns associated with transcription across animals. Results Using multiple alignments of related species we estimated the regional single-nucleotide substitution patterns along genes in four vertebrate taxa: primates, rodents, laurasiatheria and bony fishes. Our analysis is focused on intronic and intergenic regions and reveals differences in the patterns of substitution asymmetries between mammals and fishes. In mammals, the levels of asymmetries are stronger for genes starting within CpG islands than in genes lacking this property. In contrast to all other species analyzed, we found a mutational pressure in dog and stickleback, promoting an increase of GC-contents in the proximity to transcriptional start sites. Conclusions We propose that the asymmetric patterns in transcribed regions are results of transcription associated mutagenic processes and transcription coupled repair, which both seem to evolve in a taxon related manner. We also discuss alternative mechanisms that can generate strand biases and involves error prone DNA polymerases and reverse transcription. A localized increase of the GC content near the transcription start site is a signature of biased gene conversion (BGC) that occurs during recombination and heteroduplex formation. Since dog and stickleback are known to be subject to rapid adaptations due to population bottlenecks and breeding, we further hypothesize that an increase in recombination rates near gene starts has been part of an adaptive process.
Collapse
Affiliation(s)
- Paz Polak
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | |
Collapse
|
24
|
Tillo D, Hughes TR. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics 2009; 10:442. [PMID: 20028554 PMCID: PMC2808325 DOI: 10.1186/1471-2105-10-442] [Citation(s) in RCA: 208] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
Background The relative preference of nucleosomes to form on individual DNA sequences plays a major role in genome packaging. A wide variety of DNA sequence features are believed to influence nucleosome formation, including periodic dinucleotide signals, poly-A stretches and other short motifs, and sequence properties that influence DNA structure, including base content. It was recently shown by Kaplan et al. that a probabilistic model using composition of all 5-mers within a nucleosome-sized tiling window accurately predicts intrinsic nucleosome occupancy across an entire genome in vitro. However, the model is complicated, and it is not clear which specific DNA sequence properties are most important for intrinsic nucleosome-forming preferences. Results We find that a simple linear combination of only 14 simple DNA sequence attributes (G+C content, two transformations of dinucleotide composition, and the frequency of eleven 4-bp sequences) explains nucleosome occupancy in vitro and in vivo in a manner comparable to the Kaplan model. G+C content and frequency of AAAA are the most important features. G+C content is dominant, alone explaining ~50% of the variation in nucleosome occupancy in vitro. Conclusions Our findings provide a dramatically simplified means to predict and understand intrinsic nucleosome occupancy. G+C content may dominate because it both reduces frequency of poly-A-like stretches and correlates with many other DNA structural characteristics. Since G+C content is enriched or depleted at many types of features in diverse eukaryotic genomes, our results suggest that variation in nucleotide composition may have a widespread and direct influence on chromatin structure.
Collapse
Affiliation(s)
- Desiree Tillo
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada.
| | | |
Collapse
|
25
|
Polak P, Arndt PF. Long-range bidirectional strand asymmetries originate at CpG islands in the human genome. Genome Biol Evol 2009; 1:189-97. [PMID: 20333189 PMCID: PMC2817419 DOI: 10.1093/gbe/evp024] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/22/2009] [Indexed: 12/24/2022] Open
Abstract
In the human genome, CpG islands (CGIs), which are GC- and CpG-rich sequences, are associated with transcription starting sites (TSSs); in addition, there is evidence that CGIs harbor origins of bidirectional replication (OBRs) and are preferred sites for heteroduplex formation during recombination. Transcription, replication, and recombination processes are known to induce specific mutational patterns in various genomes, and therefore, these patterns are expected to be found around CGIs. We use triple alignments of human, chimp, and macaque to compute the rates of nucleotide substitutions in up to 1 Mbps long intergenic regions on both sides of CGIs. Our analysis revealed that around a CGI there is an asymmetry between complementary substitution rates that is similar to the one that found around the OBR in bacteria. We hypothesize that these asymmetries are induced by differences in the replication of the leading and lagging strand and that a significant number of CGIs overlap OBRs. Within CGIs, we observed a mutational signature of GC-biased gene conversion that is associated with recombination. We suggest that recombination has played a major role in the creation of CGIs.
Collapse
Affiliation(s)
- Paz Polak
- Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | |
Collapse
|
26
|
Expression Vector Engineering for Recombinant Protein Production. ACTA ACUST UNITED AC 2009. [DOI: 10.1007/978-90-481-2245-5_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
27
|
Zhang B, Stellwag EJ, Pan X. Large-scale genome analysis reveals unique features of microRNAs. Gene 2009; 443:100-9. [PMID: 19422892 DOI: 10.1016/j.gene.2009.04.027] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2009] [Revised: 04/27/2009] [Accepted: 04/29/2009] [Indexed: 02/07/2023]
Abstract
Although great progress has been made in identifying microRNAs (miRNAs) and their functions, their essential functional features remain largely unknown. In this study, we systemically investigated the nucleotide and thermodynamic folding distribution characteristics of 3853 miRNAs currently reported for metazoans. We determined that uracil is the dominant nucleotide in both mature and precursor sequences, and that it is particularly enriched at three sites in mature miRNAs: the first, ninth, and the five terminal 3' nucleotides. The location of these enriched uracil nucleotides is particularly interesting because positions one and nine are the edges of the "seed region", which is responsible for targeting mRNAs for gene regulation. The prevalence of U residues at these sites may contribute to the mechanism whereby miRNAs target and bind to their corresponding mRNAs. A comparison of the overall lengths of metazoan pre-miRNAs revealed that they ranged from 53 to 215 nt in length with an average of 88.10+/-14.14 nt, significantly higher than previously reported. Comparisons of miRNA diversity at different taxonomic levels revealed that the 12 features investigated in this study varied significantly among miRNAs represented by different phyla, with particularly high levels of divergence in platyhelminths relative to nematodes, arthropods or vertebrates. By comparison, lower levels of diversity were observed at lower taxonomic levels such that there was a direct relationship between divergence in miRNA features and taxonomic level. We conclude that large-scale genome analysis shows that miRNAs have many more unique features than previously reported. In particular, the distribution of nucleotides suggests an important role for uracil at the boundaries of the 'seed' region and at their termini. These results will facilitate the design of new computational programs for identifying novel miRNAs and investigating the mechanism of miRNA-mediated gene regulation.
Collapse
Affiliation(s)
- Baohong Zhang
- Department of Biology, East Carolina University, Greenville, NC 27858, USA.
| | | | | |
Collapse
|
28
|
Ma X, Li-Ling J, Huang Q, Chen X, Hou L, Ma F. Systematic analysis of alternative promoters correlated with alternative splicing in human genes. Genomics 2009; 93:420-5. [PMID: 19442634 DOI: 10.1016/j.ygeno.2009.01.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2008] [Revised: 01/22/2009] [Accepted: 01/28/2009] [Indexed: 11/17/2022]
Abstract
Interactions between various events are essential for complex and delicate transcriptional regulation. To delineate the features and potential roles of alternative promoters (APs) correlated with alternative splicing (AS), we have systematically analyzed 9908 putative alternative promoters (PAPs) from 3797 human genes. Our results showed that approximately 65% of AS events are associated with PAPs. Intriguingly, PAPs per human AS gene only averaged 2.6 for our dataset, which was significantly lower than previously reported. This seems to imply that the human genome contains a small pool of appropriable PAPs for AS genes. Exploration of the characteristics of PAPs such as CpG islands, TATA boxes, GC-content, transcription factor binding sites (TFBSs) and repetitive elements suggested that, respectively, 87% and 90% of PAPs of human AS genes are CpG- and TATA box-poor. The GC-content is significantly higher in the downstream of transcription start sites (TSSs) than upstream (58% vs. 53%), and there is a strong negative correlation between the GC-content and the number of PAPs. These suggested that GC-content around the TSSs plays an important role in the regulation of AS. Moreover, different APs contain distinct densities of repetitive elements and TFBSs, indicating that such sequences have an intrinsic role in the divergent regulation of PAPs and AS. Substantial difference was also found between human AS genes in terms of PAP numbers. A close connection between PAPs and AS may play a critical role in the choice of APs and regulation of AS genes. Furthermore, the distribution of AS genes on different human chromosomes also influences the numbers of PAPs and isoforms of AS genes. Our results may provide important clues for further studies on regulatory network of transcription-related events.
Collapse
Affiliation(s)
- Xiaojuan Ma
- College of Life Science, Liaoning Normal University, Dalian 116029, China
| | | | | | | | | | | |
Collapse
|
29
|
Comparative analysis of distinct non-coding characteristics potentially contributing to the divergence of human tissue-specific genes. Genetica 2008; 136:127-34. [DOI: 10.1007/s10709-008-9323-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2007] [Accepted: 08/25/2008] [Indexed: 10/21/2022]
|
30
|
Abeel T, Saeys Y, Rouzé P, Van de Peer Y. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics 2008; 24:i24-31. [PMID: 18586720 PMCID: PMC2718650 DOI: 10.1093/bioinformatics/btn172] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
MOTIVATION More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet a common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work. RESULTS Comparing the average structural profile based on base stacking energy of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as well as the software currently available, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, resulting in a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98% of all predictions made by ProSOM can be associated with transcriptionally active regions, which demonstrates the high precision. AVAILABILITY Predictions for the human genome, the validation datasets and the program (ProSOM) are available upon request.
Collapse
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, VIB, 9052 Gent, Belgium
| | | | | | | |
Collapse
|
31
|
Abstract
A regional analysis of nucleotide substitution rates along human genes and their flanking regions allows us to quantify the effect of mutational mechanisms associated with transcription in germ line cells. Our analysis reveals three distinct patterns of substitution rates. First, a sharp decline in the deamination rate of methylated CpG dinucleotides, which is observed in the vicinity of the 5' end of genes. Second, a strand asymmetry in complementary substitution rates, which extends from the 5' end to 1 kbp downstream from the 3' end, associated with transcription-coupled repair. Finally, a localized strand asymmetry, an excess of C-->T over G-->A substitution in the nontemplate strand confined to the first 1-2 kbp downstream of the 5' end of genes. We hypothesize that higher exposure of the nontemplate strand near the 5' end of genes leads to a higher cytosine deamination rate. Up to now, only the somatic hypermutation (SHM) pathway has been known to mediate localized and strand-specific mutagenic processes associated with transcription in mammalia. The mutational patterns in SHM are induced by cytosine deaminase, which just targets single-stranded DNA. This DNA conformation is induced by R-loops, which preferentially occur at the 5' ends of genes. We predict that R-loops are extensively formed in the beginning of transcribed regions in germ line cells.
Collapse
|
32
|
Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res 2008; 18:1180-9. [PMID: 18411406 DOI: 10.1101/gr.076117.108] [Citation(s) in RCA: 143] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
We present a threefold contribution to the computational task of motif discovery, a key component in the effort of delineating the regulatory map of a genome: (1) We constructed a comprehensive large-scale, publicly-available compendium of transcription factor and microRNA target gene sets derived from diverse high-throughput experiments in several metazoans. We used the compendium as a benchmark for motif discovery tools. (2) We developed Amadeus, a highly efficient, user-friendly software platform for genome-scale detection of novel motifs, applicable to a wide range of motif discovery tasks. Amadeus improves upon extant tools in terms of accuracy, running time, output information, and ease of use and is the only program that attained a high success rate on the metazoan compendium. (3) We demonstrate that by searching for motifs based on their genome-wide localization or chromosomal distributions (without using a predefined target set), Amadeus uncovers diverse known phenomena, as well as novel regulatory motifs.
Collapse
|
33
|
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genes Dev 2008; 18:310-23. [PMID: 18096745 PMCID: PMC2203629 DOI: 10.1101/gr.6991408] [Citation(s) in RCA: 133] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 11/14/2007] [Indexed: 11/24/2022]
Abstract
Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
Collapse
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Yvan Saeys
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Eric Bonnet
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Pierre Rouzé
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
- Laboratoire Associé de l’INRA (France), Ghent University, 9052 Gent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| |
Collapse
|
34
|
Evans KJ. Genomic DNA from animals shows contrasting strand bias in large and small subsequences. BMC Genomics 2008; 9:43. [PMID: 18221531 PMCID: PMC2267173 DOI: 10.1186/1471-2164-9-43] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2007] [Accepted: 01/25/2008] [Indexed: 01/09/2023] Open
Abstract
Background For eukaryotes, there is almost no strand bias with regard to base composition, with exceptions for origins of replication and transcription start sites and transcribed regions. This paper revisits the question for subsequences of DNA taken at random from the genome. Results For a typical mammal, for example mouse or human, there is a small strand bias throughout the genomic DNA: there is a correlation between (G - C) and (A - T) on the same strand, (that is between the difference in the number of guanine and cytosine bases and the difference in the number of adenine and thymine bases). For small subsequences – up to 1 kb – this correlation is weak but positive; but for large windows – around 50 kb to 2 Mb – the correlation is strong and negative. This effect is largely independent of GC%. Transcribed and untranscribed regions give similar correlations both for small and large subsequences, but there is a difference in these regions for intermediate sized subsequences. An analysis of the human genome showed that position within the isochore structure did not affect these correlations. An analysis of available genomes of different species shows that this contrast between large and small windows is a general feature of mammals and birds. Further down the evolutionary tree, other organisms show a similar but smaller effect. Except for the nematode, all the animals analysed showed at least a small effect. Conclusion The correlations on the large scale may be explained by DNA replication. Transcription may be a modifier of these effects but is not the fundamental cause. These results cast light on how DNA mutations affect the genome over evolutionary time. At least for vertebrates, there is a broad relationship between body temperature and the size of the correlation. The genome of mammals and birds has a structure marked by strand bias segments.
Collapse
Affiliation(s)
- Kenneth J Evans
- School of Crystallography, Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK.
| |
Collapse
|
35
|
Evans KJ. Strand bias structure in mouse DNA gives a glimpse of how chromatin structure affects gene expression. BMC Genomics 2008; 9:16. [PMID: 18194530 PMCID: PMC2266913 DOI: 10.1186/1471-2164-9-16] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2007] [Accepted: 01/14/2008] [Indexed: 12/20/2022] Open
Abstract
Background On a single strand of genomic DNA the number of As is usually about equal to the number of Ts (and similarly for Gs and Cs), but deviations have been noted for transcribed regions and origins of replication. Results The mouse genome is shown to have a segmented structure defined by strand bias. Transcription is known to cause a strand bias and numerous analyses are presented to show that the strand bias in question is not caused by transcription. However, these strand bias segments influence the position of genes and their unspliced length. The position of genes within the strand bias structure affects the probability that a gene is switched on and its expression level. Transcription has a highly directional flow within this structure and the peak volume of transcription is around 20 kb from the A-rich/T-rich segment boundary on the T-rich side, directed away from the boundary. The A-rich/T-rich boundaries are SATB1 binding regions, whereas the T-rich/A-rich boundary regions are not. Conclusion The direct cause of the strand bias structure may be DNA replication. The strand bias segments represent a further biological feature, the chromatin structure, which in turn influences the ease of transcription.
Collapse
Affiliation(s)
- Kenneth J Evans
- School of Crystallography, Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK.
| |
Collapse
|
36
|
DNA sequence and structural properties as predictors of human and mouse promoters. Gene 2007; 410:165-76. [PMID: 18234453 PMCID: PMC2672154 DOI: 10.1016/j.gene.2007.12.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2007] [Revised: 11/30/2007] [Accepted: 12/05/2007] [Indexed: 11/21/2022]
Abstract
Promoters play a central role in gene regulation, yet our power to discriminate them from non-promoter sequences in higher eukaryotes is mainly restricted to those associated with CpG islands. Here, we examined in silico the promoters of 30,954 human and 18,083 mouse transcripts in the DBTSS database, to assess the impact of particular sequence and structural features (propeller twist, bendability and nucleosome positioning preference) on promoter classification and prediction. Our analysis showed that a stricter-than-traditional definition of CpG islands captures low and high CpG count promoter classes more accurately than the traditional one. We observed that both human and mouse promoter sequences are flexible with the exception of the TATA box and TSS, which are rigid regions irrespective of association with a CpG island. Therefore varying levels of structural flexibility in promoters may affect their accessibility to proteins, and hence their specificity. For all features investigated, averaged values across core promoters discriminated CpG island associated promoters from background, whereas the same did not hold for promoters without a CpG island. However, local changes around - 34 to - 23 (expected position of TATA box) and the TSS were informative in discriminating promoters (both classes) from non-promoter sequences. Additionally, we investigated ATG deserts and observed that they occur in all promoter sets except those with a TATA-box and without a CpG island in human. Interestingly, all mouse promoter sets showed ATG codon depletion irrespective of the presence of a TATA-box, possibly reflecting a weaker contribution to TSS specificity in mouse.
Collapse
|
37
|
Gazave E, Marqués-Bonet T, Fernando O, Charlesworth B, Navarro A. Patterns and rates of intron divergence between humans and chimpanzees. Genome Biol 2007; 8:R21. [PMID: 17309804 PMCID: PMC1852421 DOI: 10.1186/gb-2007-8-2-r21] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2006] [Revised: 12/08/2006] [Accepted: 02/19/2007] [Indexed: 01/08/2023] Open
Abstract
An analysis of human-chimpanzee intron divergence shows strong correlations between intron length and divergence and GC-content. Background Introns, which constitute the largest fraction of eukaryotic genes and which had been considered to be neutral sequences, are increasingly acknowledged as having important functions. Several studies have investigated levels of evolutionary constraint along introns and across classes of introns of different length and location within genes. However, thus far these studies have yielded contradictory results. Results We present the first analysis of human-chimpanzee intron divergence, in which differences in the number of substitutions per intronic site (Ki) can be interpreted as the footprint of different intensities and directions of the pressures of natural selection. Our main findings are as follows: there was a strong positive correlation between intron length and divergence; there was a strong negative correlation between intron length and GC content; and divergence rates vary along introns and depending on their ordinal position within genes (for instance, first introns are more GC rich, longer and more divergent, and divergence is lower at the 3' and 5' ends of all types of introns). Conclusion We show that the higher divergence of first introns is related to their larger size. Also, the lower divergence of short introns suggests that they may harbor a relatively greater proportion of regulatory elements than long introns. Moreover, our results are consistent with the presence of functionally relevant sequences near the 5' and 3' ends of introns. Finally, our findings suggest that other parts of introns may also be under selective constraints.
Collapse
Affiliation(s)
- Elodie Gazave
- Unitat de Biologia Evolutiva, Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Carrer Dr Aiguader 88, 08003 Barcelona, Catalonia, Spain
| | - Tomàs Marqués-Bonet
- Unitat de Biologia Evolutiva, Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Carrer Dr Aiguader 88, 08003 Barcelona, Catalonia, Spain
| | - Olga Fernando
- Unitat de Biologia Evolutiva, Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Carrer Dr Aiguader 88, 08003 Barcelona, Catalonia, Spain
- Instituto de Tecnologia Química e Biológica (ITQB), Universidade Nova de Lisboa, Av. da República (EAN) 2781-901 Oeiras, Lisboa, Portugal
| | - Brian Charlesworth
- Institute of Evolutionary Biology, University of Edinburgh, West Mains Road, Edinburgh, Scotland, EH7 3JT, UK
| | - Arcadi Navarro
- Institucio Catalana de Recerca i Estudis Avancats (ICREA), Unitat de Biologia Evolutiva, Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Carrer Dr Aiguader 88, 08003 Barcelona, Catalonia, Spain
| |
Collapse
|
38
|
Jiang C, Han L, Su B, Li WH, Zhao Z. Features and Trend of Loss of Promoter-Associated CpG Islands in the Human and Mouse Genomes. Mol Biol Evol 2007; 24:1991-2000. [PMID: 17591602 DOI: 10.1093/molbev/msm128] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
CpG islands (CGIs) are often considered as gene markers, but the number of CGIs varies among mammalian genomes that have similar numbers of genes. In this study, we investigated the distribution of CGIs in the promoter regions of 3,197 human-mouse orthologous gene pairs and found that the mouse genome has notably fewer CGIs in the promoter regions and less pronounced CGI characteristics than does the human genome. We further inferred CGI's ancestral state using the dog genome as a reference and examined the nucleotide substitution pattern and the mutational direction in the conserved regions of human and mouse CGIs. The results reveal many losses of CGIs in both genomes but the loss rate in the mouse lineage is two to four times the rate in the human lineage. We found an intriguing feature of CGI loss, namely that the loss of a CGI usually starts from erosion at the both edges and gradually moves towards the center. We found functional bias in the genes that have lost promoter-associated CGIs in the human or mouse lineage. Finally, our analysis indicates that the association of CGIs with housekeeping genes is not as strong as previously estimated. Our study provides a detailed view of the evolution of promoter-associated CGIs in the human and mouse genomes and our findings are helpful for understanding the evolution of mammalian genomes and the role of CGIs in gene function.
Collapse
Affiliation(s)
- Cizhong Jiang
- Department of Psychiatry and Center for the Study of Biological Complexity, Virginia Commonwealth, USA
| | | | | | | | | |
Collapse
|
39
|
Kim TM, Chung YJ, Rhyu MG, Jung MH. Germline methylation patterns inferred from local nucleotide frequency of repetitive sequences in the human genome. Mamm Genome 2007; 18:277-85. [PMID: 17514347 DOI: 10.1007/s00335-007-9016-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2007] [Accepted: 03/12/2007] [Indexed: 12/31/2022]
Abstract
Given the genomic abundance and susceptibility to DNA methylation, interspersed repetitive sequences in the human genome can be exploited as valuable resources in genome-wide methylation studies. To learn about the relationships between DNA methylation and repeat sequences, we performed a global measurement of CpG dinucleotide frequencies for interspersed repetitive sequences and inferred germline methylation patterns in the human genome. Although extensive CpG depletion was observed for most repeat sequences, those in the proximity to CpG islands have been relatively removed from germline methylation being the potential source of germline activation. We also investigated the CpG depletion patterns of Alu pairs to see whether they might play an active role in germline methylation. Two kinds of Alu pairs, direct or inverted pairs classified according to the orientation, showed contrast CpG depletion patterns with respect to separating distance of Alus, i.e., as two Alu elements are more closely spaced in a pair, a higher extent of CpG depletion was observed in inverted orientation and vice versa for directly repetitive Alu pairs. This suggests that specific organization of repetitive sequences, such as inverted Alu pairs, might play a role in triggering DNA methylation consistent with a homology-dependent methylation hypothesis.
Collapse
Affiliation(s)
- Tae-Min Kim
- Division of Metabolic Disease, Center for Biomedical Science, National Institute of Health, Nokbun-dong 5, Eunpyung-gu, Seoul 122-701, Korea
| | | | | | | |
Collapse
|
40
|
Montgomery SB, Griffith OL, Schuetz JM, Brooks-Wilson A, Jones SJM. A survey of genomic properties for the detection of regulatory polymorphisms. PLoS Comput Biol 2007; 3:e106. [PMID: 17559298 PMCID: PMC1892352 DOI: 10.1371/journal.pcbi.0030106] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2006] [Accepted: 04/25/2007] [Indexed: 11/18/2022] Open
Abstract
Advances in the computational identification of functional noncoding polymorphisms will aid in cataloging novel determinants of health and identifying genetic variants that explain human evolution. To date, however, the development and evaluation of such techniques has been limited by the availability of known regulatory polymorphisms. We have attempted to address this by assembling, from the literature, a computationally tractable set of regulatory polymorphisms within the ORegAnno database (http://www.oreganno.org). We have further used 104 regulatory single-nucleotide polymorphisms from this set and 951 polymorphisms of unknown function, from 2-kb and 152-bp noncoding upstream regions of genes, to investigate the discriminatory potential of 23 properties related to gene regulation and population genetics. Among the most important properties detected in this region are distance to transcription start site, local repetitive content, sequence conservation, minor and derived allele frequencies, and presence of a CpG island. We further used the entire set of properties to evaluate their collective performance in detecting regulatory polymorphisms. Using a 10-fold cross-validation approach, we were able to achieve a sensitivity and specificity of 0.82 and 0.71, respectively, and we show that this performance is strongly influenced by the distance to the transcription start site.
Collapse
Affiliation(s)
- Stephen B Montgomery
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada.
| | | | | | | | | |
Collapse
|
41
|
Müller F, Demény MA, Tora L. New problems in RNA polymerase II transcription initiation: matching the diversity of core promoters with a variety of promoter recognition factors. J Biol Chem 2007; 282:14685-9. [PMID: 17395580 DOI: 10.1074/jbc.r700012200] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Ferenc Müller
- Institute of Toxicology and Genetics, Forschungszentrum, Karlsruhe, D-76021 Germany.
| | | | | |
Collapse
|
42
|
Reddy DA, Mitra CK. Comparative analysis of transcription start sites using mutual information. GENOMICS PROTEOMICS & BIOINFORMATICS 2007; 4:189-95. [PMID: 17127217 PMCID: PMC5054067 DOI: 10.1016/s1672-0229(06)60032-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The transcription start site (TSS) region shows greater variability compared with other promoter elements. We are interested to search for its variability by using information content as a measure. We note in this study that the variability is significant in the block of 5 nucleotides (nt) surrounding the TSS region compared with the block of 15 nt. This suggests that the actual region that may be involved is in the range of 5-10 nt in size. For Escherichia coli, we note that the information content from dinucleotide substitution matrices clearly shows a better discrimination, suggesting the presence of some correlations. However, for human this effect is much less, and for mouse it is practically absent. We can conclude that the presence of short-range correlations within the TSS region is species-dependent and is not universal. We further observe that there are other variable regions in the mitochondrial control element apart from TSS. It is also noted that effective comparisons can only be made on blocks, while single nucleotide comparisons do not give us any detectable signals.
Collapse
|
43
|
Appanah R, Dickerson DR, Goyal P, Groudine M, Lorincz MC. An unmethylated 3' promoter-proximal region is required for efficient transcription initiation. PLoS Genet 2007; 3:e27. [PMID: 17305432 PMCID: PMC1797817 DOI: 10.1371/journal.pgen.0030027] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2006] [Accepted: 12/28/2006] [Indexed: 11/30/2022] Open
Abstract
The promoter regions of approximately 40% of genes in the human genome are embedded in CpG islands, CpG-rich regions that frequently extend on the order of one kb 3′ of the transcription start site (TSS) region. CpGs 3′ of the TSS of actively transcribed CpG island promoters typically remain methylation-free, indicating that maintaining promoter-proximal CpGs in an unmethylated state may be important for efficient transcription. Here we utilize recombinase-mediated cassette exchange to introduce a Moloney Murine Leukemia Virus (MoMuLV)-based reporter, in vitro methylated 1 kb downstream of the TSS, into a defined genomic site. In a subset of clones, methylation spreads to within ∼320 bp of the TSS, yielding a dramatic decrease in transcript level, even though the promoter/TSS region remains unmethylated. Chromatin immunoprecipitation analyses reveal that such promoter-proximal methylation results in loss of RNA polymerase II and TATA-box-binding protein (TBP) binding in the promoter region, suggesting that repression occurs at the level of transcription initiation. While DNA methylation-dependent trimethylation of H3 lysine (K)9 is confined to the intragenic methylated region, the promoter and downstream regions are hypo-acetylated on H3K9/K14. Furthermore, DNase I hypersensitivity and methylase-based single promoter analysis (M-SPA) experiments reveal that a nucleosome is positioned over the unmethylated TATA-box in these clones, indicating that dense DNA methylation downstream of the promoter region is sufficient to alter the chromatin structure of an unmethylated promoter. Based on these observations, we propose that a DNA methylation-free region extending several hundred bases downstream of the TSS may be a prerequisite for efficient transcription initiation. This model provides a biochemical explanation for the typical positioning of TSSs well upstream of the 3′ end of the CpG islands in which they are embedded. Genes, the functional units of heredity, are made up of DNA, which is packaged inside the nuclei of eukaryotic cells in association with a number of proteins in a structure called chromatin. In order for transcription, the process of transferring genetic information from DNA to RNA, to take place, chromatin must be decondensed to allow the transcription machinery to bind the genes that are to be transcribed. In mammals, promoters, the starting position of genes, are frequently embedded in “CpG islands,” regions with a relatively high density of the CpG dinucleotide. Paradoxically, while cytosines in the context of the CpG dinucleotide are generally methylated, CpGs flanking the start sites of genes typically remain methylation-free. As CpG methylation is associated with condensed chromatin, it is generally believed that promoter regions must remain free of methylation to allow for binding of the transcription machinery. Here, using a novel method for introducing methylated DNA into a defined genomic site, we demonstrate that DNA methylation in the promoter-proximal region of a gene is sufficient to block transcription via the generation of a chromatin structure that inhibits binding of the transcription machinery. Thus, methylation may inhibit transcription even when present outside the promoter region.
Collapse
Affiliation(s)
- Ruth Appanah
- Department of Medical Genetics, The University of British Columbia, Vancouver, British Columbia, Canada
| | - David R Dickerson
- Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Preeti Goyal
- Department of Medical Genetics, The University of British Columbia, Vancouver, British Columbia, Canada
| | - Mark Groudine
- Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- Department of Radiation Oncology, University of Washington School of Medicine, Seattle, Washington, United States of America
| | - Matthew C Lorincz
- Department of Medical Genetics, The University of British Columbia, Vancouver, British Columbia, Canada
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
44
|
Down TA, Bergman CM, Su J, Hubbard TJP. Large-scale discovery of promoter motifs in Drosophila melanogaster. PLoS Comput Biol 2006; 3:e7. [PMID: 17238282 PMCID: PMC1779301 DOI: 10.1371/journal.pcbi.0030007] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2006] [Accepted: 12/01/2006] [Indexed: 11/28/2022] Open
Abstract
A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes. In contrast to the genomic sequences that encode proteins, little is known about the regulatory elements that instruct the cell as to when and where a given gene should be active. Regulatory elements are thought to consist of clusters of short DNA words (motifs), each of which acts as a binding site for sequence-specific DNA binding protein. Thus, building a comprehensive dictionary of such motifs is an important step towards a broader understanding of gene regulation. Using the recently published NestedMICA method for detecting overrepresented motifs in a set of sequences, we build a dictionary of 120 motifs from regulatory sequences in the fruitfly genome, 87 of which are novel. Analysis of positional biases, conservation across species, and association with specific patterns of gene expression in fruitfly embryos suggest that the great majority of these newly discovered motifs represent functional regulatory elements. In addition to providing an initial motif dictionary for one of the most intensively studied model organisms, this work provides an analytical framework for the comprehensive discovery of regulatory motifs in complex animal genomes.
Collapse
Affiliation(s)
- Thomas A Down
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.
| | | | | | | |
Collapse
|
45
|
Yang C, Bolotin E, Jiang T, Sladek FM, Martinez E. Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene 2006; 389:52-65. [PMID: 17123746 PMCID: PMC1955227 DOI: 10.1016/j.gene.2006.09.029] [Citation(s) in RCA: 255] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2006] [Revised: 09/12/2006] [Accepted: 09/22/2006] [Indexed: 10/24/2022]
Abstract
The core promoter of eukaryotic genes is the minimal DNA region that recruits the basal transcription machinery to direct efficient and accurate transcription initiation. The fraction of human and yeast genes that contain specific core promoter elements such as the TATA box and the initiator (INR) remains unclear and core promoter motifs specific for TATA-less genes remain to be identified. Here, we present genome-scale computational analyses indicating that approximately 76% of human core promoters lack TATA-like elements, have a high GC content, and are enriched in Sp1-binding sites. We further identify two motifs - M3 (SCGGAAGY) and M22 (TGCGCANK) - that occur preferentially in human TATA-less core promoters. About 24% of human genes have a TATA-like element and their promoters are generally AT-rich; however, only approximately 10% of these TATA-containing promoters have the canonical TATA box (TATAWAWR). In contrast, approximately 46% of human core promoters contain the consensus INR (YYANWYY) and approximately 30% are INR-containing TATA-less genes. Significantly, approximately 46% of human promoters lack both TATA-like and consensus INR elements. Surprisingly, mammalian-type INR sequences are present - and tend to cluster - in the transcription start site (TSS) region of approximately 40% of yeast core promoters and the frequency of specific core promoter types appears to be conserved in yeast and human genomes. Gene Ontology analyses reveal that TATA-less genes in humans, as in yeast, are frequently involved in basic "housekeeping" processes, while TATA-containing genes are more often highly regulated, such as by biotic or stress stimuli. These results reveal unexpected similarities in the occurrence of specific core promoter types and in their associated biological processes in yeast and humans and point to novel vertebrate-specific DNA motifs that might play a selective role in TATA-independent transcription.
Collapse
Affiliation(s)
- Chuhu Yang
- Genetics Genomics and Bioinformatics Graduate Program, University of California, Riverside, CA 92521, USA
| | | | | | | | | |
Collapse
|
46
|
Abstract
TATA-binding protein-associated factor 1 (TAF1) is an essential component of the general transcription factor IID (TFIID), which nucleates assembly of the preinitiation complex for transcription by RNA polymerase II. TATA-binding protein and TAF1.TAF2 heterodimers are the only components of TFIID shown to bind specific DNA sequences (the TATA box and initiator, respectively), raising the question of how TFIID localizes to gene promoters that lack binding sites for these proteins. Here we demonstrate that Drosophila TAF1 protein isoforms TAF1-2 and TAF1-4 directly bind DNA independently of TAF2. DNA binding by TAF1 isoforms is mediated by cooperative interactions of two identical AT-hook motifs, one of which is encoded by an alternatively spliced exon. Electrophoretic mobility shift assays revealed that TAF1-2 bound the minor groove of adenine-thymine-rich DNA with a preference for the sequence AAT. Alanine-scanning mutagenesis of the alternatively spliced AT-hook indicated that Lys and Arg residues made essential DNA contacts, whereas Gly and Pro residues within the Arg-Gly-Arg-Pro core sequence were less important for DNA binding, suggesting that AT-hooks are more divergent than previously predicted. TAF1-2 bound with variable affinity to the transcription start site of several Drosophila genes, and binding to the hsp70 promoter was reduced by mutation of a single base pair at the transcription start site. Collectively, these data indicate that AT-hooks serve to anchor TAF1 isoforms to the minor groove of adenine-thymine-rich Drosophila gene promoters and suggest a model in which regulated expression of TAF1 isoforms by alternative splicing contributes to gene-specific transcription.
Collapse
Affiliation(s)
- Chad E Metcalf
- Department of Biomolecular Chemistry, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin 53706, USA
| | | |
Collapse
|
47
|
Wang J, Zhang S, Schultz RM, Tseng H. Search for basonuclin target genes. Biochem Biophys Res Commun 2006; 348:1261-71. [PMID: 16919236 PMCID: PMC1630671 DOI: 10.1016/j.bbrc.2006.07.198] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2006] [Accepted: 07/25/2006] [Indexed: 11/20/2022]
Abstract
Basonuclin (Bnc 1) is a transcription factor that has an unusual ability to interact with promoters of both RNA polymerases I and II. The action of basonuclin is mediated through three pairs of evolutionarily conserved zinc fingers, which produce three DNase I footprints on the promoters of rDNA and the basonuclin gene. Using these DNase footprints, we built a computational model for the basonuclin DNA-binding module, which was used to identify in silico potential RNA polymerase II target genes in the human and mouse promoter databases. The target genes of basonuclin show that it regulates the expression of proteins involved in chromatin structure, transcription/DNA-binding, ion-channels, adhesion/cell-cell junction, signal transduction, and intracellular transport. Our results suggest that basonuclin, like MYC, may coordinate transcriptional activities among the three RNA polymerases. But basonuclin regulates a distinctive set of pathways, which differ from that regulated by MYC.
Collapse
Affiliation(s)
- Junwen Wang
- Center for Bioinformatics,University of Pennsylvania
- Department of Computer and Information
Science,University of Pennsylvania
| | | | - Richard M. Schultz
- Department of Biology,University of Pennsylvania
- Center for Research on Reproduction
andWomen’s Health,University of Pennsylvania
| | - Hung Tseng
- Department of Dermatology,University of Pennsylvania
- Cell and Developmental Biology,University of
Pennsylvania
- Center for Research on Reproduction
andWomen’s Health,University of Pennsylvania
| |
Collapse
|
48
|
Bultrini E, Pizzi E. A new parameter to study compositional properties of non-coding regions in eukaryotic genomes. Gene 2006; 385:75-82. [PMID: 16978802 DOI: 10.1016/j.gene.2006.05.030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2006] [Revised: 05/04/2006] [Accepted: 05/19/2006] [Indexed: 10/24/2022]
Abstract
Genomes are characterized by global and local compositional properties that are interesting in an evolutionary perspective but also provide useful information for the identification of some functional elements. Following previous studies, in this work we investigated compositional properties of non-coding sequences in four eukaryotic genomes (C. elegans, D. melanogaster, M. musculus, H. sapiens). We developed a procedure based on Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to identify pentamers that are over-represented in introns (intron vocabulary) and to define a new parameter (LD) that reflects oligonucleotide composition of a given sequence. We analyzed genomic sequences and we found that all non-coding parts of a genome are characterized by similar LD values. Furthermore, we used the new parameter to analyze potentially regulatory regions. We extracted non-redundant sets of promoter sequences for D. melanogaster and H. sapiens and we studied their compositional (G+C content and LD parameter) and conformational (bendability propensity) properties. We found that regions immediately surrounding transcription start sites are distinguishable because of their %G+C, LD and bendability values.
Collapse
Affiliation(s)
- Emanuele Bultrini
- Dipartimento di Malattie Infettive, Parassitarie ed Immunomediate, Istituto Superiore di Sanità, Viale Regina Elena, 299, 00161 Roma, Italy
| | | |
Collapse
|
49
|
Reddy DA, Prasad BVLS, Mitra CK. Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices. Comput Biol Chem 2006; 30:58-62. [PMID: 16321573 DOI: 10.1016/j.compbiolchem.2005.10.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2005] [Revised: 10/04/2005] [Accepted: 10/04/2005] [Indexed: 10/25/2022]
Abstract
We have studied the core promoter region in five sets of promoter sequences by calculating the average mutual information content H (relative entropy). We have used specially constructed substitution matrices to calculate mono and dinucleotide replacements in a given block of aligned sequences. These substitution matrices use log-odds form of scores, which are in bits of information. Here, we constructed and applied nucleotide substitution matrices for the core promoter region to calculate the information content to study the Transcription Start Site (TSS), TATA-box and downstream regions. As expected, the information content decreases with increasing block size. This clearly implies that the TSS region is likely to be 5-10 bases in size (length). We also notice that both in the case of mouse and humans, both TATA-boxes and TSS regions are likely to play important roles in proper transcriptional initiation.
Collapse
Affiliation(s)
- D Ashok Reddy
- Department of Biochemistry, University of Hyderabad, Hyderabad 500046, India
| | | | | |
Collapse
|
50
|
Bajic VB, Tan SL, Christoffels A, Schönbach C, Lipovich L, Yang L, Hofmann O, Kruger A, Hide W, Kai C, Kawai J, Hume DA, Carninci P, Hayashizaki Y. Mice and men: their promoter properties. PLoS Genet 2006; 2:e54. [PMID: 16683032 PMCID: PMC1449896 DOI: 10.1371/journal.pgen.0020054] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2005] [Accepted: 02/27/2006] [Indexed: 12/28/2022] Open
Abstract
Using the two largest collections of Mus musculus and Homo sapiens transcription start sites (TSSs) determined based on CAGE tags, ditags, full-length cDNAs, and other transcript data, we describe the compositional landscape surrounding TSSs with the aim of gaining better insight into the properties of mammalian promoters. We classified TSSs into four types based on compositional properties of regions immediately surrounding them. These properties highlighted distinctive features in the extended core promoters that helped us delineate boundaries of the transcription initiation domain space for both species. The TSS types were analyzed for associations with initiating dinucleotides, CpG islands, TATA boxes, and an extensive collection of statistically significant cis-elements in mouse and human. We found that different TSS types show preferences for different sets of initiating dinucleotides and cis-elements. Through Gene Ontology and eVOC categories and tissue expression libraries we linked TSS characteristics to expression. Moreover, we show a link of TSS characteristics to very specific genomic organization in an example of immune-response-related genes (GO:0006955). Our results shed light on the global properties of the two transcriptomes not revealed before and therefore provide the framework for better understanding of the transcriptional mechanisms in the two species, as well as a framework for development of new and more efficient promoter- and gene-finding tools.
Collapse
Affiliation(s)
- Vladimir B Bajic
- Knowledge Extraction Laboratory, Institute for Infocomm Research, Singapore.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|