26
|
Brown JB, Boley N, Eisman R, May GE, Stoiber MH, Duff MO, Booth BW, Wen J, Park S, Suzuki AM, Wan KH, Yu C, Zhang D, Carlson JW, Cherbas L, Eads BD, Miller D, Mockaitis K, Roberts J, Davis CA, Frise E, Hammonds AS, Olson S, Shenker S, Sturgill D, Samsonova AA, Weiszmann R, Robinson G, Hernandez J, Andrews J, Bickel PJ, Carninci P, Cherbas P, Gingeras TR, Hoskins RA, Kaufman TC, Lai EC, Oliver B, Perrimon N, Graveley BR, Celniker SE. Diversity and dynamics of the Drosophila transcriptome. Nature 2014; 512:393-9. [PMID: 24670639 PMCID: PMC4152413 DOI: 10.1038/nature12962] [Citation(s) in RCA: 474] [Impact Index Per Article: 47.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2013] [Accepted: 12/18/2013] [Indexed: 01/10/2023]
Abstract
Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.
Collapse
|
27
|
Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, Pei B, Harmanci AO, Duff MO, Djebali S, Alexander RP, Alver BH, Auerbach R, Bell K, Bickel PJ, Boeck ME, Boley NP, Booth BW, Cherbas L, Cherbas P, Di C, Dobin A, Drenkow J, Ewing B, Fang G, Fastuca M, Feingold EA, Frankish A, Gao G, Good PJ, Guigó R, Hammonds A, Harrow J, Hoskins RA, Howald C, Hu L, Huang H, Hubbard TJP, Huynh C, Jha S, Kasper D, Kato M, Kaufman TC, Kitchen RR, Ladewig E, Lagarde J, Lai E, Leng J, Lu Z, MacCoss M, May G, McWhirter R, Merrihew G, Miller DM, Mortazavi A, Murad R, Oliver B, Olson S, Park PJ, Pazin MJ, Perrimon N, Pervouchine D, Reinke V, Reymond A, Robinson G, Samsonova A, Saunders GI, Schlesinger F, Sethi A, Slack FJ, Spencer WC, Stoiber MH, Strasbourger P, Tanzer A, Thompson OA, Wan KH, Wang G, Wang H, Watkins KL, Wen J, Wen K, Xue C, Yang L, Yip K, Zaleski C, Zhang Y, Zheng H, Brenner SE, Graveley BR, Celniker SE, Gingeras TR, Waterston R. Comparative analysis of the transcriptome across distant species. Nature 2014; 512:445-8. [PMID: 25164755 PMCID: PMC4155737 DOI: 10.1038/nature13424] [Citation(s) in RCA: 204] [Impact Index Per Article: 20.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2013] [Accepted: 04/30/2014] [Indexed: 12/30/2022]
Abstract
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Collapse
|
28
|
Bassett AR, Akhtar A, Barlow DP, Bird AP, Brockdorff N, Duboule D, Ephrussi A, Ferguson-Smith AC, Gingeras TR, Haerty W, Higgs DR, Miska EA, Ponting CP. Considerations when investigating lncRNA function in vivo. eLife 2014; 3:e03058. [PMID: 25124674 PMCID: PMC4132285 DOI: 10.7554/elife.03058] [Citation(s) in RCA: 269] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Although a small number of the vast array of animal long non-coding RNAs (lncRNAs) have known effects on cellular processes examined in vitro, the extent of their contributions to normal cell processes throughout development, differentiation and disease for the most part remains less clear. Phenotypes arising from deletion of an entire genomic locus cannot be unequivocally attributed either to the loss of the lncRNA per se or to the associated loss of other overlapping DNA regulatory elements. The distinction between cis- or trans-effects is also often problematic. We discuss the advantages and challenges associated with the current techniques for studying the in vivo function of lncRNAs in the light of different models of lncRNA molecular mechanism, and reflect on the design of experiments to mutate lncRNA loci. These considerations should assist in the further investigation of these transcriptional products of the genome. DOI:http://dx.doi.org/10.7554/eLife.03058.001
Collapse
|
29
|
Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'-complete cDNAs. ACTA ACUST UNITED AC 2013; 104:Unit 25B.11. [PMID: 24510412 DOI: 10.1002/0471142727.mb25b11s104] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
RNA annotation and mapping of promoters for analysis of gene expression (RAMPAGE) is a method that harnesses highly specific sequencing of 5'-complete complementary DNAs to identify transcription start sites (TSSs) genome-wide. Although TSS mapping has historically relied on detection of 5'-complete cDNAs, current genome-wide approaches typically have limited specificity and provide only scarce information regarding transcript structure. RAMPAGE allows for highly stringent selection of 5'-complete molecules, thus allowing base-resolution TSS identification with a high signal-to-noise ratio. Paired-end sequencing of medium-length cDNAs yields transcript structure information that is essential to interpreting the relationship of TSSs to annotated genes and transcripts. As opposed to standard RNA-seq, RAMPAGE explicitly yields accurate and highly reproducible expression level estimates for individual promoters. Moreover, this approach offers a streamlined 2- to 3-day protocol that is optimized for extensive sample multiplexing, and is therefore adapted for large-scale projects. This method has been applied successfully to human and Drosophila samples, and in principle should be applicable to any eukaryotic system.
Collapse
|
30
|
Schlesinger F, Smith AD, Gingeras TR, Hannon GJ, Hodges E. De novo DNA demethylation and noncoding transcription define active intergenic regulatory elements. Genome Res 2013; 23:1601-14. [PMID: 23811145 PMCID: PMC3787258 DOI: 10.1101/gr.157271.113] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Deep sequencing of mammalian DNA methylomes has uncovered a previously unpredicted number of discrete hypomethylated regions in intergenic space (iHMRs). Here, we combined whole-genome bisulfite sequencing data with extensive gene expression and chromatin-state data to define functional classes of iHMRs, and to reconstruct the dynamics of their establishment in a developmental setting. Comparing HMR profiles in embryonic stem and primary blood cells, we show that iHMRs mark an exclusive subset of active DNase hypersensitive sites (DHS), and that both developmentally constitutive and cell-type-specific iHMRs display chromatin states typical of distinct regulatory elements. We also observe that iHMR changes are more predictive of nearby gene activity than the promoter HMR itself, and that expression of noncoding RNAs within the iHMR accompanies full activation and complete demethylation of mature B cell enhancers. Conserved sequence features corresponding to iHMR transcript start sites, including a discernible TATA motif, suggest a conserved, functional role for transcription in these regions. Similarly, we explored both primate-specific and human population variation at iHMRs, finding that while enhancer iHMRs are more variable in sequence and methylation status than any other functional class, conservation of the TATA box is highly predictive of iHMR maintenance, reflecting the impact of sequence plasticity and transcriptional signals on iHMR establishment. Overall, our analysis allowed us to construct a three-step timeline in which (1) intergenic DHS are pre-established in the stem cell, (2) partial demethylation of blood-specific intergenic DHSs occurs in blood progenitors, and (3) complete iHMR formation and transcription coincide with enhancer activation in lymphoid-specified cells.
Collapse
|
31
|
Livyatan I, Harikumar A, Nissim-Rafinia M, Duttagupta R, Gingeras TR, Meshorer E. Non-polyadenylated transcription in embryonic stem cells reveals novel non-coding RNA related to pluripotency and differentiation. Nucleic Acids Res 2013; 41:6300-15. [PMID: 23630323 PMCID: PMC3695530 DOI: 10.1093/nar/gkt316] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The transcriptional landscape in embryonic stem cells (ESCs) and during ESC differentiation has received considerable attention, albeit mostly confined to the polyadenylated fraction of RNA, whereas the non-polyadenylated (NPA) fraction remained largely unexplored. Notwithstanding, the NPA RNA super-family has every potential to participate in the regulation of pluripotency and stem cell fate. We conducted a comprehensive analysis of NPA RNA in ESCs using a combination of whole-genome tiling arrays and deep sequencing technologies. In addition to identifying previously characterized and new non-coding RNA members, we describe a group of novel conserved RNAs (snacRNAs: small NPA conserved), some of which are differentially expressed between ESC and neuronal progenitor cells, providing the first evidence of a novel group of potentially functional NPA RNA involved in the regulation of pluripotency and stem cell fate. We further show that minor spliceosomal small nuclear RNAs, which are NPA, are almost completely absent in ESCs and are upregulated in differentiation. Finally, we show differential processing of the minor intron of the polycomb group gene Eed. Our data suggest that NPA RNA, both known and novel, play important roles in ESCs.
Collapse
|
32
|
Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J, Guigó R. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2013; 22:1775-89. [PMID: 22955988 PMCID: PMC3431493 DOI: 10.1101/gr.132159.111] [Citation(s) in RCA: 3752] [Impact Index Per Article: 341.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences—particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.
Collapse
|
33
|
Cheng C, Alexander R, Min R, Leng J, Yip KY, Rozowsky J, Yan KK, Dong X, Djebali S, Ruan Y, Davis CA, Carninci P, Lassman T, Gingeras TR, Guigó R, Birney E, Weng Z, Snyder M, Gerstein M. Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res 2013; 22:1658-67. [PMID: 22955978 PMCID: PMC3431483 DOI: 10.1101/gr.136838.111] [Citation(s) in RCA: 138] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.
Collapse
|
34
|
Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res 2013; 23:169-80. [PMID: 22936248 PMCID: PMC3530677 DOI: 10.1101/gr.139618.112] [Citation(s) in RCA: 135] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2012] [Accepted: 08/29/2012] [Indexed: 12/20/2022]
Abstract
Many eukaryotic genes possess multiple alternative promoters with distinct expression specificities. Therefore, comprehensively annotating promoters and deciphering their individual regulatory dynamics is critical for gene expression profiling applications and for our understanding of regulatory complexity. We introduce RAMPAGE, a novel promoter activity profiling approach that combines extremely specific 5'-complete cDNA sequencing with an integrated data analysis workflow, to address the limitations of current techniques. RAMPAGE features a streamlined protocol for fast and easy generation of highly multiplexed sequencing libraries, offers very high transcription start site specificity, generates accurate and reproducible promoter expression measurements, and yields extensive transcript connectivity information through paired-end cDNA sequencing. We used RAMPAGE in a genome-wide study of promoter activity throughout 36 stages of the life cycle of Drosophila melanogaster, and describe here a comprehensive data set that represents the first available developmental time-course of promoter usage. We found that >40% of developmentally expressed genes have at least two promoters and that alternative promoters generally implement distinct regulatory programs. Transposable elements, long proposed to play a central role in the evolution of their host genomes through their ability to regulate gene expression, contribute at least 1300 promoters shaping the developmental transcriptome of D. melanogaster. Hundreds of these promoters drive the expression of annotated genes, and transposons often impart their own expression specificity upon the genes they regulate. These observations provide support for the theory that transposons may drive regulatory innovation through the distribution of stereotyped cis-regulatory modules throughout their host genomes.
Collapse
|
35
|
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. ACTA ACUST UNITED AC 2012; 29:15-21. [PMID: 23104886 DOI: 10.1093/bioinformatics/bts635] [Citation(s) in RCA: 26725] [Impact Index Per Article: 2227.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
MOTIVATION Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. RESULTS To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. AVAILABILITY AND IMPLEMENTATION STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Collapse
|
36
|
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi AM, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, Fejes-Toth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LH, Shahab A, Skancke J, Suzuki AM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R, Gingeras TR. Landscape of transcription in human cells. Nature 2012; 489:101-8. [PMID: 22955620 PMCID: PMC3684276 DOI: 10.1038/nature11233] [Citation(s) in RCA: 3730] [Impact Index Per Article: 310.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2011] [Accepted: 05/15/2012] [Indexed: 02/07/2023]
Abstract
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.
Collapse
|
37
|
Lasa I, Toledo-Arana A, Gingeras TR. An effort to make sense of antisense transcription in bacteria. RNA Biol 2012; 9:1039-44. [PMID: 22858676 PMCID: PMC3551857 DOI: 10.4161/rna.21167] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Analysis of bacterial transcriptomes have shown the existence of a genome-wide process of overlapping transcription due to the presence of antisense RNAs, as well as mRNAs that overlapped in their entire length or in some portion of the 5′- and 3′-UTR regions. The biological advantages of such overlapping transcription are unclear but may play important regulatory roles at the level of transcription, RNA stability and translation. In a recent report, the human pathogen Staphylococcus aureus is observed to generate genome-wide overlapping transcription in the same bacterial cells leading to a collection of short RNA fragments generated by the endoribonuclease III, RNase III. This processing appears most prominently in Gram-positive bacteria. The implications of both the use of pervasive overlapping transcription and the processing of these double stranded templates into short RNAs are explored and the consequences discussed.
Collapse
|
38
|
Dong X, Greven MC, Kundaje A, Djebali S, Brown JB, Cheng C, Gingeras TR, Gerstein M, Guigó R, Birney E, Weng Z. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol 2012; 13:R53. [PMID: 22950368 PMCID: PMC3491397 DOI: 10.1186/gb-2012-13-9-r53] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2012] [Revised: 06/13/2012] [Accepted: 06/19/2012] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines. RESULTS We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA. CONCLUSIONS Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.
Collapse
|
39
|
Djebali S, Lagarde J, Kapranov P, Lacroix V, Borel C, Mudge JM, Howald C, Foissac S, Ucla C, Chrast J, Ribeca P, Martin D, Murray RR, Yang X, Ghamsari L, Lin C, Bell I, Dumais E, Drenkow J, Tress ML, Gelpí JL, Orozco M, Valencia A, van Berkum NL, Lajoie BR, Vidal M, Stamatoyannopoulos J, Batut P, Dobin A, Harrow J, Hubbard T, Dekker J, Frankish A, Salehi-Ashtiani K, Reymond A, Antonarakis SE, Guigó R, Gingeras TR. Evidence for transcript networks composed of chimeric RNAs in human cells. PLoS One 2012; 7:e28213. [PMID: 22238572 PMCID: PMC3251577 DOI: 10.1371/journal.pone.0028213] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Accepted: 11/03/2011] [Indexed: 12/03/2022] Open
Abstract
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Collapse
|
40
|
Deng X, Hiatt JB, Nguyen DK, Ercan S, Sturgill D, Hillier LW, Schlesinger F, Davis CA, Reinke VJ, Gingeras TR, Shendure J, Waterston RH, Oliver B, Lieb JD, Disteche CM. Evidence for compensatory upregulation of expressed X-linked genes in mammals, Caenorhabditis elegans and Drosophila melanogaster. Nat Genet 2011; 43:1179-85. [PMID: 22019781 DOI: 10.1038/ng.948] [Citation(s) in RCA: 208] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2011] [Accepted: 08/25/2011] [Indexed: 12/12/2022]
Abstract
Many animal species use a chromosome-based mechanism of sex determination, which has led to the coordinate evolution of dosage-compensation systems. Dosage compensation not only corrects the imbalance in the number of X chromosomes between the sexes but also is hypothesized to correct dosage imbalance within cells that is due to monoallelic X-linked expression and biallelic autosomal expression, by upregulating X-linked genes twofold (termed 'Ohno's hypothesis'). Although this hypothesis is well supported by expression analyses of individual X-linked genes and by microarray-based transcriptome analyses, it was challenged by a recent study using RNA sequencing and proteomics. We obtained new, independent RNA-seq data, measured RNA polymerase distribution and reanalyzed published expression data in mammals, C. elegans and Drosophila. Our analyses, which take into account the skewed gene content of the X chromosome, support the hypothesis of upregulation of expressed X-linked genes to balance expression of the genome.
Collapse
|
41
|
Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B. Synthetic spike-in standards for RNA-seq experiments. Genome Res 2011; 21:1543-51. [PMID: 21816910 DOI: 10.1101/gr.121095.111] [Citation(s) in RCA: 446] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
High-throughput sequencing of cDNA (RNA-seq) is a widely deployed transcriptome profiling and annotation technique, but questions about the performance of different protocols and platforms remain. We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a 2(20) concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. We observed linearity between read density and RNA input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors. We use the control RNAs to directly measure reproducible protocol-dependent biases due to GC content and transcript length as well as stereotypic heterogeneity in coverage across transcripts correlated with position relative to RNA termini and priming sequence bias. These effects lead to biased quantification for short transcripts and individual exons, which is a serious problem for measurements of isoform abundances, but that can partially be corrected using appropriate models of bias. By using the control RNAs, we derive limits for the discovery and detection of rare transcripts in RNA-seq experiments. By using data collected as part of the model organism and human Encyclopedia of DNA Elements projects (ENCODE and modENCODE), we demonstrate that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms.
Collapse
|
42
|
Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, Booth BW, Brooks AN, Dai Q, Davis CA, Duff MO, Feng X, Gorchakov AA, Gu T, Henikoff JG, Kapranov P, Li R, MacAlpine HK, Malone J, Minoda A, Nordman J, Okamura K, Perry M, Powell SK, Riddle NC, Sakai A, Samsonova A, Sandler JE, Schwartz YB, Sher N, Spokony R, Sturgill D, van Baren M, Wan KH, Yang L, Yu C, Feingold E, Good P, Guyer M, Lowdon R, Ahmad K, Andrews J, Berger B, Brenner SE, Brent MR, Cherbas L, Elgin SCR, Gingeras TR, Grossman R, Hoskins RA, Kaufman TC, Kent W, Kuroda MI, Orr-Weaver T, Perrimon N, Pirrotta V, Posakony JW, Ren B, Russell S, Cherbas P, Graveley BR, Lewis S, Micklem G, Oliver B, Park PJ, Celniker SE, Henikoff S, Karpen GH, Lai EC, MacAlpine DM, Stein LD, White KP, Kellis M. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 2010; 330:1787-97. [PMID: 21177974 PMCID: PMC3192495 DOI: 10.1126/science.1198374] [Citation(s) in RCA: 911] [Impact Index Per Article: 65.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
Collapse
|
43
|
Cherbas L, Willingham A, Zhang D, Yang L, Zou Y, Eads BD, Carlson JW, Landolin JM, Kapranov P, Dumais J, Samsonova A, Choi JH, Roberts J, Davis CA, Tang H, van Baren MJ, Ghosh S, Dobin A, Bell K, Lin W, Langton L, Duff MO, Tenney AE, Zaleski C, Brent MR, Hoskins RA, Kaufman TC, Andrews J, Graveley BR, Perrimon N, Celniker SE, Gingeras TR, Cherbas P. The transcriptional diversity of 25 Drosophila cell lines. Genome Res 2010; 21:301-14. [PMID: 21177962 DOI: 10.1101/gr.112961.110] [Citation(s) in RCA: 213] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Drosophila melanogaster cell lines are important resources for cell biologists. Here, we catalog the expression of exons, genes, and unannotated transcriptional signals for 25 lines. Unannotated transcription is substantial (typically 19% of euchromatic signal). Conservatively, we identify 1405 novel transcribed regions; 684 of these appear to be new exons of neighboring, often distant, genes. Sixty-four percent of genes are expressed detectably in at least one line, but only 21% are detected in all lines. Each cell line expresses, on average, 5885 genes, including a common set of 3109. Expression levels vary over several orders of magnitude. Major signaling pathways are well represented: most differentiation pathways are "off" and survival/growth pathways "on." Roughly 50% of the genes expressed by each line are not part of the common set, and these show considerable individuality. Thirty-one percent are expressed at a higher level in at least one cell line than in any single developmental stage, suggesting that each line is enriched for genes characteristic of small sets of cells. Most remarkable is that imaginal disc-derived lines can generally be assigned, on the basis of expression, to small territories within developing discs. These mappings reveal unexpected stability of even fine-grained spatial determination. No two cell lines show identical transcription factor expression. We conclude that each line has retained features of an individual founder cell superimposed on a common "cell line" gene expression pattern.
Collapse
|
44
|
Yang A, Zhu Z, Kettenbach A, Kapranov P, McKeon F, Gingeras TR, Struhl K. Genome-wide mapping indicates that p73 and p63 co-occupy target sites and have similar dna-binding profiles in vivo. PLoS One 2010; 5:e11572. [PMID: 20644729 PMCID: PMC2904373 DOI: 10.1371/journal.pone.0011572] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2010] [Accepted: 06/21/2010] [Indexed: 11/19/2022] Open
Abstract
Background The p53 homologs, p63 and p73, share ∼85% amino acid identity in their DNA-binding domains, but they have distinct biological functions. Principal Findings Using chromatin immunoprecipitation and high-resolution tiling arrays covering the human genome, we identify p73 DNA binding sites on a genome-wide level in ME180 human cervical carcinoma cells. Strikingly, the p73 binding profile is indistinguishable from the previously described binding profile for p63 in the same cells. Moreover, the p73∶p63 binding ratio is similar at all genomic loci tested, suggesting that there are few, if any, targets that are specific for one of these factors. As assayed by sequential chromatin immunoprecipitation, p63 and p73 co-occupy DNA target sites in vivo, suggesting that p63 and p73 bind primarily as heterotetrameric complexes in ME180 cells. Conclusions The observation that p63 and p73 associate with the same genomic targets suggest that their distinct biological functions are due to cell-type specific expression and/or protein domains that involve functions other than DNA binding.
Collapse
|
45
|
Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, Lassmann T, Vitezic M, Severin J, Olivarius S, Lazarevic D, Hornig N, Orlando V, Bell I, Gao H, Dumais J, Kapranov P, Wang H, Davis CA, Gingeras TR, Kawai J, Daub CO, Hayashizaki Y, Gustincich S, Carninci P. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat Methods 2010; 7:528-34. [PMID: 20543846 PMCID: PMC2906222 DOI: 10.1038/nmeth.1470] [Citation(s) in RCA: 116] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Accepted: 05/05/2010] [Indexed: 01/18/2023]
Abstract
Large-scale sequencing projects have revealed an unexpected complexity in the origins, structures and functions of mammalian transcripts. Many loci are known to produce overlapping coding and noncoding RNAs with capped 5' ends that vary in size. Methods to identify the 5' ends of transcripts will facilitate the discovery of new promoters and 5' ends derived from secondary capping events. Such methods often require high input amounts of RNA not obtainable from highly refined samples such as tissue microdissections and subcellular fractions. Therefore, we developed nano-cap analysis of gene expression (nanoCAGE), a method that captures the 5' ends of transcripts from as little as 10 ng of total RNA, and CAGEscan, a mate-pair adaptation of nanoCAGE that captures the transcript 5' ends linked to a downstream region. Both of these methods allow further annotation-agnostic studies of the complex human transcriptome.
Collapse
|
46
|
Makrythanasis P, Kapranov P, Bartoloni L, Reymond A, Deutsch S, Guigó R, Denoeud F, Drenkow J, Rossier C, Ariani F, Capra V, Excoffier L, Renieri A, Gingeras TR, Antonarakis SE. Variation in novel exons (RACEfrags) of the MECP2 gene in Rett syndrome patients and controls. Hum Mutat 2009; 30:E866-79. [PMID: 19562714 DOI: 10.1002/humu.21073] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The study of transcription using genomic tiling arrays has lead to the identification of numerous additional exons. One example is the MECP2 gene on the X chromosome; using 5'RACE and RT-PCR in human tissues and cell lines, we have found more than 70 novel exons (RACEfrags) connecting to at least one annotated exon.. We sequenced all MECP2-connected exons and flanking sequences in 3 groups: 46 patients with the Rett syndrome and without mutations in the currently annotated exons of the MECP2 and CDKL5 genes; 32 patients with the Rett syndrome and identified mutations in the MECP2 gene; 100 control individuals from the same geoethnic group. Approximately 13 kb were sequenced per sample, (2.4 Mb of DNA resequencing). A total of 75 individuals had novel rare variants (mostly private variants) but no statistically significant difference was found among the 3 groups. These results suggest that variants in the newly discovered exons may not contribute to Rett syndrome. Interestingly however, there are about twice more variants in the novel exons than in the flanking sequences (44 vs. 21 for approximately 1.3 Mb sequenced for each class of sequences, p=0.0025). Thus the evolutionary forces that shape these novel exons may be different than those of neighboring sequences.
Collapse
|
47
|
Abstract
Deep sequencing of 'transcriptomes'--the collection of all RNA transcripts produced at a given time--from worms to humans reveals that some transcripts are composed of sequence segments that are not co-linear, with pieces of sequence coming from distant regions of DNA, even different chromosomes. Some of these 'chimaeric' transcripts are formed by genetic rearrangements, but others arise during post-transcriptional events. The 'trans-splicing' process in lower eukaryotes is well understood, but events in higher eukaryotes are not. The existence of such chimaeric RNAs has far-reaching implications for the potential information content of genomes and the way it is arranged.
Collapse
|
48
|
Efroni S, Duttagupta R, Cheng J, Dehghani H, Hoeppner DJ, Dash C, Bazett-Jones DP, Le Grice S, McKay RDG, Buetow KH, Gingeras TR, Misteli T, Meshorer E. Global transcription in pluripotent embryonic stem cells. Cell Stem Cell 2009. [PMID: 18462694 DOI: 10.1016/j.stem.2008.03.02188] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
The molecular mechanisms underlying pluripotency and lineage specification from embryonic stem cells (ESCs) are largely unclear. Differentiation pathways may be determined by the targeted activation of lineage-specific genes or by selective silencing of genome regions. Here we show that the ESC genome is transcriptionally globally hyperactive and undergoes large-scale silencing as cells differentiate. Normally silent repeat regions are active in ESCs, and tissue-specific genes are sporadically expressed at low levels. Whole-genome tiling arrays demonstrate widespread transcription in coding and noncoding regions in ESCs, whereas the transcriptional landscape becomes more discrete as differentiation proceeds. The transcriptional hyperactivity in ESCs is accompanied by disproportionate expression of chromatin-remodeling genes and the general transcription machinery. We propose that global transcription is a hallmark of pluripotent ESCs, contributing to their plasticity, and that lineage specification is driven by reduction of the transcribed portion of the genome.
Collapse
|
49
|
Efroni S, Duttagupta R, Cheng J, Dehghani H, Hoeppner DJ, Dash C, Bazett-Jones DP, Le Grice S, McKay RDG, Buetow KH, Gingeras TR, Misteli T, Meshorer E. Global transcription in pluripotent embryonic stem cells. Cell Stem Cell 2009; 2:437-47. [PMID: 18462694 DOI: 10.1016/j.stem.2008.03.021] [Citation(s) in RCA: 505] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2007] [Revised: 11/09/2007] [Accepted: 03/28/2008] [Indexed: 12/21/2022]
Abstract
The molecular mechanisms underlying pluripotency and lineage specification from embryonic stem cells (ESCs) are largely unclear. Differentiation pathways may be determined by the targeted activation of lineage-specific genes or by selective silencing of genome regions. Here we show that the ESC genome is transcriptionally globally hyperactive and undergoes large-scale silencing as cells differentiate. Normally silent repeat regions are active in ESCs, and tissue-specific genes are sporadically expressed at low levels. Whole-genome tiling arrays demonstrate widespread transcription in coding and noncoding regions in ESCs, whereas the transcriptional landscape becomes more discrete as differentiation proceeds. The transcriptional hyperactivity in ESCs is accompanied by disproportionate expression of chromatin-remodeling genes and the general transcription machinery. We propose that global transcription is a hallmark of pluripotent ESCs, contributing to their plasticity, and that lineage specification is driven by reduction of the transcribed portion of the genome.
Collapse
|
50
|
Djebali S, Kapranov P, Foissac S, Lagarde J, Reymond A, Ucla C, Wyss C, Drenkow J, Dumais E, Murray RR, Lin C, Szeto D, Denoeud F, Calvo M, Frankish A, Harrow J, Makrythanasis P, Vidal M, Salehi-Ashtiani K, Antonarakis SE, Gingeras TR, Guigó R. Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat Methods 2008; 5:629-35. [PMID: 18500348 PMCID: PMC2713501 DOI: 10.1038/nmeth.1216] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 04/24/2008] [Indexed: 11/09/2022]
Abstract
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.
Collapse
|