51
|
Efroni S, Duttagupta R, Cheng J, Dehghani H, Hoeppner DJ, Dash C, Bazett-Jones DP, Le Grice S, McKay RDG, Buetow KH, Gingeras TR, Misteli T, Meshorer E. Global transcription in pluripotent embryonic stem cells. Cell Stem Cell 2009; 2:437-47. [PMID: 18462694 DOI: 10.1016/j.stem.2008.03.021] [Citation(s) in RCA: 505] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2007] [Revised: 11/09/2007] [Accepted: 03/28/2008] [Indexed: 12/21/2022]
Abstract
The molecular mechanisms underlying pluripotency and lineage specification from embryonic stem cells (ESCs) are largely unclear. Differentiation pathways may be determined by the targeted activation of lineage-specific genes or by selective silencing of genome regions. Here we show that the ESC genome is transcriptionally globally hyperactive and undergoes large-scale silencing as cells differentiate. Normally silent repeat regions are active in ESCs, and tissue-specific genes are sporadically expressed at low levels. Whole-genome tiling arrays demonstrate widespread transcription in coding and noncoding regions in ESCs, whereas the transcriptional landscape becomes more discrete as differentiation proceeds. The transcriptional hyperactivity in ESCs is accompanied by disproportionate expression of chromatin-remodeling genes and the general transcription machinery. We propose that global transcription is a hallmark of pluripotent ESCs, contributing to their plasticity, and that lineage specification is driven by reduction of the transcribed portion of the genome.
Collapse
|
52
|
Djebali S, Kapranov P, Foissac S, Lagarde J, Reymond A, Ucla C, Wyss C, Drenkow J, Dumais E, Murray RR, Lin C, Szeto D, Denoeud F, Calvo M, Frankish A, Harrow J, Makrythanasis P, Vidal M, Salehi-Ashtiani K, Antonarakis SE, Gingeras TR, Guigó R. Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat Methods 2008; 5:629-35. [PMID: 18500348 PMCID: PMC2713501 DOI: 10.1038/nmeth.1216] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 04/24/2008] [Indexed: 11/09/2022]
Abstract
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.
Collapse
|
53
|
Gingeras TR. Genome‐Wide Maps of the Human Transcriptome Reveal an Interleaved Organization and Novel Short and Long Classes of RNAs. FASEB J 2008. [DOI: 10.1096/fasebj.22.1_supplement.397.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
54
|
Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, Koch CM, Dunham I, Bieda M, Xu X, Farnham PJ, Kapranov P, Nix DA, Gingeras TR, Zhang X, Holster H, Jiang N, Green RD, Song JS, McCuine SA, Anton E, Nguyen L, Trinklein ND, Ye Z, Ching K, Hawkins D, Ren B, Scacheri PC, Rozowsky J, Karpikov A, Euskirchen G, Weissman S, Gerstein M, Snyder M, Yang A, Moqtaderi Z, Hirsch H, Shulha HP, Fu Y, Weng Z, Struhl K, Myers RM, Lieb JD, Liu XS. Systematic evaluation of variability in ChIP-chip experiments using predefined DNA targets. Genome Res 2008; 18:393-403. [PMID: 18258921 DOI: 10.1101/gr.7080508] [Citation(s) in RCA: 108] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The most widely used method for detecting genome-wide protein-DNA interactions is chromatin immunoprecipitation on tiling microarrays, commonly known as ChIP-chip. Here, we conducted the first objective analysis of tiling array platforms, amplification procedures, and signal detection algorithms in a simulated ChIP-chip experiment. Mixtures of human genomic DNA and "spike-ins" comprised of nearly 100 human sequences at various concentrations were hybridized to four tiling array platforms by eight independent groups. Blind to the number of spike-ins, their locations, and the range of concentrations, each group made predictions of the spike-in locations. We found that microarray platform choice is not the primary determinant of overall performance. In fact, variation in performance between labs, protocols, and algorithms within the same array platform was greater than the variation in performance between array platforms. However, each array platform had unique performance characteristics that varied with tiling resolution and the number of replicates, which have implications for cost versus detection power. Long oligonucleotide arrays were slightly more sensitive at detecting very low enrichment. On all platforms, simple sequence repeats and genome redundancy tended to result in false positives. LM-PCR and WGA, the most popular sample amplification techniques, reproduced relative enrichment levels with high fidelity. Performance among signal detection algorithms was heavily dependent on array platform. The spike-in DNA samples and the data presented here provide a stable benchmark against which future ChIP platforms, protocol improvements, and analysis methods can be evaluated.
Collapse
|
55
|
Rada-Iglesias A, Ameur A, Kapranov P, Enroth S, Komorowski J, Gingeras TR, Wadelius C. Whole-genome maps of USF1 and USF2 binding and histone H3 acetylation reveal new aspects of promoter structure and candidate genes for common human disorders. Genome Res 2008; 18:380-92. [PMID: 18230803 DOI: 10.1101/gr.6880908] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Transcription factors and histone modifications are crucial regulators of gene expression that mutually influence each other. We present the DNA binding profiles of upstream stimulatory factors 1 and 2 (USF1, USF2) and acetylated histone H3 (H3ac) in a liver cell line for the whole human genome using ChIP-chip at a resolution of 35 base pairs. We determined that these three proteins bind mostly in proximity of protein coding genes transcription start sites (TSSs), and their bindings are positively correlated with gene expression levels. Based on the spatial and functional relationship between USFs and H3ac at protein coding gene promoters, we found similar promoter architecture for known genes and the novel and less-characterized transcripts human mRNAs and spliced ESTs. Furthermore, our analysis revealed a previously underestimated abundance of genes in a bidirectional conformation, where USFs are bound in between TSSs. After taking into account this promoter conformation, the results indicate that H3ac is mainly located downstream of TSS, and it is at this genomic location where it positively correlates with gene expression. Finally, USF1, which is associated to familial combined hyperlipidemia, was found to bind and potentially regulate nuclear mitochondrial genes as well as genes for lipid and cholesterol metabolism, frequently in collaboration with GA binding protein transcription factor alpha (GABPA, nuclear respiratory factor 2 [NRF-2]). This expands our understanding about the transcriptional control of metabolic processes and its alteration in metabolic disorders.
Collapse
|
56
|
Ghosh S, Hirsch HA, Sekinger EA, Kapranov P, Struhl K, Gingeras TR. Differential analysis for high density tiling microarray data. BMC Bioinformatics 2007; 8:359. [PMID: 17892592 PMCID: PMC2231405 DOI: 10.1186/1471-2105-8-359] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2007] [Accepted: 09/24/2007] [Indexed: 11/18/2022] Open
Abstract
Background High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. These arrays are being increasingly used to study the associated processes of transcription, transcription factor binding, chromatin structure and their association. Studies of differential expression and/or regulation provide critical insight into the mechanics of transcription and regulation that occurs during the developmental program of a cell. The time-course experiment, which comprises an in-vivo system and the proposed analyses, is used to determine if annotated and un-annotated portions of genome manifest coordinated differential response to the induced developmental program. Results We have proposed a novel approach, based on a piece-wise function – to analyze genome-wide differential response. This enables segmentation of the response based on protein-coding and non-coding regions; for genes the methodology also partitions differential response with a 5' versus 3' versus intra-genic bias. Conclusion The algorithm built upon the framework of Significance Analysis of Microarrays, uses a generalized logic to define regions/patterns of coordinated differential change. By not adhering to the gene-centric paradigm, discordant differential expression patterns between exons and introns have been identified at a FDR of less than 12 percent. A co-localization of differential binding between RNA Polymerase II and tetra-acetylated histone has been quantified at a p-value < 0.003; it is most significant at the 5' end of genes, at a p-value < 10-13. The prototype R code has been made available as supplementary material [see Additional file 1].
Collapse
|
57
|
Abstract
While the concept of a gene has been helpful in defining the relationship of a portion of a genome to a phenotype, this traditional term may not be as useful as it once was. Currently, "gene" has come to refer principally to a genomic region producing a polyadenylated mRNA that encodes a protein. However, the recent emergence of a large collection of unannotated transcripts with apparently little protein coding capacity, collectively called transcripts of unknown function (TUFs), has begun to blur the physical boundaries and genomic organization of genic regions with noncoding transcripts often overlapping protein-coding genes on the same (sense) and opposite strand (antisense). Moreover, they are often located in intergenic regions, making the genic portions of the human genome an interleaved network of both annotated polyadenylated and nonpolyadenylated transcripts, including splice variants with novel 5' ends extending hundreds of kilobases. This complex transcriptional organization and other recently observed features of genomes argue for the reconsideration of the term "gene" and suggests that transcripts may be used to define the operational unit of a genome.
Collapse
|
58
|
Rozowsky JS, Newburger D, Sayward F, Wu J, Jordan G, Korbel JO, Nagalakshmi U, Yang J, Zheng D, Guigó R, Gingeras TR, Weissman S, Miller P, Snyder M, Gerstein MB. The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. Genome Res 2007; 17:732-45. [PMID: 17567993 PMCID: PMC1891334 DOI: 10.1101/gr.5696007] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
For the approximately 1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs-array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that approximately 14% of the novel TARs can be associated with known genes, while approximately 21% can be clustered into approximately 200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.
Collapse
|
59
|
Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei CL, Gingeras TR, Guigó R, Harrow J, Gerstein MB. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 2007; 17:839-51. [PMID: 17568002 PMCID: PMC1891343 DOI: 10.1101/gr.5586307] [Citation(s) in RCA: 152] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Arising from either retrotransposition or genomic duplication of functional genes, pseudogenes are "genomic fossils" valuable for exploring the dynamics and evolution of genes and genomes. Pseudogene identification is an important problem in computational genomics, and is also critical for obtaining an accurate picture of a genome's structure and function. However, no consensus computational scheme for defining and detecting pseudogenes has been developed thus far. As part of the ENCyclopedia Of DNA Elements (ENCODE) project, we have compared several distinct pseudogene annotation strategies and found that different approaches and parameters often resulted in rather distinct sets of pseudogenes. We subsequently developed a consensus approach for annotating pseudogenes (derived from protein coding genes) in the ENCODE regions, resulting in 201 pseudogenes, two-thirds of which originated from retrotransposition. A survey of orthologs for these pseudogenes in 28 vertebrate genomes showed that a significant fraction ( approximately 80%) of the processed pseudogenes are primate-specific sequences, highlighting the increasing retrotransposition activity in primates. Analysis of sequence conservation and variation also demonstrated that most pseudogenes evolve neutrally, and processed pseudogenes appear to have lost their coding potential immediately or soon after their emergence. In order to explore the functional implication of pseudogene prevalence, we have extensively examined the transcriptional activity of the ENCODE pseudogenes. We performed systematic series of pseudogene-specific RACE analyses. These, together with complementary evidence derived from tiling microarrays and high throughput sequencing, demonstrated that at least a fifth of the 201 pseudogenes are transcribed in one or more cell lines or tissues.
Collapse
|
60
|
Rozowsky J, Wu J, Lian Z, Nagalakshmi U, Korbel JO, Kapranov P, Zheng D, Dyke S, Newburger P, Miller P, Gingeras TR, Weissman S, Gerstein M, Snyder M. Novel transcribed regions in the human genome. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2007; 71:111-6. [PMID: 17381286 DOI: 10.1101/sqb.2006.71.054] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We have used genomic tiling arrays to identify transcribed regions throughout the human genome. Analysis of the mapping results of RNA isolated from five cell/tissue types, NB4 cells, NB4 cells treated with retinoic acid (RA), NB4 cells treated with 12-O-tetradecanoylphorbol-13 acetate (TPA), neutrophils, and placenta, throughout the ENCODE region reveals a large number of novel transcribed regions. Interestingly, neutrophils exhibit a great deal of novel expression in several intronic regions. Comparison of the hybridization results of NB4 cells treated with different stimuli relative to untreated cells reveals that many new regions are expressed upon cell differentiation. One such region is the Hox locus, which contains a large number of novel regions expressed in a number of cell types. Analysis of the trinucleotide composition of the novel transcribed regions reveals that it is similar to that of known exons. These results suggest that many of the novel transcribed regions may have a functional role.
Collapse
|
61
|
Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CWH, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JNS, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PIW, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447:799-816. [PMID: 17571346 PMCID: PMC2212820 DOI: 10.1038/nature05874] [Citation(s) in RCA: 3828] [Impact Index Per Article: 225.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
Collapse
|
62
|
Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigó R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF. Structured RNAs in the ENCODE selected regions of the human genome. Genes Dev 2007; 17:852-64. [PMID: 17568003 PMCID: PMC1891344 DOI: 10.1101/gr.5650707] [Citation(s) in RCA: 136] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2006] [Accepted: 12/12/2006] [Indexed: 12/16/2022]
Abstract
Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).
Collapse
|
63
|
Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J, Dike S, Wyss C, Henrichsen CN, Holroyd N, Dickson MC, Taylor R, Hance Z, Foissac S, Myers RM, Rogers J, Hubbard T, Harrow J, Guigó R, Gingeras TR, Antonarakis SE, Reymond A. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genes Dev 2007; 17:746-59. [PMID: 17567994 PMCID: PMC1891335 DOI: 10.1101/gr.5660607] [Citation(s) in RCA: 162] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2006] [Accepted: 01/22/2007] [Indexed: 11/24/2022]
Abstract
This report presents systematic empirical annotation of transcript products from 399 annotated protein-coding loci across the 1% of the human genome targeted by the Encyclopedia of DNA elements (ENCODE) pilot project using a combination of 5' rapid amplification of cDNA ends (RACE) and high-density resolution tiling arrays. We identified previously unannotated and often tissue- or cell-line-specific transcribed fragments (RACEfrags), both 5' distal to the annotated 5' terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene(s). Notably, at least 20% of the resultant novel transcripts have changes in their open reading frames (ORFs), most of them fusing ORFs of adjacent transcripts. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results have significant implications concerning (1) our current understanding of the architecture of protein-coding genes; (2) our views on locations of regulatory regions in the genome; and (3) the interpretation of sequence polymorphisms mapping to regions hitherto considered to be "noncoding," ultimately relating to the identification of disease-related sequence alterations.
Collapse
|
64
|
Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 2007; 316:1484-8. [PMID: 17510325 DOI: 10.1126/science.1138341] [Citation(s) in RCA: 1870] [Impact Index Per Article: 110.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Significant fractions of eukaryotic genomes give rise to RNA, much of which is unannotated and has reduced protein-coding potential. The genomic origins and the associations of human nuclear and cytosolic polyadenylated RNAs longer than 200 nucleotides (nt) and whole-cell RNAs less than 200 nt were investigated in this genome-wide study. Subcellular addresses for nucleotides present in detected RNAs were assigned, and their potential processing into short RNAs was investigated. Taken together, these observations suggest a novel role for some unannotated RNAs as primary transcripts for the production of short RNAs. Three potentially functional classes of RNAs have been identified, two of which are syntenically conserved and correlate with the expression state of protein-coding genes. These data support a highly interleaved organization of the human transcriptome.
Collapse
|
65
|
Kapranov P, Willingham AT, Gingeras TR. Genome-wide transcription and the implications for genomic organization. Nat Rev Genet 2007; 8:413-23. [PMID: 17486121 DOI: 10.1038/nrg2083] [Citation(s) in RCA: 529] [Impact Index Per Article: 31.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Recent evidence of genome-wide transcription in several species indicates that the amount of transcription that occurs cannot be entirely accounted for by current sets of genome-wide annotations. Evidence indicates that most of both strands of the human genome might be transcribed, implying extensive overlap of transcriptional units and regulatory elements. These observations suggest that genomic architecture is not colinear, but is instead interleaved and modular, and that the same genomic sequences are multifunctional: that is, used for multiple independently regulated transcripts and as regulatory regions. What are the implications and consequences of such an interleaved genomic architecture in terms of increased information content, transcriptional complexity, evolution and disease states?
Collapse
|
66
|
Sasaki D, Kondo S, Maeda N, Gingeras TR, Hasegawa Y, Hayashizaki Y. Characteristics of oligonucleotide tiling arrays measured by hybridizing full-length cDNA clones: causes of signal variation and false positive signals. Genomics 2007; 89:541-51. [PMID: 17292583 DOI: 10.1016/j.ygeno.2006.12.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2006] [Revised: 11/14/2006] [Accepted: 12/29/2006] [Indexed: 10/23/2022]
Abstract
An assessment of the hybridization characteristics of oligonucleotide tiling arrays was carried out using 162 full-length sequenced cDNA clones in spike-in experiments. The properties of array probes that influence signal intensity were investigated, and their capability in the detection of the cDNA exons was evaluated. The signal intensities detected in exonic and nonexonic genomic regions were examined by focusing on the features of probe sequences that raise or lower the level of intensity and on the causes of false positive signals found in nonexonic regions. The effectiveness of measures used in published protocols to improve the separation between signal and background intensity distributions, including the use of replicates and threshold parameterization of signal intensity, was assessed. Sensitivity and specificity in the detection of exons were measured using various sets of threshold parameters, and the effects of each parameter on the detection efficiency and the rate of false positives were evaluated. It was also demonstrated that hybridization of full-length cDNA clones is an excellent method to investigate the characteristics of oligonucleotide tiling arrays.
Collapse
|
67
|
Yang A, Zhu Z, Kapranov P, McKeon F, Church GM, Gingeras TR, Struhl K. Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol Cell 2007; 24:593-602. [PMID: 17188034 DOI: 10.1016/j.molcel.2006.10.018] [Citation(s) in RCA: 221] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2006] [Revised: 08/08/2006] [Accepted: 10/12/2006] [Indexed: 01/10/2023]
Abstract
Using tiled microarrays covering the entire human genome, we identify approximately 5800 target sites for p63, a p53 homolog essential for stratified epithelial development. p63 targets are enriched for genes involved in cell adhesion, proliferation, death, and signaling pathways. The quality of the derived DNA sequence motif for p63 targets correlates with binding strength binding in vivo, but only a small minority of motifs in the genome is bound by p63. Conversely, many p63 targets have motif scores expected for random genomic regions. Thus, p63 binding in vivo is highly selective and often requires additional factors beyond the simple protein-DNA interaction. There is a significant, but complex, relationship between p63 target sites and p63-responsive genes, with DeltaNp63 isoforms being linked to transcriptional activation. Many p63 binding regions are evolutionarily conserved and/or associated with sequence motifs for other transcription factors, suggesting that a substantial portion of p63 sites is biologically relevant.
Collapse
|
68
|
Ghosh S, Hirsch HA, Sekinger E, Struhl K, Gingeras TR. Rank-statistics based enrichment-site prediction algorithm developed for chromatin immunoprecipitation on chip experiments. BMC Bioinformatics 2006; 7:434. [PMID: 17022824 PMCID: PMC1615882 DOI: 10.1186/1471-2105-7-434] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 10/05/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. Tiling arrays are increasingly used in chromatin immunoprecipitation (IP) experiments (ChIP on chip). ChIP on chip facilitates the generation of genome-wide maps of in-vivo interactions between DNA-associated proteins including transcription factors and DNA. Analysis of the hybridization of an immunoprecipitated sample to a tiling array facilitates the identification of ChIP-enriched segments of the genome. These enriched segments are putative targets of antibody assayable regulatory elements. The enrichment response is not ubiquitous across the genome. Typically 5 to 10% of tiled probes manifest some significant enrichment. Depending upon the factor being studied, this response can drop to less than 1%. The detection and assessment of significance for interactions that emanate from non-canonical and/or un-annotated regions of the genome is especially challenging. This is the motivation behind the proposed algorithm. RESULTS We have proposed a novel rank and replicate statistics-based methodology for identifying and ascribing statistical confidence to regions of ChIP-enrichment. The algorithm is optimized for identification of sites that manifest low levels of enrichment but are true positives, as validated by alternative biochemical experiments. Although the method is described here in the context of ChIP on chip experiments, it can be generalized to any treatment-control experimental design. The results of the algorithm show a high degree of concordance with independent biochemical validation methods. The sensitivity and specificity of the algorithm have been characterized via quantitative PCR and independent computational approaches. CONCLUSION The algorithm ranks all enrichment sites based on their intra-replicate ranks and inter-replicate rank consistency. Following the ranking, the method allows segmentation of sites based on a meta p-value, a composite array signal enrichment criterion, or a composite of these two measures. The sensitivities obtained subsequent to the segmentation of data using a meta p-value of 10-5, an array signal enrichment of 0.2 and a composite of these two values are 88%, 87% and 95%, respectively.
Collapse
|
69
|
Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M. Genome-wide analysis of estrogen receptor binding sites. Nat Genet 2006; 38:1289-97. [PMID: 17013392 DOI: 10.1038/ng1901] [Citation(s) in RCA: 1043] [Impact Index Per Article: 57.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2006] [Accepted: 09/08/2006] [Indexed: 11/09/2022]
Abstract
The estrogen receptor is the master transcriptional regulator of breast cancer phenotype and the archetype of a molecular therapeutic target. We mapped all estrogen receptor and RNA polymerase II binding sites on a genome-wide scale, identifying the authentic cis binding sites and target genes, in breast cancer cells. Combining this unique resource with gene expression data demonstrates distinct temporal mechanisms of estrogen-mediated gene regulation, particularly in the case of estrogen-suppressed genes. Furthermore, this resource has allowed the identification of cis-regulatory sites in previously unexplored regions of the genome and the cooperating transcription factors underlying estrogen signaling in breast cancer.
Collapse
|
70
|
Manak JR, Dike S, Sementchenko V, Kapranov P, Biemar F, Long J, Cheng J, Bell I, Ghosh S, Piccolboni A, Gingeras TR. Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat Genet 2006; 38:1151-8. [PMID: 16951679 DOI: 10.1038/ng1875] [Citation(s) in RCA: 157] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2006] [Accepted: 08/03/2006] [Indexed: 11/09/2022]
Abstract
Many animal and plant genomes are transcribed much more extensively than current annotations predict. However, the biological function of these unannotated transcribed regions is largely unknown. Approximately 7% and 23% of the detected transcribed nucleotides during D. melanogaster embryogenesis map to unannotated intergenic and intronic regions, respectively. Based on computational analysis of coordinated transcription, we conservatively estimate that 29% of all unannotated transcribed sequences function as missed or alternative exons of well-characterized protein-coding genes. We estimate that 15.6% of intergenic transcribed regions function as missed or alternative transcription start sites (TSS) used by 11.4% of the expressed protein-coding genes. Identification of P element mutations within or near newly identified 5' exons provides a strategy for mapping previously uncharacterized mutations to their respective genes. Collectively, these data indicate that at least 85% of the fly genome is transcribed and processed into mature transcripts representing at least 30% of the fly genome.
Collapse
|
71
|
|
72
|
Abstract
The widespread occurrence of noncoding (nc) RNAs--unannotated eukaryotic transcripts with reduced protein coding potential--suggests that they are functionally important. Study of ncRNAs is increasing our understanding of the organization and regulation of genomes.
Collapse
|
73
|
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006; 7 Suppl 1:S2.1-31. [PMID: 16925836 PMCID: PMC1810551 DOI: 10.1186/gb-2006-7-s1-s2] [Citation(s) in RCA: 198] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. RESULTS The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. CONCLUSION This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Collapse
|
74
|
Liu W, Putnam AL, Xu-Yu Z, Szot GL, Lee MR, Zhu S, Gottlieb PA, Kapranov P, Gingeras TR, Fazekas de St Groth B, Clayberger C, Soper DM, Ziegler SF, Bluestone JA. CD127 expression inversely correlates with FoxP3 and suppressive function of human CD4+ T reg cells. ACTA ACUST UNITED AC 2006; 203:1701-11. [PMID: 16818678 PMCID: PMC2118339 DOI: 10.1084/jem.20060772] [Citation(s) in RCA: 2024] [Impact Index Per Article: 112.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Regulatory T (T reg) cells are critical regulators of immune tolerance. Most T reg cells are defined based on expression of CD4, CD25, and the transcription factor, FoxP3. However, these markers have proven problematic for uniquely defining this specialized T cell subset in humans. We found that the IL-7 receptor (CD127) is down-regulated on a subset of CD4+ T cells in peripheral blood. We demonstrate that the majority of these cells are FoxP3+, including those that express low levels or no CD25. A combination of CD4, CD25, and CD127 resulted in a highly purified population of T reg cells accounting for significantly more cells that previously identified based on other cell surface markers. These cells were highly suppressive in functional suppressor assays. In fact, cells separated based solely on CD4 and CD127 expression were anergic and, although representing at least three times the number of cells (including both CD25+CD4+ and CD25−CD4+ T cell subsets), were as suppressive as the “classic” CD4+CD25hi T reg cell subset. Finally, we show that CD127 can be used to quantitate T reg cell subsets in individuals with type 1 diabetes supporting the use of CD127 as a biomarker for human T reg cells.
Collapse
|
75
|
Zhang M, Drenkow J, Lankford CSR, Frucht DM, Rabin RL, Gingeras TR, Venkateshan C, Schwartzkopff F, Clouse KA, Dayton AI. HIV regulation of the IL-7R: a viral mechanism for enhancing HIV-1 replication in human macrophages in vitro. J Leukoc Biol 2006; 79:1328-38. [PMID: 16614257 DOI: 10.1189/jlb.0704424] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
We report a novel mechanism, involving up-regulation of the interleukin (IL)-7 cytokine receptor, by which human immunodeficiency virus (HIV) enhances its own production in monocyte-derived macrophages (MDM) in vitro. HIV-1 infection or treatment of MDM cultures with exogenous HIV-1 Tat(86) protein up-regulates the IL-7 receptor (IL-7R) alpha-chain at the levels of steady-state RNA, protein, and functional IL-7R on the cell surface (as measured by ligand-induced receptor signaling). This IL-7R up-regulation is associated with increased amounts of HIV-1 virions in the supernatants of infected MDM cultures treated with exogenous IL-7 cytokine. The overall effect of IL-7 stimulation on HIV replication in MDM culture supernatants is typically in the range of one log and greater. The results are consistent with a model in which HIV infection produces the Tat protein, which in turn up-regulates IL-7R in a paracrine manner. This results in increased IL-7R signaling in response to the IL-7 cytokine, which ultimately promotes early events in HIV replication, including binding/entry and possibly other steps prior to reverse transcription. The results suggest that the effects of IL-7 on HIV replication in MDM should be considered when analyzing and designing clinical trials involving treatment of patients with IL-7 or Tat vaccines.
Collapse
|