3851
|
Lin JM, Collins PJ, Trinklein ND, Fu Y, Xi H, Myers RM, Weng Z. Transcription factor binding and modified histones in human bidirectional promoters. Genome Res 2007; 17:818-27. [PMID: 17568000 PMCID: PMC1891341 DOI: 10.1101/gr.5623407] [Citation(s) in RCA: 106] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Bidirectional promoters have received considerable attention because of their ability to regulate two downstream genes (divergent genes). They are also highly abundant, directing the transcription of approximately 11% of genes in the human genome. We categorized the presence of DNA sequence motifs, binding of transcription factors, and modified histones as overrepresented, shared, or underrepresented in bidirectional promoters with respect to unidirectional promoters. We found that a small set of motifs, including GABPA, MYC, E2F1, E2F4, NRF-1, CCAAT, YY1, and ACTACAnnTCC are overrepresented in bidirectional promoters, while the majority (73%) of known vertebrate motifs are underrepresented. We performed chromatin-immunoprecipitation (ChIP), followed by quantitative PCR for GABPA, on 118 regions in the human genome and showed that it binds to bidirectional promoters more frequently than unidirectional promoters, and its position-specific scoring matrix is highly predictive of binding. Signatures of active transcription, such as occupancy of RNA polymerase II and the modified histones H3K4me2, H3K4me3, and H3ac, are overrepresented in regions around bidirectional promoters, suggesting that a higher fraction of divergent genes are transcribed in a given cell than the fraction of other genes. Accordingly, analysis of whole-genome microarray data indicates that 68% of divergent genes are transcribed compared with 44% of all human genes. By combining the analysis of publicly available ENCODE data and a detailed study of GABPA, we survey bidirectional promoters with breadth and depth, leading to biological insights concerning their motif composition and bidirectional regulatory mode.
Collapse
Affiliation(s)
- Jane M. Lin
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, 02215, USA
| | - Patrick J. Collins
- Department of Genetics, Stanford University, School of Medicine, Stanford, California 94305-5120, USA
| | - Nathan D. Trinklein
- Department of Genetics, Stanford University, School of Medicine, Stanford, California 94305-5120, USA
| | - Yutao Fu
- Program in Bioinformatics and Systems Biology, Boston University, Boston, Massachusetts, 02215, USA
| | - Hualin Xi
- Program in Bioinformatics and Systems Biology, Boston University, Boston, Massachusetts, 02215, USA
| | - Richard M. Myers
- Department of Genetics, Stanford University, School of Medicine, Stanford, California 94305-5120, USA
| | - Zhiping Weng
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, 02215, USA
- Program in Bioinformatics and Systems Biology, Boston University, Boston, Massachusetts, 02215, USA
- Corresponding author.E-mail ; fax (617) 353-6766
| |
Collapse
|
3852
|
Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Res 2007; 17:910-6. [PMID: 17568006 PMCID: PMC1891349 DOI: 10.1101/gr.5574907] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Identifying the genome-wide binding sites of transcription factors is important in deciphering transcriptional regulatory networks. ChIP-chip (Chromatin immunoprecipitation combined with microarrays) has been widely used to map transcription factor binding sites in the human genome. However, whole genome ChIP-chip analysis is still technically challenging in vertebrates. We recently developed STAGE as an unbiased method for identifying transcription factor binding sites in the genome. STAGE is conceptually based on SAGE, except that the input is ChIP-enriched DNA. In this study, we implemented an improved sequencing strategy and analysis methods and applied STAGE to map the genomic binding profile of the transcription factor STAT1 after interferon treatment. STAT1 is mainly responsible for mediating the cellular responses to interferons, such as cell proliferation, apoptosis, immune surveillance, and immune responses. We present novel algorithms for STAGE tag analysis to identify enriched loci with high specificity, as verified by quantitative ChIP. STAGE identified several previously unknown STAT1 target genes, many of which are involved in mediating the response to interferon-gamma signaling. STAGE is thus a viable method for identifying the chromosomal targets of transcription factors and generating meaningful biological hypotheses that further our understanding of transcriptional regulatory networks.
Collapse
Affiliation(s)
- Akshay A. Bhinge
- Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Section of Molecular Genetics and Microbiology, University of Texas at Austin, Austin, Texas 78712, USA
| | - Jonghwan Kim
- Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Section of Molecular Genetics and Microbiology, University of Texas at Austin, Austin, Texas 78712, USA
| | - Ghia M. Euskirchen
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
| | - Michael Snyder
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
| | - Vishwanath R. Iyer
- Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Section of Molecular Genetics and Microbiology, University of Texas at Austin, Austin, Texas 78712, USA
- Corresponding author.E-mail ; fax (512) 232-3472
| |
Collapse
|
3853
|
Amin AR, D Thompson S, Amin SA. Future of genomics in diagnosis of human arthritis: the hype, hope and metamorphosis for tomorrow. ACTA ACUST UNITED AC 2007. [DOI: 10.2217/17460816.2.4.385] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
3854
|
Geuten K, Massingham T, Darius P, Smets E, Goldman N. Experimental Design Criteria in Phylogenetics: Where to Add Taxa. Syst Biol 2007; 56:609-22. [PMID: 17654365 DOI: 10.1080/10635150701499563] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Abstract
Accurate phylogenetic inference is a topic of intensive research and debate and has been studied in response to many different factors: for example, differences in the method of reconstruction, the shape of the underlying tree, the substitution model, and varying quantities and types of data. Investigating whether the conditions used might lead to inaccurate inference has been attempted through elaborate data exploration but less attention has been given to creating a unified methodology to enable experimental designs in phylogenetic analysis to be improved and so avoid suboptimal conditions. Experimental design has been part of the field of statistics since the seminal work of Fisher in the early 20th century and a large body of literature exists on how to design optimum experiments. Here we investigate the use of the Fisher information matrix to decide between candidate positions for adding a taxon to a fixed topology, and introduce a parameter transformation that permits comparison of these different designs. This extension to Goldman (1998. Proc. R. Soc. Lond. B. 265: 1779-1786) thus allows investigation of "where to add taxa" in a phylogeny. We compare three different measures of the total information for selecting the position to add a taxon to a tree. Our methods are illustrated by investigating the behavior of the three criteria when adding a branch to model trees, and by applying the different criteria to two biological examples: a simplified taxon-sampling problem in the balsaminoid Ericales and the phylogeny of seed plants.
Collapse
Affiliation(s)
- Koen Geuten
- Laboratory of Plant Systematics, KU Leuven, Belgium.
| | | | | | | | | |
Collapse
|
3855
|
Abstract
Recent studies of developmental biology have shown that the genes controlling phenotypic characters expressed in the early stage of development are highly conserved and that recent evolutionary changes have occurred primarily in the characters expressed in later stages of development. Even the genes controlling the latter characters are generally conserved, but there is a large component of neutral or nearly neutral genetic variation within and between closely related species. Phenotypic evolution occurs primarily by mutation of genes that interact with one another in the developmental process. The enormous amount of phenotypic diversity among different phyla or classes of organisms is a product of accumulation of novel mutations and their conservation that have facilitated adaptation to different environments. Novel mutations may be incorporated into the genome by natural selection (elimination of preexisting genotypes) or by random processes such as genetic and genomic drift. However, once the mutations are incorporated into the genome, they may generate developmental constraints that will affect the future direction of phenotypic evolution. It appears that the driving force of phenotypic evolution is mutation, and natural selection is of secondary importance.
Collapse
Affiliation(s)
- Masatoshi Nei
- Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, 328 Mueller Laboratory, University Park, PA 16802, USA.
| |
Collapse
|
3856
|
Asthana S, Noble WS, Kryukov G, Grant CE, Sunyaev S, Stamatoyannopoulos JA. Widely distributed noncoding purifying selection in the human genome. Proc Natl Acad Sci U S A 2007; 104:12410-5. [PMID: 17640883 PMCID: PMC1941483 DOI: 10.1073/pnas.0705140104] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
It is widely assumed that human noncoding sequences comprise a substantial reservoir for functional variants impacting gene regulation and other chromosomal processes. Evolutionarily conserved noncoding sequences (CNSs) in the human genome have attracted considerable attention for their potential to simplify the search for functional elements and phenotypically important human alleles. A major outstanding question is whether functionally significant human noncoding variation is concentrated in CNSs or distributed more broadly across the genome. Here, we combine whole genome sequence data from four nonhuman species (chimp, dog, mouse, and rat) with recently available comprehensive human polymorphism data to analyze selection at single-nucleotide resolution. We show that a substantial fraction of active purifying selection in human noncoding sequences occurs outside of CNSs and is diffusely distributed across the genome. This finding suggests the existence of a large complement of human noncoding variants that may impact gene expression and phenotypic traits, the majority of which will escape detection with current approaches to genome analysis.
Collapse
Affiliation(s)
- Saurabh Asthana
- *Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115; and
| | - William S. Noble
- Department of Genome Sciences, University of Washington, 1705 Northeast Pacific Street, Seattle, WA 98195
| | - Gregory Kryukov
- *Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115; and
| | - Charles E. Grant
- Department of Genome Sciences, University of Washington, 1705 Northeast Pacific Street, Seattle, WA 98195
| | - Shamil Sunyaev
- *Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115; and
- To whom correspondence may be addressed. E-mail: or
| | - John A. Stamatoyannopoulos
- Department of Genome Sciences, University of Washington, 1705 Northeast Pacific Street, Seattle, WA 98195
- To whom correspondence may be addressed. E-mail: or
| |
Collapse
|
3857
|
Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RDG, Chenoweth JG, Tesar PJ, Furey TS, Ren B, Weng Z, Crawford GE. Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. PLoS Genet 2007; 3:e136. [PMID: 17708682 PMCID: PMC1950163 DOI: 10.1371/journal.pgen.0030136] [Citation(s) in RCA: 174] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 06/27/2007] [Indexed: 11/18/2022] Open
Abstract
The identification of regulatory elements from different cell types is necessary for understanding the mechanisms controlling cell type-specific and housekeeping gene expression. Mapping DNaseI hypersensitive (HS) sites is an accurate method for identifying the location of functional regulatory elements. We used a high throughput method called DNase-chip to identify 3,904 DNaseI HS sites from six cell types across 1% of the human genome. A significant number (22%) of DNaseI HS sites from each cell type are ubiquitously present among all cell types studied. Surprisingly, nearly all of these ubiquitous DNaseI HS sites correspond to either promoters or insulator elements: 86% of them are located near annotated transcription start sites and 10% are bound by CTCF, a protein with known enhancer-blocking insulator activity. We also identified a large number of DNaseI HS sites that are cell type specific (only present in one cell type); these regions are enriched for enhancer elements and correlate with cell type-specific gene expression as well as cell type-specific histone modifications. Finally, we found that approximately 8% of the genome overlaps a DNaseI HS site in at least one the six cell lines studied, indicating that a significant percentage of the genome is potentially functional.
Collapse
Affiliation(s)
- Hualin Xi
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Hennady P Shulha
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
| | - Jane M Lin
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
| | - Teresa R Vales
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America
| | - Yutao Fu
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - David M Bodine
- Hematopoiesis Section, Genetics and Molecular Biology Branch, National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | - Ronald D. G McKay
- National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Josh G Chenoweth
- National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Paul J Tesar
- National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Terrence S Furey
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America
| | - Bing Ren
- Ludwig Institute for Cancer Research, University of California San Diego, La Jolla, California, United States of America
| | - Zhiping Weng
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail: (GEC); (ZW)
| | - Gregory E Crawford
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America
- * To whom correspondence should be addressed. E-mail: (GEC); (ZW)
| |
Collapse
|
3858
|
News in brief. Nat Methods 2007. [DOI: 10.1038/nmeth0707-539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
3859
|
Research highlights. Nat Biotechnol 2007. [DOI: 10.1038/nbt0707-753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
3860
|
Skipper M. Resourcing the genome. Nat Rev Genet 2007. [DOI: 10.1038/nrg2151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
3861
|
|
3862
|
|
3863
|
Karnani N, Taylor C, Malhotra A, Dutta A. Pan-S replication patterns and chromosomal domains defined by genome-tiling arrays of ENCODE genomic areas. Genome Res 2007; 17:865-76. [PMID: 17568004 PMCID: PMC1891345 DOI: 10.1101/gr.5427007] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2006] [Accepted: 10/30/2006] [Indexed: 12/24/2022]
Abstract
In eukaryotes, accurate control of replication time is required for the efficient completion of S phase and maintenance of genome stability. We present a high-resolution genome-tiling array-based profile of replication timing for approximately 1% of the human genome studied by The ENCODE Project Consortium. Twenty percent of the investigated segments replicate asynchronously (pan-S). These areas are rich in genes and CpG islands, features they share with early-replicating loci. Interphase FISH showed that pan-S replication is a consequence of interallelic variation in replication time and is not an artifact derived from a specific cell cycle synchronization method or from aneuploidy. The interallelic variation in replication time is likely due to interallelic variation in chromatin environment, because while the early- or late-replicating areas were exclusively enriched in activating or repressing histone modifications, respectively, the pan-S areas had both types of histone modification. The replication profile of the chromosomes identified contiguous chromosomal segments of hundreds of kilobases separated by smaller segments where the replication time underwent an acute transition. Close examination of one such segment demonstrated that the delay of replication time was accompanied by a decrease in level of gene expression and appearance of repressive chromatin marks, suggesting that the transition segments are boundary elements separating chromosomal domains with different chromatin environments.
Collapse
Affiliation(s)
- Neerja Karnani
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22908, USA
| | - Christopher Taylor
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22908, USA
- Department of Computer Science, University of Virginia, Charlottesville, Virginia 22908, USA
| | - Ankit Malhotra
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22908, USA
- Department of Computer Science, University of Virginia, Charlottesville, Virginia 22908, USA
| | - Anindya Dutta
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22908, USA
| |
Collapse
|
3864
|
Thurman RE, Day N, Noble WS, Stamatoyannopoulos JA. Identification of higher-order functional domains in the human ENCODE regions. Genes Dev 2007; 17:917-27. [PMID: 17568007 PMCID: PMC1891350 DOI: 10.1101/gr.6081407] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2006] [Accepted: 03/27/2007] [Indexed: 11/25/2022]
Abstract
It has long been posited that human and other large genomes are organized into higher-order (i.e., greater than gene-sized) functional domains. We hypothesized that diverse experimental data types generated by The ENCODE Project Consortium could be combined to delineate active and quiescent or repressed functional domains and thereby illuminate the higher-order functional architecture of the genome. To address this, we coupled wavelet analysis with hidden Markov models for unbiased discovery of "domain-level" behavior in high-resolution functional genomic data, including activating and repressive histone modifications, RNA output, and DNA replication timing. We find that higher-order patterns in these data types are largely concordant and may be analyzed collectively in the context of HeLa cells to delineate 53 active and 62 repressed functional domains within the ENCODE regions. Active domains comprise approximately 44% of the ENCODE regions but contain approximately 75%-80% of annotated genes, transcripts, and CpG islands. Repressed domains are enriched in certain classes of repetitive elements and, surprisingly, in evolutionarily conserved nonexonic sequences. The functional domain structure of the ENCODE regions appears to be largely stable across different cell types. Taken together, our results suggest that higher-order functional domains represent a fundamental organizing principle of human genome architecture.
Collapse
Affiliation(s)
- Robert E. Thurman
- Division of Medical Genetics, University of Washington, Seattle, Washington 98195, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Nathan Day
- Department of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - William S. Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
- Department of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | | |
Collapse
|
3865
|
Greenbaum JA, Parker SC, Tullius TD. Detection of DNA structural motifs in functional genomic elements. Genome Res 2007; 17:940-6. [PMID: 17568009 PMCID: PMC1891352 DOI: 10.1101/gr.5602807] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2006] [Accepted: 11/22/2006] [Indexed: 01/03/2023]
Abstract
The completion of the human genome project has fueled the search for regulatory elements by a variety of different approaches. Many successful analyses have focused on examining primary DNA sequence and/or chromatin structure. However, it has been difficult to detect common sequence motifs within the feature of chromatin structure most closely associated with regulatory elements, DNase I hypersensitive sites (DHSs). Considering just the nucleotide sequence and/or the chromatin structure of regulatory elements may neglect a critical feature of what is recognized by the regulatory machinery--DNA structure. We introduce a new computational method to detect common DNA structural motifs in a large collection of DHSs that are found in the ENCODE regions of the human genome. We show that DHSs have common DNA structural motifs that show no apparent sequence consensus. One such structural motif is much more highly enriched in experimentally identified DHSs that are in CpG islands and near transcription start sites (TSSs), compared to DHSs not in CpG islands and farther from TSSs, suggesting that DNA structural motifs may participate in the formation of functional regulatory elements. We propose that studies of the conservation of DNA structure, independent of sequence conservation, will provide new information about the link between the nucleotide sequence of a DNA molecule and its experimentally demonstrated function.
Collapse
Affiliation(s)
- Jason A. Greenbaum
- Program in Bioinformatics, Boston University, Boston, Massachusetts 02215, USA
| | - Stephen C.J. Parker
- Program in Bioinformatics, Boston University, Boston, Massachusetts 02215, USA
| | - Thomas D. Tullius
- Program in Bioinformatics, Boston University, Boston, Massachusetts 02215, USA
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
3866
|
Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J, Dike S, Wyss C, Henrichsen CN, Holroyd N, Dickson MC, Taylor R, Hance Z, Foissac S, Myers RM, Rogers J, Hubbard T, Harrow J, Guigó R, Gingeras TR, Antonarakis SE, Reymond A. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res 2007; 17:746-59. [PMID: 17567994 PMCID: PMC1891335 DOI: 10.1101/gr.5660607] [Citation(s) in RCA: 151] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2006] [Accepted: 01/22/2007] [Indexed: 11/24/2022]
Abstract
This report presents systematic empirical annotation of transcript products from 399 annotated protein-coding loci across the 1% of the human genome targeted by the Encyclopedia of DNA elements (ENCODE) pilot project using a combination of 5' rapid amplification of cDNA ends (RACE) and high-density resolution tiling arrays. We identified previously unannotated and often tissue- or cell-line-specific transcribed fragments (RACEfrags), both 5' distal to the annotated 5' terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene(s). Notably, at least 20% of the resultant novel transcripts have changes in their open reading frames (ORFs), most of them fusing ORFs of adjacent transcripts. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results have significant implications concerning (1) our current understanding of the architecture of protein-coding genes; (2) our views on locations of regulatory regions in the genome; and (3) the interpretation of sequence polymorphisms mapping to regions hitherto considered to be "noncoding," ultimately relating to the identification of disease-related sequence alterations.
Collapse
Affiliation(s)
- France Denoeud
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
| | | | - Catherine Ucla
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Robert Castelo
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
| | - Jorg Drenkow
- Affymetrix, Inc., Santa Clara, California 95051, USA
| | - Julien Lagarde
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
| | - Tyler Alioto
- Center for Genomic Regulation, 08003 Barcelona, Catalonia, Spain
| | - Caroline Manzano
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Sujit Dike
- Affymetrix, Inc., Santa Clara, California 95051, USA
| | - Carine Wyss
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | | | - Nancy Holroyd
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Mark C. Dickson
- Department of Genetics, Stanford Human Genome Center, Stanford University School of Medicine, Stanford, California 94305-5120, USA
| | - Ruth Taylor
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Zahra Hance
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Sylvain Foissac
- Center for Genomic Regulation, 08003 Barcelona, Catalonia, Spain
| | - Richard M. Myers
- Department of Genetics, Stanford Human Genome Center, Stanford University School of Medicine, Stanford, California 94305-5120, USA
| | - Jane Rogers
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Tim Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom
| | - Roderic Guigó
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain
- Center for Genomic Regulation, 08003 Barcelona, Catalonia, Spain
| | | | - Stylianos E. Antonarakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Alexandre Reymond
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
3867
|
Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigó R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF. Structured RNAs in the ENCODE selected regions of the human genome. Genes Dev 2007; 17:852-64. [PMID: 17568003 PMCID: PMC1891344 DOI: 10.1101/gr.5650707] [Citation(s) in RCA: 124] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2006] [Accepted: 12/12/2006] [Indexed: 12/16/2022]
Abstract
Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).
Collapse
Affiliation(s)
- Stefan Washietl
- Institute for Theoretical Chemistry, University of Vienna, A-1090 Wien, Austria.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3868
|
Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei CL, Gingeras TR, Guigó R, Harrow J, Gerstein MB. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 2007; 17:839-51. [PMID: 17568002 PMCID: PMC1891343 DOI: 10.1101/gr.5586307] [Citation(s) in RCA: 156] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2006] [Accepted: 10/03/2006] [Indexed: 10/23/2022]
Abstract
Arising from either retrotransposition or genomic duplication of functional genes, pseudogenes are "genomic fossils" valuable for exploring the dynamics and evolution of genes and genomes. Pseudogene identification is an important problem in computational genomics, and is also critical for obtaining an accurate picture of a genome's structure and function. However, no consensus computational scheme for defining and detecting pseudogenes has been developed thus far. As part of the ENCyclopedia Of DNA Elements (ENCODE) project, we have compared several distinct pseudogene annotation strategies and found that different approaches and parameters often resulted in rather distinct sets of pseudogenes. We subsequently developed a consensus approach for annotating pseudogenes (derived from protein coding genes) in the ENCODE regions, resulting in 201 pseudogenes, two-thirds of which originated from retrotransposition. A survey of orthologs for these pseudogenes in 28 vertebrate genomes showed that a significant fraction ( approximately 80%) of the processed pseudogenes are primate-specific sequences, highlighting the increasing retrotransposition activity in primates. Analysis of sequence conservation and variation also demonstrated that most pseudogenes evolve neutrally, and processed pseudogenes appear to have lost their coding potential immediately or soon after their emergence. In order to explore the functional implication of pseudogene prevalence, we have extensively examined the transcriptional activity of the ENCODE pseudogenes. We performed systematic series of pseudogene-specific RACE analyses. These, together with complementary evidence derived from tiling microarrays and high throughput sequencing, demonstrated that at least a fifth of the 201 pseudogenes are transcribed in one or more cell lines or tissues.
Collapse
Affiliation(s)
- Deyou Zheng
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1HH, United Kingdom
| | - Robert Baertsch
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA
| | | | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Siew Woh Choo
- Genome Institute of Singapore, Singapore 138672, Singapore
| | - Yontao Lu
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA
| | - France Denoeud
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, Passeig Marítim de la Barceloneta, 37-49, 08003, Barcelona, Catalonia, Spain
| | - Stylianos E. Antonarakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Michael Snyder
- Molecular, Cellular & Developmental Biology Department, Yale University, New Haven, Connecticut 06520, USA
| | - Yijun Ruan
- Genome Institute of Singapore, Singapore 138672, Singapore
| | - Chia-Lin Wei
- Genome Institute of Singapore, Singapore 138672, Singapore
| | | | - Roderic Guigó
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, Passeig Marítim de la Barceloneta, 37-49, 08003, Barcelona, Catalonia, Spain
- Center for Genomic Regulation, Passeig Marítim de la Barceloneta, 37-49, 08003, Barcelona, Catalonia, Spain
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1HH, United Kingdom
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
3869
|
Reiche K, Stadler PF. RNAstrand: reading direction of structured RNAs in multiple sequence alignments. Algorithms Mol Biol 2007; 2:6. [PMID: 17540014 PMCID: PMC1892782 DOI: 10.1186/1748-7188-2-6] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2007] [Accepted: 05/31/2007] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Genome-wide screens for structured ncRNA genes in mammals, urochordates, and nematodes have predicted thousands of putative ncRNA genes and other structured RNA motifs. A prerequisite for their functional annotation is to determine the reading direction with high precision. RESULTS While folding energies of an RNA and its reverse complement are similar, the differences are sufficient at least in conjunction with substitution patterns to discriminate between structured RNAs and their complements. We present here a support vector machine that reliably classifies the reading direction of a structured RNA from a multiple sequence alignment and provides a considerable improvement in classification accuracy over previous approaches. SOFTWARE RNAstrand is freely available as a stand-alone tool from http://www.bioinf.uni-leipzig.de/Software/RNAstrand and is also included in the latest release of RNAz, a part of the Vienna RNA Package.
Collapse
Affiliation(s)
- Kristin Reiche
- Bioinformatics Group, Dept. of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany
| | - Peter F Stadler
- Bioinformatics Group, Dept. of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
| |
Collapse
|
3870
|
Clark TG, Andrew T, Cooper GM, Margulies EH, Mullikin JC, Balding DJ. Functional constraint and small insertions and deletions in the ENCODE regions of the human genome. Genome Biol 2007; 8:R180. [PMID: 17784950 PMCID: PMC2375018 DOI: 10.1186/gb-2007-8-9-r180] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2006] [Revised: 09/04/2007] [Accepted: 09/04/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint. RESULTS Indel rates are observed to be reduced approximately 20-fold to 60-fold in exonic regions, 5-fold to 10-fold in sequence that exhibits high evolutionary constraint in mammals, and up to 2-fold in some classes of regulatory elements (for instance, formaldehyde assisted isolation of regulatory elements [FAIRE] and hypersensitive sites). In addition, some noncoding transcription and other chromatin mediated regulatory sites also have reduced indel rates. Overall indel rates for these data are estimated to be smaller than single nucleotide polymorphism (SNP) rates by a factor of approximately 2, with both rates measured as base pairs per 100 kilobases to facilitate comparison. CONCLUSION Indel rates exhibit a broadly similar distribution across genomic features compared with SNP density rates, with a reduction in rates in coding transcription and evolutionarily constrained sequence. However, unlike indels, SNP rates do not appear to be reduced in some noncoding functional sequences, such as pseudo-exons, and FAIRE and hypersensitive sites. We conclude that indel rates are greatly reduced in transcribed and evolutionarily constrained DNA, and discuss why indel (but not SNP) rates appear to be constrained at some regulatory sites.
Collapse
Affiliation(s)
- Taane G Clark
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London, W2 1PG, UK
| | - Toby Andrew
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London, W2 1PG, UK
| | - Gregory M Cooper
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Elliott H Margulies
- National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, Maryland 20892, USA
| | - James C Mullikin
- National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, Maryland 20892, USA
| | - David J Balding
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London, W2 1PG, UK
| |
Collapse
|
3871
|
DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project. PLoS Biol 2004; 18:1518-29. [PMID: 15550986 DOI: 10.1101/gr.077479.108] [Citation(s) in RCA: 288] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
The Human Epigenome Project aims to identify, catalogue, and interpret genome-wide DNA methylation phenomena. Occurring naturally on cytosine bases at cytosine-guanine dinucleotides, DNA methylation is intimately involved in diverse biological processes and the aetiology of many diseases. Differentially methylated cytosines give rise to distinct profiles, thought to be specific for gene activity, tissue type, and disease state. The identification of such methylation variable positions will significantly improve our understanding of genome biology and our ability to diagnose disease. Here, we report the results of the pilot study for the Human Epigenome Project entailing the methylation analysis of the human major histocompatibility complex. This study involved the development of an integrated pipeline for high-throughput methylation analysis using bisulphite DNA sequencing, discovery of methylation variable positions, epigenotyping by matrix-assisted laser desorption/ionisation mass spectrometry, and development of an integrated public database available at http://www.epigenome.org. Our analysis of DNA methylation levels within the major histocompatibility complex, including regulatory exonic and intronic regions associated with 90 genes in multiple tissues and individuals, reveals a bimodal distribution of methylation profiles (i.e., the vast majority of the analysed regions were either hypo- or hypermethylated), tissue specificity, inter-individual variation, and correlation with independent gene expression data.
Collapse
|