1
|
Hoang M, Marçais G, Kingsford C. Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. J Comput Biol 2024; 31:2-20. [PMID: 37975802 PMCID: PMC10794853 DOI: 10.1089/cmb.2023.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2023] Open
Abstract
Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
Collapse
Affiliation(s)
- Minh Hoang
- Department of Computer Science, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Guillaume Marçais
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
2
|
Bo S, Sun Q, Ning P, Yuan N, Weng Y, Liang Y, Wang H, Lu Z, Li Z, Zhao X. A novel approach to analyze the association characteristics between post-spliced introns and their corresponding mRNA. Front Genet 2023; 14:1151172. [PMID: 36923795 PMCID: PMC10008863 DOI: 10.3389/fgene.2023.1151172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 02/15/2023] [Indexed: 03/03/2023] Open
Abstract
Studies have shown that post-spliced introns promote cell survival when nutrients are scarce, and intron loss/gain can influence many stages of mRNA metabolism. However, few approaches are currently available to study the correlation between intron sequences and their corresponding mature mRNA sequences. Here, based on the results of the improved Smith-Waterman local alignment-based algorithm method (SW method) and binding free energy weighted local alignment algorithm method (BFE method), the optimal matched segments between introns and their corresponding mature mRNAs in Caenorhabditis elegans (C.elegans) and their relative matching frequency (RF) distributions were obtained. The results showed that although the distributions of relative matching frequencies on mRNAs obtained by the BFE method were similar to the SW method, the interaction intensity in 5'and 3'untranslated regions (UTRs) regions was weaker than the SW method. The RF distributions in the exon-exon junction regions were comparable, the effects of long and short introns on mRNA and on the five functional sites with BFE method were similar to the SW method. However, the interaction intensity in 5'and 3'UTR regions with BFE method was weaker than with SW method. Although the matching rate and length distribution shape of the optimal matched fragment were consistent with the SW method, an increase in length was observed. The matching rates and the length of the optimal matched fragments were mainly in the range of 60%-80% and 20-30bp, respectively. Although we found that there were still matching preferences in the 5'and 3'UTR regions of the mRNAs with BFE, the matching intensities were significantly lower than the matching intensities between introns and their corresponding mRNAs with SW method. Overall, our findings suggest that the interaction between introns and mRNAs results from synergism among different types of sequences during the evolutionary process.
Collapse
Affiliation(s)
- Suling Bo
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Qiuying Sun
- Department of Oncology, Inner Mongolia Cancer Hospital and The Affiliated People's Hospital of Inner Mongolia Medical University, Hohhot, China
| | - Pengfei Ning
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Ningping Yuan
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Yujie Weng
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Ying Liang
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Huitao Wang
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Zhanyuan Lu
- Inner Mongolia Academy of Agricultural and Animal Husbandry Sciences, Hohhot, China.,School of Life Science, Inner Mongolia University, Hohhot, China.,Key Laboratory of Black Soil Protection And Utilization (Hohhot), Ministry of Agriculture and Rural Affairs, Hohhot, China.,6 Inner Mongolia Key Laboratory of Degradation Farmland Ecological Restoration and Pollution Control, Hohhot, China
| | - Zhongxian Li
- College of Computer Information, Inner Mongolia Medical University, Hohhot, China
| | - Xiaoqing Zhao
- Inner Mongolia Academy of Agricultural and Animal Husbandry Sciences, Hohhot, China.,School of Life Science, Inner Mongolia University, Hohhot, China.,Key Laboratory of Black Soil Protection And Utilization (Hohhot), Ministry of Agriculture and Rural Affairs, Hohhot, China.,6 Inner Mongolia Key Laboratory of Degradation Farmland Ecological Restoration and Pollution Control, Hohhot, China
| |
Collapse
|
3
|
Hoang M, Zheng H, Kingsford C. Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer. J Comput Biol 2022; 29:1288-1304. [PMID: 36095142 PMCID: PMC9807081 DOI: 10.1089/cmb.2022.0275] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Minimizers are widely used to sample representative k-mers from biological sequences in many applications, such as read mapping and taxonomy prediction. In most scenarios, having the minimizer scheme select as few k-mer positions as possible (i.e., having a low density) is desirable to reduce computation and memory cost. Despite the growing interest in minimizers, learning an effective scheme with optimal density is still an open question, as it requires solving an apparently challenging discrete optimization problem on the permutation space of k-mer orderings. Most existing schemes are designed to work well in expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, only approximate the original objective with likewise discrete surrogate tasks that are not able to significantly improve the density performance. This article introduces the first continuous relaxation of the density minimizing objective, DeepMinimizer, which employs a novel Deep Learning twin architecture to simultaneously ensure both validity and performance of the minimizer scheme. Our surrogate objective is fully differentiable and, therefore, amenable to efficient gradient-based optimization using GPU computing. Finally, we demonstrate that DeepMinimizer discovers minimizer schemes that significantly outperform state-of-the-art constructions on human genomic sequences.
Collapse
Affiliation(s)
- Minh Hoang
- Computer Science Department, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Hongyu Zheng
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computer Science Department, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
4
|
Du Z, D’Alessandro E, Zheng Y, Wang M, Chen C, Wang X, Song C. Retrotransposon Insertion Polymorphisms (RIPs) in Pig Coat Color Candidate Genes. Animals (Basel) 2022; 12:ani12080969. [PMID: 35454216 PMCID: PMC9031378 DOI: 10.3390/ani12080969] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 03/28/2022] [Accepted: 04/05/2022] [Indexed: 12/17/2022] Open
Abstract
The diversity of livestock coat color results from human positive selection and represents an indispensable part of breed identity. As an important biodiversity resource, pigs have many special characteristics, including the most visualized feature, coat color, and excellent adaptation, and the coat color represents an important phenotypic characteristic of the pig breed. Exploring the genetic mechanisms of phenotypic characteristics and the melanocortin system is of considerable interest in domestic animals because their energy metabolism and pigmentation have been under strong selection. In this study, 20 genes related to coat color in mammals were selected, and the structural variations (SVs) in these genic regions were identified by sequence alignment across 17 assembled pig genomes, from representing different types of pigs (miniature, lean, and fat type). A total of 167 large structural variations (>50 bp) of coat-color genes, which overlap with retrotransposon insertions (>50 bp), were obtained and designated as putative RIPs. Finally, 42 RIPs were confirmed by PCR detection. Additionally, eleven RIP sites were further evaluated for their genotypic distributions by PCR in more individuals of eleven domesticated breeds representing different coat color groups. Differential distributions of these RIPs were observed across populations, and some RIPs may be associated with breed differences.
Collapse
Affiliation(s)
- Zhanyu Du
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (Z.D.); (Y.Z.); (M.W.); (C.C.); (X.W.)
| | - Enrico D’Alessandro
- Department of Veterinary Sciences, University of Messina, Via Palatucci, 98168 Messina, Italy;
| | - Yao Zheng
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (Z.D.); (Y.Z.); (M.W.); (C.C.); (X.W.)
| | - Mengli Wang
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (Z.D.); (Y.Z.); (M.W.); (C.C.); (X.W.)
| | - Cai Chen
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (Z.D.); (Y.Z.); (M.W.); (C.C.); (X.W.)
| | - Xiaoyan Wang
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (Z.D.); (Y.Z.); (M.W.); (C.C.); (X.W.)
| | - Chengyi Song
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (Z.D.); (Y.Z.); (M.W.); (C.C.); (X.W.)
- Correspondence:
| |
Collapse
|
5
|
Gu A, Cho HJ, Sheffield NC. Bedshift: perturbation of genomic interval sets. Genome Biol 2021; 22:238. [PMID: 34416909 PMCID: PMC8379854 DOI: 10.1186/s13059-021-02440-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Accepted: 07/26/2021] [Indexed: 12/25/2022] Open
Abstract
Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.
Collapse
Affiliation(s)
- Aaron Gu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Computer Science, University of Virginia School of Engineering, Charlottesville, VA, USA
| | - Hyun Jae Cho
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Computer Science, University of Virginia School of Engineering, Charlottesville, VA, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA.
| |
Collapse
|
6
|
SINE Insertion in the Intron of Pig GHR May Decrease Its Expression by Acting as a Repressor. Animals (Basel) 2021; 11:ani11071871. [PMID: 34201672 PMCID: PMC8300111 DOI: 10.3390/ani11071871] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 06/15/2021] [Accepted: 06/19/2021] [Indexed: 11/17/2022] Open
Abstract
Simple Summary GH/IGF axis genes play a central role in the regulation of skeletal accretion during development and growth, and thus represent candidate genes for growth traits. Retrotransposon insertion polymorphisms are major contributors to structural variations. They tend to generate large effect mutations resulting in variations in target gene activity and phenotype due to the fact that they carry functional elements, such as enhancers, insulators, or promoters. In the present study, RIPs in four GH/IGF axis genes (GH, GHR, IGF1, and IGF1R) were investigated by comparative genomics and PCR. Four RIPs in the GHR gene and one RIP in the IGF1 gene were identified. Further analysis revealed that one RIP in the first intron of GHR might play a role in the regulation of GHR expression by acting as a repressor. These findings contribute to the understanding of the role of RIPs in the genetic variation of GH/IGF axis genes and phenotypic variation in pigs. Abstract The genetic diversity of the GH/IGF axis genes and their association with the variation of gene expression and phenotypic traits, principally represented by SNPs, have been extensively reported. Nevertheless, the impact of retrotransposon insertion polymorphisms (RIPs) on the GH/IGF axis gene activity has not been reported. In the present study, bioinformatic prediction and PCR verification were performed to screen RIPs in four GH/IGF axis genes (GH, GHR, IGF1 and IGF1R). In total, five RIPs, including one SINE RIP in intron 3 of IGF1, one L1 RIP in intron 7 of GHR, and three SINE RIPs in intron 1, intron 5 and intron 9 of GHR, were confirmed by PCR, displaying polymorphisms in diverse breeds. Dual luciferase reporter assay revealed that the SINE insertion in intron 1 of GHR significantly repressed the GHR promoter activity in PK15, Hela, C2C12 and 3T3-L1 cells. Furthermore, qPCR results confirmed that this SINE insertion was associated with a decreased expression of GHR in the leg muscle and longissimus dorsi, indicating that it may act as a repressor involved in the regulation of GHR expression. In summary, our data revealed that RIPs contribute to the genetic variation of GH/IGF axis genes, whereby one SINE RIP in the intron 1 of GHR may decrease the expression of GHR by acting as a repressor.
Collapse
|
7
|
Nikitin D, Kolosov N, Murzina A, Pats K, Zamyatin A, Tkachev V, Sorokin M, Kopylov P, Buzdin A. Retroelement-Linked H3K4me1 Histone Tags Uncover Regulatory Evolution Trends of Gene Enhancers and Feature Quickly Evolving Molecular Processes in Human Physiology. Cells 2019; 8:cells8101219. [PMID: 31597351 PMCID: PMC6830109 DOI: 10.3390/cells8101219] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 09/25/2019] [Accepted: 10/01/2019] [Indexed: 12/20/2022] Open
Abstract
Background: Retroelements (REs) are mobile genetic elements comprising ~40% of human DNA. They can reshape expression patterns of nearby genes by providing various regulatory sequences. The proportion of regulatory sequences held by REs can serve a measure of regulatory evolution rate of the respective genes and molecular pathways. Methods: We calculated RE-linked enrichment scores for individual genes and molecular pathways based on ENCODE project epigenome data for enhancer-specific histone modification H3K4me1 in five human cell lines. We identified consensus groups of molecular processes that are enriched and deficient in RE-linked H3K4me1 regulation. Results: We calculated H3K4me1 RE-linked enrichment scores for 24,070 human genes and 3095 molecular pathways. We ranked genes and pathways and identified those statistically significantly enriched and deficient in H3K4me1 RE-linked regulation. Conclusion: Non-coding RNA genes were statistically significantly enriched by RE-linked H3K4me1 regulatory modules, thus suggesting their high regulatory evolution rate. The processes of gene silencing by small RNAs, DNA metabolism/chromatin structure, sensory perception/neurotransmission and lipids metabolism showed signs of the fastest regulatory evolution, while the slowest processes were connected with immunity, protein ubiquitination/degradation, cell adhesion, migration and interaction, metals metabolism/ion transport, cell death, intracellular signaling pathways.
Collapse
Affiliation(s)
- Daniil Nikitin
- Group for genomic analysis of cell signaling systems, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, 117997 Moscow, Russia.
- Omicsway Corp., Walnut, CA 91789, USA.
| | | | | | - Karina Pats
- ITMO University, 195251 Saint-Petersburg, Russia.
| | | | | | - Maxim Sorokin
- Omicsway Corp., Walnut, CA 91789, USA.
- Institute of Personalized Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia.
| | - Philippe Kopylov
- Institute of Personalized Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia.
| | - Anton Buzdin
- Group for genomic analysis of cell signaling systems, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, 117997 Moscow, Russia.
- Omicsway Corp., Walnut, CA 91789, USA.
- Institute of Personalized Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia.
| |
Collapse
|
8
|
Kanduri C, Bock C, Gundersen S, Hovig E, Sandve GK. Colocalization analyses of genomic elements: approaches, recommendations and challenges. Bioinformatics 2019; 35:1615-1624. [PMID: 30307532 PMCID: PMC6499241 DOI: 10.1093/bioinformatics/bty835] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 09/03/2018] [Accepted: 10/10/2018] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Many high-throughput methods produce sets of genomic regions as one of their main outputs. Scientists often use genomic colocalization analysis to interpret such region sets, for example to identify interesting enrichments and to understand the interplay between the underlying biological processes. Although widely used, there is little standardization in how these analyses are performed. Different practices can substantially affect the conclusions of colocalization analyses. RESULTS Here, we describe the different approaches and provide recommendations for performing genomic colocalization analysis, while also discussing common methodological challenges that may influence the conclusions. As illustrated by concrete example cases, careful attention to analysis details is needed in order to meet these challenges and to obtain a robust and biologically meaningful interpretation of genomic region set data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chakravarthi Kanduri
- Department of Informatics, University of Oslo, Oslo, Norway
- K. G. Jebsen Coeliac Disease Research Centre, Oslo, Norway
| | - Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
- Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Sveinung Gundersen
- Department of Informatics, University of Oslo, Oslo, Norway
- Elixir Norway, Oslo Node, University of Oslo, Oslo, Norway
| | - Eivind Hovig
- Department of Informatics, University of Oslo, Oslo, Norway
- Elixir Norway, Oslo Node, University of Oslo, Oslo, Norway
- Department of Tumor Biology, Institute for Cancer Research, Oslo, Norway
- Institute for Cancer Genetics and Informatics, The Norwegian Radium Hospital, Oslo, Norway, UK
| | - Geir Kjetil Sandve
- Department of Informatics, University of Oslo, Oslo, Norway
- K. G. Jebsen Coeliac Disease Research Centre, Oslo, Norway
| |
Collapse
|
9
|
Dozmorov MG. Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning. Bioinformatics 2018; 33:3323-3330. [PMID: 29028263 DOI: 10.1093/bioinformatics/btx414] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 06/22/2017] [Indexed: 12/12/2022] Open
Abstract
Motivation One of the goals of functional genomics is to understand the regulatory implications of experimentally obtained genomic regions of interest (ROIs). Most sequencing technologies now generate ROIs distributed across the whole genome. The interpretation of these genome-wide ROIs represents a challenge as the majority of them lie outside of functionally well-defined protein coding regions. Recent efforts by the members of the International Human Epigenome Consortium have generated volumes of functional/regulatory data (reference epigenomic datasets), effectively annotating the genome with epigenomic properties. Consequently, a wide variety of computational tools has been developed utilizing these epigenomic datasets for the interpretation of genomic data. Results The purpose of this review is to provide a structured overview of practical solutions for the interpretation of ROIs with the help of epigenomic data. Starting with epigenomic enrichment analysis, we discuss leading tools and machine learning methods utilizing epigenomic and 3D genome structure data. The hierarchy of tools and methods reviewed here presents a practical guide for the interpretation of genome-wide ROIs within an epigenomic context. Contact mikhail.dozmorov@vcuhealth.org. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mikhail G Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA 23298, USA
| |
Collapse
|
10
|
Naidoo T, Sjödin P, Schlebusch C, Jakobsson M. Patterns of variation in cis-regulatory regions: examining evidence of purifying selection. BMC Genomics 2018; 19:95. [PMID: 29373957 PMCID: PMC5787233 DOI: 10.1186/s12864-017-4422-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 12/27/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With only 2 % of the human genome consisting of protein coding genes, functionality across the rest of the genome has been the subject of much debate. This has gained further impetus in recent years due to a rapidly growing catalogue of genomic elements, based primarily on biochemical signatures (e.g. the ENCODE project). While the assessment of functionality is a complex task, the presence of selection acting on a genomic region is a strong indicator of importance. In this study, we apply population genetic methods to investigate signals overlaying several classes of regulatory elements. RESULTS We disentangle signals of purifying selection acting directly on regulatory elements from the confounding factors of demography and purifying selection linked to e.g. nearby protein coding regions. We confirm the importance of regulatory regions proximal to coding sequence, while also finding differential levels of selection at distal regions. We note differences in purifying selection among transcription factor families. Signals of constraint at some genomic classes were also strongly dependent on their physical location relative to coding sequence. In addition, levels of selection efficacy across genomic classes differed between African and non-African populations. CONCLUSIONS In order to assign a valid signal of selection to a particular class of genomic sequence, we show that it is crucial to isolate the signal by accounting for the effects of demography and linked-purifying selection. Our study highlights the intricate interplay of factors affecting signals of selection on functional elements.
Collapse
Affiliation(s)
- Thijessen Naidoo
- Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Per Sjödin
- Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Carina Schlebusch
- Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Mattias Jakobsson
- Department of Organismal Biology, Uppsala University, Uppsala, Sweden. .,Science for Life Lab, Uppsala, Sweden.
| |
Collapse
|
11
|
Lin Z, Guo H, Cao Y, Zohrabian S, Zhou P, Ma Q, VanDusen N, Guo Y, Zhang J, Stevens SM, Liang F, Quan Q, van Gorp PR, Li A, Dos Remedios C, He A, Bezzerides VJ, Pu WT. Acetylation of VGLL4 Regulates Hippo-YAP Signaling and Postnatal Cardiac Growth. Dev Cell 2016; 39:466-479. [PMID: 27720608 DOI: 10.1016/j.devcel.2016.09.005] [Citation(s) in RCA: 80] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Revised: 07/12/2016] [Accepted: 09/08/2016] [Indexed: 11/28/2022]
Abstract
Binding of the transcriptional co-activator YAP with the transcription factor TEAD stimulates growth of the heart and other organs. YAP overexpression potently stimulates fetal cardiomyocyte (CM) proliferation, but YAP's mitogenic potency declines postnatally. While investigating factors that limit YAP's postnatal mitogenic activity, we found that the CM-enriched TEAD1 binding protein VGLL4 inhibits CM proliferation by inhibiting TEAD1-YAP interaction and by targeting TEAD1 for degradation. Importantly, VGLL4 acetylation at lysine 225 negatively regulated its binding to TEAD1. This developmentally regulated acetylation event critically governs postnatal heart growth, since overexpression of an acetylation-refractory VGLL4 mutant enhanced TEAD1 degradation, limited neonatal CM proliferation, and caused CM necrosis. Our study defines an acetylation-mediated, VGLL4-dependent switch that regulates TEAD stability and YAP-TEAD activity. These insights may improve targeted modulation of TEAD-YAP activity in applications from cardiac regeneration to cancer.
Collapse
Affiliation(s)
- Zhiqiang Lin
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA.
| | - Haidong Guo
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA; Department of Anatomy, School of Basic Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Yuan Cao
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA; Peking University, Fifth School of Clinical Medicine, Beijing 100730, China
| | - Sylvia Zohrabian
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Pingzhu Zhou
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Qing Ma
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Nathan VanDusen
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Yuxuan Guo
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Jin Zhang
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Sean M Stevens
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Feng Liang
- Rowland Institute at Harvard, Harvard University, Cambridge, MA 02142, USA
| | - Qimin Quan
- Rowland Institute at Harvard, Harvard University, Cambridge, MA 02142, USA
| | - Pim R van Gorp
- Department of Cardiology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands
| | - Amy Li
- Department of Anatomy & Histology, Bosch Institute, University of Sydney, Sydney, NSW 2006, Australia
| | - Cristobal Dos Remedios
- Department of Anatomy & Histology, Bosch Institute, University of Sydney, Sydney, NSW 2006, Australia
| | - Aibin He
- Institute of Molecular Medicine, Peking University, PKU-Tsinghua U Joint Center for Life Sciences, Beijing 100871, China
| | - Vassilios J Bezzerides
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA
| | - William T Pu
- Department of Cardiology, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA; Harvard Stem Cell Institute, Harvard University, Cambridge, MA 02138, USA.
| |
Collapse
|
12
|
Castellanos-Martín A, Castillo-Lluva S, Sáez-Freire MDM, Blanco-Gómez A, Hontecillas-Prieto L, Patino-Alonso C, Galindo-Villardon P, Pérez Del Villar L, Martín-Seisdedos C, Isidoro-Garcia M, Abad-Hernández MDM, Cruz-Hernández JJ, Rodríguez-Sánchez CA, González-Sarmiento R, Alonso-López D, De Las Rivas J, García-Cenador B, García-Criado J, Lee DY, Bowen B, Reindl W, Northen T, Mao JH, Pérez-Losada J. Unraveling heterogeneous susceptibility and the evolution of breast cancer using a systems biology approach. Genome Biol 2015; 16:40. [PMID: 25853295 PMCID: PMC4389302 DOI: 10.1186/s13059-015-0599-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Accepted: 01/27/2015] [Indexed: 12/16/2022] Open
Abstract
Background An essential question in cancer is why individuals with the same disease have different clinical outcomes. Progress toward a more personalized medicine in cancer patients requires taking into account the underlying heterogeneity at different molecular levels. Results Here, we present a model in which there are complex interactions at different cellular and systemic levels that account for the heterogeneity of susceptibility to and evolution of ERBB2-positive breast cancers. Our model is based on our analyses of a cohort of mice that are characterized by heterogeneous susceptibility to ERBB2-positive breast cancers. Our analysis reveals that there are similarities between ERBB2 tumors in humans and those of backcross mice at clinical, genomic, expression, and signaling levels. We also show that mice that have tumors with intrinsically high levels of active AKT and ERK are more resistant to tumor metastasis. Our findings suggest for the first time that a site-specific phosphorylation at the serine 473 residue of AKT1 modifies the capacity for tumors to disseminate. Finally, we present two predictive models that can explain the heterogeneous behavior of the disease in the mouse population when we consider simultaneously certain genetic markers, liver cell signaling and serum biomarkers that are identified before the onset of the disease. Conclusions Considering simultaneously tumor pathophenotypes and several molecular levels, we show the heterogeneous behavior of ERBB2-positive breast cancer in terms of disease progression. This and similar studies should help to better understand disease variability in patient populations. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0599-z) contains supplementary material, which is available to authorized users.
Collapse
|
13
|
Kravatsky YV, Chechetkin VR, Tchurikov NA, Kravatskaya GI. Genome-wide study of correlations between genomic features and their relationship with the regulation of gene expression. DNA Res 2015; 22:109-19. [PMID: 25627242 PMCID: PMC4379982 DOI: 10.1093/dnares/dsu044] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
The broad class of tasks in genetics and epigenetics can be reduced to the study of various features that are distributed over the genome (genome tracks). The rapid and efficient processing of the huge amount of data stored in the genome-scale databases cannot be achieved without the software packages based on the analytical criteria. However, strong inhomogeneity of genome tracks hampers the development of relevant statistics. We developed the criteria for the assessment of genome track inhomogeneity and correlations between two genome tracks. We also developed a software package, Genome Track Analyzer, based on this theory. The theory and software were tested on simulated data and were applied to the study of correlations between CpG islands and transcription start sites in the Homo sapiens genome, between profiles of protein-binding sites in chromosomes of Drosophila melanogaster, and between DNA double-strand breaks and histone marks in the H. sapiens genome. Significant correlations between transcription start sites on the forward and the reverse strands were observed in genomes of D. melanogaster, Caenorhabditis elegans, Mus musculus, H. sapiens, and Danio rerio. The observed correlations may be related to the regulation of gene expression in eukaryotes. Genome Track Analyzer is freely available at http://ancorr.eimb.ru/.
Collapse
Affiliation(s)
- Yuri V Kravatsky
- Engelhardt Institute of Molecular Biology of Russian Academy of Sciences, Moscow 119991, Russia
| | - Vladimir R Chechetkin
- Engelhardt Institute of Molecular Biology of Russian Academy of Sciences, Moscow 119991, Russia
| | - Nikolai A Tchurikov
- Engelhardt Institute of Molecular Biology of Russian Academy of Sciences, Moscow 119991, Russia
| | - Galina I Kravatskaya
- Engelhardt Institute of Molecular Biology of Russian Academy of Sciences, Moscow 119991, Russia
| |
Collapse
|
14
|
Li WC, Zhong ZJ, Zhu PP, Deng EZ, Ding H, Chen W, Lin H. Sequence analysis of origins of replication in the Saccharomyces cerevisiae genomes. Front Microbiol 2014; 5:574. [PMID: 25477864 PMCID: PMC4235382 DOI: 10.3389/fmicb.2014.00574] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2014] [Accepted: 10/11/2014] [Indexed: 12/26/2022] Open
Abstract
DNA replication is a highly precise process that is initiated from origins of replication (ORIs) and is regulated by a set of regulatory proteins. The mining of DNA sequence information will be not only beneficial for understanding the regulatory mechanism of replication initiation but also for accurately identifying ORIs. In this study, the GC profile and GC skew were calculated to analyze the compositional bias in the Saccharomyces cerevisiae genome. We found that the GC profile in the region of ORIs is significantly lower than that in the flanking regions. By calculating the information redundancy, an estimation of the correlation of nucleotides, we found that the intensity of adjoining correlation in ORIs is dramatically higher than that in flanking regions. Furthermore, the relationships between ORIs and nucleosomes as well as transcription start sites were investigated. Results showed that ORIs are usually not occupied by nucleosomes. Finally, we calculated the distribution of ORIs in yeast chromosomes and found that most ORIs are in transcription terminal regions. We hope that these results will contribute to the identification of ORIs and the study of DNA replication mechanisms.
Collapse
Affiliation(s)
- Wen-Chao Li
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China Chengdu, China
| | - Zhe-Jin Zhong
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China Chengdu, China
| | - Pan-Pan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China Chengdu, China
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China Chengdu, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China Chengdu, China
| | - Wei Chen
- Department of Physics, School of Sciences and Center for Genomics and Computational Biology, Hebei United University Tangshan, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China Chengdu, China
| |
Collapse
|
15
|
Budden DM, Hurley DG, Crampin EJ. Predictive modelling of gene expression from transcriptional regulatory elements. Brief Bioinform 2014; 16:616-28. [PMID: 25231769 DOI: 10.1093/bib/bbu034] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 08/20/2014] [Indexed: 12/15/2022] Open
Abstract
Predictive modelling of gene expression provides a powerful framework for exploring the regulatory logic underpinning transcriptional regulation. Recent studies have demonstrated the utility of such models in identifying dysregulation of gene and miRNA expression associated with abnormal patterns of transcription factor (TF) binding or nucleosomal histone modifications (HMs). Despite the growing popularity of such approaches, a comparative review of the various modelling algorithms and feature extraction methods is lacking. We define and compare three methods of quantifying pairwise gene-TF/HM interactions and discuss their suitability for integrating the heterogeneous chromatin immunoprecipitation (ChIP)-seq binding patterns exhibited by TFs and HMs. We then construct log-linear and ϵ-support vector regression models from various mouse embryonic stem cell (mESC) and human lymphoblastoid (GM12878) data sets, considering both ChIP-seq- and position weight matrix- (PWM)-derived in silico TF-binding. The two algorithms are evaluated both in terms of their modelling prediction accuracy and ability to identify the established regulatory roles of individual TFs and HMs. Our results demonstrate that TF-binding and HMs are highly predictive of gene expression as measured by mRNA transcript abundance, irrespective of algorithm or cell type selection and considering both ChIP-seq and PWM-derived TF-binding. As we encourage other researchers to explore and develop these results, our framework is implemented using open-source software and made available as a preconfigured bootable virtual environment.
Collapse
|
16
|
Shen L, Choi I, Nestler EJ, Won KJ. Human Transcriptome and Chromatin Modifications: An ENCODE Perspective. Genomics Inform 2013; 11:60-7. [PMID: 23843771 PMCID: PMC3704928 DOI: 10.5808/gi.2013.11.2.60] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2013] [Revised: 03/05/2013] [Accepted: 03/13/2013] [Indexed: 11/22/2022] Open
Abstract
A decade-long project, led by several international research groups, called the Encyclopedia of DNA Elements (ENCODE), recently released an unprecedented amount of data. The ambitious project covers transcriptome, cistrome, epigenome, and interactome data from more than 1,600 sets of experiments in human. To make use of this valuable resource, it is important to understand the information it represents and the techniques that were used to generate these data. In this review, we introduce the data that ENCODE generated, summarize the observations from the data analysis, and revisit a computational approach that ENCODE used to predict gene expression, with a focus on the human transcriptome and its association with chromatin modifications.
Collapse
Affiliation(s)
- Li Shen
- Department of Neuroscience, Mount Sinai School of Medicine, New York, NY 10029, USA
| | | | | | | |
Collapse
|
17
|
Affiliation(s)
- Kelly A Frazer
- Moores UCSD Cancer Center, Department of Pediatrics and Rady Children's Hospital, University of California at San Diego, La Jolla, California 92093, USA.
| |
Collapse
|
18
|
Zhang J, Parvin J, Huang K. Redistribution of H3K4me2 on neural tissue specific genes during mouse brain development. BMC Genomics 2012; 13 Suppl 8:S5. [PMID: 23281639 PMCID: PMC3535709 DOI: 10.1186/1471-2164-13-s8-s5] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Histone modification plays an important role in cell differentiation and tissue development. A recent study has shown that the dimethylation of lysine 4 residue on histone 3 (H3K4me2) marks the gene body area of tissue specific genes in the human CD4+ T cells and neural cells. However, little is known of the H3k4me2 distribution dynamics through the cell differentiation and tissue development. Results We applied several clustering methods including K-means, hierarchical and principle component analysis on H3K4me2 ChIP-seq data from embryonic stem cell, neural progenitor cell and whole brain of mouse, trying to identify genes with the H3K4me2 binding on the gene body region in different cell development stage and study their redistribution in different tissue development stages. A cluster of 356 genes with heavy H3K4me2 labeling in the gene body region was identified in the mouse whole brain tissue using K-means clustering. They are highly enriched with neural system related functions and pathways, and are involved in several central neural system diseases. The distribution of H3K4me2 on neural function related genes follows three distinctive patterns: a group of genes contain constant heavy H3K4me2 marks in the gene body from embryonic stem cell stage through neural progenitor stage to matured brain tissue stage; another group of gene have little H3K4me2 marks until cells mature into brain cells; the majority of the genes acquired H3K4me2 marks in the neural progenitor cell stage, and gain heavy labeling in the matured brain cell stage. Gene ontology enrichment analysis also revealed corresponding gene ontology terms that fit in the scenario of each cell developmental stages. Conclusions We investigated the process of the H3K4me2 mark redistribution during tissue specificity development for mouse brain tissue. Our analysis confirmed the previous report that heavy labeling of H3K4me2 in the downstream of TSS marks tissue specific genes. These genes show remarkable enrichment in central neural system related diseases. Furthermore, we have shown that H3K4me2 labeling can be acquired as early as the embryonic stem cell stage, and its distribution is dynamic and progressive throughout cell differentiation and tissue development.
Collapse
Affiliation(s)
- Jie Zhang
- The CCC Biomedical Informatics Shared Resource, The Ohio State University, USA
| | | | | |
Collapse
|
19
|
Stojnic R, Fu AQ, Adryan B. A graphical modelling approach to the dissection of highly correlated transcription factor binding site profiles. PLoS Comput Biol 2012; 8:e1002725. [PMID: 23144600 PMCID: PMC3493460 DOI: 10.1371/journal.pcbi.1002725] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2012] [Accepted: 08/01/2012] [Indexed: 11/18/2022] Open
Abstract
Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that directly regulate gene expression from those that are indirectly associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package ddgraph, available as part of Bioconductor (http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html). Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs. Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression. Recent technological advances make it possible to map TF binding patterns across the whole genome. Multiple single-gene studies showed that combinatorial binding of multiple transcription factors determines the gene transcriptional output. A common naive assumption is that correlated binding profiles may indicate combinatorial binding. However, it has been found that many TFs bind to distinct hotspots whose role is currently unclear. It is thus of great interest to find transcription factor combinations whose correlated binding is causally most immediate to gene expression. Building upon theories of statistical dependence and causality, we develop novel graphical modelbased algorithms that handle highly correlated transcription factor binding profiles more efficiently and reliably than existing algorithms do. These algorithms can also be applied to other biological areas involving highly correlated variables, such as the analysis of high-throughput gene knock-down experiments.
Collapse
Affiliation(s)
- Robert Stojnic
- Cambridge Systems Biology Centre, University of Cambridge, Cambridge, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Audrey Qiuyan Fu
- Cambridge Systems Biology Centre, University of Cambridge, Cambridge, United Kingdom
- Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, United Kingdom
| | - Boris Adryan
- Cambridge Systems Biology Centre, University of Cambridge, Cambridge, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
20
|
Dawson M, Foster S, Bannister A, Robson S, Hannah R, Wang X, Xhemalce B, Wood A, Green A, Göttgens B, Kouzarides T. Three distinct patterns of histone H3Y41 phosphorylation mark active genes. Cell Rep 2012; 2:470-7. [PMID: 22999934 PMCID: PMC3607218 DOI: 10.1016/j.celrep.2012.08.016] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2012] [Revised: 07/16/2012] [Accepted: 08/16/2012] [Indexed: 02/02/2023] Open
Abstract
The JAK2 tyrosine kinase is a critical mediator of cytokine-induced signaling. It plays a role in the nucleus, where it regulates transcription by phosphorylating histone H3 at tyrosine 41 (H3Y41ph). We used chromatin immunoprecipitation coupled to massively parallel DNA sequencing (ChIP-seq) to define the genome-wide pattern of H3Y41ph in human erythroid leukemia cells. Our results indicate that H3Y41ph is located at three distinct sites: (1) at a subset of active promoters, where it overlaps with H3K4me3, (2) at distal cis-regulatory elements, where it coincides with the binding of STAT5, and (3) throughout the transcribed regions of active, tissue-specific hematopoietic genes. Together, these data extend our understanding of this conserved and essential signaling pathway and provide insight into the mechanisms by which extracellular stimuli may lead to the coordinated regulation of transcription.
Collapse
Affiliation(s)
- Mark A. Dawson
- Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge, CB2 1QN, UK,Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK,Addenbrooke’s Hospital, University of Cambridge, Cambridge, CB2 0XY, UK
| | - Samuel D. Foster
- Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK
| | - Andrew J. Bannister
- Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Samuel C. Robson
- Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Rebecca Hannah
- Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK
| | - Xiaonan Wang
- Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK
| | - Blerta Xhemalce
- Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Andrew D. Wood
- Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK
| | - Anthony R. Green
- Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK,Addenbrooke’s Hospital, University of Cambridge, Cambridge, CB2 0XY, UK
| | - Berthold Göttgens
- Department of Haematology, Cambridge Institute for Medical Research and The Wellcome Trust and MRC Stem Cell Institute, University of Cambridge, Cambridge, CB2 0XY, UK,Corresponding author
| | - Tony Kouzarides
- Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge, CB2 1QN, UK,Corresponding author
| |
Collapse
|
21
|
Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol 2012; 13:R48. [PMID: 22950945 PMCID: PMC3491392 DOI: 10.1186/gb-2012-13-9-r48] [Citation(s) in RCA: 187] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Revised: 05/06/2012] [Accepted: 06/08/2012] [Indexed: 01/22/2023] Open
Abstract
Background Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors. Results As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions. Conclusions Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
Collapse
|
22
|
Manic G, Maurin-Marlin A, Galluzzi L, Subra F, Mouscadet JF, Bury-Moné S. 3' self-inactivating long terminal repeat inserts for the modulation of transgene expression from lentiviral vectors. Hum Gene Ther Methods 2012; 23:84-97. [PMID: 22456436 DOI: 10.1089/hgtb.2011.154] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Gene transfer for research or gene therapy requires the design of vectors that allow for adequate and safe transgene expression. Current methods to modulate the safety and expression profile of retroviral vectors can involve the insertion of insulators or scaffold/matrix-attachment regions in self-inactivating long terminal repeats (SIN-LTRs). Here, we generated a set of lentiviral vectors (with internal CMV or PGK promoter) in which we inserted (at the level of SIN-LTRs) sequences of avian (i.e., chicken hypersensitive site-4, cHS4), human (i.e., putative insulator and desert sequence), or bacterial origin. We characterized them with respect to viral titer, integration, transduction efficiency and transgene expression levels, in both integrase-proficient and -deficient contexts. We found that the cHS4 insulator enhanced transgene expression by a factor of 1.5 only when cloned in the antisense orientation. On the other hand, cHS4 in the sense orientation as well as all other inserts decreased transgene expression. This attenuation phenomenon persisted over long periods of time and did not correspond to extinction or variegation. Decreased transgene expression was associated with lower mRNA levels, yet RNA stability was not affected. Insertions within the SIN-LTRs may negatively affect transgene transcription in a direct fashion through topological rearrangements. The lentiviral vectors that we generated constitute valuable genetic tools for manipulating the level of transgene expression. Moreover, this study demonstrates that SIN-LTR inserts can decrease transgene expression, a phenomenon that might be overcome by modifying insert orientation, thereby highlighting the importance of careful vector design for gene therapy.
Collapse
Affiliation(s)
- Gwenola Manic
- Laboratoire de Biologie et de Pharmacologie Appliquée, UMR 8113 CNRS, Ecole Normale Supérieure de Cachan, FR-94230 Cachan, France
| | | | | | | | | | | |
Collapse
|
23
|
Chikina MD, Troyanskaya OG. An effective statistical evaluation of ChIPseq dataset similarity. ACTA ACUST UNITED AC 2012; 28:607-13. [PMID: 22262674 DOI: 10.1093/bioinformatics/bts009] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION ChIPseq is rapidly becoming a common technique for investigating protein-DNA interactions. However, results from individual experiments provide a limited understanding of chromatin structure, as various chromatin factors cooperate in complex ways to orchestrate transcription. In order to quantify chromtain interactions, it is thus necessary to devise a robust similarity metric applicable to ChIPseq data. Unfortunately, moving past simple overlap calculations to give statistically rigorous comparisons of ChIPseq datasets often involves arbitrary choices of distance metrics, with significance being estimated by computationally intensive permutation tests whose statistical power may be sensitive to non-biological experimental and post-processing variation. RESULTS We show that it is in fact possible to compare ChIPseq datasets through the efficient computation of exact P-values for proximity. Our method is insensitive to non-biological variation in datasets such as peak width, and can rigorously model peak location biases by evaluating similarity conditioned on a restricted set of genomic regions (such as mappable genome or promoter regions). Applying our method to the well-studied dataset of Chen et al. (2008), we elucidate novel interactions which conform well with our biological understanding. By comparing ChIPseq data in an asymmetric way, we are able to observe clear interaction differences between cofactors such as p300 and factors that bind DNA directly. AVAILABILITY Source code is available for download at http://sonorus.princeton.edu/IntervalStats/IntervalStats.tar.gz. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maria D Chikina
- Department of Neurology, Mount Sinai School of Medicine, New York, NY 10029, USA
| | | |
Collapse
|
24
|
Abstract
Bacterial artificial chromosomes (BACs) are widely used in studies of vertebrate gene regulation and function because they often closely recapitulate the expression patterns of endogenous genes. Here we report a step-by-step protocol for efficient BAC transgenesis in zebrafish using the medaka Tol2 transposon. Using recombineering in Escherichia coli, we introduce the iTol2 cassette in the BAC plasmid backbone, which contains the inverted minimal cis-sequences required for Tol2 transposition, and a reporter gene to replace a target locus in the BAC. Microinjection of the Tol2-BAC and a codon-optimized transposase mRNA into fertilized eggs results in clean integrations in the genome and transmission to the germline at a rate of ∼15%. A single person can prepare a dozen constructs within 3 weeks, and obtain transgenic fish within approximately 3-4 months. Our protocol drastically reduces the labor involved in BAC transgenesis and will greatly facilitate biological and biomedical studies in model vertebrates.
Collapse
|
25
|
Cheng C, Min R, Gerstein M. TIP: a probabilistic method for identifying transcription factor target genes from ChIP-seq binding profiles. ACTA ACUST UNITED AC 2011; 27:3221-7. [PMID: 22039215 DOI: 10.1093/bioinformatics/btr552] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION ChIP-seq and ChIP-chip experiments have been widely used to identify transcription factor (TF) binding sites and target genes. Conventionally, a fairly 'simple' approach is employed for target gene identification e.g. finding genes with binding sites within 2 kb of a transcription start site (TSS). However, this does not take into account the number of sites upstream of the TSS, their exact positioning or the fact that different TFs appear to act at different characteristic distances from the TSS. RESULTS Here we propose a probabilistic model called target identification from profiles (TIP) that quantitatively measures the regulatory relationships between TFs and target genes. For each TF, our model builds a characteristic, averaged profile of binding around the TSS and then uses this to weight the sites associated with a given gene, providing a continuous-valued 'regulatory' score relating each TF and potential target. Moreover, the score can readily be turned into a ranked list of target genes and an estimate of significance, which is useful for case-dependent downstream analysis. CONCLUSION We show the advantages of TIP by comparing it to the 'simple' approach on several representative datasets, using motif occurrence and relationship to knock-out experiments as metrics of validation. Moreover, we show that the probabilistic model is not as sensitive to various experimental parameters (including sequencing depth and peak-calling method) as the simple approach; in fact, the lesser dependence on sequencing depth potentially utilizes the result of a ChIP-seq experiment in a more 'cost-effective' manner. CONTACT mark.gerstein@yale.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Cheng
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| | | | | |
Collapse
|
26
|
Jee J, Rozowsky J, Yip KY, Lochovsky L, Bjornson R, Zhong G, Zhang Z, Fu Y, Wang J, Weng Z, Gerstein M. ACT: aggregation and correlation toolbox for analyses of genome tracks. Bioinformatics 2011; 27:1152-4. [PMID: 21349863 PMCID: PMC3072554 DOI: 10.1093/bioinformatics/btr092] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation--i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts. AVAILABILITY ACT is available at http://act.gersteinlab.org CONTACT pi@gersteinlab.org.
Collapse
Affiliation(s)
- Justin Jee
- Program in Computational Biology and Bioinformatics, Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Sakabe NJ, Nobrega MA. Genome-wide maps of transcription regulatory elements. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2010; 2:422-437. [PMID: 20836039 DOI: 10.1002/wsbm.70] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Expression of eukaryotic genes with complex spatial-temporal regulation during development requires finer regulation than that of genes with simpler expression patterns. Given the high degree of conservation of the developmental gene set across distantly related phylogenetic taxa, it is argued that evolutionary variation has occurred by tweaking regulation of expression of developmental genes, rather than by changes in genes themselves. Complex regulation is often achieved through the coordinated action of transcription regulatory elements spread across the genome up to tens of kilobases from the promoters of their target genes. Disruption of regulatory elements has been implicated in several diseases and studies showing associations between disease traits and nonprotein coding variation hint for a role of regulatory elements as cause of diseases. Therefore, the identification and mapping of regulatory elements in genome scale is crucial to understand how gene expression is regulated, how organisms evolve, and to identify sequence variation causing diseases. Previously developed experimental techniques have been adapted to identify regulatory elements in genome scale and high-throughput, allowing a global view of their biological roles. We review methods as chromatin immunoprecipitation, DNase I hypersensitivity, and computational approaches and how they have been employed to generate maps of histone modifications, open chromatin, nucleosome positioning, and transcription factor binding regions in whole mammalian genomes. Given the importance of non-promoter elements in gene regulation and the recent explosion in the number of studies devoted to them, we focus on these elements and discuss the insights on gene regulation being obtained by these studies.
Collapse
Affiliation(s)
- Noboru J Sakabe
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Marcelo A Nobrega
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
28
|
When needles look like hay: how to find tissue-specific enhancers in model organism genomes. Dev Biol 2010; 350:239-54. [PMID: 21130761 DOI: 10.1016/j.ydbio.2010.11.026] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Revised: 11/11/2010] [Accepted: 11/22/2010] [Indexed: 01/22/2023]
Abstract
A major prerequisite for the investigation of tissue-specific processes is the identification of cis-regulatory elements. No generally applicable technique is available to distinguish them from any other type of genomic non-coding sequence. Therefore, researchers often have to identify these elements by elaborate in vivo screens, testing individual regions until the right one is found. Here, based on many examples from the literature, we summarize how functional enhancers have been isolated from other elements in the genome and how they have been characterized in transgenic animals. Covering computational and experimental studies, we provide an overview of the global properties of cis-regulatory elements, like their specific interactions with promoters and target gene distances. We describe conserved non-coding elements (CNEs) and their internal structure, nucleotide composition, binding site clustering and overlap, with a special focus on developmental enhancers. Conflicting data and unresolved questions on the nature of these elements are highlighted. Our comprehensive overview of the experimental shortcuts that have been found in the different model organism communities and the new field of high-throughput assays should help during the preparation phase of a screen for enhancers. The review is accompanied by a list of general guidelines for such a project.
Collapse
|
29
|
Cooper DN, Chen JM, Ball EV, Howells K, Mort M, Phillips AD, Chuzhanova N, Krawczak M, Kehrer-Sawatzki H, Stenson PD. Genes, mutations, and human inherited disease at the dawn of the age of personalized genomics. Hum Mutat 2010; 31:631-55. [PMID: 20506564 DOI: 10.1002/humu.21260] [Citation(s) in RCA: 117] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The number of reported germline mutations in human nuclear genes, either underlying or associated with inherited disease, has now exceeded 100,000 in more than 3,700 different genes. The availability of these data has both revolutionized the study of the morbid anatomy of the human genome and facilitated "personalized genomics." With approximately 300 new "inherited disease genes" (and approximately 10,000 new mutations) being identified annually, it is pertinent to ask how many "inherited disease genes" there are in the human genome, how many mutations reside within them, and where such lesions are likely to be located? To address these questions, it is necessary not only to reconsider how we define human genes but also to explore notions of gene "essentiality" and "dispensability."Answers to these questions are now emerging from recent novel insights into genome structure and function and through complete genome sequence information derived from multiple individual human genomes. However, a change in focus toward screening functional genomic elements as opposed to genes sensu stricto will be required if we are to capitalize fully on recent technical and conceptual advances and identify new types of disease-associated mutation within noncoding regions remote from the genes whose function they disrupt.
Collapse
Affiliation(s)
- David N Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, United Kingdom.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Carstensen L, Sandelin A, Winther O, Hansen NR. Multivariate Hawkes process models of the occurrence of regulatory elements. BMC Bioinformatics 2010; 11:456. [PMID: 20828413 PMCID: PMC2949889 DOI: 10.1186/1471-2105-11-456] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2010] [Accepted: 09/09/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A central question in molecular biology is how transcriptional regulatory elements (TREs) act in combination. Recent high-throughput data provide us with the location of multiple regulatory regions for multiple regulators, and thus with the possibility of analyzing the multivariate distribution of the occurrences of these TREs along the genome. RESULTS We present a model of TRE occurrences known as the Hawkes process. We illustrate the use of this model by analyzing two different publically available data sets. We are able to model, in detail, how the occurrence of one TRE is affected by the occurrences of others, and we can test a range of natural hypotheses about the dependencies among the TRE occurrences. In contrast to earlier efforts, pre-processing steps such as clustering or binning are not needed, and we thus retain information about the dependencies among the TREs that is otherwise lost. For each of the two data sets we provide two results: first, a qualitative description of the dependencies among the occurrences of the TREs, and second, quantitative results on the favored or avoided distances between the different TREs. CONCLUSIONS The Hawkes process is a novel way of modeling the joint occurrences of multiple TREs along the genome that is capable of providing new insights into dependencies among elements involved in transcriptional regulation. The method is available as an R package from http://www.math.ku.dk/~richard/ppstat/.
Collapse
Affiliation(s)
- Lisbeth Carstensen
- Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, 2100 Copenhagen Ø, Denmark
| | | | | | | |
Collapse
|
31
|
Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. Annotating non-coding regions of the genome. Nat Rev Genet 2010; 11:559-71. [PMID: 20628352 DOI: 10.1038/nrg2814] [Citation(s) in RCA: 326] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Most of the human genome consists of non-protein-coding DNA. Recently, progress has been made in annotating these non-coding regions through the interpretation of functional genomics experiments and comparative sequence analysis. One can conceptualize functional genomics analysis as involving a sequence of steps: turning the output of an experiment into a 'signal' at each base pair of the genome; smoothing this signal and segmenting it into small blocks of initial annotation; and then clustering these small blocks into larger derived annotations and networks. Finally, one can relate functional genomics annotations to conserved units and measures of conservation derived from comparative sequence analysis.
Collapse
Affiliation(s)
- Roger P Alexander
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | |
Collapse
|
32
|
Cai X, Hu H, Li X. A new measurement of sequence conservation. BMC Genomics 2009; 10:623. [PMID: 20028539 PMCID: PMC2807881 DOI: 10.1186/1471-2164-10-623] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
Background Understanding sequence conservation is important for the study of sequence evolution and for the identification of functional regions of the genome. Current studies often measure sequence conservation based on every position in contiguous regions. Therefore, a large number of functional regions that contain conserved segments separated by relatively long divergent segments are ignored. Our goal in this paper is to define a new measurement of sequence conservation such that both contiguously conserved regions and discontiguously conserved regions can be detected based on this new measurement. Here and in the following, conserved regions are those regions that share similarity higher than a pre-specified similarity threshold with their homologous regions in other species. That is, conserved regions are good candidates of functional regions and may not be always functional. Moreover, conserved regions may contain long and divergent segments. Results To identify both discontiguously and contiguously conserved regions, we proposed a new measurement of sequence conservation, which measures sequence similarity based only on the conserved segments within the regions. By defining conserved segments using the local alignment tool CHAOS, under the new measurement, we analyzed the conservation of 1642 experimentally verified human functional non-coding regions in the mouse genome. We found that the conservation in at least 11% of these functional regions could be missed by the current conservation analysis methods. We also found that 72% of the mouse homologous regions identified based on the new measurement are more similar to the human functional sequences than the aligned mouse sequences from the UCSC genome browser. We further compared BLAST and discontiguous MegaBLAST with our method. We found that our method picks up many more conserved segments than BLAST and discontiguous MegaBLAST in these regions. Conclusions It is critical to have a new measurement of sequence conservation that is based only on the conserved segments in one region. Such a new measurement can aid the identification of better local "orthologous" regions. It will also shed light on the identification of new types of conserved functional regions in vertebrate genomes [1].
Collapse
Affiliation(s)
- Xiaohui Cai
- Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. MC0446, La Jolla, CA 92093, USA
| | | | | |
Collapse
|
33
|
ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A 2009; 106:21521-6. [PMID: 19995984 DOI: 10.1073/pnas.0904863106] [Citation(s) in RCA: 243] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Next-generation sequencing has greatly increased the scope and the resolution of transcriptional regulation study. RNA sequencing (RNA-Seq) and ChIP-Seq experiments are now generating comprehensive data on transcript abundance and on regulator-DNA interactions. We propose an approach for an integrated analysis of these data based on feature extraction of ChIP-Seq signals, principal component analysis, and regression-based component selection. Compared with traditional methods, our approach not only offers higher power in predicting gene expression from ChIP-Seq data but also provides a way to capture cooperation among regulators. In mouse embryonic stem cells (ESCs), we find that a remarkably high proportion of variation in gene expression (65%) can be explained by the binding signals of 12 transcription factors (TFs). Two groups of TFs are identified. Whereas the first group (E2f1, Myc, Mycn, and Zfx) act as activators in general, the second group (Oct4, Nanog, Sox2, Smad1, Stat3, Tcfcp2l1, and Esrrb) may serve as either activator or repressor depending on the target. The two groups of TFs cooperate tightly to activate genes that are differentially up-regulated in ESCs. In the absence of binding by the first group, the binding of the second group is associated with genes that are repressed in ESCs and derepressed upon early differentiation.
Collapse
|
34
|
He X, Chen CC, Hong F, Fang F, Sinha S, Ng HH, Zhong S. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS One 2009; 4:e8155. [PMID: 19956545 PMCID: PMC2780727 DOI: 10.1371/journal.pone.0008155] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2009] [Accepted: 11/10/2009] [Indexed: 11/19/2022] Open
Abstract
Background How transcription factors (TFs) interact with cis-regulatory sequences and interact with each other is a fundamental, but not well understood, aspect of gene regulation. Methodology/Principal Findings We present a computational method to address this question, relying on the established biophysical principles. This method, STAP (sequence to affinity prediction), takes into account all combinations and configurations of strong and weak binding sites to analyze large scale transcription factor (TF)-DNA binding data to discover cooperative interactions among TFs, infer sequence rules of interaction and predict TF target genes in new conditions with no TF-DNA binding data. The distinctions between STAP and other statistical approaches for analyzing cis-regulatory sequences include the utility of physical principles and the treatment of the DNA binding data as quantitative representation of binding strengths. Applying this method to the ChIP-seq data of 12 TFs in mouse embryonic stem (ES) cells, we found that the strength of TF-DNA binding could be significantly modulated by cooperative interactions among TFs with adjacent binding sites. However, further analysis on five putatively interacting TF pairs suggests that such interactions may be relatively insensitive to the distance and orientation of binding sites. Testing a set of putative Nanog motifs, STAP showed that a novel Nanog motif could better explain the ChIP-seq data than previously published ones. We then experimentally tested and verified the new Nanog motif. A series of comparisons showed that STAP has more predictive power than several state-of-the-art methods for cis-regulatory sequence analysis. We took advantage of this power to study the evolution of TF-target relationship in Drosophila. By learning the TF-DNA interaction models from the ChIP-chip data of D. melanogaster (Mel) and applying them to the genome of D. pseudoobscura (Pse), we found that only about half of the sequences strongly bound by TFs in Mel have high binding affinities in Pse. We show that prediction of functional TF targets from ChIP-chip data can be improved by using the conservation of STAP predicted affinities as an additional filter. Conclusions/Significance STAP is an effective method to analyze binding site arrangements, TF cooperativity, and TF target genes from genome-wide TF-DNA binding data.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Chieh-Chun Chen
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Feng Hong
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Fang Fang
- Gene Regulation Laboratory, Genome Institute of Singapore, Singapore, Singapore
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Huck-Hui Ng
- Gene Regulation Laboratory, Genome Institute of Singapore, Singapore, Singapore
| | - Sheng Zhong
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
- * E-mail:
| |
Collapse
|
35
|
Fu AQ, Adryan B. Scoring overlapping and adjacent signals from genome-wide ChIP and DamID assays. MOLECULAR BIOSYSTEMS 2009; 5:1429-38. [PMID: 19763325 PMCID: PMC3475982 DOI: 10.1039/b906880e] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Much of the research utilising genome-wide ChIP and DamID assays aims to understand the combinatorial feature of transcription factor binding and the chromatin modification code. With these experimental methods becoming more affordable and widespread, the focus of research is shifting to making sense of the data. Amongst the many challenges arising from data analyses, we are concerned with identifying biologically meaningful co-occurrences of transcription factor binding or chromatin modifications, using genome-wide profiles generated from ChIP and DamID assays. Co-occurrences are reflected in overlapping and adjacent signals in multiple ChIP or DamID profiles. We review existing quantitative methods to score overlaps and to cluster binding events in ChIP and DamID profiles. For pairwise comparison, existing methods either are based on a single score at the genome level or take a genomic, region-specific view. To draw inference from many profiles simultaneously, methods exist to cluster regions by their regulatory importance or to infer cis-regulatory modules for a particular region. We provide a simple guide to some of the statistical tools used by these methods.
Collapse
Affiliation(s)
- Audrey Qiuyan Fu
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge, UK.
| | | |
Collapse
|
36
|
Lister R, Gregory BD, Ecker JR. Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond. CURRENT OPINION IN PLANT BIOLOGY 2009; 12:107-18. [PMID: 19157957 PMCID: PMC2723731 DOI: 10.1016/j.pbi.2008.11.004] [Citation(s) in RCA: 138] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2008] [Revised: 11/17/2008] [Accepted: 11/20/2008] [Indexed: 05/18/2023]
Abstract
The sudden availability of DNA sequencing technologies that rapidly produce vast amounts of sequence information has triggered a paradigm shift in genomics, enabling massively parallel surveying of complex nucleic acid populations. The diversity of applications to which these technologies have already been applied demonstrates the immense range of cellular processes and properties that can now be studied at the single-base resolution. These include genome resequencing and polymorphism discovery, mutation mapping, DNA methylation, histone modifications, transcriptome sequencing, gene discovery, alternative splicing identification, small RNA profiling, DNA-protein, and possibly even protein-protein interactions. Thus, these deep sequencing technologies offer plant biologists unprecedented opportunities to increase the understanding of the functions and dynamics of plant cells and populations.
Collapse
Affiliation(s)
- Ryan Lister
- Plant Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Brian D. Gregory
- Plant Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Joseph R. Ecker
- Plant Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
- Corresponding author: Joseph R. Ecker, Plant Biology Laboratory and Genomic Analysis Laboratory, The Salk Institute for Biological Studies, 10010 N. Torrey Pines Rd., La Jolla, CA 92037, Telephone: (858) 453-4100 x1795, Fax: (858) 558-6379, E-mail:
| |
Collapse
|
37
|
Smith JJ, Putta S, Zhu W, Pao GM, Verma IM, Hunter T, Bryant SV, Gardiner DM, Harkins TT, Voss SR. Genic regions of a large salamander genome contain long introns and novel genes. BMC Genomics 2009; 10:19. [PMID: 19144141 PMCID: PMC2633012 DOI: 10.1186/1471-2164-10-19] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Accepted: 01/13/2009] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND The basis of genome size variation remains an outstanding question because DNA sequence data are lacking for organisms with large genomes. Sixteen BAC clones from the Mexican axolotl (Ambystoma mexicanum: c-value = 32 x 10(9) bp) were isolated and sequenced to characterize the structure of genic regions. RESULTS Annotation of genes within BACs showed that axolotl introns are on average 10x longer than orthologous vertebrate introns and they are predicted to contain more functional elements, including miRNAs and snoRNAs. Loci were discovered within BACs for two novel EST transcripts that are differentially expressed during spinal cord regeneration and skin metamorphosis. Unexpectedly, a third novel gene was also discovered while manually annotating BACs. Analysis of human-axolotl protein-coding sequences suggests there are 2% more lineage specific genes in the axolotl genome than the human genome, but the great majority (86%) of genes between axolotl and human are predicted to be 1:1 orthologs. Considering that axolotl genes are on average 5x larger than human genes, the genic component of the salamander genome is estimated to be incredibly large, approximately 2.8 gigabases! CONCLUSION This study shows that a large salamander genome has a correspondingly large genic component, primarily because genes have incredibly long introns. These intronic sequences may harbor novel coding and non-coding sequences that regulate biological processes that are unique to salamanders.
Collapse
Affiliation(s)
- Jeramiah J Smith
- Department of Biology and Spinal Cord and Brain Injury Research Center, University of Kentucky, Lexington, KY 40506, USA
- University of Washington, Department of Genome Sciences, Seattle, WA 98195, USA
- Benaroya Research Institute at Virginia Mason, Seattle, WA 98101, USA
| | - Srikrishna Putta
- Department of Biology and Spinal Cord and Brain Injury Research Center, University of Kentucky, Lexington, KY 40506, USA
| | - Wei Zhu
- The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Gerald M Pao
- The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Inder M Verma
- The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Tony Hunter
- The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Susan V Bryant
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA 92697, USA
- The Developmental Biology Center, University of California Irvine, Irvine, CA 92697, USA
| | - David M Gardiner
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA 92697, USA
- The Developmental Biology Center, University of California Irvine, Irvine, CA 92697, USA
| | | | - S Randal Voss
- Department of Biology and Spinal Cord and Brain Injury Research Center, University of Kentucky, Lexington, KY 40506, USA
| |
Collapse
|
38
|
McGaughey DM, Stine ZE, Huynh JL, Vinton RM, McCallion AS. Asymmetrical distribution of non-conserved regulatory sequences at PHOX2B is reflected at the ENCODE loci and illuminates a possible genome-wide trend. BMC Genomics 2009; 10:8. [PMID: 19128492 PMCID: PMC2630312 DOI: 10.1186/1471-2164-10-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 01/07/2009] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Transcriptional regulatory elements are central to development and interspecific phenotypic variation. Current regulatory element prediction tools rely heavily upon conservation for prediction of putative elements. Recent in vitro observations from the ENCODE project combined with in vivo analyses at the zebrafish phox2b locus suggests that a significant fraction of regulatory elements may fall below commonly applied metrics of conservation. We propose to explore these observations in vivo at the human PHOX2B locus, and also evaluate the potential evidence for genome-wide applicability of these observations through a novel analysis of extant data. RESULTS Transposon-based transgenic analysis utilizing a tiling path proximal to human PHOX2B in zebrafish recapitulates the observations at the zebrafish phox2b locus of both conserved and non-conserved regulatory elements. Analysis of human sequences conserved with previously identified zebrafish phox2b regulatory elements demonstrates that the orthologous sequences exhibit overlapping regulatory control. Additionally, analysis of non-conserved sequences scattered over 135 kb 5' to PHOX2B, provides evidence of non-conserved regulatory elements positively biased with close proximity to the gene. Furthermore, we provide a novel analysis of data from the ENCODE project, finding a non-uniform distribution of regulatory elements consistent with our in vivo observations at PHOX2B. These observations remain largely unchanged when one accounts for the sequence repeat content of the assayed intervals, when the intervals are sub-classified by biological role (developmental versus non-developmental), or by gene density (gene desert versus non-gene desert). CONCLUSION While regulatory elements frequently display evidence of evolutionary conservation, a fraction appears to be undetected by current metrics of conservation. In vivo observations at the PHOX2B locus, supported by our analyses of in vitro data from the ENCODE project, suggest that the risk of excluding non-conserved sequences in a search for regulatory elements may decrease as distance from the gene increases. Our data combined with the ENCODE data suggests that this may represent a genome wide trend.
Collapse
Affiliation(s)
- David M McGaughey
- McKusick - Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 N, Broadway, BRB Suite 449, Baltimore, MD 21205, USA.
| | | | | | | | | |
Collapse
|
39
|
Miele A, Dekker J. Long-range chromosomal interactions and gene regulation. MOLECULAR BIOSYSTEMS 2008; 4:1046-57. [PMID: 18931780 PMCID: PMC2653627 DOI: 10.1039/b803580f] [Citation(s) in RCA: 130] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Over the last few years important new insights into the process of long-range gene regulation have been obtained. Gene regulatory elements are found to engage in direct physical interactions with distant target genes and with loci on other chromosomes to modulate transcription. An overview of recently discovered long-range chromosomal interactions is presented, and a network approach is proposed to unravel gene-element relationships. Gene expression is controlled by regulatory elements that can be located far away along the chromosome or in some cases even on other chromosomes. Genes and regulatory elements physically associate with each other resulting in complex genome-wide networks of chromosomal interactions. Here we describe several well-characterized cases of long-range interactions involved in the activation and repression of transcription. We speculate on how these interactions may affect gene expression and outline possible mechanisms that may facilitate encounters between distant elements. Finally, we propose that a genome-wide network analysis may provide new insights into the logic of long-range gene regulation.
Collapse
Affiliation(s)
- Adriana Miele
- Program in Gene Function and Expression and Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, 364 Plantation Street, Worcester MA 01605-0103
| | - Job Dekker
- Program in Gene Function and Expression and Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, 364 Plantation Street, Worcester MA 01605-0103
| |
Collapse
|
40
|
Identification of nuclear and cytoplasmic mRNA targets for the shuttling protein SF2/ASF. PLoS One 2008; 3:e3369. [PMID: 18841201 PMCID: PMC2556390 DOI: 10.1371/journal.pone.0003369] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2008] [Accepted: 07/31/2008] [Indexed: 12/15/2022] Open
Abstract
The serine and arginine-rich protein family (SR proteins) are highly conserved regulators of pre-mRNA splicing. SF2/ASF, a prototype member of the SR protein family, is a multifunctional RNA binding protein with roles in pre-mRNA splicing, mRNA export and mRNA translation. These observations suggest the intriguing hypothesis that SF2/ASF may couple splicing and translation of specific mRNA targets in vivo. Unfortunately the paucity of endogenous mRNA targets for SF2/ASF has hindered testing of this hypothesis. Here, we identify endogenous mRNAs directly cross-linked to SF2/ASF in different sub-cellular compartments. Cross-Linking Immunoprecipitation (CLIP) captures the in situ specificity of protein-RNA interaction and allows for the simultaneous identification of endogenous RNA targets as well as the locations of binding sites within the RNA transcript. Using the CLIP method we identified 326 binding sites for SF2/ASF in RNA transcripts from 180 protein coding genes. A purine-rich consensus motif was identified in binding sites located within exon sequences but not introns. Furthermore, 72 binding sites were occupied by SF2/ASF in different sub-cellular fractions suggesting that these binding sites may influence the splicing or translational control of endogenous mRNA targets. We demonstrate that ectopic expression of SF2/ASF regulates the splicing and polysome association of transcripts derived from the SFRS1, PABC1, NETO2 and ENSA genes. Taken together the data presented here indicate that SF2/ASF has the capacity to co-regulate the nuclear and cytoplasmic processing of specific mRNAs and provide further evidence that the nuclear history of an mRNA may influence its cytoplasmic fate.
Collapse
|
41
|
Pashos EE, Kague E, Fisher S. Evaluation of cis-regulatory function in zebrafish. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2008; 7:465-73. [PMID: 18820318 DOI: 10.1093/bfgp/eln045] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
As increasing numbers of vertebrate genomes are sequenced, comparative genomics offers tremendous promise to unveil mechanisms of transcriptional gene regulation on a large scale. However, the challenge of analysing immense amounts of sequence data and relating primary sequence to function is daunting. Several teleost species occupy crucial niches in the world of comparative genomics, as experimental model organisms of wide utility and living roadmaps of molecular evolution. Extant species have evolved after a teleost-specific genome duplication, and offer the opportunity to examine the evolution of thousands of duplicate gene pairs. Transgenesis in zebrafish is being increasingly employed to functionally examine non-coding sequences, from fish and mammals. Here, we discuss current approaches to the study of gene regulation in teleosts, and the promise of future research.
Collapse
|
42
|
Zhou H, Lin K. Excess of microRNAs in large and very 5' biased introns. Biochem Biophys Res Commun 2008; 368:709-15. [PMID: 18249189 DOI: 10.1016/j.bbrc.2008.01.117] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2008] [Accepted: 01/27/2008] [Indexed: 11/29/2022]
Abstract
Many of microRNAs (miRNAs) and small nucleolar RNAs (snoRNAs) are located within the introns of genes in eukaryotes. Contrary to intronic snoRNAs, intronic miRNAs are processed from unspliced intronic regions before the catalysis of splicing in vertebrates. By analyzing the distribution patterns of the length and position of the introns hosting these two groups of small RNA genes, we observed that both human and mouse intronic miRNAs tended to be present in large introns, and miRNA host introns have a more 5'-biased position distribution compared with all other introns among the two genomes. These observations indicate that the negative selection of functional constraints might affect the intron size in both genomes. Interestingly, the very 5'-biased positions of miRNA host introns may be necessary for the transcription and regulation of intronic miRNAs to utilize the regulatory signals within the 5'-UTRs of their host genes.
Collapse
Affiliation(s)
- Hongjun Zhou
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University, No. 19, Xinjiekouwai Street, Beijing 100875, China
| | | |
Collapse
|
43
|
Abstract
Epigenetic research aims to understand heritable gene regulation that is not directly encoded in the DNA sequence. Epigenetic mechanisms such as DNA methylation and histone modifications modulate the packaging of the DNA in the nucleus and thereby influence gene expression. Patterns of epigenetic information are faithfully propagated over multiple cell divisions, which makes epigenetic regulation a key mechanism for cellular differentiation and cell fate decisions. In addition, incomplete erasure of epigenetic information can lead to complex patterns of non-Mendelian inheritance. Stochastic and environment-induced epigenetic defects are known to play a major role in cancer and ageing, and they may also contribute to mental disorders and autoimmune diseases. Recent technical advances such as ChIP-on-chip and ChIP-seq have started to convert epigenetic research into a high-throughput endeavor, to which bioinformatics is expected to make significant contributions. Here, we review pioneering computational studies that have contributed to epigenetic research. In addition, we give a brief introduction into epigenetics-targeted at bioinformaticians who are new to the field-and we outline future challenges in computational epigenetics.
Collapse
Affiliation(s)
- Christoph Bock
- Max-Planck-Institut für Informatik, Saarbrücken, Germany.
| | | |
Collapse
|
44
|
Adams D, Karolak M, Robertson E, Oxburgh L. Control of kidney, eye and limb expression of Bmp7 by an enhancer element highly conserved between species. Dev Biol 2007; 311:679-90. [PMID: 17936743 DOI: 10.1016/j.ydbio.2007.08.036] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2007] [Revised: 08/10/2007] [Accepted: 08/20/2007] [Indexed: 01/04/2023]
Abstract
Bmp7 is expressed in numerous tissues throughout development and is required for morphogenesis of the eye, hindlimb and kidney. In this study we show that the majority if not all of the cis-regulatory sequence governing expression at these anatomical sites during development is present in approximately 20 kb surrounding exon 1. In eye, limb and kidney, multiple distinct enhancer elements drive Bmp7 expression within each organ. In the eye, the elements driving expression in the pigmented epithelium and iris are spatially separated. In the kidney, Bmp7 expression in collecting ducts and nephron progenitors is driven by separate enhancer elements. Similarly, limb mesenchyme and apical ectodermal ridge expression are governed by separate elements. Although enhancers for pigmented epithelium, nephrogenic mesenchyme and apical ectodermal ridge are distributed across the approximately 20 kb region, an element of approximately 480 base pairs within intron 1 governs expression within the developing iris, collecting duct system of the kidney and limb mesenchyme. This element is remarkably conserved both in sequence and position in the Bmp7 locus between different vertebrates, ranging from Xenopus tropicalis to Homo sapiens, demonstrating that there is strong selective pressure for Bmp7 expression at these tissue sites. Furthermore, we show that the frog enhancer functions appropriately in transgenic mice. Interestingly, the intron 1 element cannot be found in the Bmp7 genes of vertebrates such as Danio rerio and Takifugu rubripes indicating that this modification of the Bmp7 gene might have arisen during the adaptation from aquatic to terrestrial life. Mutational analysis demonstrates that the enhancer activity of the intron 1 element is entirely dependent on the presence of a 10 base pair site within the intron 1 enhancer containing a predicted binding site for the FOXD3 transcription factor.
Collapse
Affiliation(s)
- Derek Adams
- Maine Medical Center Research Institute, 81 Research Drive, Scarborough, Maine 04074, USA
| | | | | | | |
Collapse
|
45
|
Affiliation(s)
- George M Weinstock
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.
| |
Collapse
|
46
|
Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CWH, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JNS, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PIW, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447:799-816. [PMID: 17571346 PMCID: PMC2212820 DOI: 10.1038/nature05874] [Citation(s) in RCA: 3826] [Impact Index Per Article: 225.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
Collapse
|
47
|
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M. What is a gene, post-ENCODE? History and updated definition. Genome Res 2007; 17:669-81. [PMID: 17567988 DOI: 10.1101/gr.6339607] [Citation(s) in RCA: 457] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.
Collapse
Affiliation(s)
- Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06511, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|