Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

78
(from Reference Citation Analysis)

Article PDFs (31)

Cited by > 0 (68)

Searched Name

Ivan Ovcharenko

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Number	Citation Analysis
1	Detection of new pioneer transcription factors as cell-type-specific nucleosome binders. eLife 2024;12:RP88936. [PMID: 38293962 PMCID: PMC10945518 DOI: 10.7554/elife.88936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2024] Open Abstract Wrapping of DNA into nucleosomes restricts accessibility to DNA and may affect the recognition of binding motifs by transcription factors. A certain class of transcription factors, the pioneer transcription factors, can specifically recognize their DNA binding sites on nucleosomes, initiate local chromatin opening, and facilitate the binding of co-factors in a cell-type-specific manner. For the majority of human pioneer transcription factors, the locations of their binding sites, mechanisms of binding, and regulation remain unknown. We have developed a computational method to predict the cell-type-specific ability of transcription factors to bind nucleosomes by integrating ChIP-seq, MNase-seq, and DNase-seq data with details of nucleosome structure. We have demonstrated the ability of our approach in discriminating pioneer from canonical transcription factors and predicted new potential pioneer transcription factors in H1, K562, HepG2, and HeLa-S3 cell lines. Last, we systematically analyzed the interaction modes between various pioneer transcription factors and detected several clusters of distinctive binding sites on nucleosomal DNA. Collapse Key Words chromatin computational computational biology human nucleosome nucleosome binding pioneer transcription factor systems biology transcription factor Collapse MESH Headings Humans Nucleosomes/genetics Transcription Factors/genetics Transcription Factors/metabolism Chromatin DNA/metabolism Binding Sites Collapse Grants National Library of Medicine NIH HHS Canada Research Chairs Ontario Institute for Cancer Research Natural Sciences and Engineering Research Council of Canada National Natural Science Foundation of China Cancer Research UK Cambridge Institute, University of Cambridge National Institutes of Health Collapse
2	Sequence characteristics and an accurate model of abundant hyperactive loci in the human genome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.05.527203. [PMID: 36945558 PMCID: PMC10028745 DOI: 10.1101/2023.02.05.527203] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Abstract Enhancers and promoters are classically considered to be bound by a small set of TFs in a sequence-specific manner. This assumption has come under increasing skepticism as the datasets of ChIP-seq assays of TFs have expanded. In particular, high-occupancy target (HOT) loci attract hundreds of TFs with seemingly no detectable correlation between ChIP-seq peaks and DNA-binding motif presence. Here, we used a set of 1,003 TF ChIP-seq datasets (HepG2, K562, H1) to analyze the patterns of ChIP-seq peak co-occurrence in combination with functional genomics datasets. We identified 43,891 HOT loci forming at the promoter (53%) and enhancer (47%) regions. HOT promoters regulate housekeeping genes, whereas HOT enhancers are involved in tissue-specific process regulation. HOT loci form the foundation of human super-enhancers and evolve under strong negative selection, with some of these loci being located in ultraconserved regions. Sequence-based classification analysis of HOT loci suggested that their formation is driven by the sequence features, and the density of mapped ChIP-seq peaks across TF-bound loci correlates with sequence features and the expression level of flanking genes. Based on the affinities to bind to promoters and enhancers we detected 5 distinct clusters of TFs that form the core of the HOT loci. We report an abundance of HOT loci in the human genome and a commitment of 51% of all TF ChIP-seq binding events to HOT locus formation thus challenging the classical model of enhancer activity and propose a model of HOT locus formation based on the existence of large transcriptional condensates. Collapse Key Words Computational genomics epigenetics gene regulation transcription factors transcriptional regulation Collapse MESH Headings Collapse Grants Collapse
3	Detection of new pioneer transcription factors as cell-type specific nucleosome binders. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.10.540098. [PMID: 37425841 PMCID: PMC10327179 DOI: 10.1101/2023.05.10.540098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023] Abstract Wrapping of DNA into nucleosomes restricts accessibility to the DNA and may affect the recognition of binding motifs by transcription factors. A certain class of transcription factors, the pioneer transcription factors, can specifically recognize their DNA binding sites on nucleosomes, may initiate local chromatin opening and facilitate the binding of co-factors in a cell-type-specific manner. For the majority of human pioneer transcription factors, the locations of their binding sites, mechanisms of binding and regulation remain unknown. We have developed a computational method to predict the cell-type-specific ability of transcription factors to bind nucleosomes by integrating ChIP-seq, MNase-seq and DNase-seq data with details of nucleosome structure. We have demonstrated the ability of our approach in discriminating pioneer from canonical transcription factors and predicted new potential pioneer transcription factors in H1, K562, HepG2 and HeLa cell lines. Lastly, we systemically analyzed the interaction modes between various pioneer transcription factors and detected several clusters of distinctive binding sites on nucleosomal DNA. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
4	Modeling islet enhancers using deep learning identifies candidate causal variants at loci associated with T2D and glycemic traits. Proc Natl Acad Sci U S A 2023;120:e2206612120. [PMID: 37603758 PMCID: PMC10469333 DOI: 10.1073/pnas.2206612120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 07/19/2023] [Indexed: 08/23/2023] Open Abstract Genetic association studies have identified hundreds of independent signals associated with type 2 diabetes (T2D) and related traits. Despite these successes, the identification of specific causal variants underlying a genetic association signal remains challenging. In this study, we describe a deep learning (DL) method to analyze the impact of sequence variants on enhancers. Focusing on pancreatic islets, a T2D relevant tissue, we show that our model learns islet-specific transcription factor (TF) regulatory patterns and can be used to prioritize candidate causal variants. At 101 genetic signals associated with T2D and related glycemic traits where multiple variants occur in linkage disequilibrium, our method nominates a single causal variant for each association signal, including three variants previously shown to alter reporter activity in islet-relevant cell types. For another signal associated with blood glucose levels, we biochemically test all candidate causal variants from statistical fine-mapping using a pancreatic islet beta cell line and show biochemical evidence of allelic effects on TF binding for the model-prioritized variant. To aid in future research, we publicly distribute our model and islet enhancer perturbation scores across ~67 million genetic variants. We anticipate that DL methods like the one presented in this study will enhance the prioritization of candidate causal variants for functional studies. Collapse Key Words deep learning enhancer epigenomics pancreatic islets type 2 diabetes Collapse MESH Headings Diabetes Mellitus, Type 2/genetics Diabetes Mellitus, Type 2/metabolism Diabetes Mellitus, Type 2/pathology Deep Learning Enhancer Elements, Genetic Islets of Langerhans/metabolism Islets of Langerhans/pathology Genetic Variation Humans Computer Simulation Collapse Grants ZIA HG000024 Intramural NIH HHS ZIA LM200881 Intramural NIH HHS R01 DK118011 NIDDK NIH HHS HHS \| National Institutes of Health (NIH) U.S. Department of Defense (DOD) HHS \| NIH \| National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Collapse
5	ChromDL: a next-generation regulatory DNA classifier. Bioinformatics 2023;39:i377-i385. [PMID: 37387183 DOI: 10.1093/bioinformatics/btad217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open Abstract MOTIVATION Predicting the regulatory function of non-coding DNA using only the DNA sequence continues to be a major challenge in genomics. With the advent of improved optimization algorithms, faster GPU speeds, and more intricate machine-learning libraries, hybrid convolutional and recurrent neural network architectures can be constructed and applied to extract crucial information from non-coding DNA. RESULTS Using a comparative analysis of the performance of thousands of Deep Learning architectures, we developed ChromDL, a neural network architecture combining bidirectional gated recurrent units, convolutional neural networks, and bidirectional long short-term memory units, which significantly improves upon a range of prediction metrics compared to its predecessors in transcription factor binding site, histone modification, and DNase-I hyper-sensitive site detection. Combined with a secondary model, it can be utilized for accurate classification of gene regulatory elements. The model can also detect weak transcription factor binding as compared to previously developed methods and has the potential to help delineate transcription factor binding motif specificities. AVAILABILITY AND IMPLEMENTATION The ChromDL source code can be found at https://github.com/chrishil1/ChromDL. Collapse Key Words Collapse MESH Headings Collapse Grants NLM NIH HHS NIH HHS Collapse
6	De novo human brain enhancers created by single-nucleotide mutations. SCIENCE ADVANCES 2023;9:eadd2911. [PMID: 36791193 PMCID: PMC9931207 DOI: 10.1126/sciadv.add2911] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 01/12/2023] [Indexed: 05/30/2023] Abstract Advanced human cognition is attributed to increased neocortex size and complexity, but the underlying evolutionary and regulatory mechanisms are largely unknown. Using human and macaque embryonic neocortical H3K27ac data coupled with a deep learning model of enhancers, we identified ~4000 enhancer gains in humans, which, per our model, can often be attributed to single-nucleotide essential mutations. Our analyses suggest that functional gains in embryonic brain development are associated with de novo enhancers whose putative target genes exhibit increased expression in progenitor cells and interneurons and partake in critical neural developmental processes. Essential mutations alter enhancer activity through altered binding of key transcription factors (TFs) of embryonic neocortex, including ISL1, POU3F2, PITX1/2, and several SOX TFs, and are associated with central nervous system disorders. Overall, our results suggest that essential mutations lead to gain of embryonic neocortex enhancers, which orchestrate expression of genes involved in critical developmental processes associated with human cognition. Collapse Key Words Collapse MESH Headings Humans Enhancer Elements, Genetic Nucleotides Transcription Factors/genetics Brain Mutation Gene Expression Regulation, Developmental Collapse Grants ZIA BC011979 Intramural NIH HHS ZIA LM200881 Intramural NIH HHS National Institutes of Health National Cancer Institute Collapse
7	ChromDL: A Next-Generation Regulatory DNA Classifier. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.27.525971. [PMID: 36789431 PMCID: PMC9928050 DOI: 10.1101/2023.01.27.525971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023] Abstract MOTIVATION Predicting the regulatory function of non-coding DNA using only the DNA sequence continues to be a major challenge in genomics. With the advent of improved optimization algorithms, faster GPU speeds, and more intricate machine learning libraries, hybrid convolutional and recurrent neural network architectures can be constructed and applied to extract crucial information from non-coding DNA. RESULTS Using a comparative analysis of the performance of thousands of Deep Learning (DL) architectures, we developed ChromDL, a neural network architecture combining bidirectional gated recurrent units (BiGRU), convolutional neural networks (CNNs), and bidirectional long short-term memory units (BiLSTM), which significantly improves upon a range of prediction metrics compared to its predecessors in transcription factor binding site (TFBS), histone modification (HM), and DNase-I hypersensitive site (DHS) detection. Combined with a secondary model, it can be utilized for accurate classification of gene regulatory elements. The model can also detect weak transcription factor (TF) binding with higher accuracy as compared to previously developed methods and has the potential to accurately delineate TF binding motif specificities. AVAILABILITY The ChromDL source code can be found at https://github.com/chrishil1/ChromDL . Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
8	A regulatory network of Sox and Six transcription factors initiate a cell fate transformation during hearing regeneration in adult zebrafish. CELL GENOMICS 2022;2. [PMID: 36212030 PMCID: PMC9540346 DOI: 10.1016/j.xgen.2022.100170] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Abstract Using adult zebrafish inner ears as a model for sensorineural regeneration, we ablated the mechanosensory receptors and characterized the single-cell epigenome and transcriptome at consecutive time points during hair cell regeneration. We utilized deep learning on the regeneration-induced open chromatin sequences and identified cell-specific transcription factor (TF) motif patterns. Enhancer activity correlated with gene expression and identified potential gene regulatory networks. A pattern of overlapping Sox- and Six-family TF gene expression and binding motifs was detected, suggesting a combinatorial program of TFs driving regeneration and cell identity. Pseudotime analysis of single-cell transcriptomic data suggested that support cells within the sensory epithelium changed cell identity to a “progenitor” cell population that could differentiate into hair cells. We identified a 2.6 kb DNA enhancer upstream of the sox2 promoter that, when deleted, showed a dominant phenotype that resulted in a hair-cell-regeneration-specific deficit in both the lateral line and adult inner ear. Jimenez et al. interrogate the epigenomic and transcriptomic landscape of regenerating adult zebrafish inner-ear sensory epithelia. They show that the support-cell population transitions to an intermediate “progenitor” cell state that becomes new hair cells, and they demonstrate that the cell fate decisions may be driven by the coordinate regulation and spatial co-binding of Sox and Six transcription factors. By functionally validating a predicted regeneration-responsive enhancer upstream of sox2, they show that precise timing of sox2 expression is critical for hearing regeneration in zebrafish. Collapse Key Words Collapse MESH Headings Collapse Grants ZIA HG200386 Intramural NIH HHS Collapse
9	Heterogeneity of enhancers embodies shared and representative functional groups underlying developmental and cell type-specific gene regulation. Gene 2022;834:146640. [PMID: 35680026 PMCID: PMC9235925 DOI: 10.1016/j.gene.2022.146640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 04/20/2022] [Accepted: 06/02/2022] [Indexed: 11/04/2022] Abstract While enhancers in a particular tissue coordinately fulfill regulatory functions, these functions are heterogeneous in nature and comprise of multiple enhancer subclasses and the associated regulatory mechanisms. In this work, we used multiple cell lines to identify enhancer subclasses linked to development, differentiation, and cellular identity. We found that enhancer functional heterogeneity during development encompasses subclasses of ubiquitous functions (11%), development specific regulatory activity (62%), and chromatin interactions (12%). In differentiated cell lines, ubiquitous enhancers (10%) stay active across multiple cell lines.They are accompanied by a large enhancer subclass (ranging from 33% to 63%) with functions specific to the corresponding lineage. The remaining enhancers (27-40%) establish regulatory chromatin structure and facilitate interactions of cell type-specific enhancers with their target promoters. In addition to specialized functions of cell type-specific enhancers, we show that proper accounting of enhancer heterogeneity leads to a 10% increase in accuracy of enhancer classification, which significantly improves the modeling of enhancers and identification of underlying regulatory mechanisms. In summary, our observations suggest that although cell type-specific enhancers are heterogeneous and coordinate different regulatory programs, enhancers from different cell lines maintain common categories of functional groups across developmental and differentiation stages, indicating a higher order rule followed by enhancer-gene regulation. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
10	Corrigendum: Enhancer–silencer transitions in the human genome. Genome Res 2022. [PMCID: PMC9248891 DOI: 10.1101/gr.276950.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
11	Enhancer-silencer transitions in the human genome. Genome Res 2022;32:437-448. [PMID: 35105669 PMCID: PMC8896465 DOI: 10.1101/gr.275992.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 01/27/2022] [Indexed: 11/24/2022] Abstract Dual-function regulatory elements (REs), acting as enhancers in some cellular contexts and as silencers in others, have been reported to facilitate the precise gene regulatory response to developmental signals in Drosophila melanogaster. However, with few isolated examples detected, dual-function REs in mammals have yet to be systematically studied. We herein investigated this class of REs in the human genome and profiled their activity across multiple cell types. Focusing on enhancer–silencer transitions specific to the development of T cells, we built an accurate deep learning classifier of REs and identified about 12,000 silencers active in primary peripheral blood T cells that act as enhancers in embryonic stem cells. Compared with regular silencers, these dual-function REs are evolving under stronger purifying selection and are enriched for mutations associated with disease phenotypes and altered gene expression. In addition, they are enriched in the loci of transcriptional regulators, such as transcription factors (TFs) and chromatin remodeling genes. Dual-function REs consist of two intertwined but largely distinct sets of binding sites bound by either activating or repressing TFs, depending on the type of RE function in a given cell line. This indicates the recruitment of different TFs for different regulatory modes and a complex DNA sequence composition of these REs with dual activating and repressive encoding. With an estimated >6% of cell type–specific human silencers acting as dual-function REs, this overlooked class of REs requires a specific investigation on how their inherent functional plasticity might be a contributing factor to human diseases. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
12	A model of active transcription hubs that unifies the roles of active promoters and enhancers. Nucleic Acids Res 2021;49:4493-4505. [PMID: 33872375 DOI: 10.1093/nar/gkab235] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Revised: 01/27/2021] [Accepted: 03/22/2021] [Indexed: 12/31/2022] Open Abstract An essential questions of gene regulation is how large number of enhancers and promoters organize into gene regulatory loops. Using transcription-factor binding enrichment as an indicator of enhancer strength, we identified a portion of H3K27ac peaks as potentially strong enhancers and found a universal pattern of promoter and enhancer distribution: At actively transcribed regions of length of ∼200-300 kb, the numbers of active promoters and enhancers are inversely related. Enhancer clusters are associated with isolated active promoters, regardless of the gene's cell-type specificity. As the number of nearby active promoters increases, the number of enhancers decreases. At regions where multiple active genes are closely located, there are few distant enhancers. With Hi-C analysis, we demonstrate that the interactions among the regulatory elements (active promoters and enhancers) occur predominantly in clusters and multiway among linearly close elements and the distance between adjacent elements shows a preference of ∼30 kb. We propose a simple rule of spatial organization of active promoters and enhancers: Gene transcriptions and regulations mainly occur at local active transcription hubs contributed dynamically by multiple elements from linearly close enhancers and/or active promoters. The hub model can be represented with a flower-shaped structure and implies an enhancer-like role of active promoters. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
13	Stable enhancers are active in development, and fragile enhancers are associated with evolutionary adaptation. Genome Biol 2019;20:140. [PMID: 31307522 PMCID: PMC6631995 DOI: 10.1186/s13059-019-1750-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Accepted: 06/28/2019] [Indexed: 12/13/2022] Open Abstract Background Despite continual progress in the identification and characterization of trait- and disease-associated variants that disrupt transcription factor (TF)-DNA binding, little is known about the distribution of TF binding deactivating mutations (deMs) in enhancer sequences. Here, we focus on elucidating the mechanism underlying the different densities of deMs in human enhancers. Results We identify two classes of enhancers based on the density of nucleotides prone to deMs. Firstly, fragile enhancers with abundant deM nucleotides are associated with the immune system and regular cellular maintenance. Secondly, stable enhancers with only a few deM nucleotides are associated with the development and regulation of TFs and are evolutionarily conserved. These two classes of enhancers feature different regulatory programs: the binding sites of pioneer TFs of FOX family are specifically enriched in stable enhancers, while tissue-specific TFs are enriched in fragile enhancers. Moreover, stable enhancers are more tolerant of deMs due to their dominant employment of homotypic TF binding site (TFBS) clusters, as opposed to the larger-extent usage of heterotypic TFBS clusters in fragile enhancers. Notably, the sequence environment and chromatin context of the cognate motif, other than the motif itself, contribute more to the susceptibility to deMs of TF binding. Conclusions This dichotomy of enhancer activity is conserved across different tissues, has a specific footprint in epigenetic profiles, and argues for a bimodal evolution of gene regulatory programs in vertebrates. Specifically encoded stable enhancers are evolutionarily conserved and associated with development, while differently encoded fragile enhancers are associated with the adaptation of species. Electronic supplementary material The online version of this article (10.1186/s13059-019-1750-z) contains supplementary material, which is available to authorized users. Collapse Key Words Causal regulatory variants Enhancer Evolution of gene regulation Transcription factor interaction Transgenic mouse reporter assay Collapse MESH Headings Collapse Grants Collapse
14	Identification of human silencers by correlating cross-tissue epigenetic profiles and gene expression. Genome Res 2019;29:657-667. [PMID: 30886051 PMCID: PMC6442386 DOI: 10.1101/gr.247007.118] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 02/14/2019] [Indexed: 12/22/2022] Abstract Compared to enhancers, silencers are notably difficult to identify and validate experimentally. In search for human silencers, we utilized H3K27me3-DNase I hypersensitive site (DHS) peaks with tissue specificity negatively correlated with the expression of nearby genes across 25 diverse cell lines. These regions are predicted to be silencers since they are physically linked, using Hi-C loops, or associated, using expression quantitative trait loci (eQTL) results, with a decrease in gene expression much more frequently than general H3K27me3-DHSs. Also, these regions are enriched for the binding sites of transcriptional repressors (such as CTCF, MECOM, SMAD4, and SNAI3) and depleted of the binding sites of transcriptional activators. Using sequence signatures of these regions, we constructed a computational model and predicted approximately 10,000 additional silencers per cell line and demonstrated that the majority of genes linked to these silencers are expressed at a decreased level. Furthermore, single nucleotide polymorphisms (SNPs) in predicted silencers are significantly associated with disease phenotypes. Finally, our results show that silencers commonly interact with enhancers to affect the transcriptional dynamics of tissue-specific genes and to facilitate fine-tuning of transcription in the human genome. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
15	Dichotomy in redundant enhancers points to presence of initiators of gene regulation. BMC Genomics 2018;19:947. [PMID: 30563465 PMCID: PMC6299655 DOI: 10.1186/s12864-018-5335-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Accepted: 11/29/2018] [Indexed: 12/31/2022] Open Abstract Background The regulatory landscape of a gene locus often consists of several functionally redundant enhancers establishing phenotypic robustness and evolutionary stability of its regulatory program. However, it is unclear what mechanisms are employed by redundant enhancers to cooperatively orchestrate gene expression. Results By comparing redundant enhancers to single enhancers (enhancers present in a single copy in a gene locus), we observed that the DNA sequence encryption differs between these two classes of enhancers, suggesting a difference in their regulatory mechanisms. Initiator enhancers, which are a subset of redundant enhancers and show similar sequence encryption to single enhancers, differ from the rest of redundant enhancers in their sequence encryption, evolutionary conservation and proximity to target genes. Genes hosting initiator enhancers in their loci feature elevated levels of expression. Initiator enhancers show a high level of 3D chromatin contacts with both transcription start sites and regular enhancers, suggesting their roles as primary activators and intermediate catalysts of gene expression, through which the regulatory signals of redundant enhancers are propagated to the target genes. In addition, GWAS and eQTLs variants are significantly enriched in initiator enhancers compared to redundant enhancers, suggesting a key functional role these sequences play in gene regulation. Conclusions The specific characteristics and widespread abundance of initiator enhancers advocate for a possible universal hierarchical mechanism of tissue-specific gene regulation involving multiple redundant enhancers acting through initiator enhancers. Electronic supplementary material The online version of this article (10.1186/s12864-018-5335-0) contains supplementary material, which is available to authorized users. Collapse Key Words Gene regulation Redundant enhancers Collapse MESH Headings Collapse Grants Collapse
16	Enhancer reprogramming in mammalian genomes. BMC Bioinformatics 2018;19:316. [PMID: 30200877 PMCID: PMC6131754 DOI: 10.1186/s12859-018-2343-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Accepted: 08/28/2018] [Indexed: 12/18/2022] Open Abstract Background Transcription factor binding site (TFBS) loss, gain, and reshuffling within the sequence of a regulatory element could alter the function of that regulatory element. Some of the changes will be detrimental to the fitness of the species and will result in gradual removal from a population, while other changes would be either beneficial or just a part of genetic drift and end up being fixed in a population. This “reprogramming” of regulatory elements results in modification of the gene regulatory landscape during evolution. Results We identified reprogrammed enhancers (RPEs) by comparing the distribution of tissue-specific enhancers in the human and mouse genomes. We observed that around 30% of mammalian enhancers have been reprogrammed after the human-mouse speciation. In 79% of cases, the reprogramming of an enhancer resulted in a quantifiably different expression of a flanking gene. In the case of the Thy-1 cell surface antigen gene, for example, enhancer reprogramming is associated with cortex to thymus change in gene expression. To understand the mechanisms of enhancer reprogramming, we profiled the evolutionary changes in the TFBS enhancer content and found that enhancer reprogramming took place through the acquisition of new TFBSs in 72% of reprogramming events. Conclusions Our results suggest that enhancer reprogramming takes place within well-established regulatory loci with RPEs contributing additively to fine-tuning of the gene regulatory program in mammals. We also found evidence for acquisition of novel gene function through enhancer reprogramming, which allows expansion of gene regulatory landscapes into new regulatory domains. Electronic supplementary material The online version of this article (10.1186/s12859-018-2343-7) contains supplementary material, which is available to authorized users. Collapse Key Words Enhancers Evolution Gene regulation Transcription factor binding sites Collapse MESH Headings Collapse Grants Collapse
17	The hypothesis of ultraconserved enhancer dispensability overturned. Genome Biol 2018;19:57. [PMID: 29739466 PMCID: PMC5938802 DOI: 10.1186/s13059-018-1433-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open Abstract Two recent studies explore how redundant enhancers in mice really are. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
18	SNPDelScore: combining multiple methods to score deleterious effects of noncoding mutations in the human genome. Bioinformatics 2017;34:289-291. [PMID: 28968739 DOI: 10.1093/bioinformatics/btx583] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 09/11/2017] [Accepted: 09/13/2017] [Indexed: 11/12/2022] Open Abstract SUMMARY Addressing deleterious effects of noncoding mutations is an essential step towards the identification of disease-causal mutations of gene regulatory elements. Several methods for quantifying the deleteriousness of noncoding mutations using artificial intelligence, deep learning and other approaches have been recently proposed. Although the majority of the proposed methods have demonstrated excellent accuracy on different test sets, there is rarely a consensus. In addition, advanced statistical and artificial learning approaches used by these methods make it difficult porting these methods outside of the labs that have developed them. To address these challenges and to transform the methodological advances in predicting deleterious noncoding mutations into a practical resource available for the broader functional genomics and population genetics communities, we developed SNPDelScore, which uses a panel of proposed methods for quantifying deleterious effects of noncoding mutations to precompute and compare the deleteriousness scores of all common SNPs in the human genome in 44 cell lines. The panel of deleteriousness scores of a SNP computed using different methods is supplemented by functional information from the GWAS Catalog, libraries of transcription factor-binding sites, and genic characteristics of mutations. SNPDelScore comes with a genome browser capable of displaying and comparing large sets of SNPs in a genomic locus and rapidly identifying consensus SNPs with the highest deleteriousness scores making those prime candidates for phenotype-causal polymorphisms. AVAILABILITY AND IMPLEMENTATION https://www.ncbi.nlm.nih.gov/research/snpdelscore/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
19	Quantifying deleterious effects of regulatory variants. Nucleic Acids Res 2017;45:2307-2317. [PMID: 27980060 PMCID: PMC5389506 DOI: 10.1093/nar/gkw1263] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2016] [Accepted: 12/05/2016] [Indexed: 12/13/2022] Open Abstract The majority of genome-wide association study (GWAS) risk variants reside in non-coding DNA sequences. Understanding how these sequence modifications lead to transcriptional alterations and cell-to-cell variability can help unraveling genotype-phenotype relationships. Here, we describe a computational method, dubbed CAPE, which calculates the likelihood of a genetic variant deactivating enhancers by disrupting the binding of transcription factors (TFs) in a given cellular context. CAPE learns sequence signatures associated with putative enhancers originating from large-scale sequencing experiments (such as ChIP-seq or DNase-seq) and models the change in enhancer signature upon a single nucleotide substitution. CAPE accurately identifies causative cis-regulatory variation including expression quantitative trait loci (eQTLs) and DNase I sensitivity quantitative trait loci (dsQTLs) in a tissue-specific manner with precision superior to several currently available methods. The presented method can be trained on any tissue-specific dataset of enhancers and known functional variants and applied to prioritize disease-associated variants in the corresponding tissue. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
20	Epigenetic and genetic alterations and their influence on gene regulation in chronic lymphocytic leukemia. BMC Genomics 2017;18:236. [PMID: 28302063 PMCID: PMC5353786 DOI: 10.1186/s12864-017-3617-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 03/10/2017] [Indexed: 01/07/2023] Open Abstract BACKGROUND To understand the changes of gene regulation in carcinogenesis, we explored signals of DNA methylation - a stable epigenetic mark of gene regulatory elements - and designed a computational model to profile loss and gain of regulatory elements (REs) during carcinogenesis. We also utilized sequencing data to analyze the allele frequency of single nucleotide polymorphisms (SNPs) and detected the cancer-associated SNPs, i.e., the SNPs displaying the significant allele frequency difference between cancer and normal samples. RESULTS After applying this model to chronic lymphocytic leukemia (CLL) data, we identified REs differentially activated (dREs) between normal and CLL cells, consisting of 6,802 dREs gained and 4,606 dREs lost in CLL. The identified regulatory perturbations coincide with changes in the expression of target genes. In particular, the genes encoding DNA methyltransferases harbor multiple lost-in-cancer dREs and zero gained-in-cancer dREs, indicating that the damaged regulation of these genes might be one of the key causes of tumor formation. dREs display a significantly elevated density of the genome-wide association study (GWAS) SNPs associated with CLL and CLL-related traits. We observed that most of dRE GWAS SNPs associated with CLL and CLL-related traits (83%) display a significant haplotype association among the identified cancer-associated alleles and the risk alleles that have been reported in GWAS. Also dREs are enriched for the binding sites of the well-established B-cell and CLL transcription factors (TFs) NF-kB, AP2, P53, E2F1, PAX5, and SP1. We also identified CLL-associated SNPs and demonstrated that the mutations at these SNPs change the binding sites of key TFs much more frequently than expected. CONCLUSIONS Through exploring sequencing data measuring DNA methylation, we identified the epigenetic alterations (more specifically, DNA methylation) and genetic mutations along non-coding genomic regions CLL, and demonstrated that these changes play a critical role in carcinogenesis through damaging the regulation of key genes and alternating the binding of key TFs in B and CLL cells. Collapse Key Words DNA methylation Genetic mutation Genome-wide association study Regulatory elements Transcription factor binding site Collapse MESH Headings Collapse Grants Collapse
21	Human Enhancers Are Fragile and Prone to Deactivating Mutations. Mol Biol Evol 2015;32:2161-80. [PMID: 25976354 DOI: 10.1093/molbev/msv118] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open Abstract To explore the underlying mechanisms whereby noncoding variants affect transcriptional regulation, we identified nucleotides capable of disrupting binding of transcription factors and deactivating enhancers if mutated (dubbed candidate killer mutations or KMs) in HepG2 enhancers. On average, approximately 11% of enhancer positions are prone to KMs. A comparable number of enhancer positions are capable of creating de novo binding sites via a single-nucleotide mutation (dubbed candidate restoration mutations or RSs). Both KM and RS positions are evolutionarily conserved and tend to form clusters within an enhancer. We observed that KMs have the most deleterious effect on enhancer activity. In contrast, RSs have a smaller effect in increasing enhancer activity. Additionally, the KMs are strongly associated with liver-related Genome Wide Association Study traits compared with other HepG2 enhancer regions. By applying our framework to lymphoblastoid cell lines, we found that KMs underlie differential binding of transcription factors and differential local chromatin accessibility. The gene expression quantitative trait loci associated with the tissue-specific genes are strongly enriched in KM positions. In summary, we conclude that the KMs have the greatest impact on the level of gene expression and are likely to be the causal variants of tissue-specific gene expression and disease predisposition. Collapse Key Words causal mutations enhancers gene regulation transcription factor binding sites Collapse MESH Headings Collapse Grants Collapse
22	Enhancer modeling uncovers transcriptional signatures of individual cardiac cell states in Drosophila. Nucleic Acids Res 2015;43:1726-39. [PMID: 25609699 PMCID: PMC4330375 DOI: 10.1093/nar/gkv011] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open Abstract Here we used discriminative training methods to uncover the chromatin, transcription factor (TF) binding and sequence features of enhancers underlying gene expression in individual cardiac cells. We used machine learning with TF motifs and ChIP data for a core set of cardiogenic TFs and histone modifications to classify Drosophila cell-type-specific cardiac enhancer activity. We show that the classifier models can be used to predict cardiac cell subtype cis-regulatory activities. Associating the predicted enhancers with an expression atlas of cardiac genes further uncovered clusters of genes with transcription and function limited to individual cardiac cell subtypes. Further, the cell-specific enhancer models revealed chromatin, TF binding and sequence features that distinguish enhancer activities in distinct subsets of heart cells. Collectively, our results show that computational modeling combined with empirical testing provides a powerful platform to uncover the enhancers, TF motifs and gene expression profiles which characterize individual cardiac cell fates. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
23	Identification and computational analysis of gene regulatory elements. Cold Spring Harb Protoc 2015;2015:pdb.top083642. [PMID: 25561628 PMCID: PMC5885252 DOI: 10.1101/pdb.top083642] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023] Abstract Over the last two decades, advances in experimental and computational technologies have greatly facilitated genomic research. Next-generation sequencing technologies have made de novo sequencing of large genomes affordable, and powerful computational approaches have enabled accurate annotations of genomic DNA sequences. Charting functional regions in genomes must account for not only the coding sequences, but also noncoding RNAs, repetitive elements, chromatin states, epigenetic modifications, and gene regulatory elements. A mix of comparative genomics, high-throughput biological experiments, and machine learning approaches has played a major role in this truly global effort. Here we describe some of these approaches and provide an account of our current understanding of the complex landscape of the human genome. We also present overviews of different publicly available, large-scale experimental data sets and computational tools, which we hope will prove beneficial for researchers working with large and complex genomes. Collapse Key Words Collapse MESH Headings Animals Computational Biology Epigenesis, Genetic Genome/genetics Genomics/methods Humans Regulatory Elements, Transcriptional/genetics Collapse Grants 500188-Z-11-Z DBT-Wellcome Trust India Alliance ZIA LM200881-02 Intramural NIH HHS Collapse
24	Identifying causal regulatory SNPs in ChIP-seq enhancers. Nucleic Acids Res 2015;43:225-36. [PMID: 25520196 PMCID: PMC4288203 DOI: 10.1093/nar/gku1318] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Revised: 12/04/2014] [Accepted: 12/05/2014] [Indexed: 01/19/2023] Open Abstract Thousands of non-coding SNPs have been linked to human diseases in the past. The identification of causal alleles within this pool of disease-associated non-coding SNPs is largely impossible due to the inability to accurately quantify the impact of non-coding variation. To overcome this challenge, we developed a computational model that uses ChIP-seq intensity variation in response to non-coding allelic change as a proxy to the quantification of the biological role of non-coding SNPs. We applied this model to HepG2 enhancers and detected 4796 enhancer SNPs capable of disrupting enhancer activity upon allelic change. These SNPs are significantly over-represented in the binding sites of HNF4 and FOXA families of liver transcription factors and liver eQTLs. In addition, these SNPs are strongly associated with liver GWAS traits, including type I diabetes, and are linked to the abnormal levels of HDL and LDL cholesterol. Our model is directly applicable to any enhancer set for mapping causal regulatory SNPs. Collapse Key Words Collapse MESH Headings Alleles Binding Sites Cell Line Cell Line, Tumor Chromatin Immunoprecipitation Enhancer Elements, Genetic Genome-Wide Association Study Humans Liver/metabolism Polymorphism, Single Nucleotide Quantitative Trait Loci Sequence Analysis, DNA Transcription Factors/metabolism Collapse Grants Intramural NIH HHS Collapse
25	Abstract 348: Multi-Species Genome-Wide Analyses of the Specification of Individual Cardiac Cell Fates. Circ Res 2014. [DOI: 10.1161/res.115.suppl_1.348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Abstract There are remarkable molecular and embryological similarities in cardiogenesis between Drosophila and vertebrates. Cells comprising the Drosophila heart can be subdivided into individual identities based on differences in morphology, function and gene expression patterns. Recent studies have shown that differential modifications of histone proteins, in vivo transcription factor (TF) binding, and the presence of particular TF binding motifs can be used as predictive signatures of the enhancers that govern cell-specific gene expression. Here we used discriminative training methods within an integrative, multi-species framework to uncover the motifs, enhancers and genes underlying cardiac cell fate decisions. As an initial step, we undertook a large-scale validation of Drosophila heart enhancers, which revealed enhancer activities in distinct subpopulations of cardiac cells. To identify related cell-specific regulatory elements, we used the validated enhancers as a training set in a machine learning approach that integrated TF motifs with ChIP data for both TF binding and histone modifications. Empirical validation of candidate enhancers predicted by this method confirmed activity in the appropriate cardiac cells. By clustering the motifs derived from the individual cardiac classifiers, we identified and validated sequence features which discriminate specific cellular identities. Next, we asked if similar predictive signatures underlie mouse and human cardiomyocyte (CM) differentiation from embryonic stem cells (ESCs). We show that the distribution of histone marks found within differentiating human and mouse ESCs indeed predict genes potentially critical for CM differentiation, with the best predictions provided by the overlapping mouse and human candidates. We evaluated this result in a large-scale RNAi-based screen of Drosophila orthologs of the mammalian genes, which uncovered dozens of novel cardiogenic regulators whose function is being tested in differentiating human ESCs. In total, these results document the utility of computational modeling combined with empirical testing to uncover the enhancers, TF motifs and genes which characterize individual cardiac cell fates in both invertebrate and mammalian species. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
26	Machine learning classification of cell-specific cardiac enhancers uncovers developmental subnetworks regulating progenitor cell division and cell fate specification. Development 2014;141:878-88. [PMID: 24496624 PMCID: PMC3912831 DOI: 10.1242/dev.101709] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Abstract The Drosophila heart is composed of two distinct cell types, the contractile cardial cells (CCs) and the surrounding non-muscle pericardial cells (PCs), development of which is regulated by a network of conserved signaling molecules and transcription factors (TFs). Here, we used machine learning with array-based chromatin immunoprecipitation (ChIP) data and TF sequence motifs to computationally classify cell type-specific cardiac enhancers. Extensive testing of predicted enhancers at single-cell resolution revealed the added value of ChIP data for modeling cell type-specific activities. Furthermore, clustering the top-scoring classifier sequence features identified novel cardiac and cell type-specific regulatory motifs. For example, we found that the Myb motif learned by the classifier is crucial for CC activity, and the Myb TF acts in concert with two forkhead domain TFs and Polo kinase to regulate cardiac progenitor cell divisions. In addition, differential motif enrichment and cis-trans genetic studies revealed that the Notch signaling pathway TF Suppressor of Hairless [Su(H)] discriminates PC from CC enhancer activities. Collectively, these studies elucidate molecular pathways used in the regulatory decisions for proliferation and differentiation of cardiac progenitor cells, implicate Su(H) in regulating cell fate decisions of these progenitors, and document the utility of enhancer modeling in uncovering developmental regulatory subnetworks. Collapse Key Words Cell division Drosophila Gene regulation Machine learning Organogenesis Progenitor specification Transcription factors Collapse MESH Headings Collapse Grants Collapse
27	Interrogating transcriptional regulatory sequences in Tol2-mediated Xenopus transgenics. PLoS One 2013;8:e68548. [PMID: 23874664 PMCID: PMC3713029 DOI: 10.1371/journal.pone.0068548] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2013] [Accepted: 05/30/2013] [Indexed: 12/13/2022] Open Abstract Identifying gene regulatory elements and their target genes in vertebrates remains a significant challenge. It is now recognized that transcriptional regulatory sequences are critical in orchestrating dynamic controls of tissue-specific gene expression during vertebrate development and in adult tissues, and that these elements can be positioned at great distances in relation to the promoters of the genes they control. While significant progress has been made in mapping DNA binding regions by combining chromatin immunoprecipitation and next generation sequencing, functional validation remains a limiting step in improving our ability to correlate in silico predictions with biological function. We recently developed a computational method that synergistically combines genome-wide gene-expression profiling, vertebrate genome comparisons, and transcription factor binding-site analysis to predict tissue-specific enhancers in the human genome. We applied this method to 270 genes highly expressed in skeletal muscle and predicted 190 putative cis-regulatory modules. Furthermore, we optimized Tol2 transgenic constructs in Xenopus laevis to interrogate 20 of these elements for their ability to function as skeletal muscle-specific transcriptional enhancers during embryonic development. We found 45% of these elements expressed only in the fast muscle fibers that are oriented in highly organized chevrons in the Xenopus laevis tadpole. Transcription factor binding site analysis identified >2 Mef2/MyoD sites within ∼200 bp regions in 6 of the validated enhancers, and systematic mutagenesis of these sites revealed that they are critical for the enhancer function. The data described herein introduces a new reporter system suitable for interrogating tissue-specific cis-regulatory elements which allows monitoring of enhancer activity in real time, throughout early stages of embryonic development, in Xenopus. Collapse Key Words Collapse MESH Headings Animals Animals, Genetically Modified Chromatin Immunoprecipitation Larva/metabolism Molecular Sequence Data Muscle, Skeletal/metabolism Regulatory Sequences, Nucleic Acid/genetics Xenopus laevis Collapse Grants R01 HG003963 NHGRI NIH HHS Intramural NIH HHS HG003963 NHGRI NIH HHS Collapse
28	Effects of gene regulatory reprogramming on gene expression in human and mouse developing hearts. Philos Trans R Soc Lond B Biol Sci 2013;368:20120366. [PMID: 23650638 PMCID: PMC3682729 DOI: 10.1098/rstb.2012.0366] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open Abstract Lineage-specific regulatory elements underlie adaptation of species and play a role in disease susceptibility. We compared functionally conserved and lineage-specific enhancers by cross-mapping 5042 human and 6564 mouse heart enhancers. Of these, 79 per cent are lineage-specific, lacking a functional orthologue. Heart enhancers tend to cluster and, commonly, there are multiple heart enhancers in a heart locus providing a regulatory stability to the locus. We observed little cross-clustering, however, between lineage-specific and functionally conserved heart enhancers suggesting regulatory function acquisition and development in loci previously lacking heart activity. We also identified 862 human-specific heart enhancers: 417 featuring sequence conservation with mouse (class II) and 445 with neither sequence nor function conservation (class III). Ninety-eight per cent of class III enhancers were deleted from the mouse genome, and we estimated a similar-sized enhancer gain in the human lineage. Human-specific enhancers display no detectable decrease in the negative selection pressure and are strongly associated with genes partaking in the heart regulatory programmes. The loss of a heart enhancer could be compensated by activity of a redundant heart enhancer; however, we observed redundancy in only 15 per cent of class II and III enhancer loci indicating a large-scale reprogramming of the heart regulatory programme in mammals. Collapse Key Words cis-regulatory evolution gene regulation lineage-specific heart enhancers Collapse MESH Headings Collapse Grants Collapse
29	High mobility group N proteins modulate the fidelity of the cellular transcriptional profile in a tissue- and variant-specific manner. J Biol Chem 2013;288:16690-16703. [PMID: 23620591 DOI: 10.1074/jbc.m113.463315] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open Abstract The nuclei of most vertebrate cells contain members of the high mobility group N (HMGN) protein family, which bind specifically to nucleosome core particles and affect chromatin structure and function, including transcription. Here, we study the biological role of this protein family by systematic analysis of phenotypes and tissue transcription profiles in mice lacking functional HMGN variants. Phenotypic analysis of Hmgn1(tm1/tm1), Hmgn3(tm1/tm1), and Hmgn5(tm1/tm1) mice and their wild type littermates with a battery of standardized tests uncovered variant-specific abnormalities. Gene expression analysis of four different tissues in each of the Hmgn(tm1/tm1) lines reveals very little overlap between genes affected by specific variants in different tissues. Pathway analysis reveals that loss of an HMGN variant subtly affects expression of numerous genes in specific biological processes. We conclude that within the biological framework of an entire organism, HMGNs modulate the fidelity of the cellular transcriptional profile in a tissue- and HMGN variant-specific manner. Collapse Key Words Chromatin Chromosomes/Non-histone Chromosomal Proteins Gene Regulation HMG Proteins Mouse Physiology Transcriptomics Transgenic Mice Collapse MESH Headings Collapse Grants Collapse
30	Using an ensemble of statistical metrics to quantify large sets of plant transcription factor binding sites. PLANT METHODS 2013;9:12. [PMID: 23578135 PMCID: PMC3639912 DOI: 10.1186/1746-4811-9-12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 03/28/2013] [Indexed: 05/07/2023] Abstract BACKGROUND From initial seed germination through reproduction, plants continuously reprogram their transcriptional repertoire to facilitate growth and development. This dynamic is mediated by a diverse but inextricably-linked catalog of regulatory proteins called transcription factors (TFs). Statistically quantifying TF binding site (TFBS) abundance in promoters of differentially expressed genes can be used to identify binding site patterns in promoters that are closely related to stress-response. Output from today's transcriptomic assays necessitates statistically-oriented software to handle large promoter-sequence sets in a computationally tractable fashion. RESULTS We present Marina, an open-source software for identifying over-represented TFBSs from amongst large sets of promoter sequences, using an ensemble of 7 statistical metrics and binding-site profiles. Through software comparison, we show that Marina can identify considerably more over-represented plant TFBSs compared to a popular software alternative. CONCLUSIONS Marina was used to identify over-represented TFBSs in a two time-point RNA-Seq study exploring the transcriptomic interplay between soybean (Glycine max) and soybean rust (Phakopsora pachyrhizi). Marina identified numerous abundant TFBSs recognized by transcription factors that are associated with defense-response such as WRKY, HY5 and MYB2. Comparing results from Marina to that of a popular software alternative suggests that regardless of the number of promoter-sequences, Marina is able to identify significantly more over-represented TFBSs. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
31	A high-resolution enhancer atlas of the developing telencephalon. Cell 2013;152:895-908. [PMID: 23375746 DOI: 10.1016/j.cell.2012.12.041] [Citation(s) in RCA: 181] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Revised: 10/31/2012] [Accepted: 12/20/2012] [Indexed: 11/25/2022] Abstract The mammalian telencephalon plays critical roles in cognition, motor function, and emotion. Though many of the genes required for its development have been identified, the distant-acting regulatory sequences orchestrating their in vivo expression are mostly unknown. Here, we describe a digital atlas of in vivo enhancers active in subregions of the developing telencephalon. We identified more than 4,600 candidate embryonic forebrain enhancers and studied the in vivo activity of 329 of these sequences in transgenic mouse embryos. We generated serial sets of histological brain sections for 145 reproducible forebrain enhancers, resulting in a publicly accessible web-based data collection comprising more than 32,000 sections. We also used epigenomic analysis of human and mouse cortex tissue to directly compare the genome-wide enhancer architecture in these species. These data provide a primary resource for investigating gene regulatory mechanisms of telencephalon development and enable studies of the role of distant-acting enhancers in neurodevelopmental disorders. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
32	Sequence signatures extracted from proximal promoters can be used to predict distal enhancers. Genome Biol 2013;14:R117. [PMID: 24156763 PMCID: PMC3983659 DOI: 10.1186/gb-2013-14-10-r117] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 10/24/2013] [Indexed: 01/22/2023] Open Abstract BACKGROUND Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. RESULTS We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% of the tissues, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning distal non-coding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes. Liver enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity in the mouse liver, whereas a control set of five sequences was completely negative. CONCLUSIONS We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific signatures that can be learned and used for the de novo prediction of enhancers. Collapse Key Words Collapse MESH Headings Animals Binding Sites Enhancer Elements, Genetic Gene Expression Regulation Genome-Wide Association Study Genomics/methods Humans Mice Nucleotide Motifs Organ Specificity/genetics Promoter Regions, Genetic Regulatory Sequences, Nucleic Acid Reproducibility of Results Support Vector Machine Transcription Factors Collapse Grants P30 DK026743 NIDDK NIH HHS 1R01NS079231 NINDS NIH HHS CIHR R01 HD059862 NICHD NIH HHS GM61390 NIGMS NIH HHS U01 GM061390 NIGMS NIH HHS Intramural NIH HHS R01 DK090382 NIDDK NIH HHS T32 GM007175 NIGMS NIH HHS 1R01HG005058 NHGRI NIH HHS 1R01HG006768 NHGRI NIH HHS 1R01DK090382 NIDDK NIH HHS R01 NS079231 NINDS NIH HHS U19 GM061390 NIGMS NIH HHS R01HD059862 NICHD NIH HHS Collapse
33	Systematic elucidation and in vivo validation of sequences enriched in hindbrain transcriptional control. Genome Res 2012;22:2278-89. [PMID: 22759862 PMCID: PMC3483557 DOI: 10.1101/gr.139717.112] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Illuminating the primary sequence encryption of enhancers is central to understanding the regulatory architecture of genomes. We have developed a machine learning approach to decipher motif patterns of hindbrain enhancers and identify 40,000 sequences in the human genome that we predict display regulatory control that includes the hindbrain. Consistent with their roles in hindbrain patterning, MEIS1, NKX6-1, as well as HOX and POU family binding motifs contributed strongly to this enhancer model. Predicted hindbrain enhancers are overrepresented at genes expressed in hindbrain and associated with nervous system development, and primarily reside in the areas of open chromatin. In addition, 77 (0.2%) of these predictions are identified as hindbrain enhancers on the VISTA Enhancer Browser, and 26,000 (60%) overlap enhancer marks (H3K4me1 or H3K27ac). To validate these putative hindbrain enhancers, we selected 55 elements distributed throughout our predictions and six low scoring controls for evaluation in a zebrafish transgenic assay. When assayed in mosaic transgenic embryos, 51/55 elements directed expression in the central nervous system. Furthermore, 30/34 (88%) predicted enhancers analyzed in stable zebrafish transgenic lines directed expression in the larval zebrafish hindbrain. Subsequent analysis of sequence fragments selected based upon motif clustering further confirmed the critical role of the motifs contributing to the classifier. Our results demonstrate the existence of a primary sequence code characteristic to hindbrain enhancers. This code can be accurately extracted using machine-learning approaches and applied successfully for de novo identification of hindbrain enhancers. This study represents a critical step toward the dissection of regulatory control in specific neuronal subtypes. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
34	A machine learning approach for identifying novel cell type-specific transcriptional regulators of myogenesis. PLoS Genet 2012;8:e1002531. [PMID: 22412381 PMCID: PMC3297574 DOI: 10.1371/journal.pgen.1002531] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2011] [Accepted: 12/23/2011] [Indexed: 12/22/2022] Open Abstract Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA-based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type-specific developmental gene expression patterns. Collapse Key Words Collapse MESH Headings Animals Artificial Intelligence Binding Sites Cell Lineage Drosophila melanogaster/cytology Drosophila melanogaster/genetics Drosophila melanogaster/growth & development Enhancer Elements, Genetic Evolution, Molecular Gene Expression Regulation, Developmental Mesoderm/cytology Mesoderm/growth & development Muscles/cytology Phylogeny Transcription Factors/genetics Transcription, Genetic Collapse Grants Intramural NIH HHS Collapse
35	Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs. BMC Bioinformatics 2012;13:25. [PMID: 22313678 PMCID: PMC3359238 DOI: 10.1186/1471-2105-13-25] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2011] [Accepted: 02/07/2012] [Indexed: 12/26/2022] Open Abstract Background Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed. Results We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4⁺T cells. On several data sets, the system achieved 99% specificity. Conclusion These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
36	CLARE: Cracking the LAnguage of Regulatory Elements. ACTA ACUST UNITED AC 2011;28:581-3. [PMID: 22199387 DOI: 10.1093/bioinformatics/btr704] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Abstract UNLABELLED CLARE is a computational method designed to reveal sequence encryption of tissue-specific regulatory elements. Starting with a set of regulatory elements known to be active in a particular tissue/process, it learns the sequence code of the input set and builds a predictive model from features specific to those elements. The resulting model can then be applied to user-supplied genomic regions to identify novel candidate regulatory elements. CLARE's model also provides a detailed analysis of transcription factors that most likely bind to the elements, making it an invaluable tool for understanding mechanisms of tissue-specific gene regulation. AVAILABILITY CLARE is freely accessible at http://clare.dcode.org/. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
37	Global gene expression analysis of murine limb development. PLoS One 2011;6:e28358. [PMID: 22174793 PMCID: PMC3235105 DOI: 10.1371/journal.pone.0028358] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2011] [Accepted: 11/07/2011] [Indexed: 01/11/2023] Open Abstract Detailed information about stage-specific changes in gene expression is crucial for understanding the gene regulatory networks underlying development and the various signal transduction pathways contributing to morphogenesis. Here we describe the global gene expression dynamics during early murine limb development, when cartilage, tendons, muscle, joints, vasculature and nerves are specified and the musculoskeletal system of limbs is established. We used whole-genome microarrays to identify genes with differential expression at 5 stages of limb development (E9.5 to 13.5), during fore- and hind-limb patterning. We found that the onset of limb formation is characterized by an up-regulation of transcription factors, which is followed by a massive activation of genes during E10.5 and E11.5 which levels off at later time points. Among the 3520 genes identified as significantly up-regulated in the limb, we find ∼30% to be novel, dramatically expanding the repertoire of candidate genes likely to function in the limb. Hierarchical and stage-specific clustering identified expression profiles that are likely to correlate with functional programs during limb development and further characterization of these transcripts will provide new insights into specific tissue patterning processes. Here, we provide for the first time a comprehensive analysis of developmentally regulated genes during murine limb development, and provide some novel insights into the expression dynamics governing limb morphogenesis. Collapse Key Words Collapse MESH Headings Animals Extremities/embryology Gene Expression Profiling Gene Expression Regulation, Developmental Limb Buds/anatomy & histology Limb Buds/embryology Mice Organ Specificity/genetics Organogenesis/genetics Promoter Regions, Genetic/genetics Time Factors Transcriptome/genetics Up-Regulation/genetics Collapse Grants DK075730 NIDDK NIH HHS R01 HG003963 NHGRI NIH HHS R01 DK075730 NIDDK NIH HHS Intramural NIH HHS R01 HG003963-03 NHGRI NIH HHS R01 DK075730-03 NIDDK NIH HHS HG003963 NHGRI NIH HHS Collapse
38	Genome-wide identification of conserved regulatory function in diverged sequences. Genome Res 2011;21:1139-49. [PMID: 21628450 PMCID: PMC3129256 DOI: 10.1101/gr.119016.110] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Accepted: 04/19/2011] [Indexed: 01/16/2023] Abstract Plasticity of gene regulatory encryption can permit DNA sequence divergence without loss of function. Functional information is preserved through conservation of the composition of transcription factor binding sites (TFBS) in a regulatory element. We have developed a method that can accurately identify pairs of functional noncoding orthologs at evolutionarily diverged loci by searching for conserved TFBS arrangements. With an estimated 5% false-positive rate (FPR) in approximately 3000 human and zebrafish syntenic loci, we detected approximately 300 pairs of diverged elements that are likely to share common ancestry and have similar regulatory activity. By analyzing a pool of experimentally validated human enhancers, we demonstrated that 7/8 (88%) of their predicted functional orthologs retained in vivo regulatory control. Moreover, in 5/7 (71%) of assayed enhancer pairs, we observed concordant expression patterns. We argue that TFBS composition is often necessary to retain and sufficient to predict regulatory function in the absence of overt sequence conservation, revealing an entire class of functionally conserved, evolutionarily diverged regulatory elements that we term "covert." Collapse Key Words Collapse MESH Headings Animals Animals, Genetically Modified/genetics Computational Biology/methods Conserved Sequence Enhancer Elements, Genetic Evolution, Molecular Gene Expression Regulation, Developmental Genetic Loci Genome, Human Humans Models, Genetic Oligonucleotide Array Sequence Analysis Sequence Alignment Sequence Analysis, DNA/methods Synteny Transcription Factors/genetics Zebrafish/genetics Collapse Grants R01 HG004428 NHGRI NIH HHS R01 HL088393 NHLBI NIH HHS R01 NS062972 NINDS NIH HHS Intramural NIH HHS Collapse
39	Effects of HMGN variants on the cellular transcription profile. Nucleic Acids Res 2011;39:4076-87. [PMID: 21278158 PMCID: PMC3105402 DOI: 10.1093/nar/gkq1343] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open Abstract High mobility group N (HMGN) is a family of intrinsically disordered nuclear proteins that bind to nucleosomes, alters the structure of chromatin and affects transcription. A major unresolved question is the extent of functional specificity, or redundancy, between the various members of the HMGN protein family. Here, we analyze the transcriptional profile of cells in which the expression of various HMGN proteins has been either deleted or doubled. We find that both up- and downregulation of HMGN expression altered the cellular transcription profile. Most, but not all of the changes were variant specific, suggesting limited redundancy in transcriptional regulation. Analysis of point and swap HMGN mutants revealed that the transcriptional specificity is determined by a unique combination of a functional nucleosome-binding domain and C-terminal domain. Doubling the amount of HMGN had a significantly larger effect on the transcription profile than total deletion, suggesting that the intrinsically disordered structure of HMGN proteins plays an important role in their function. The results reveal an HMGN-variant-specific effect on the fidelity of the cellular transcription profile, indicating that functionally the various HMGN subtypes are not fully redundant. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
40	Tissue-specific and ubiquitous expression patterns from alternative promoters of human genes. PLoS One 2010;5:e12274. [PMID: 20806066 PMCID: PMC2923625 DOI: 10.1371/journal.pone.0012274] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2010] [Accepted: 06/18/2010] [Indexed: 01/17/2023] Open Abstract BACKGROUND Transcriptome diversity provides the key to cellular identity. One important contribution to expression diversity is the use of alternative promoters, which creates mRNA isoforms by expanding the choice of transcription initiation sites of a gene. The proximity of the basal promoter to the transcription initiation site enables prediction of a promoter's location based on the gene annotations. We show that annotation of alternative promoters regulating expression of transcripts with distinct first exons enables a novel methodology to quantify expression levels and tissue specificity of mRNA isoforms. PRINCIPAL FINDINGS The use of distinct alternative first exons in 3,296 genes was examined using exon-microarray data from 11 human tissues. Comparing two transcripts from each gene we found that the activity of alternative promoters (i.e., P1 and P2) was not correlated through tissue specificity or level of expression. Furthermore neither P1 nor P2 conferred any bias for tissue-specific or ubiquitous expression. Genes associated with specific diseases produced transcripts whose limited expression patterns were consistent with the tissue affected in disease. Notably, genes that were historically designated as tissue-specific or housekeeping had alternative isoforms that showed differential expression. Furthermore, only a small number of alternative promoters showed expression exclusive to a single tissue indicating that "tissue preference" provides a better description of promoter activity than tissue specificity. When compared to gene expression data in public databases, as few as 22% of the genes had detailed information for more than one isoform, whereas the remainder collapsed the expression patterns from individual transcripts into one profile. CONCLUSIONS We describe a computational pipeline that uses microarray data to assess the level of expression and breadth of tissue profiles for transcripts with distinct first exons regulated by alternative promoters. We conclude that alternative promoters provide individualized regulation that is confirmed through expression levels, tissue preference and chromatin modifications. Although the selective use of alternative promoters often goes uncharacterized in gene expression analyses, transcripts produced in this manner make unique contributions to the cell that requires further exploration. Collapse Key Words Collapse MESH Headings Computational Biology Databases, Genetic Disease/genetics Entropy Epigenesis, Genetic/genetics Exons/genetics Gene Expression Profiling Genomics Hepatocyte Nuclear Factor 4/genetics Humans Mutation Nucleic Acid Hybridization Oligonucleotide Array Sequence Analysis Organ Specificity Phenotype Promoter Regions, Genetic/genetics RNA, Messenger/genetics Collapse Grants Intramural NIH HHS Collapse
41	The genome of the Western clawed frog Xenopus tropicalis. Science 2010;328:633-6. [PMID: 20431018 PMCID: PMC2994648 DOI: 10.1126/science.1183670] [Citation(s) in RCA: 574] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Abstract The western clawed frog Xenopus tropicalis is an important model for vertebrate development that combines experimental advantages of the African clawed frog Xenopus laevis with more tractable genetics. Here we present a draft genome sequence assembly of X. tropicalis. This genome encodes more than 20,000 protein-coding genes, including orthologs of at least 1700 human disease genes. Over 1 million expressed sequence tags validated the annotation. More than one-third of the genome consists of transposable elements, with unusually prevalent DNA transposons. Like that of other tetrapods, the genome of X. tropicalis contains gene deserts enriched for conserved noncoding elements. The genome exhibits substantial shared synteny with human and chicken over major parts of large chromosomes, broken by lineage-specific chromosome fusions and fissions, mainly in the mammalian lineage. Collapse Key Words Collapse MESH Headings Animals Chickens/genetics Chromosome Mapping Chromosomes/genetics Computational Biology Conserved Sequence DNA Transposable Elements DNA, Complementary Embryo, Nonmammalian/metabolism Evolution, Molecular Expressed Sequence Tags Gene Duplication Genes Genome Humans Phylogeny Sequence Analysis, DNA Synteny Vertebrates/genetics Xenopus/embryology Xenopus/genetics Xenopus Proteins/genetics Collapse Grants R01 AI027877-20 NIAID NIH HHS U01 HG002155-05 NHGRI NIH HHS P41 HD064556 NICHD NIH HHS P41 HD064556-01 NICHD NIH HHS R21 HD065713 NICHD NIH HHS R01 MH079381 NIMH NIH HHS MC_U117560482 Medical Research Council R01 DK070858-05 NIDDK NIH HHS R01 GM060572 NIGMS NIH HHS R01 GM086321-03 NIGMS NIH HHS R01 GM060572-05 NIGMS NIH HHS HHSN261200800001E NCI NIH HHS R01 HD046661-03 NICHD NIH HHS R01 AI027877 NIAID NIH HHS R01 DK070858 NIDDK NIH HHS U01 HG002155 NHGRI NIH HHS R24 RR015088-03 NCRR NIH HHS R01 HD042294-05 NICHD NIH HHS R24 RR015088 NCRR NIH HHS R01 HD042294 NICHD NIH HHS R01 GM086321 NIGMS NIH HHS U01 HG02155 NHGRI NIH HHS P41 HD064556-02 NICHD NIH HHS R01 EY018000 NEI NIH HHS R01 EY018000-03 NEI NIH HHS R24 AI059830-08 NIAID NIH HHS Intramural NIH HHS R01 HD045776 NICHD NIH HHS R01 HD045776-05 NICHD NIH HHS R01 MH079381-02 NIMH NIH HHS R24 AI059830 NIAID NIH HHS Collapse
42	Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res 2010;20:565-77. [PMID: 20363979 DOI: 10.1101/gr.104471.109] [Citation(s) in RCA: 169] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Clustering of multiple transcription factor binding sites (TFBSs) for the same transcription factor (TF) is a common feature of cis-regulatory modules in invertebrate animals, but the occurrence of such homotypic clusters of TFBSs (HCTs) in the human genome has remained largely unknown. To explore whether HCTs are also common in human and other vertebrates, we used known binding motifs for vertebrate TFs and a hidden Markov model-based approach to detect HCTs in the human, mouse, chicken, and fugu genomes, and examined their association with cis-regulatory modules. We found that evolutionarily conserved HCTs occupy nearly 2% of the human genome, with experimental evidence for individual TFs supporting their binding to predicted HCTs. More than half of the promoters of human genes contain HCTs, with a distribution around the transcription start site in agreement with the experimental data from the ENCODE project. In addition, almost half of the 487 experimentally validated developmental enhancers contain them as well--a number more than 25-fold larger than expected by chance. We also found evidence of negative selection acting on TFBSs within HCTs, as the conservation of TFBSs is stronger than the conservation of sequences separating them. The important role of HCTs as components of developmental enhancers is additionally supported by a strong correlation between HCTs and the binding of the enhancer-associated coactivator protein Ep300 (also known as p300). Experimental validation of HCT-containing elements in both zebrafish and mouse suggest that HCTs could be used to predict both the presence of enhancers and their tissue specificity, and are thus a feature that can be effectively used in deciphering the gene regulatory code. In conclusion, our results indicate that HCTs are a pervasive feature of human cis-regulatory modules and suggest that they play an important role in gene regulation in the human and other vertebrate genomes. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
43	Human variation in short regions predisposed to deep evolutionary conservation. Mol Biol Evol 2010;27:1279-88. [PMID: 20093432 PMCID: PMC2872621 DOI: 10.1093/molbev/msq011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open Abstract The landscape of the human genome consists of millions of short islands of conservation that are 100% conserved across multiple vertebrate genomes (termed “bricks”), the majority of which are located in noncoding regions. Several hundred thousand bricks are deeply conserved reaching the genomes of amphibians and fish. Deep phylogenetic conservation of noncoding DNA has been reported to be strongly associated with the presence of gene regulatory elements, introducing bricks as a proxy to the functional noncoding landscape of the human genome. Here, we report a significant overrepresentation of bricks in the promoters of transcription factors and developmental genes, where the high level of phylogenetic conservation correlates with an increase in brick overrepresentation. We also found that the presence of a brick dictates a predisposition to evolutionary constraint, with only 0.7% of the amniota brick central nucleotides being diverged within the primate lineage—an 11-fold reduction in the divergence rate compared with random expectation. Human single-nucleotide polymorphism (SNP) data explains only 3% of primate-specific variation in amniota bricks, thus arguing for a widespread fixation of brick mutations within the primate lineage and prior to human radiation. This variation, in turn, might have been utilized as a driving force for primate- and hominoid-specific adaptation. We also discovered a pronounced deviation from the evolutionary predisposition in the human lineage, with over 20-fold increase in the substitution rate at brick SNP sites over expected values. In addition, contrary to typical brick mutations, brick variation commonly encountered in the human population displays limited, if any, signatures of negative selection as measured by the minor allele frequency and population differentiation (F-statistical measure) measures. These observations argue for the plasticity of gene regulatory mechanisms in vertebrates—with evidence of strong purifying selection acting on the gene regulatory landscape of the human genome, where widespread advantageous mutations in putative regulatory elements are likely utilized in functional diversification and adaptation of species. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
44	Genome-wide discovery of human heart enhancers. Genome Res 2010;20:381-92. [PMID: 20075146 DOI: 10.1101/gr.098657.109] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Abstract The various organogenic programs deployed during embryonic development rely on the precise expression of a multitude of genes in time and space. Identifying the cis-regulatory elements responsible for this tightly orchestrated regulation of gene expression is an essential step in understanding the genetic pathways involved in development. We describe a strategy to systematically identify tissue-specific cis-regulatory elements that share combinations of sequence motifs. Using heart development as an experimental framework, we employed a combination of Gibbs sampling and linear regression to build a classifier that identifies heart enhancers based on the presence and/or absence of various sequence features, including known and putative transcription factor (TF) binding specificities. In distinguishing heart enhancers from a large pool of random noncoding sequences, the performance of our classifier is vastly superior to four commonly used methods, with an accuracy reaching 92% in cross-validation. Furthermore, most of the binding specificities learned by our method resemble the specificities of TFs widely recognized as key players in heart development and differentiation, such as SRF, MEF2, ETS1, SMAD, and GATA. Using our classifier as a predictor, a genome-wide scan identified over 40,000 novel human heart enhancers. Although the classifier used no gene expression information, these novel enhancers are strongly associated with genes expressed in the heart. Finally, in vivo tests of our predictions in mouse and zebrafish achieved a validation rate of 62%, significantly higher than what is expected by chance. These results support the existence of underlying cis-regulatory codes dictating tissue-specific transcription in mammalian genomes and validate our enhancer classifier strategy as a method to uncover these regulatory codes. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
45	Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009;8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Abstract Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
46	Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. ACTA ACUST UNITED AC 2009;25:578-84. [PMID: 19168912 PMCID: PMC2647827 DOI: 10.1093/bioinformatics/btp043] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Abstract MOTIVATION Several functional gene annotation databases have been developed in the recent years, and are widely used to infer the biological function of gene sets, by scrutinizing the attributes that appear over- and underrepresented. However, this strategy is not directly applicable to the study of non-coding DNA, as the non-coding sequence span varies greatly among different gene loci in the human genome and longer loci have a higher likelihood of being selected purely by chance. Therefore, conclusions involving the function of non-coding elements that are drawn based on the annotation of neighboring genes are often biased. We assessed the systematic bias in several particular Gene Ontology (GO) categories using the standard hypergeometric test, by randomly sampling non-coding elements from the human genome and inferring their function based on the functional annotation of the closest genes. While no category is expected to occur significantly over- or underrepresented for a random selection of elements, categories such as 'cell adhesion', 'nervous system development' and 'transcription factor activities' appeared to be systematically overrepresented, while others such as 'olfactory receptor activity'-underrepresented. RESULTS Our results suggest that functional inference for non-coding elements using gene annotation databases requires a special correction. We introduce a set of correction coefficients for the probabilities of the GO categories that accounts for the variability in the length of the non-coding DNA across different loci and effectively eliminates the ascertainment bias from the functional characterization of non-coding elements. Our approach can be easily generalized to any other gene annotation database. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
47	DiRE: identifying distant regulatory elements of co-expressed genes. Nucleic Acids Res 2008;36:W133-9. [PMID: 18487623 PMCID: PMC2447744 DOI: 10.1093/nar/gkn300] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2007] [Revised: 04/23/2008] [Accepted: 04/29/2008] [Indexed: 11/13/2022] Open Abstract Regulation of gene expression in eukaryotic genomes is established through a complex cooperative activity of proximal promoters and distant regulatory elements (REs) such as enhancers, repressors and silencers. We have developed a web server named DiRE, based on the Enhancer Identification (EI) method, for predicting distant regulatory elements in higher eukaryotic genomes, namely for determining their chromosomal location and functional characteristics. The server uses gene co-expression data, comparative genomics and profiles of transcription factor binding sites (TFBSs) to determine TFBS-association signatures that can be used for discriminating specific regulatory functions. DiRE's unique feature is its ability to detect REs outside of proximal promoter regions, as it takes advantage of the full gene locus to conduct the search. DiRE can predict common REs for any set of input genes for which the user has prior knowledge of co-expression, co-function or other biologically meaningful grouping. The server predicts function-specific REs consisting of clusters of specifically-associated TFBSs and it also scores the association of individual transcription factors (TFs) with the biological function shared by the group of input genes. Its integration with the Array2BIO server allows users to start their analysis with raw microarray expression data. The DiRE web server is freely available at http://dire.dcode.org. Collapse Key Words Collapse MESH Headings Animals Binding Sites Enhancer Elements, Genetic Gene Expression Profiling Gene Expression Regulation Genomics Humans Internet Promoter Regions, Genetic Regulatory Elements, Transcriptional Software Transcription Factors/metabolism User-Computer Interface Collapse Grants Intramural NIH HHS Collapse
48	Widespread ultraconservation divergence in primates. Mol Biol Evol 2008;25:1668-76. [PMID: 18492662 PMCID: PMC2464743 DOI: 10.1093/molbev/msn116] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open Abstract The distribution and evolution of ultraconserved elements (UCEs, DNA stretches that are perfectly identical in primates and rodents) were examined in genomes of 3 primate species (human, chimpanzee, and rhesus macaque). It was found that the number of UCEs has decreased throughout primate evolution. At least 26% of ancestral UCEs have diverged in hominoids, whereas an additional 17% have accumulated one or more single nucleotide polymorphisms in the human genome. Sequence polymorphism analyses indicate that mutation fixation within an UCE can trigger a relaxation in the selective constraint on that element. Homogeneous mutation accumulations in UCEs served as a template by which purifying selection acted more effectively on protein-coding UCEs. Gene ontology annotation suggests that UCE sequence variation, primarily occurring in noncoding regions, might be linked to the reprogramming of the expression pattern of transcription factors and developmentally important genes. Many of these genes are expressed in the central nervous system. Finally, UCE sequence variability within human populations has been identified, including population-specific nonsynonymous changes in protein-coding regions. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
49	Comparative analysis of chicken chromosome 28 provides new clues to the evolutionary fragility of gene-rich vertebrate regions. Genome Res 2007;17:1603-13. [PMID: 17921355 DOI: 10.1101/gr.6775107] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Abstract The chicken genome draft sequence has provided a valuable resource for studies of an important agricultural and experimental model species and an important data set for comparative analysis. However, some of the most gene-rich segments are missing from chicken genome draft assemblies, limiting the analysis of a substantial number of genes and preventing a closer look at regions that are especially prone to syntenic rearrangements. To facilitate the functional and evolutionary analysis of one especially gene-rich, rearrangement-prone genomic region, we analyzed sequence from BAC clones spanning chicken microchromosome GGA28; as a complement we also analyzed a gene-sparse, stable region from GGA11. In these two regions we documented the conservation and lineage-specific gain and loss of protein-coding genes and precisely mapped the locations of 31 major human-chicken syntenic breakpoints. Altogether, we identified 72 lineage-specific genes, many of which are found at or near syntenic breaks, implicating evolutionary breakpoint regions as major sites of genetic innovation and change. Twenty-two of the 31 breakpoint regions have been reused repeatedly as rearrangement breakpoints in vertebrate evolution. Compared with stable GC-matched regions, GGA28 is highly enriched in CpG islands, as are break-prone intervals identified elsewhere in the chicken genome; evolutionary breakpoints are further enriched in GC content and CpG islands, highlighting a potential role for these features in genome instability. These data support the hypothesis that chromosome rearrangements have not occurred randomly over the course of vertebrate evolution but are focused preferentially within "fragile" regions with unusual DNA sequence characteristics. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
50	Extent of linkage disequilibrium in chicken. Cytogenet Genome Res 2007;117:338-45. [PMID: 17675876 DOI: 10.1159/000103196] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2006] [Accepted: 10/25/2006] [Indexed: 01/27/2023] Open Abstract Many of the economically important traits in chicken are multifactorial and governed by multiple genes located at different quantitative trait loci (QTLs). The optimal marker density to identify these QTLs in linkage and association studies is largely determined by the extent of linkage disequilibrium (LD) around them. In this study, we investigated the extent of LD on two chromosomes in a white layer and two broiler chicken breeds. Pairwise levels of LD were calculated for 33 and 36 markers on chromosomes 10 and 28, respectively. We found that useful LD (i.e. an r(2) value higher than 0.3) in Nutreco chicken breed E5 (inbred) can extend to around 1 cM on chromosomes 10 and 28, although in a second region on chromosome 28 it extends to about 2.5 cM. The extent in breed Nutreco E3 (outbred) was very short in chromosome 10 (15 kb) but very much larger on chromosome 28, particularly in one region of depressed heterozygosity. The layer breed E2 (inbred) showed an extent of useful LD up to 4 cM on chromosome 10; the extent on chromosome 28 could not be assessed due to an erratic pattern of LD on that chromosome, although in one region LD appears to be in the order of 0.8 cM. This indicates that there may be very large differences in patterns of LD between different chicken breeds and different genomic regions. Collapse Key Words Collapse MESH Headings Animals Breeding Chickens/genetics Female Genetic Markers Linkage Disequilibrium/genetics Male Polymorphism, Single Nucleotide/genetics Collapse Grants Collapse