1
|
Enhanced Detection and Genotyping of Disease-Associated Tandem Repeats Using HMMSTR and Targeted Long-Read Sequencing. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.01.24306681. [PMID: 38746091 PMCID: PMC11092683 DOI: 10.1101/2024.05.01.24306681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Tandem repeat sequences comprise approximately 8% of the human genome and are linked to more than 50 neurodegenerative disorders. Accurate characterization of disease-associated repeat loci remains resource intensive and often lacks high resolution genotype calls. We introduce a multiplexed, targeted nanopore sequencing panel and HMMSTR, a sequence-based tandem repeat copy number caller. HMMSTR outperforms current signal- and sequence-based callers relative to two assemblies and we show it performs with high accuracy in heterozygous regions and at low read coverage. The flexible panel allows us to capture disease associated regions at an average coverage of >150x. Using these tools, we successfully characterize known or suspected repeat expansions in patient derived samples. In these samples we also identify unexpected expanded alleles at tandem repeat loci not previously associated with the underlying diagnosis. This genotyping approach for tandem repeat expansions is scalable, simple, flexible, and accurate, offering significant potential for diagnostic applications and investigation of expansion co-occurrence in neurodegenerative disorders. Abstract Figure
Collapse
|
2
|
HaplotagLR: An efficient and configurable utility for haplotagging long reads. PLoS One 2024; 19:e0298688. [PMID: 38478504 PMCID: PMC10936807 DOI: 10.1371/journal.pone.0298688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 01/30/2024] [Indexed: 03/17/2024] Open
Abstract
Understanding the functional effects of sequence variation is crucial in genomics. Individual human genomes contain millions of variants that contribute to phenotypic variability and disease risks at the population level. Because variants rarely act in isolation, we must consider potential interactions of neighboring variants to accurately predict functional effects. We can accomplish this using haplotagging, which matches sequencing reads to their parental haplotypes using alleles observed at known heterozygous variants. However, few published tools for haplotagging exist and these share several technical and usability-related shortcomings that limit applicability, in particular a lack of insight or control over error rates, and lack of key metrics on the underlying sources of haplotagging error. Here we present HaplotagLR: a user-friendly tool that haplotags long sequencing reads based on a multinomial model and existing phased variant lists. HaplotagLR is user-configurable and includes a basic error model to control the empirical FDR in its output. We show that HaplotagLR outperforms the leading haplotagging method in simulated datasets, especially at high levels of specificity, and displays 7% greater sensitivity in haplotagging real data. HaplotagLR advances both the immediate utility of haplotagging and paves the way for further improvements to this important method.
Collapse
|
3
|
CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 2024; 25:53. [PMID: 38389099 PMCID: PMC10882881 DOI: 10.1186/s13059-023-03113-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/17/2023] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors. RESULTS Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic. CONCLUSIONS Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
Collapse
|
4
|
Systematic investigation of allelic regulatory activity of schizophrenia-associated common variants. CELL GENOMICS 2023; 3:100404. [PMID: 37868037 PMCID: PMC10589626 DOI: 10.1016/j.xgen.2023.100404] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 02/23/2023] [Accepted: 08/21/2023] [Indexed: 10/24/2023]
Abstract
Genome-wide association studies (GWASs) have successfully identified 145 genomic regions that contribute to schizophrenia risk, but linkage disequilibrium makes it challenging to discern causal variants. We performed a massively parallel reporter assay (MPRA) on 5,173 fine-mapped schizophrenia GWAS variants in primary human neural progenitors and identified 439 variants with allelic regulatory effects (MPRA-positive variants). Transcription factor binding had modest predictive power, while fine-map posterior probability, enhancer overlap, and evolutionary conservation failed to predict MPRA-positive variants. Furthermore, 64% of MPRA-positive variants did not exhibit expressive quantitative trait loci signature, suggesting that MPRA could identify yet unexplored variants with regulatory potentials. To predict the combinatorial effect of MPRA-positive variants on gene regulation, we propose an accessibility-by-contact model that combines MPRA-measured allelic activity with neuronal chromatin architecture.
Collapse
|
5
|
Mechanisms of Intrinsic Radioresistance in Breast Cancer Identify Potential Therapeutic Vulnerabilities. Int J Radiat Oncol Biol Phys 2023; 117:e250. [PMID: 37784974 DOI: 10.1016/j.ijrobp.2023.06.1191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
PURPOSE/OBJECTIVE(S) Clinical management of breast cancer (BC) includes radiation therapy (RT) for most women, though the molecular mechanisms that underly RT response and intrinsic radioresistance are poorly understood. Both in vitro and in vivo models aid in our understanding of radiobiology, and we hypothesized that transcriptional changes caused by radiation in vitro in BC cell lines would be recapitulated in an in vivo mouse xenograft model and uncover targetable mechanisms of radioresistance in BC. MATERIALS/METHODS Radiosensitivity was measured with clonogenic survival assays in 16 cell lines. RNA-seq experiments in vitro and in vivo were performed in an RT resistant (SUM-159) and RT sensitive (ZR-75) cell line 24 hrs after 4 Gy or after 2 Gy x 6 fractions, respectively. Differentially expressed genes (DEGs) were identified from RNA-seq data with DeSeq2 followed by pathway analysis with iPathwayGuide. RESULTS RT sensitivity was subtype independent in 16 BC cell lines, with SUM-159 radioresistant (SF 0.88) and ZR-75-1 radiosensitive (SF 0.29). There were 75 unique pathways that were significantly altered after RT in SUM-159 cells (53 pathways in vivo only, 36 pathways in vitro only, 14 both conditions; adjusted p-value < 0.05) and 85 unique pathways that were significantly altered after RT in ZR-75-1 cells (16 pathways in vivo only, 72 in vivo only, 3 both conditions; adjusted p-value < 0.05). Pathways that were significantly affected in both cell lines exclusively in the in vitro condition include canonical RT response pathways such as cell cycle, cellular senescence, and DNA replication, though the direction of DEGs were opposite in the two cell lines for each of these pathways. The IL-17 signaling pathway was significantly altered for both cell lines in vivo. Of the pathways that were significantly altered in both conditions for SUM-159 cells, inflammation, including chemokine signaling pathway and cytokine-cytokine receptor interaction, were among the most significant. Significantly more cytokines were upregulated following RT in vivo than in vitro. Cytokines were not upregulated in ZR-75-1 cells in vitro or in vivo. CONCLUSION Taken together, the significant changes in the IL-17 pathway and the upregulation of cytokines only in vivo indicate a potential of the tumor microenvironment in the in vivo condition that the in vitro condition lacks. Increased heterogeneity in vivo relative to in vitro may also explain the absence of several canonical RT response pathways in the in vivo conditions for each cell line. Notably, the opposite direction of DEG changes in the canonical RT response pathways between the 2 cell lines with disparate radiosensitivity levels may point to important biologic vulnerabilities that may be targeted in the resistant SUM-159 cells. Future studies are underway using additional BC cell lines and single-cell analysis to better understand RT response heterogeneity.
Collapse
|
6
|
Sperm chromatin structure and reproductive fitness are altered by substitution of a single amino acid in mouse protamine 1. Nat Struct Mol Biol 2023; 30:1077-1091. [PMID: 37460896 PMCID: PMC10833441 DOI: 10.1038/s41594-023-01033-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 06/12/2023] [Indexed: 08/11/2023]
Abstract
Conventional dogma presumes that protamine-mediated DNA compaction in sperm is achieved by electrostatic interactions between DNA and the arginine-rich core of protamines. Phylogenetic analysis reveals several non-arginine residues conserved within, but not across species. The significance of these residues and their post-translational modifications are poorly understood. Here, we investigated the role of K49, a rodent-specific lysine residue in protamine 1 (P1) that is acetylated early in spermiogenesis and retained in sperm. In sperm, alanine substitution (P1(K49A)) decreases sperm motility and male fertility-defects that are not rescued by arginine substitution (P1(K49R)). In zygotes, P1(K49A) leads to premature male pronuclear decompaction, altered DNA replication, and embryonic arrest. In vitro, P1(K49A) decreases protamine-DNA binding and alters DNA compaction and decompaction kinetics. Hence, a single amino acid substitution outside the P1 arginine core is sufficient to profoundly alter protein function and developmental outcomes, suggesting that protamine non-arginine residues are essential for reproductive fitness.
Collapse
|
7
|
Challenges in screening for de novo noncoding variants contributing to genetically complex phenotypes. HGG ADVANCES 2023; 4:100210. [PMID: 37305558 PMCID: PMC10248550 DOI: 10.1016/j.xhgg.2023.100210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 05/15/2023] [Indexed: 06/13/2023] Open
Abstract
Understanding the genetic basis for complex, heterogeneous disorders, such as autism spectrum disorder (ASD), is a persistent challenge in human medicine. Owing to their phenotypic complexity, the genetic mechanisms underlying these disorders may be highly variable across individual patients. Furthermore, much of their heritability is unexplained by known regulatory or coding variants. Indeed, there is evidence that much of the causal genetic variation stems from rare and de novo variants arising from ongoing mutation. These variants occur mostly in noncoding regions, likely affecting regulatory processes for genes linked to the phenotype of interest. However, because there is no uniform code for assessing regulatory function, it is difficult to separate these mutations into likely functional and nonfunctional subsets. This makes finding associations between complex diseases and potentially causal de novo single-nucleotide variants (dnSNVs) a difficult task. To date, most published studies have struggled to find any significant associations between dnSNVs from ASD patients and any class of known regulatory elements. We sought to identify the underlying reasons for this and present strategies for overcoming these challenges. We show that, contrary to previous claims, the main reason for failure to find robust statistical enrichments is not only the number of families sampled, but also the quality and relevance to ASD of the annotations used to prioritize dnSNVs, and the reliability of the set of dnSNVs itself. We present a list of recommendations for designing future studies of this sort that will help researchers avoid common pitfalls.
Collapse
|
8
|
Multiplexed long-read plasmid validation and analysis using OnRamp. Genome Res 2023:gr.277369.122. [PMID: 37156622 DOI: 10.1101/gr.277369.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 05/03/2023] [Indexed: 05/10/2023]
Abstract
Recombinant plasmid vectors are versatile tools that have facilitated discoveries in molecular biology, genetics, proteomics, and many other fields. As the enzymatic and bacterial processes used to create recombinant DNA can introduce errors, sequence validation is an essential step in plasmid assembly. Sanger sequencing is the current standard for plasmid validation; however, this method is limited by an inability to sequence through complex secondary structure, and lacks scalability when applied to full-plasmid sequencing of multiple plasmids due to read-length limits. While high-throughput sequencing does provide full-plasmid sequencing at scale, it is impractical and costly when utilized outside of library-scale validation. Here we present OnRamp (Oxford nanopore-based Rapid Analysis of Multiplexed Plasmids), an alternative method for routine plasmid validation which combines the advantages of high-throughput sequencing's full-plasmid coverage and scalability with Sanger's affordability and accessibility by leveraging nanopore's long-read sequencing technology. We include customized wet-lab protocols for plasmid preparation along with a pipeline designed for analysis of read data obtained using these protocols. This analysis pipeline is deployed on the OnRamp web app, which generates alignments between actual and predicted plasmid sequences, quality scores, and read-level views. OnRamp is designed to be broadly accessible regardless of programming experience to facilitate more widespread adoption of long-read sequencing for routine plasmid validation. Here we describe the OnRamp protocols and pipeline and demonstrate our ability to obtain full sequences from pooled plasmids while detecting sequence variation even in regions of high secondary structure at less than half the cost of equivalent Sanger sequencing.
Collapse
|
9
|
|
10
|
Abstract 2401: Molecular mechanisms of intrinsic radioresistance in breast cancer. Cancer Res 2023. [DOI: 10.1158/1538-7445.am2023-2401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/07/2023]
Abstract
Abstract
Background: Clinical management of BC includes radiation therapy (RT), with most women receiving RT as part of their treatment. Although effective, many women develop locoregional recurrence, including a disproportionate number of women with triple-negative or inflammatory BC. Unfortunately, the molecular mechanisms that underly RT response and intrinsic radioresistance are poorly understood. We hypothesized that transcriptomic and proteomic changes that occur after ionizing radiation in intrinsically radiosensitive and resistant BC models would offer mechanistic insight into mediators of this differential response.
Methods: Intrinsic radiosensitivity across all 10 cell lines was measured with clonogenic survival assays as the surviving fraction (SF) after 2 Gy RT. Gene expression changes were assessed by RNA-Seq 24 hours after 4 Gy RT. For long-course RT, cell lines were treated with fractionated RT (2 Gy x 5 fractions). For in vivo mouse xenograft experiments mice received fractionated RT (2 Gy x 6 fractions). Differential gene expression analysis with DeSeq2 was performed on all samples, followed by pathway analysis with Advaita Bioinformatics’ iPathwayGuide. Protein was collected 1, 12, and 24 hours after RT for RPPA analysis evaluating expression changes in 100 proteins and phospho-proteins with SuperCurve.
Results: Clonogenic survival identified a wide range of radiation sensitivity in human BC cell lines (SF 83% - 19%) with no significant correlation (r %lt 0.3) to intrinsic BC subtype. The most highly affected pathways in both resistant and sensitive cell lines 24 hours after RT include cell cycle, cellular senescence, and estrogen signaling pathways. For the long-course RT samples, several pathways were significantly altered in fractionated samples only, including MAPK and Hippo signaling and EGFR tyrosine kinase inhibitor resistance. From the in vivo experiments, pathways uniquely affected in the in vivo samples include IL-17 signaling and transcriptional misregulation in cancer. From the proteomic data, we found that proteins including p53, Bcl-2 family proteins, and cell cycle proteins exhibit expression changes after 1 hour. A significant number of pathways (N=69, p %lt 0.01, FDR 0.05) were affected in radioresistant BC models compared to radiosensitive cell lines and these pathways may underlie intrinsic radioresistance.
Conclusions: Ionizing radiation induces transcriptomic and proteomic expression changes that differ between intrinsically sensitive and resistant BC models in both single fraction and fractionated studies. Pathways identified in these analyses offer potential insight into the mechanisms underlying intrinsic radioresistance and suggest biologic vulnerabilities that may be targeted to more effectively treat women at a high risk of local BC recurrence. Genome wide CRIPSR-Cas9 screens are currently underway in these breast cancer models to confirm these vulnerability targets.
Citation Format: Breanna N. McBean, Anna R. Michmerhuizen, Kari Wilder-Romans, Benjamin C. Chandler, Lynn M. Lerner, Connor Ward, Meilan Liu, Alan P. Boyle, Corey W. Speers. Molecular mechanisms of intrinsic radioresistance in breast cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 2401.
Collapse
|
11
|
LRphase: an efficient method for assigning haplotype identity to long reads. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.18.524565. [PMID: 36712073 PMCID: PMC9882277 DOI: 10.1101/2023.01.18.524565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Understanding the functional effects of sequence variation is among the primary goals of contemporary genomics. Individual human genomes contain millions of variants which are thought to contribute to phenotypic variability and differential disease risks at the population level. However, because variants rarely act in isolation, we cannot accurately predict functional effects without first considering the potential effects of other interacting variants on the same chromosome. This information can be obtained by phasing the read data from sequencing experiments. However, no standalone tools are available to simply phase reads based on known haplotypes. Here we present LRphase: a user-friendly utility for simple phasing of long sequencing reads.
Collapse
|
12
|
SEMplMe: a tool for integrating DNA methylation effects in transcription factor binding affinity predictions. BMC Bioinformatics 2022; 23:317. [PMID: 35927613 PMCID: PMC9351228 DOI: 10.1186/s12859-022-04865-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 07/28/2022] [Indexed: 12/02/2022] Open
Abstract
MOTIVATION Aberrant DNA methylation in transcription factor binding sites has been shown to lead to anomalous gene regulation that is strongly associated with human disease. However, the majority of methylation-sensitive positions within transcription factor binding sites remain unknown. Here we introduce SEMplMe, a computational tool to generate predictions of the effect of methylation on transcription factor binding strength in every position within a transcription factor's motif. RESULTS SEMplMe uses ChIP-seq and whole genome bisulfite sequencing to predict effects of methylation within binding sites. SEMplMe validates known methylation sensitive and insensitive positions within a binding motif, identifies cell type specific transcription factor binding driven by methylation, and outperforms SELEX-based predictions for CTCF. These predictions can be used to identify aberrant sites of DNA methylation contributing to human disease. AVAILABILITY AND IMPLEMENTATION SEMplMe is available from https://github.com/Boyle-Lab/SEMplMe .
Collapse
|
13
|
Comprehensive enhancer-target gene assignments improve gene set level interpretation of genome-wide regulatory data. Genome Biol 2022; 23:105. [PMID: 35473573 PMCID: PMC9044877 DOI: 10.1186/s13059-022-02668-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 04/06/2022] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Revealing the gene targets of distal regulatory elements is challenging yet critical for interpreting regulome data. Experiment-derived enhancer-gene links are restricted to a small set of enhancers and/or cell types, while the accuracy of genome-wide approaches remains elusive due to the lack of a systematic evaluation. We combined multiple spatial and in silico approaches for defining enhancer locations and linking them to their target genes aggregated across >500 cell types, generating 1860 human genome-wide distal enhancer-to-target gene definitions (EnTDefs). To evaluate performance, we used gene set enrichment (GSE) testing on 87 independent ENCODE ChIP-seq datasets of 34 transcription factors (TFs) and assessed concordance of results with known TF Gene Ontology annotations, and other benchmarks. RESULTS The top ranked 741 (40%) EnTDefs significantly outperform the common, naïve approach of linking distal regions to the nearest genes, and the top 10 EnTDefs perform well when applied to ChIP-seq data of other cell types. The GSE-based ranking of EnTDefs is highly concordant with ranking based on overlap with curated benchmarks of enhancer-gene interactions. Both our top general EnTDef and cell-type-specific EnTDefs significantly outperform seven independent computational and experiment-based enhancer-gene pair datasets. We show that using our top EnTDefs for GSE with either genome-wide DNA methylation or ATAC-seq data is able to better recapitulate the biological processes changed in gene expression data performed in parallel for the same experiment than our lower-ranked EnTDefs. CONCLUSIONS Our findings illustrate the power of our approach to provide genome-wide interpretation regardless of cell type.
Collapse
|
14
|
Abstract
We present SquiggleNet, the first deep-learning model that can classify nanopore reads directly from their electrical signals. SquiggleNet operates faster than DNA passes through the pore, allowing real-time classification and read ejection. Using 1 s of sequencing data, the classifier achieves significantly higher accuracy than base calling followed by sequence alignment. Our approach is also faster and requires an order of magnitude less memory than alignment-based approaches. SquiggleNet distinguished human from bacterial DNA with over 90% accuracy, generalized to unseen bacterial species in a human respiratory meta genome sample, and accurately classified sequences containing human long interspersed repeat elements.
Collapse
|
15
|
Prioritization of regulatory variants with tissue-specific function in the non-coding regions of human genome. Nucleic Acids Res 2021; 50:e6. [PMID: 34648033 PMCID: PMC8754628 DOI: 10.1093/nar/gkab924] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 09/21/2021] [Accepted: 09/27/2021] [Indexed: 02/06/2023] Open
Abstract
Understanding the functional consequences of genetic variation in the non-coding regions of the human genome remains a challenge. We introduce h ere a computational tool, TURF, to prioritize regulatory variants with tissue-specific function by leveraging evidence from functional genomics experiments, including over 3000 functional genomics datasets from the ENCODE project provided in the RegulomeDB database. TURF is able to generate prediction scores at both organism and tissue/organ-specific levels for any non-coding variant on the genome. We present that TURF has an overall top performance in prediction by using validated variants from MPRA experiments. We also demonstrate how TURF can pick out the regulatory variants with tissue-specific function over a candidate list from associate studies. Furthermore, we found that various GWAS traits showed the enrichment of regulatory variants predicted by TURF scores in the trait-relevant organs, which indicates that these variants can be a valuable source for future studies.
Collapse
|
16
|
The Inducible lac Operator-Repressor System Is Functional in Zebrafish Cells. Front Genet 2021; 12:683394. [PMID: 34220959 PMCID: PMC8249864 DOI: 10.3389/fgene.2021.683394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Accepted: 05/20/2021] [Indexed: 11/13/2022] Open
Abstract
Background Zebrafish are a foundational model organism for studying the spatio-temporal activity of genes and their regulatory sequences. A variety of approaches are currently available for editing genes and modifying gene expression in zebrafish, including RNAi, Cre/lox, and CRISPR-Cas9. However, the lac operator-repressor system, an E. coli lac operon component which has been adapted for use in many other species and is a valuable, flexible tool for inducible modulation of gene expression studies, has not been previously tested in zebrafish. Results Here we demonstrate that the lac operator-repressor system robustly decreases expression of firefly luciferase in cultured zebrafish fibroblast cells. Our work establishes the lac operator-repressor system as a promising tool for the manipulation of gene expression in whole zebrafish. Conclusion Our results lay the groundwork for the development of lac-based reporter assays in zebrafish, and adds to the tools available for investigating dynamic gene expression in embryogenesis. We believe this work will catalyze the development of new reporter assay systems to investigate uncharacterized regulatory elements and their cell-type specific activities.
Collapse
|
17
|
Abstract
Mobile element insertions (MEIs) are repetitive genomic sequences that contribute to genetic variation and can lead to genetic disorders. Targeted and whole-genome approaches using short-read sequencing have been developed to identify reference and non-reference MEIs; however, the read length hampers detection of these elements in complex genomic regions. Here, we pair Cas9-targeted nanopore sequencing with computational methodologies to capture active MEIs in human genomes. We demonstrate parallel enrichment for distinct classes of MEIs, averaging 44% of reads on-targeted signals and exhibiting a 13.4-54x enrichment over whole-genome approaches. We show an individual flow cell can recover most MEIs (97% L1Hs, 93% AluYb, 51% AluYa, 99% SVA_F, and 65% SVA_E). We identify seventeen non-reference MEIs in GM12878 overlooked by modern, long-read analysis pipelines, primarily in repetitive genomic regions. This work introduces the utility of nanopore sequencing for MEI enrichment and lays the foundation for rapid discovery of elusive, repetitive genetic elements.
Collapse
|
18
|
F-Seq2: improving the feature density based peak caller with dynamic statistics. NAR Genom Bioinform 2021; 3:lqab012. [PMID: 33655209 PMCID: PMC7902237 DOI: 10.1093/nargab/lqab012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 01/06/2021] [Accepted: 02/04/2021] [Indexed: 01/15/2023] Open
Abstract
Genomic and epigenomic features are captured at a genome-wide level by using high-throughput sequencing (HTS) technologies. Peak calling delineates features identified in HTS experiments, such as open chromatin regions and transcription factor binding sites, by comparing the observed read distributions to a random expectation. Since its introduction, F-Seq has been widely used and shown to be the most sensitive and accurate peak caller for DNase I hypersensitive site (DNase-seq) data. However, the first release (F-Seq1) has two key limitations: lack of support for user-input control datasets, and poor test statistic reporting. These constrain its ability to capture systematic and experimental biases inherent to the background distributions in peak prediction, and to subsequently rank predicted peaks by confidence. To address these limitations, we present F-Seq2, which combines kernel density estimation and a dynamic 'continuous' Poisson test to account for local biases and accurately rank candidate peaks. The output of F-Seq2 is suitable for irreproducible discovery rate analysis as test statistics are calculated for individual candidate summits, allowing direct comparison of predictions across replicates. These improvements significantly boost the performance of F-Seq2 for ATAC-seq and ChIP-seq datasets, outperforming competing peak callers used by the ENCODE Consortium in terms of precision and recall.
Collapse
|
19
|
MapGL: inferring evolutionary gain and loss of short genomic sequence features by phylogenetic maximum parsimony. BMC Bioinformatics 2020; 21:416. [PMID: 32962625 PMCID: PMC7510305 DOI: 10.1186/s12859-020-03742-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 09/07/2020] [Indexed: 12/26/2022] Open
Abstract
Background Comparative genomics studies are growing in number partly because of their unique ability to provide insight into shared and divergent biology between species. Of particular interest is the use of phylogenetic methods to infer the evolutionary history of cis-regulatory sequence features, which contribute strongly to phenotypic divergence and are frequently gained and lost in eutherian genomes. Understanding the mechanisms by which cis-regulatory element turnover generate emergent phenotypes is crucial to our understanding of adaptive evolution. Ancestral reconstruction methods can place species-specific cis-regulatory features in their evolutionary context, thus increasing our understanding of the process of regulatory sequence turnover. However, applying these methods to gain and loss of cis-regulatory features historically required complex workflows, preventing widespread adoption by the broad scientific community. Results MapGL simplifies phylogenetic inference of the evolutionary history of short genomic sequence features by combining the necessary steps into a single piece of software with a simple set of inputs and outputs. We show that MapGL can reliably disambiguate the mechanisms underlying differential regulatory sequence content across a broad range of phylogenetic topologies and evolutionary distances. Thus, MapGL provides the necessary context to evaluate how genomic sequence gain and loss contribute to species-specific divergence. Conclusions MapGL makes phylogenetic inference of species-specific sequence gain and loss easy for both expert and non-expert users, making it a powerful tool for gaining novel insights into genome evolution.
Collapse
|
20
|
TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence. Genome Res 2020; 30:1040-1046. [PMID: 32660981 PMCID: PMC7397869 DOI: 10.1101/gr.258228.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 06/26/2020] [Indexed: 02/06/2023]
Abstract
Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors (TFs) can bind. Thus, identification of TF binding sites (TFBSs) is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches used for TFBS prediction, such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq), are widely used but have their drawbacks, including high false-positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns; however, these also have limitations. We have developed a footprinting method to predict TF footprints in active chromatin elements (TRACE) to improve the prediction of TFBS footprints. TRACE incorporates DNase-seq data and PWMs within a multivariate hidden Markov model (HMM) to detect footprint-like regions with matching motifs. TRACE is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement for pregenerated candidate binding sites or ChIP-seq training data. Compared with published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.
Collapse
|
21
|
Transposable elements contribute to cell and species-specific chromatin looping and gene regulation in mammalian genomes. Nat Commun 2020; 11:1796. [PMID: 32286261 PMCID: PMC7156512 DOI: 10.1038/s41467-020-15520-5] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Accepted: 03/14/2020] [Indexed: 01/12/2023] Open
Abstract
Chromatin looping is important for gene regulation, and studies of 3D chromatin structure across species and cell types have improved our understanding of the principles governing chromatin looping. However, 3D genome evolution and its relationship with natural selection remains largely unexplored. In mammals, the CTCF protein defines the boundaries of most chromatin loops, and variations in CTCF occupancy are associated with looping divergence. While many CTCF binding sites fall within transposable elements (TEs), their contribution to 3D chromatin structural evolution is unknown. Here we report the relative contributions of TE-driven CTCF binding site expansions to conserved and divergent chromatin looping in human and mouse. We demonstrate that TE-derived CTCF binding divergence may explain a large fraction of variable loops. These variable loops contribute significantly to corresponding gene expression variability across cells and species, possibly by refining sub-TAD-scale loop contacts responsible for cell-type-specific enhancer-promoter interactions.
Collapse
|
22
|
Poly-Enrich: count-based methods for gene set enrichment testing with genomic regions. NAR Genom Bioinform 2020; 2:lqaa006. [PMID: 32051932 PMCID: PMC7003681 DOI: 10.1093/nargab/lqaa006] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 12/12/2019] [Accepted: 01/30/2020] [Indexed: 12/13/2022] Open
Abstract
Gene set enrichment (GSE) testing enhances the biological interpretation of ChIP-seq data and other large sets of genomic regions. Our group has previously introduced two GSE methods for genomic regions: ChIP-Enrich for narrow regions and Broad-Enrich for broad regions. Here, we introduce Poly-Enrich, which has wider applicability, additional capabilities and models the number of peaks assigned to a gene using a generalized additive model with a negative binomial family to determine gene set enrichment, while adjusting for gene locus length. As opposed to ChIP-Enrich, Poly-Enrich works well even when nearly all genes have a peak, illustrated by using Poly-Enrich to characterize pathways and types of genic regions enriched with different families of repetitive elements. By comparing Poly-Enrich and ChIP-Enrich results with ENCODE ChIP-seq data, we found that the optimal test depends more on the pathway being regulated than on properties of the transcription factors. Using known transcription factor functions, we discovered clusters of related biological processes consistently better modeled with Poly-Enrich. This suggests that the regulation of certain processes may be modified by multiple binding events, better modeled by a count-based method. Our new hybrid method automatically uses the optimal method for each gene set, with correct FDR-adjustment.
Collapse
|
23
|
Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay. Hum Mutat 2019; 40:1280-1291. [PMID: 31106481 PMCID: PMC6879779 DOI: 10.1002/humu.23797] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 04/17/2019] [Accepted: 05/15/2019] [Indexed: 12/25/2022]
Abstract
The integrative analysis of high-throughput reporter assays, machine learning, and profiles of epigenomic chromatin state in a broad array of cells and tissues has the potential to significantly improve our understanding of noncoding regulatory element function and its contribution to human disease. Here, we report results from the CAGI 5 regulation saturation challenge where participants were asked to predict the impact of nucleotide substitution at every base pair within five disease-associated human enhancers and nine disease-associated promoters. A library of mutations covering all bases was generated by saturation mutagenesis and altered activity was assessed in a massively parallel reporter assay (MPRA) in relevant cell lines. Reporter expression was measured relative to plasmid DNA to determine the impact of variants. The challenge was to predict the functional effects of variants on reporter expression. Comparative analysis of the full range of submitted prediction results identifies the most successful models of transcription factor binding sites, machine learning algorithms, and ways to choose among or incorporate diverse datatypes and cell-types for training computational models. These results have the potential to improve the design of future studies on more diverse sets of regulatory elements and aid the interpretation of disease-associated genetic variation.
Collapse
|
24
|
Predicting the effects of SNPs on transcription factor binding affinity. Bioinformatics 2019; 36:364-372. [PMID: 31373606 PMCID: PMC7999143 DOI: 10.1093/bioinformatics/btz612] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 07/15/2019] [Accepted: 08/01/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Genome-wide association studies have revealed that 88% of disease-associated single-nucleotide polymorphisms (SNPs) reside in noncoding regions. However, noncoding SNPs remain understudied, partly because they are challenging to prioritize for experimental validation. To address this deficiency, we developed the SNP effect matrix pipeline (SEMpl). RESULTS SEMpl estimates transcription factor-binding affinity by observing differences in chromatin immunoprecipitation followed by deep sequencing signal intensity for SNPs within functional transcription factor-binding sites (TFBSs) genome-wide. By cataloging the effects of every possible mutation within the TFBS motif, SEMpl can predict the consequences of SNPs to transcription factor binding. This knowledge can be used to identify potential disease-causing regulatory loci. AVAILABILITY AND IMPLEMENTATION SEMpl is available from https://github.com/Boyle-Lab/SEM_CPP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
25
|
CGIMP: Real-time exploration and covariate projection for self-organizing map datasets. JOURNAL OF OPEN SOURCE SOFTWARE 2019; 4:1520. [PMID: 32500114 PMCID: PMC7272009 DOI: 10.21105/joss.01520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
|
26
|
Predicting functional variants in enhancer and promoter elements using RegulomeDB. Hum Mutat 2019; 40:1292-1298. [PMID: 31228310 DOI: 10.1002/humu.23791] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 04/26/2019] [Accepted: 05/09/2019] [Indexed: 01/07/2023]
Abstract
Here we present a computational model, Score of Unified Regulatory Features (SURF), that predicts functional variants in enhancer and promoter elements. SURF is trained on data from massively parallel reporter assays and predicts the effect of variants on reporter expression levels. It achieved the top performance in the Fifth Critical Assessment of Genome Interpretation "Regulation Saturation" challenge. We also show that features queried through RegulomeDB, which are direct annotations from functional genomics data, help improve prediction accuracy beyond transfer learning features from DNA sequence-based deep learning models. Some of the most important features include DNase footprints, especially when coupled with complementary ChIP-seq data. Furthermore, we found our model achieved good performance in predicting allele-specific transcription factor binding events. As an extension to the current scoring system in RegulomeDB, we expect our computational model to prioritize variants in regulatory regions, thus help the understanding of functional variants in noncoding regions that lead to disease.
Collapse
|
27
|
Conserved and species-specific transcription factor co-binding patterns drive divergent gene regulation in human and mouse. Nucleic Acids Res 2018; 46:1878-1894. [PMID: 29361190 PMCID: PMC5829737 DOI: 10.1093/nar/gky018] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Revised: 12/15/2017] [Accepted: 01/08/2018] [Indexed: 12/24/2022] Open
Abstract
The mouse is widely used as system to study human genetic mechanisms. However, extensive rewiring of transcriptional regulatory networks often confounds translation of findings between human and mouse. Site-specific gain and loss of individual transcription factor binding sites (TFBS) has caused functional divergence of orthologous regulatory loci, and so we must look beyond this positional conservation to understand common themes of regulatory control. Fortunately, transcription factor co-binding patterns shared across species often perform conserved regulatory functions. These can be compared to 'regulatory sentences' that retain the same meanings regardless of sequence and species context. By analyzing TFBS co-occupancy patterns observed in four human and mouse cell types, we learned a regulatory grammar: the rules by which TFBS are combined into meaningful regulatory sentences. Different parts of this grammar associate with specific sets of functional annotations regardless of sequence conservation and predict functional signatures more accurately than positional conservation. We further show that both species-specific and conserved portions of this grammar are involved in gene expression divergence and human disease risk. These findings expand our understanding of transcriptional regulatory mechanisms, suggesting that phenotypic divergence and disease risk are driven by a complex interplay between deeply conserved and species-specific transcriptional regulatory pathways.
Collapse
|
28
|
Genome-wide Study of Atrial Fibrillation Identifies Seven Risk Loci and Highlights Biological Pathways and Regulatory Elements Involved in Cardiac Development. Am J Hum Genet 2018; 102:103-115. [PMID: 29290336 DOI: 10.1016/j.ajhg.2017.12.003] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Accepted: 12/04/2017] [Indexed: 01/03/2023] Open
Abstract
Atrial fibrillation (AF) is a common cardiac arrhythmia and a major risk factor for stroke, heart failure, and premature death. The pathogenesis of AF remains poorly understood, which contributes to the current lack of highly effective treatments. To understand the genetic variation and biology underlying AF, we undertook a genome-wide association study (GWAS) of 6,337 AF individuals and 61,607 AF-free individuals from Norway, including replication in an additional 30,679 AF individuals and 278,895 AF-free individuals. Through genotyping and dense imputation mapping from whole-genome sequencing, we tested almost nine million genetic variants across the genome and identified seven risk loci, including two novel loci. One novel locus (lead single-nucleotide variant [SNV] rs12614435; p = 6.76 × 10-18) comprised intronic and several highly correlated missense variants situated in the I-, A-, and M-bands of titin, which is the largest protein in humans and responsible for the passive elasticity of heart and skeletal muscle. The other novel locus (lead SNV rs56202902; p = 1.54 × 10-11) covered a large, gene-dense chromosome 1 region that has previously been linked to cardiac conduction. Pathway and functional enrichment analyses suggested that many AF-associated genetic variants act through a mechanism of impaired muscle cell differentiation and tissue formation during fetal heart development.
Collapse
|
29
|
A proximity-based graph clustering method for the identification and application of transcription factor clusters. BMC Bioinformatics 2017; 18:530. [PMID: 29187152 PMCID: PMC5706350 DOI: 10.1186/s12859-017-1935-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2017] [Accepted: 11/14/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Transcription factors (TFs) form a complex regulatory network within the cell that is crucial to cell functioning and human health. While methods to establish where a TF binds to DNA are well established, these methods provide no information describing how TFs interact with one another when they do bind. TFs tend to bind the genome in clusters, and current methods to identify these clusters are either limited in scope, unable to detect relationships beyond motif similarity, or not applied to TF-TF interactions. METHODS Here, we present a proximity-based graph clustering approach to identify TF clusters using either ChIP-seq or motif search data. We use TF co-occurrence to construct a filtered, normalized adjacency matrix and use the Markov Clustering Algorithm to partition the graph while maintaining TF-cluster and cluster-cluster interactions. We then apply our graph structure beyond clustering, using it to increase the accuracy of motif-based TFBS searching for an example TF. RESULTS We show that our method produces small, manageable clusters that encapsulate many known, experimentally validated transcription factor interactions and that our method is capable of capturing interactions that motif similarity methods might miss. Our graph structure is able to significantly increase the accuracy of motif TFBS searching, demonstrating that the TF-TF connections within the graph correlate with biological TF-TF interactions. CONCLUSION The interactions identified by our method correspond to biological reality and allow for fast exploration of TF clustering and regulatory dynamics.
Collapse
|
30
|
Protein-altering and regulatory genetic variants near GATA4 implicated in bicuspid aortic valve. Nat Commun 2017; 8:15481. [PMID: 28541271 PMCID: PMC5458508 DOI: 10.1038/ncomms15481] [Citation(s) in RCA: 73] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2016] [Accepted: 03/31/2017] [Indexed: 01/09/2023] Open
Abstract
Bicuspid aortic valve (BAV) is a heritable congenital heart defect and an important risk factor for valvulopathy and aortopathy. Here we report a genome-wide association scan of 466 BAV cases and 4,660 age, sex and ethnicity-matched controls with replication in up to 1,326 cases and 8,103 controls. We identify association with a noncoding variant 151 kb from the gene encoding the cardiac-specific transcription factor, GATA4, and near-significance for p.Ser377Gly in GATA4. GATA4 was interrupted by CRISPR-Cas9 in induced pluripotent stem cells from healthy donors. The disruption of GATA4 significantly impaired the transition from endothelial cells into mesenchymal cells, a critical step in heart valve development.
Collapse
|
31
|
Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. Trends Genet 2016; 33:34-45. [PMID: 27939749 DOI: 10.1016/j.tig.2016.10.008] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Revised: 10/30/2016] [Accepted: 10/31/2016] [Indexed: 11/18/2022]
Abstract
One of the formative goals of genetics research is to understand how genetic variation leads to phenotypic differences and human disease. Genome-wide association studies (GWASs) bring us closer to this goal by linking variation with disease faster than ever before. Despite this, GWASs alone are unable to pinpoint disease-causing single nucleotide polymorphisms (SNPs). Noncoding SNPs, which represent the majority of GWAS SNPs, present a particular challenge. To address this challenge, an array of computational tools designed to prioritize and predict the function of noncoding GWAS SNPs have been developed. However, fewer than 40% of GWAS publications from 2015 utilized these tools. We discuss several leading methods for annotating noncoding variants and how they can be integrated into research pipelines in hopes that they will be broadly applied in future GWAS analyses.
Collapse
|
32
|
Pattern matching approach to pseudosymmetry problems in electron backscatter diffraction. Ultramicroscopy 2016; 160:146-154. [DOI: 10.1016/j.ultramic.2015.10.010] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2015] [Revised: 10/09/2015] [Accepted: 10/11/2015] [Indexed: 10/22/2022]
|
33
|
Mango: a bias-correcting ChIA-PET analysis pipeline. Bioinformatics 2015; 31:3092-8. [PMID: 26034063 DOI: 10.1093/bioinformatics/btv336] [Citation(s) in RCA: 91] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 05/26/2015] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION Chromatin Interaction Analysis by Paired-End Tag sequencing (ChIA-PET) is an established method for detecting genome-wide looping interactions at high resolution. Current ChIA-PET analysis software packages either fail to correct for non-specific interactions due to genomic proximity or only address a fraction of the steps required for data processing. We present Mango, a complete ChIA-PET data analysis pipeline that provides statistical confidence estimates for interactions and corrects for major sources of bias including differential peak enrichment and genomic proximity. RESULTS Comparison to the existing software packages, ChIA-PET Tool and ChiaSig revealed that Mango interactions exhibit much better agreement with high-resolution Hi-C data. Importantly, Mango executes all steps required for processing ChIA-PET datasets, whereas ChiaSig only completes 20% of the required steps. Application of Mango to multiple available ChIA-PET datasets permitted the independent rediscovery of known trends in chromatin loops including enrichment of CTCF, RAD21, SMC3 and ZNF143 at the anchor regions of interactions and strong bias for convergent CTCF motifs. AVAILABILITY AND IMPLEMENTATION Mango is open source and distributed through github at https://github.com/dphansti/mango. CONTACT mpsnyder@standford.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
34
|
A comparative encyclopedia of DNA elements in the mouse genome. Nature 2015; 515:355-64. [PMID: 25409824 PMCID: PMC4266106 DOI: 10.1038/nature13992] [Citation(s) in RCA: 1135] [Impact Index Per Article: 126.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Accepted: 10/24/2014] [Indexed: 12/11/2022]
Abstract
The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.
Collapse
|
35
|
Principles of regulatory information conservation between mouse and human. Nature 2015; 515:371-375. [PMID: 25409826 PMCID: PMC4343047 DOI: 10.1038/nature13985] [Citation(s) in RCA: 189] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Accepted: 10/21/2014] [Indexed: 11/09/2022]
Abstract
To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.
Collapse
|
36
|
Sushi.R: flexible, quantitative and integrative genomic visualizations for publication-quality multi-panel figures. Bioinformatics 2014; 30:2808-10. [PMID: 24903420 PMCID: PMC4173017 DOI: 10.1093/bioinformatics/btu379] [Citation(s) in RCA: 122] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation: Interpretation and communication of genomic data require flexible and quantitative tools to analyze and visualize diverse data types, and yet, a comprehensive tool to display all common genomic data types in publication quality figures does not exist to date. To address this shortcoming, we present Sushi.R, an R/Bioconductor package that allows flexible integration of genomic visualizations into highly customizable, publication-ready, multi-panel figures from common genomic data formats including Browser Extensible Data (BED), bedGraph and Browser Extensible Data Paired-End (BEDPE). Sushi.R is open source and made publicly available through GitHub (https://github.com/dphansti/Sushi) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/Sushi.html). Contact:mpsnyder@stanford.edu or dphansti@stanford.edu
Collapse
|
37
|
Abstract
The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.
Collapse
|
38
|
Dynamic trans-acting factor colocalization in human cells. Cell 2013; 155:713-24. [PMID: 24243024 DOI: 10.1016/j.cell.2013.09.043] [Citation(s) in RCA: 105] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2013] [Revised: 07/13/2013] [Accepted: 08/27/2013] [Indexed: 01/02/2023]
Abstract
Different trans-acting factors (TFs) collaborate and act in concert at distinct loci to perform accurate regulation of their target genes. To date, the cobinding of TF pairs has been investigated in a limited context both in terms of the number of factors within a cell type and across cell types and the extent of combinatorial colocalizations. Here, we use an approach to analyze TF colocalization within a cell type and across multiple cell lines at an unprecedented level. We extend this approach with large-scale mass spectrometry analysis of immunoprecipitations of 50 TFs. Our combined approach reveals large numbers of interesting TF-TF associations. We observe extensive change in TF colocalizations both within a cell type exposed to different conditions and across multiple cell types. We show distinct functional annotations and properties of different TF cobinding patterns and provide insights into the complex regulatory landscape of the cell.
Collapse
|
39
|
Abstract
Genome-wide association studies have been successful in identifying single nucleotide polymorphisms (SNPs) associated with a large number of phenotypes. However, an associated SNP is likely part of a larger region of linkage disequilibrium. This makes it difficult to precisely identify the SNPs that have a biological link with the phenotype. We have systematically investigated the association of multiple types of ENCODE data with disease-associated SNPs and show that there is significant enrichment for functional SNPs among the currently identified associations. This enrichment is strongest when integrating multiple sources of functional information and when highest confidence disease-associated SNPs are used. We propose an approach that integrates multiple types of functional data generated by the ENCODE Consortium to help identify “functional SNPs” that may be associated with the disease phenotype. Our approach generates putative functional annotations for up to 80% of all previously reported associations. We show that for most associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather than the reported SNP itself. Our results show that the experimental data sets generated by the ENCODE Consortium can be successfully used to suggest functional hypotheses for variants associated with diseases and other phenotypes.
Collapse
|
40
|
Abstract
As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences.
Collapse
|
41
|
Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 2012; 148:1293-307. [PMID: 22424236 DOI: 10.1016/j.cell.2012.02.009] [Citation(s) in RCA: 845] [Impact Index Per Article: 70.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Revised: 01/27/2012] [Accepted: 02/04/2012] [Indexed: 12/18/2022]
Abstract
Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.
Collapse
|
42
|
High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res 2010; 21:456-64. [PMID: 21106903 DOI: 10.1101/gr.112656.110] [Citation(s) in RCA: 235] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Regulation of gene transcription in diverse cell types is determined largely by varied sets of cis-elements where transcription factors bind. Here we demonstrate that data from a single high-throughput DNase I hypersensitivity assay can delineate hundreds of thousands of base-pair resolution in vivo footprints in human cells that precisely mark individual transcription factor-DNA interactions. These annotations provide a unique resource for the investigation of cis-regulatory elements. We find that footprints for specific transcription factors correlate with ChIP-seq enrichment and can accurately identify functional versus nonfunctional transcription factor motifs. We also find that footprints reveal a unique evolutionary conservation pattern that differentiates functional footprinted bases from surrounding DNA. Finally, detailed analysis of CTCF footprints suggests multiple modes of binding and a novel DNA binding motif upstream of the primary binding site.
Collapse
|
43
|
Global epigenomic analysis of primary human pancreatic islets provides insights into type 2 diabetes susceptibility loci. Cell Metab 2010; 12:443-55. [PMID: 21035756 PMCID: PMC3026436 DOI: 10.1016/j.cmet.2010.09.012] [Citation(s) in RCA: 144] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/07/2010] [Revised: 07/22/2010] [Accepted: 08/26/2010] [Indexed: 01/17/2023]
Abstract
Identifying cis-regulatory elements is important to understanding how human pancreatic islets modulate gene expression in physiologic or pathophysiologic (e.g., diabetic) conditions. We conducted genome-wide analysis of DNase I hypersensitive sites, histone H3 lysine methylation modifications (K4me1, K4me3, K79me2), and CCCTC factor (CTCF) binding in human islets. This identified ∼18,000 putative promoters (several hundred unannotated and islet-active). Surprisingly, active promoter modifications were absent at genes encoding islet-specific hormones, suggesting a distinct regulatory mechanism. Of 34,039 distal (nonpromoter) regulatory elements, 47% are islet unique and 22% are CTCF bound. In the 18 type 2 diabetes (T2D)-associated loci, we identified 118 putative regulatory elements and confirmed enhancer activity for 12 of 33 tested. Among six regulatory elements harboring T2D-associated variants, two exhibit significant allele-specific differences in activity. These findings present a global snapshot of the human islet epigenome and should provide functional context for noncoding variants emerging from genetic studies of T2D and other islet disorders.
Collapse
|
44
|
Abstract
The extent to which variation in chromatin structure and transcription factor binding may influence gene expression, and thus underlie or contribute to variation in phenotype, is unknown. To address this question, we cataloged both individual-to-individual variation and differences between homologous chromosomes within the same individual (allele-specific variation) in chromatin structure and transcription factor binding in lymphoblastoid cells derived from individuals of geographically diverse ancestry. Ten percent of active chromatin sites were individual-specific; a similar proportion were allele-specific. Both individual-specific and allele-specific sites were commonly transmitted from parent to child, which suggests that they are heritable features of the human genome. Our study shows that heritable chromatin status and transcription factor binding differ as a result of genetic variation and may underlie phenotypic variation in humans.
Collapse
|
45
|
Abstract
cERMIT is a computationally efficient motif discovery tool based on analyzing genome-wide quantitative regulatory evidence. Instead of pre-selecting promising candidate sequences, it utilizes information across all sequence regions to search for high-scoring motifs. We apply cERMIT on a range of direct binding and overexpression datasets; it substantially outperforms state-of-the-art approaches on curated ChIP-chip datasets, and easily scales to current mammalian ChIP-seq experiments with data on thousands of non-coding regions.
Collapse
|
46
|
Abstract
A subtelomeric region, 4q35.2, is implicated in facioscapulohumeral muscular dystrophy (FSHD), a dominant disease thought to involve local pathogenic changes in chromatin. FSHD patients have too few copies of a tandem 3.3-kb repeat (D4Z4) at 4q35.2. No phenotype is associated with having few copies of an almost identical repeat at 10q26.3. Standard expression analyses have not given definitive answers as to the genes involved. To investigate the pathogenic effects of short D4Z4 arrays on gene expression in the very gene-poor 4q35.2 and to find chromatin landmarks there for transcription control, unannotated genes and chromatin structure, we mapped DNaseI-hypersensitive (DH) sites in FSHD and control myoblasts. Using custom tiling arrays (DNase-chip), we found unexpectedly many DH sites in the two large gene deserts in this 4-Mb region. One site was seen preferentially in FSHD myoblasts. Several others were mapped >0.7 Mb from genes known to be active in the muscle lineage and were also observed in cultured fibroblasts, but not in lymphoid, myeloid or hepatic cells. Their selective occurrence in cells derived from mesoderm suggests functionality. Our findings indicate that the gene desert regions of 4q35.2 may have functional significance, possibly also to FSHD, despite their paucity of known genes.
Collapse
|
47
|
Both noncoding and protein-coding RNAs contribute to gene expression evolution in the primate brain. Genome Biol Evol 2010; 2:67-79. [PMID: 20333225 PMCID: PMC2839352 DOI: 10.1093/gbe/evq002] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/11/2010] [Indexed: 12/22/2022] Open
Abstract
Despite striking differences in cognition and behavior between humans and our closest primate relatives, several studies have found little evidence for adaptive change in protein-coding regions of genes expressed primarily in the brain. Instead, changes in gene expression may underlie many cognitive and behavioral differences. Here, we used digital gene expression: tag profiling (here called Tag-Seq, also called DGE:tag profiling) to assess changes in global transcript abundance in the frontal cortex of the brains of 3 humans, 3 chimpanzees, and 3 rhesus macaques. A substantial fraction of transcripts we identified as differentially transcribed among species were not assayed in previous studies based on microarrays. Differentially expressed tags within coding regions are enriched for gene functions involved in synaptic transmission, transport, oxidative phosphorylation, and lipid metabolism. Importantly, because Tag-Seq technology provides strand-specific information about all polyadenlyated transcripts, we were able to assay expression in noncoding intragenic regions, including both sense and antisense noncoding transcripts (relative to nearby genes). We find that many noncoding transcripts are conserved in both location and expression level between species, suggesting a possible functional role. Lastly, we examined the overlap between differential gene expression and signatures of positive selection within putative promoter regions, a sign that these differences represent adaptations during human evolution. Comparative approaches may provide important insights into genes responsible for differences in cognitive functions between humans and nonhuman primates, as well as highlighting new candidate genes for studies investigating neurological disorders.
Collapse
|
48
|
Abstract
Microarray and high-throughput sequencing technologies have enabled the development of comprehensive assays to identify locations of particular chromatin structures and regulatory elements. It is now possible to create genome-wide maps of DNA methylation, trans-factor binding sites, histone variants and histone tail modifications, nucleosome positions, regions of open chromatin, and chromosome locations and interactions. This review provides a summary of these new assays that are changing the way in which molecular biology research is being performed. While the generation of large amounts of data from these experiments is becoming increasingly easier, the development of corresponding analysis methods has progressed more slowly. It will likely be years before the full extent of the information contained in these data is fully appreciated.
Collapse
|
49
|
Abstract
UNLABELLED Tag sequencing using high-throughput sequencing technologies are now regularly employed to identify specific sequence features, such as transcription factor binding sites (ChIP-seq) or regions of open chromatin (DNase-seq). To intuitively summarize and display individual sequence data as an accurate and interpretable signal, we developed F-Seq, a software package that generates a continuous tag sequence density estimation allowing identification of biologically meaningful sites whose output can be displayed directly in the UCSC Genome Browser. AVAILABILITY The software is written in the Java language and is available on all major computing platforms for download at http://www.genome.duke.edu/labs/furey/software/fseq.
Collapse
|
50
|
High-resolution mapping and characterization of open chromatin across the genome. Cell 2008; 132:311-22. [PMID: 18243105 DOI: 10.1016/j.cell.2007.12.014] [Citation(s) in RCA: 1003] [Impact Index Per Article: 62.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2007] [Revised: 10/22/2007] [Accepted: 12/04/2007] [Indexed: 12/28/2022]
Abstract
Mapping DNase I hypersensitive (HS) sites is an accurate method of identifying the location of genetic regulatory elements, including promoters, enhancers, silencers, insulators, and locus control regions. We employed high-throughput sequencing and whole-genome tiled array strategies to identify DNase I HS sites within human primary CD4+ T cells. Combining these two technologies, we have created a comprehensive and accurate genome-wide open chromatin map. Surprisingly, only 16%-21% of the identified 94,925 DNase I HS sites are found in promoters or first exons of known genes, but nearly half of the most open sites are in these regions. In conjunction with expression, motif, and chromatin immunoprecipitation data, we find evidence of cell-type-specific characteristics, including the ability to identify transcription start sites and locations of different chromatin marks utilized in these cells. In addition, and unexpectedly, our analyses have uncovered detailed features of nucleosome structure.
Collapse
|