1
|
Knight HR, Ketter E, Ung T, Weiss A, Ajit J, Chen Q, Shen J, Ip KM, Chiang CY, Barreiro L, Esser-Kahn A. High-throughput screen identifies non inflammatory small molecule inducers of trained immunity. Proc Natl Acad Sci U S A 2024; 121:e2400413121. [PMID: 38976741 PMCID: PMC11260140 DOI: 10.1073/pnas.2400413121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/29/2024] [Indexed: 07/10/2024] Open
Abstract
Trained immunity is characterized by epigenetic and metabolic reprogramming in response to specific stimuli. This rewiring can result in increased cytokine and effector responses to pathogenic challenges, providing nonspecific protection against disease. It may also improve immune responses to established immunotherapeutics and vaccines. Despite its promise for next-generation therapeutic design, most current understanding and experimentation is conducted with complex and heterogeneous biologically derived molecules, such as β-glucan or the Bacillus Calmette-Guérin (BCG) vaccine. This limited collection of training compounds also limits the study of the genes most involved in training responses as each molecule has both training and nontraining effects. Small molecules with tunable pharmacokinetics and delivery modalities would both assist in the study of trained immunity and its future applications. To identify small molecule inducers of trained immunity, we screened a library of 2,000 drugs and drug-like compounds. Identification of well-defined compounds can improve our understanding of innate immune memory and broaden the scope of its clinical applications. We identified over two dozen small molecules in several chemical classes that induce a training phenotype in the absence of initial immune activation-a current limitation of reported inducers of training. A surprising result was the identification of glucocorticoids, traditionally considered immunosuppressive, providing an unprecedented link between glucocorticoids and trained innate immunity. We chose seven of these top candidates to characterize and establish training activity in vivo. In this work, we expand the number of compounds known to induce trained immunity, creating alternative avenues for studying and applying innate immune training.
Collapse
Affiliation(s)
- Hannah Riley Knight
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| | - Ellen Ketter
- Biological Sciences Division, University of Chicago, Chicago, IL60637
| | - Trevor Ung
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| | - Adam Weiss
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| | - Jainu Ajit
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| | - Qing Chen
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| | - Jingjing Shen
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| | - Ka Man Ip
- Biological Sciences Division, University of Chicago, Chicago, IL60637
| | - Chun-yi Chiang
- Biological Sciences Division, University of Chicago, Chicago, IL60637
| | - Luis Barreiro
- Biological Sciences Division, University of Chicago, Chicago, IL60637
| | - Aaron Esser-Kahn
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL60637
| |
Collapse
|
2
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
3
|
Xu J, Gao J, Ni P, Gerstein M. Less-is-more: selecting transcription factor binding regions informative for motif inference. Nucleic Acids Res 2024; 52:e20. [PMID: 38214231 PMCID: PMC10899791 DOI: 10.1093/nar/gkad1240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 12/06/2023] [Accepted: 12/17/2023] [Indexed: 01/13/2024] Open
Abstract
Numerous statistical methods have emerged for inferring DNA motifs for transcription factors (TFs) from genomic regions. However, the process of selecting informative regions for motif inference remains understudied. Current approaches select regions with strong ChIP-seq signal for a given TF, assuming that such strong signal primarily results from specific interactions between the TF and its motif. Additionally, these selection approaches do not account for non-target motifs, i.e. motifs of other TFs; they presume the occurrence of these non-target motifs infrequent compared to that of the target motif, and thus assume these have minimal interference with the identification of the target. Leveraging extensive ChIP-seq datasets, we introduced the concept of TF signal 'crowdedness', referred to as C-score, for each genomic region. The C-score helps in highlighting TF signals arising from non-specific interactions. Moreover, by considering the C-score (and adjusting for the length of genomic regions), we can effectively mitigate interference of non-target motifs. Using these tools, we find that in many instances, strong ChIP-seq signal stems mainly from non-specific interactions, and the occurrence of non-target motifs significantly impacts the accurate inference of the target motif. Prioritizing genomic regions with reduced crowdedness and short length markedly improves motif inference. This 'less-is-more' effect suggests that ChIP-seq region selection warrants more attention.
Collapse
Affiliation(s)
- Jinrui Xu
- Department of Biology, Howard University, Washington, DC 20059, USA
- Center for Applied Data Science and Analytics, Howard University, Washington, DC 20059, USA
| | - Jiahao Gao
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Pengyu Ni
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, New Haven, CT 06520, USA
- Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
4
|
Fan K, Pfister E, Weng Z. Toward a comprehensive catalog of regulatory elements. Hum Genet 2023; 142:1091-1111. [PMID: 36935423 DOI: 10.1007/s00439-023-02519-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2022] [Accepted: 01/03/2023] [Indexed: 03/21/2023]
Abstract
Regulatory elements are the genomic regions that interact with transcription factors to control cell-type-specific gene expression in different cellular environments. A precise and complete catalog of functional elements encoded by the human genome is key to understanding mammalian gene regulation. Here, we review the current state of regulatory element annotation. We first provide an overview of assays for characterizing functional elements, including genome, epigenome, transcriptome, three-dimensional chromatin interaction, and functional validation assays. We then discuss computational methods for defining regulatory elements, including peak-calling and other statistical modeling methods. Finally, we introduce several high-quality lists of regulatory element annotations and suggest potential future directions.
Collapse
Affiliation(s)
- Kaili Fan
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, 368 Plantation Street, ASC5-1069, Worcester, MA, 01605, USA
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, 02138, USA
| | - Edith Pfister
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, 368 Plantation Street, ASC5-1069, Worcester, MA, 01605, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, 368 Plantation Street, ASC5-1069, Worcester, MA, 01605, USA.
| |
Collapse
|
5
|
Jalili V, Cremona MA, Palluzzi F. Rescuing biologically relevant consensus regions across replicated samples. BMC Bioinformatics 2023; 24:240. [PMID: 37286963 PMCID: PMC10246347 DOI: 10.1186/s12859-023-05340-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 05/16/2023] [Indexed: 06/09/2023] Open
Abstract
BACKGROUND Protein-DNA binding sites of ChIP-seq experiments are identified where the binding affinity is significant based on a given threshold. The choice of the threshold is a trade-off between conservative region identification and discarding weak, but true binding sites. RESULTS We rescue weak binding sites using MSPC, which efficiently exploits replicates to lower the threshold required to identify a site while keeping a low false-positive rate, and we compare it to IDR, a widely used post-processing method for identifying highly reproducible peaks across replicates. We observe several master transcription regulators (e.g., SP1 and GATA3) and HDAC2-GATA1 regulatory networks on rescued regions in K562 cell line. CONCLUSIONS We argue the biological relevance of weak binding sites and the information they add when rescued by MSPC. An implementation of the proposed extended MSPC methodology and the scripts to reproduce the performed analysis are freely available at https://genometric.github.io/MSPC/ ; MSPC is distributed as a command-line application and an R package available from Bioconductor ( https://doi.org/doi:10.18129/B9.bioc.rmspc ).
Collapse
Affiliation(s)
- Vahid Jalili
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Marzia A Cremona
- Department of Operations and Decision Systems, Université Laval, Quebec, Canada.
- CHU de Québec - Université Laval Research Center, Quebec, Canada.
| | - Fernando Palluzzi
- Department of Brain and Behavioral Sciences, Università di Pavia, Pavia, Italy.
| |
Collapse
|
6
|
Kanoh Y, Ueno M, Hayano M, Kudo S, Masai H. Aberrant association of chromatin with nuclear periphery induced by Rif1 leads to mitotic defect. Life Sci Alliance 2023; 6:e202201603. [PMID: 36750367 PMCID: PMC9909590 DOI: 10.26508/lsa.202201603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 01/23/2023] [Accepted: 01/24/2023] [Indexed: 02/09/2023] Open
Abstract
The architecture and nuclear location of chromosomes affect chromatin events. Rif1, a crucial regulator of replication timing, recognizes G-quadruplex and inhibits origin firing over the 50-100-kb segment in fission yeast, Schizosaccharomyces pombe, leading us to postulate that Rif1 may generate chromatin higher order structures inhibitory for initiation. However, the effects of Rif1 on chromatin localization in nuclei have not been known. We show here that Rif1 overexpression causes growth inhibition and eventually, cell death in fission yeast. Chromatin-binding activity of Rif1, but not recruitment of phosphatase PP1, is required for growth inhibition. Overexpression of a PP1-binding site mutant of Rif1 does not delay the S-phase, but still causes cell death, indicating that cell death is caused not by S-phase problems but by issues in other phases of the cell cycle, most likely the M-phase. Indeed, Rif1 overexpression generates cells with unequally segregated chromosomes. Rif1 overexpression relocates chromatin near nuclear periphery in a manner dependent on its chromatin-binding ability, and this correlates with growth inhibition. Thus, coordinated progression of S- and M-phases may require regulated Rif1-mediated chromatin association with the nuclear periphery.
Collapse
Affiliation(s)
- Yutaka Kanoh
- Department of Basic Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| | - Masaru Ueno
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
| | - Motoshi Hayano
- Department of Neuropsychiatry, Keio University, Tokyo, Japan
| | - Satomi Kudo
- Department of Basic Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| | - Hisao Masai
- Department of Basic Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| |
Collapse
|
7
|
Teng M. Statistical Analysis in ChIP-seq-Related Applications. Methods Mol Biol 2023; 2629:169-181. [PMID: 36929078 DOI: 10.1007/978-1-0716-2986-4_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
Chromatin immunoprecipitation sequencing (ChIP-seq) has been widely performed to identify protein binding information along the genome. The sequencing protocol is quite flexible and mature to measure different types of protein binding as long as sequencing parameters are properly tailored to accommodate protein features. Two distinct types of protein binding are point-source-like binding by transcription factors and diffused-distribution binding by histone modifications. Consequently, statistical approaches have been proposed to address ChIP-seq-related questions according to different protein features. In this chapter, we briefly summarize statistical principles, approaches, and tools that are widely implemented in modeling ChIP-seq data, from raw data quality control to final result reporting. We discuss the key solutions in addressing eight routine questions in ChIP-seq applications. We also include discussion on approaches fitting unique data features in different ChIP-seq types. We hope this chapter will serve as a brief guide, especially for ChIP-seq beginners, to provide them with a high-level overview to understand and design processing plans for their ChIP-seq experiments.
Collapse
Affiliation(s)
- Mingxiang Teng
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA.
| |
Collapse
|
8
|
Hentges LD, Sergeant MJ, Cole CB, Downes DJ, Hughes JR, Taylor S. LanceOtron: a deep learning peak caller for genome sequencing experiments. Bioinformatics 2022; 38:4255-4263. [PMID: 35866989 PMCID: PMC9477537 DOI: 10.1093/bioinformatics/btac525] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 05/10/2022] [Accepted: 07/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Genome sequencing experiments have revolutionized molecular biology by allowing researchers to identify important DNA-encoded elements genome wide. Regions where these elements are found appear as peaks in the analog signal of an assay's coverage track, and despite the ease with which humans can visually categorize these patterns, the size of many genomes necessitates algorithmic implementations. Commonly used methods focus on statistical tests to classify peaks, discounting that the background signal does not completely follow any known probability distribution and reducing the information-dense peak shapes to simply maximum height. Deep learning has been shown to be highly accurate for many pattern recognition tasks, on par or even exceeding human capabilities, providing an opportunity to reimagine and improve peak calling. RESULTS We present the peak calling framework LanceOtron, which combines deep learning for recognizing peak shape with multifaceted enrichment calculations for assessing significance. In benchmarking ATAC-seq, ChIP-seq and DNase-seq, LanceOtron outperforms long-standing, gold-standard peak callers through its improved selectivity and near-perfect sensitivity. AVAILABILITY AND IMPLEMENTATION A fully featured web application is freely available from LanceOtron.molbiol.ox.ac.uk, command line interface via python is pip installable from PyPI at https://pypi.org/project/lanceotron/, and source code and benchmarking tests are available at https://github.com/LHentges/LanceOtron. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lance D Hentges
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Martin J Sergeant
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Christopher B Cole
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Damien J Downes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Jim R Hughes
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Stephen Taylor
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| |
Collapse
|
9
|
A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics 2022. [DOI: 10.1007/s11192-022-04332-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
10
|
Molina-Sánchez MD, García-Rodríguez FM, Andrés-León E, Toro N. Identification of Group II Intron RmInt1 Binding Sites in a Bacterial Genome. Front Mol Biosci 2022; 9:834020. [PMID: 35281263 PMCID: PMC8914252 DOI: 10.3389/fmolb.2022.834020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Accepted: 02/07/2022] [Indexed: 11/13/2022] Open
Abstract
RmInt1 is a group II intron encoding a reverse transcriptase protein (IEP) lacking the C-terminal endonuclease domain. RmInt1 is an efficient mobile retroelement that predominantly reverse splices into the transient single-stranded DNA at the template for lagging strand DNA synthesis during host replication, a process facilitated by the interaction of the RmInt1 IEP with DnaN at the replication fork. It has been suggested that group II intron ribonucleoprotein particles bind DNA nonspecifically, and then scan for their correct target site. In this study, we investigated RmInt1 binding sites throughout the Sinorhizobium meliloti genome, by chromatin-immunoprecipitation coupled with next-generation sequencing. We found that RmInt1 binding sites cluster around the bidirectional replication origin of each of the three replicons comprising the S. meliloti genome. Our results provide new evidence linking group II intron mobility to host DNA replication.
Collapse
Affiliation(s)
- María Dolores Molina-Sánchez
- Structure, Dynamics and Function of Rhizobacterial Genomes, Estación Experimental del Zaidín, Department of Soil Microbiology and Symbiotic Systems, Spanish National Research Council (CSIC), Granada, Spain
| | - Fernando Manuel García-Rodríguez
- Structure, Dynamics and Function of Rhizobacterial Genomes, Estación Experimental del Zaidín, Department of Soil Microbiology and Symbiotic Systems, Spanish National Research Council (CSIC), Granada, Spain
| | - Eduardo Andrés-León
- Bioinformatics Unit, Institute of Parasitology and Biomedicine “López-Neyra” (IPBLN), Spanish National Research Council (CSIC), Granada, Spain
| | - Nicolás Toro
- Structure, Dynamics and Function of Rhizobacterial Genomes, Estación Experimental del Zaidín, Department of Soil Microbiology and Symbiotic Systems, Spanish National Research Council (CSIC), Granada, Spain
- *Correspondence: Nicolás Toro,
| |
Collapse
|
11
|
O Adetunji M, J Abraham B. SEAseq: a portable and cloud-based chromatin occupancy analysis suite. BMC Bioinformatics 2022; 23:77. [PMID: 35193506 PMCID: PMC8864840 DOI: 10.1186/s12859-022-04588-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 01/28/2022] [Indexed: 11/26/2022] Open
Abstract
Background Genome-wide protein-DNA binding is popularly assessed using specific antibody pulldown in Chromatin Immunoprecipitation Sequencing (ChIP-Seq) or Cleavage Under Targets and Release Using Nuclease (CUT&RUN) sequencing experiments. These technologies generate high-throughput sequencing data that necessitate the use of multiple sophisticated, computationally intensive genomic tools to make discoveries, but these genomic tools often have a high barrier to use because of computational resource constraints. Results We present a comprehensive, infrastructure-independent, computational pipeline called SEAseq, which leverages field-standard, open-source tools for processing and analyzing ChIP-Seq/CUT&RUN data. SEAseq performs extensive analyses from the raw output of the experiment, including alignment, peak calling, motif analysis, promoters and metagene coverage profiling, peak annotation distribution, clustered/stitched peaks (e.g. super-enhancer) identification, and multiple relevant quality assessment metrics, as well as automatic interfacing with data in GEO/SRA. SEAseq enables rapid and cost-effective resource for analysis of both new and publicly available datasets as demonstrated in our comparative case studies. Conclusions The easy-to-use and versatile design of SEAseq makes it a reliable and efficient resource for ensuring high quality analysis. Its cloud implementation enables a broad suite of analyses in environments with constrained computational resources. SEAseq is platform-independent and is aimed to be usable by everyone with or without programming skills. It is available on the cloud at https://platform.stjude.cloud/workflows/seaseq and can be locally installed from the repository at https://github.com/stjude/seaseq. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04588-z.
Collapse
Affiliation(s)
- Modupeore O Adetunji
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA
| | - Brian J Abraham
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
| |
Collapse
|
12
|
Suryatenggara J, Yong KJ, Tenen DE, Tenen DG, Bassal MA. ChIP-AP: an integrated analysis pipeline for unbiased ChIP-seq analysis. Brief Bioinform 2021; 23:6489109. [PMID: 34965583 PMCID: PMC8769893 DOI: 10.1093/bib/bbab537] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 11/02/2021] [Accepted: 11/19/2021] [Indexed: 12/15/2022] Open
Abstract
Chromatin immunoprecipitation coupled with sequencing (ChIP-seq) is a technique used to identify protein–DNA interaction sites through antibody pull-down, sequencing and analysis; with enrichment ‘peak’ calling being the most critical analytical step. Benchmarking studies have consistently shown that peak callers have distinct selectivity and specificity characteristics that are not additive and seldom completely overlap in many scenarios, even after parameter optimization. We therefore developed ChIP-AP, an integrated ChIP-seq analysis pipeline utilizing four independent peak callers, which seamlessly processes raw sequencing files to final result. This approach enables (1) better gauging of peak confidence through detection by multiple algorithms, and (2) more thoroughly surveys the binding landscape by capturing peaks not detected by individual callers. Final analysis results are then integrated into a single output table, enabling users to explore their data by applying selectivity and sensitivity thresholds that best address their biological questions, without needing any additional reprocessing. ChIP-AP therefore presents investigators with a more comprehensive coverage of the binding landscape without requiring additional wet-lab observations.
Collapse
Affiliation(s)
- Jeremiah Suryatenggara
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, 117599, Singapore
| | - Kol Jia Yong
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, 117599, Singapore.,Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117597, Singapore
| | | | - Daniel G Tenen
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, 117599, Singapore.,Harvard Stem Cell Institute, Boston, 02138, USA
| | - Mahmoud A Bassal
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, 117599, Singapore.,Harvard Stem Cell Institute, Boston, 02138, USA
| |
Collapse
|
13
|
Ferré Q, Chèneby J, Puthier D, Capponi C, Ballester B. Anomaly detection in genomic catalogues using unsupervised multi-view autoencoders. BMC Bioinformatics 2021; 22:460. [PMID: 34563116 PMCID: PMC8467021 DOI: 10.1186/s12859-021-04359-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 06/04/2021] [Accepted: 08/09/2021] [Indexed: 11/13/2022] Open
Abstract
Background Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision. Results Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions’ representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database’s large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models. Conclusion Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04359-2.
Collapse
Affiliation(s)
- Quentin Ferré
- INSERM, TAGC, Aix Marseille University, Marseille, France.,Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France
| | - Jeanne Chèneby
- INSERM, TAGC, Aix Marseille University, Marseille, France
| | - Denis Puthier
- INSERM, TAGC, Aix Marseille University, Marseille, France
| | - Cécile Capponi
- Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France.
| | | |
Collapse
|
14
|
Meiler A, Marchiano F, Haering M, Weitkunat M, Schnorrer F, Habermann BH. AnnoMiner is a new web-tool to integrate epigenetics, transcription factor occupancy and transcriptomics data to predict transcriptional regulators. Sci Rep 2021; 11:15463. [PMID: 34326396 PMCID: PMC8322331 DOI: 10.1038/s41598-021-94805-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 07/14/2021] [Indexed: 11/23/2022] Open
Abstract
Gene expression regulation requires precise transcriptional programs, led by transcription factors in combination with epigenetic events. Recent advances in epigenomic and transcriptomic techniques provided insight into different gene regulation mechanisms. However, to date it remains challenging to understand how combinations of transcription factors together with epigenetic events control cell-type specific gene expression. We have developed the AnnoMiner web-server, an innovative and flexible tool to annotate and integrate epigenetic, and transcription factor occupancy data. First, AnnoMiner annotates user-provided peaks with gene features. Second, AnnoMiner can integrate genome binding data from two different transcriptional regulators together with gene features. Third, AnnoMiner offers to explore the transcriptional deregulation of genes nearby, or within a specified genomic region surrounding a user-provided peak. AnnoMiner’s fourth function performs transcription factor or histone modification enrichment analysis for user-provided gene lists by utilizing hundreds of public, high-quality datasets from ENCODE for the model organisms human, mouse, Drosophila and C. elegans. Thus, AnnoMiner can predict transcriptional regulators for a studied process without the strict need for chromatin data from the same process. We compared AnnoMiner to existing tools and experimentally validated several transcriptional regulators predicted by AnnoMiner to indeed contribute to muscle morphogenesis in Drosophila. AnnoMiner is freely available at http://chimborazo.ibdm.univ-mrs.fr/AnnoMiner/.
Collapse
Affiliation(s)
- Arno Meiler
- Max Planck Institute of Biochemistry, Am Klopferspitz 18, 82152, Martinsried, Germany
| | - Fabio Marchiano
- Aix-Marseille University, CNRS, IBDM UMR 7288, The Turing Centre for Living systems (CENTURI), Aix-Marseille University, Parc Scientifique de Luminy Case 907, 163, Avenue de Luminy, 13009, Marseille, France
| | - Margaux Haering
- Aix-Marseille University, CNRS, IBDM UMR 7288, The Turing Centre for Living systems (CENTURI), Aix-Marseille University, Parc Scientifique de Luminy Case 907, 163, Avenue de Luminy, 13009, Marseille, France
| | - Manuela Weitkunat
- Max Planck Institute of Biochemistry, Am Klopferspitz 18, 82152, Martinsried, Germany
| | - Frank Schnorrer
- Max Planck Institute of Biochemistry, Am Klopferspitz 18, 82152, Martinsried, Germany.,Aix-Marseille University, CNRS, IBDM UMR 7288, The Turing Centre for Living systems (CENTURI), Aix-Marseille University, Parc Scientifique de Luminy Case 907, 163, Avenue de Luminy, 13009, Marseille, France
| | - Bianca H Habermann
- Max Planck Institute of Biochemistry, Am Klopferspitz 18, 82152, Martinsried, Germany. .,Aix-Marseille University, CNRS, IBDM UMR 7288, The Turing Centre for Living systems (CENTURI), Aix-Marseille University, Parc Scientifique de Luminy Case 907, 163, Avenue de Luminy, 13009, Marseille, France.
| |
Collapse
|
15
|
Piao Y, Xu W, Park KH, Ryu KH, Xiang R. Comprehensive Evaluation of Differential Methylation Analysis Methods for Bisulfite Sequencing Data. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph18157975. [PMID: 34360271 PMCID: PMC8345583 DOI: 10.3390/ijerph18157975] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Revised: 07/19/2021] [Accepted: 07/20/2021] [Indexed: 12/13/2022]
Abstract
Background: With advances in next-generation sequencing technologies, the bisulfite conversion of genomic DNA followed by sequencing has become the predominant technique for quantifying genome-wide DNA methylation at single-base resolution. A large number of computational approaches are available in literature for identifying differentially methylated regions in bisulfite sequencing data, and more are being developed continuously. Results: Here, we focused on a comprehensive evaluation of commonly used differential methylation analysis methods and describe the potential strengths and limitations of each method. We found that there are large differences among methods, and no single method consistently ranked first in all benchmarking. Moreover, smoothing seemed not to improve the performance greatly, and a small number of replicates created more difficulties in the computational analysis of BS-seq data than low sequencing depth. Conclusions: Data analysis and interpretation should be performed with great care, especially when the number of replicates or sequencing depth is limited.
Collapse
Affiliation(s)
- Yongjun Piao
- School of Medicine, Nankai University, Tianjin 300071, China;
- Tianjin Key Laboratory of Human Development and Reproductive Regulation, Tianjin Central Hospital of Gynecology Obstetrics, Tianjin 300199, China
| | - Wanxue Xu
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing 100191, China;
| | - Kwang Ho Park
- Department of Computer Science, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea;
| | - Keun Ho Ryu
- Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam
- Correspondence: (K.H.R.); (R.X.)
| | - Rong Xiang
- School of Medicine, Nankai University, Tianjin 300071, China;
- Correspondence: (K.H.R.); (R.X.)
| |
Collapse
|
16
|
Serra F, Bottini S, Pratella D, Stathopoulou MG, Sebille W, El-Hami L, Repetto E, Mauduit C, Benahmed M, Grandjean V, Trabucchi M. Systemic CLIP-seq analysis and game theory approach to model microRNA mode of binding. Nucleic Acids Res 2021; 49:e66. [PMID: 33823551 PMCID: PMC8216473 DOI: 10.1093/nar/gkab198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Revised: 02/19/2021] [Accepted: 03/10/2021] [Indexed: 12/18/2022] Open
Abstract
microRNAs (miRNAs) associate with Ago proteins to post-transcriptionally silence gene expression by targeting mRNAs. To characterize the modes of miRNA-binding, we developed a novel computational framework, called optiCLIP, which considers the reproducibility of the identified peaks among replicates based on the peak overlap. We identified 98 999 binding sites for mouse and human miRNAs, from eleven Ago2 CLIP-seq datasets. Clustering the binding preferences, we found heterogeneity of the mode of binding for different miRNAs. Finally, we set up a quantitative model, named miRgame, based on an adaptation of the game theory. We have developed a new algorithm to translate the miRgame into a score that corresponds to a miRNA degree of occupancy for each Ago2 peak. The degree of occupancy summarizes the number of miRNA-binding sites and miRNAs targeting each binding site, and binding energy of each miRNA::RNA heteroduplex in each peak. Ago peaks were stratified accordingly to the degree of occupancy. Target repression correlates with higher score of degree of occupancy and number of miRNA-binding sites within each Ago peak. We validated the biological performance of our new method on miR-155-5p. In conclusion, our data demonstrate that miRNA-binding sites within each Ago2 CLIP-seq peak synergistically interplay to enhance target repression.
Collapse
Affiliation(s)
- Fabrizio Serra
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Silvia Bottini
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - David Pratella
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Maria G Stathopoulou
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Wanda Sebille
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Loubna El-Hami
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Emanuela Repetto
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Claire Mauduit
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Mohamed Benahmed
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Valerie Grandjean
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| | - Michele Trabucchi
- Inserm U1065, C3M, Team Control of Gene Expression (10), Nice, France.,Université Côte d'Azur, Inserm, C3M, Nice, France
| |
Collapse
|
17
|
Beacon TH, Delcuve GP, López C, Nardocci G, Kovalchuk I, van Wijnen AJ, Davie JR. The dynamic broad epigenetic (H3K4me3, H3K27ac) domain as a mark of essential genes. Clin Epigenetics 2021; 13:138. [PMID: 34238359 PMCID: PMC8264473 DOI: 10.1186/s13148-021-01126-1] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 06/30/2021] [Indexed: 02/06/2023] Open
Abstract
Transcriptionally active chromatin is marked by tri-methylation of histone H3 at lysine 4 (H3K4me3) located after first exons and around transcription start sites. This epigenetic mark is typically restricted to narrow regions at the 5`end of the gene body, though a small subset of genes have a broad H3K4me3 domain which extensively covers the coding region. Although most studies focus on the H3K4me3 mark, the broad H3K4me3 domain is associated with a plethora of histone modifications (e.g., H3 acetylated at K27) and is therein termed broad epigenetic domain. Genes marked with the broad epigenetic domain are involved in cell identity and essential cell functions and have clinical potential as biomarkers for patient stratification. Reducing expression of genes with the broad epigenetic domain may increase the metastatic potential of cancer cells. Enhancers and super-enhancers interact with the broad epigenetic domain marked genes forming a hub of interactions involving nucleosome-depleted regions. Together, the regulatory elements coalesce with transcription factors, chromatin modifying/remodeling enzymes, coactivators, and the Mediator and/or Integrator complex into a transcription factory which may be analogous to a liquid–liquid phase-separated condensate. The broad epigenetic domain has a dynamic chromatin structure which supports frequent transcription bursts. In this review, we present the current knowledge of broad epigenetic domains.
Collapse
Affiliation(s)
- Tasnim H Beacon
- CancerCare Manitoba Research Institute, CancerCare Manitoba, Winnipeg, MB, R3E 0V9, Canada.,Department of Biochemistry and Medical Genetics, University of Manitoba, 745 Bannatyne Avenue, Room 333A, Winnipeg, MB, Canada
| | - Geneviève P Delcuve
- Department of Biochemistry and Medical Genetics, University of Manitoba, 745 Bannatyne Avenue, Room 333A, Winnipeg, MB, Canada
| | - Camila López
- CancerCare Manitoba Research Institute, CancerCare Manitoba, Winnipeg, MB, R3E 0V9, Canada.,Department of Biochemistry and Medical Genetics, University of Manitoba, 745 Bannatyne Avenue, Room 333A, Winnipeg, MB, Canada
| | - Gino Nardocci
- Faculty of Medicine, Universidad de Los Andes, Santiago, Chile.,Molecular Biology and Bioinformatics Lab, Program in Molecular Biology and Bioinformatics, Center for Biomedical Research and Innovation (CIIB), Universidad de Los Andes, Santiago, Chile
| | - Igor Kovalchuk
- Department of Biological Sciences, University of Lethbridge, Lethbridge, AB, Canada
| | - Andre J van Wijnen
- Department of Orthopedic Surgery, Mayo Clinic, Rochester, MN, USA.,Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA
| | - James R Davie
- CancerCare Manitoba Research Institute, CancerCare Manitoba, Winnipeg, MB, R3E 0V9, Canada. .,Department of Biochemistry and Medical Genetics, University of Manitoba, 745 Bannatyne Avenue, Room 333A, Winnipeg, MB, Canada.
| |
Collapse
|
18
|
Menzel M, Hurka S, Glasenhardt S, Gogol-Döring A. NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling. Bioinformatics 2021; 37:596-602. [PMID: 32991679 DOI: 10.1093/bioinformatics/btaa845] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 09/14/2020] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION The discovery of sequence motifs mediating DNA-protein binding usually implies the determination of binding sites using high-throughput sequencing and peak calling. The determination of peaks, however, depends strongly on data quality and is susceptible to noise. RESULTS Here, we present a novel approach to reliably identify transcription factor-binding motifs from ChIP-Seq data without peak detection. By evaluating the distributions of sequencing reads around the different k-mers in the genome, we are able to identify binding motifs in ChIP-Seq data that yield no results in traditional pipelines. AVAILABILITY AND IMPLEMENTATION NoPeak is published under the GNU General Public License and available as a standalone console-based Java application at https://github.com/menzel/nopeak. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michael Menzel
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| | - Sabine Hurka
- Institute for Insect Biotechnology, Justus Liebig University, Giessen 35392, Germany
| | - Stefan Glasenhardt
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| | - Andreas Gogol-Döring
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| |
Collapse
|
19
|
Lee BH, Rhie SK. Molecular and computational approaches to map regulatory elements in 3D chromatin structure. Epigenetics Chromatin 2021; 14:14. [PMID: 33741028 PMCID: PMC7980343 DOI: 10.1186/s13072-021-00390-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/08/2021] [Indexed: 12/19/2022] Open
Abstract
Epigenetic marks do not change the sequence of DNA but affect gene expression in a cell-type specific manner by altering the activities of regulatory elements. Development of new molecular biology assays, sequencing technologies, and computational approaches enables us to profile the human epigenome in three-dimensional structure genome-wide. Here we describe various molecular biology techniques and bioinformatic tools that have been developed to measure the activities of regulatory elements and their chromatin interactions. Moreover, we list currently available three-dimensional epigenomic data sets that are generated in various human cell types and tissues to assist in the design and analysis of research projects.
Collapse
Affiliation(s)
- Beoung Hun Lee
- Department of Biochemistry and Molecular Medicine and the Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90089, USA
| | - Suhn K Rhie
- Department of Biochemistry and Molecular Medicine and the Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
20
|
Ohnuki H, Venzon DJ, Lobanov A, Tosato G. Iterative epigenomic analyses in the same single cell. Genome Res 2021; 31:1819-1830. [PMID: 33627472 DOI: 10.1101/gr.269068.120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Accepted: 01/14/2021] [Indexed: 11/24/2022]
Abstract
Gene expression in individual cells is epigenetically regulated by DNA modifications, histone modifications, transcription factors, and other DNA-binding proteins. It has been shown that multiple histone modifications can predict gene expression and reflect future responses of bulk cells to extracellular cues. However, the predictive ability of epigenomic analysis is still limited for mechanistic research at a single cell level. To overcome this limitation, it would be useful to acquire reliable signals from multiple epigenetic marks in the same single cell. Here, we propose a new approach and a new method for analysis of several components of the epigenome in the same single cell. The new method allows reanalysis of the same single cell. We found that reanalysis of the same single cell is feasible, provides confirmation of the epigenetic signals, and allows application of statistical analysis to identify reproduced reads using data sets generated only from the single cell. Reanalysis of the same single cell is also useful to acquire multiple epigenetic marks from the same single cells. The method can acquire at least five epigenetic marks: H3K27ac, H3K27me3, mediator complex subunit 1, a DNA modification, and a DNA-interacting protein. We can predict active signaling pathways in K562 single cells using the epigenetic data and confirm that the predicted results strongly correlate with actual active signaling pathways identified by RNA-seq results. These results suggest that the new method provides mechanistic insights for cellular phenotypes through multilayered epigenome analysis in the same single cells.
Collapse
Affiliation(s)
- Hidetaka Ohnuki
- Laboratory of Cellular Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - David J Venzon
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Rockville, Maryland 20850, USA
| | - Alexei Lobanov
- CCR Collaborative Bioinformatics Resource (CCBR), Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.,Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research sponsored by the National Cancer Institute, Frederick, Maryland 21702, USA
| | - Giovanna Tosato
- Laboratory of Cellular Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|
21
|
Awdeh A, Turcotte M, Perkins TJ. WACS: improving ChIP-seq peak calling by optimally weighting controls. BMC Bioinformatics 2021; 22:69. [PMID: 33588754 PMCID: PMC7885521 DOI: 10.1186/s12859-020-03927-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 12/09/2020] [Indexed: 01/21/2023] Open
Abstract
Background Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. Result We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. Conclusions This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls.
Collapse
Affiliation(s)
- Aseel Awdeh
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada. .,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, K1H8L6, Canada.
| | - Marcel Turcotte
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada
| | - Theodore J Perkins
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada. .,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, K1H8L6, Canada. .,Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, K1H8M5, Canada.
| |
Collapse
|
22
|
Jeon H, Lee H, Kang B, Jang I, Roh TY. Comparative analysis of commonly used peak calling programs for ChIP-Seq analysis. Genomics Inform 2021; 18:e42. [PMID: 33412758 PMCID: PMC7808876 DOI: 10.5808/gi.2020.18.4.e42] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 11/22/2020] [Indexed: 11/20/2022] Open
Abstract
Chromatin immunoprecipitation coupled with high-throughput DNA sequencing (ChIP-Seq) is a powerful technology to profile the location of proteins of interest on a whole-genome scale. To identify the enrichment location of proteins, many programs and algorithms have been proposed. However, none of the commonly used peak calling programs could accurately explain the binding features of target proteins detected by ChIP-Seq. Here, publicly available data on 12 histone modifications, including H3K4ac/me1/me2/me3, H3K9ac/me3, H3K27ac/me3, H3K36me3, H3K56ac, and H3K79me1/me2, generated from a human embryonic stem cell line (H1), were profiled with five peak callers (CisGenome, MACS1, MACS2, PeakSeq, and SISSRs). The performance of the peak calling programs was compared in terms of reproducibility between replicates, examination of enriched regions to variable sequencing depths, the specificity-to-noise signal, and sensitivity of peak prediction. There were no major differences among peak callers when analyzing point source histone modifications. The peak calling results from histone modifications with low fidelity, such as H3K4ac, H3K56ac, and H3K79me1/me2, showed low performance in all parameters, which indicates that their peak positions might not be located accurately. Our comparative results could provide a helpful guide to choose a suitable peak calling program for specific histone modifications.
Collapse
Affiliation(s)
- Hyeongrin Jeon
- Department of Life Sciences, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea
| | - Hyunji Lee
- Department of Life Sciences, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea
| | - Byunghee Kang
- Department of Life Sciences, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea
| | - Insoon Jang
- Department of Life Sciences, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea
| | - Tae-Young Roh
- Department of Life Sciences, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea.,Division of Integrative Biosciences and Biotechnology, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea.,SysGenLab Inc., Pohang 37613, Korea
| |
Collapse
|
23
|
Xing Z, Carbonetto P, Stephens M. Flexible Signal Denoising via Flexible Empirical Bayes Shrinkage. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2021; 22:93. [PMID: 38149302 PMCID: PMC10751020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
Signal denoising-also known as non-parametric regression-is often performed through shrinkage estimation in a transformed (e.g., wavelet) domain; shrinkage in the transformed domain corresponds to smoothing in the original domain. A key question in such applications is how much to shrink, or, equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an attractive solution to this problem; they use the data to estimate a distribution of underlying "effects," hence automatically select an appropriate amount of shrinkage. However, most existing implementations of empirical Bayes shrinkage are less flexible than they could be-both in their assumptions on the underlying distribution of effects, and in their ability to handle heteroskedasticity-which limits their signal denoising applications. Here we address this by adopting a particularly flexible, stable and computationally convenient empirical Bayes shrinkage method and applying it to several signal denoising problems. These applications include smoothing of Poisson data and heteroskedastic Gaussian data. We show through empirical comparisons that the results are competitive with other methods, including both simple thresholding rules and purpose-built empirical Bayes procedures. Our methods are implemented in the R package smashr, "SMoothing by Adaptive SHrinkage in R," available at https://www.github.com/stephenslab/smashr.
Collapse
Affiliation(s)
- Zhengrong Xing
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA
| | - Peter Carbonetto
- Research Computing Center and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Matthew Stephens
- Department of Statistics and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
24
|
Choudhury SR, Ashby C, Tytarenko R, Bauer M, Wang Y, Deshpande S, Den J, Schinke C, Zangari M, Thanendrarajan S, Davies FE, van Rhee F, Morgan GJ, Walker BA. The functional epigenetic landscape of aberrant gene expression in molecular subgroups of newly diagnosed multiple myeloma. J Hematol Oncol 2020; 13:108. [PMID: 32762714 PMCID: PMC7409490 DOI: 10.1186/s13045-020-00933-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 02/24/2020] [Indexed: 02/07/2023] Open
Abstract
Background Multiple Myeloma (MM) is a hematological malignancy with genomic heterogeneity and poor survival outcome. Apart from the central role of genetic lesions, epigenetic anomalies have been identified as drivers in the development of the disease. Methods Alterations in the DNA methylome were mapped in 52 newly diagnosed MM (NDMM) patients of six molecular subgroups and matched with loci-specific chromatin marks to define their impact on gene expression. Differential DNA methylation analysis was performed using DMAP with a ≥10% increase (hypermethylation) or decrease (hypomethylation) in NDMM subgroups, compared to control samples, considered significant for all the subsequent analyses with p<0.05 after adjusting for a false discovery rate. Results We identified differentially methylated regions (DMRs) within the etiological cytogenetic subgroups of myeloma, compared to control plasma cells. Using gene expression data we identified genes that are dysregulated and correlate with DNA methylation levels, indicating a role for DNA methylation in their transcriptional control. We demonstrated that 70% of DMRs in the MM epigenome were hypomethylated and overlapped with repressive H3K27me3. In contrast, differentially expressed genes containing hypermethylated DMRs within the gene body or hypomethylated DMRs at the promoters overlapped with H3K4me1, H3K4me3, or H3K36me3 marks. Additionally, enrichment of BRD4 or MED1 at the H3K27ac enriched DMRs functioned as super-enhancers (SE), controlling the overexpression of genes or gene-cassettes. Conclusions Therefore, this study presents the underlying epigenetic regulatory networks of gene expression dysregulation in NDMM patients and identifies potential targets for future therapies.
Collapse
Affiliation(s)
- Samrat Roy Choudhury
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Cody Ashby
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Ruslana Tytarenko
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Michael Bauer
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Yan Wang
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Shayu Deshpande
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Judith Den
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Carolina Schinke
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Maurizio Zangari
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | | | - Faith E Davies
- Department of Medicine, NYU Langone Health, New York, NY, 10016, USA
| | - Frits van Rhee
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Gareth J Morgan
- Department of Medicine, NYU Langone Health, New York, NY, 10016, USA
| | - Brian A Walker
- Myeloma Center, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA. .,Division of Hematology Oncology, Melvin and Bren Simon Comprehensive Cancer Center, Indiana University, Indianapolis, IN, 46202, USA.
| |
Collapse
|
25
|
Benner P, Vingron M. ModHMM: A Modular Supra-Bayesian Genome Segmentation Method. J Comput Biol 2020; 27:442-457. [DOI: 10.1089/cmb.2019.0280] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Affiliation(s)
- Philipp Benner
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
26
|
Hall TJ, Vernimmen D, Browne JA, Mullen MP, Gordon SV, MacHugh DE, O’Doherty AM. Alveolar Macrophage Chromatin Is Modified to Orchestrate Host Response to Mycobacterium bovis Infection. Front Genet 2020; 10:1386. [PMID: 32117424 PMCID: PMC7020904 DOI: 10.3389/fgene.2019.01386] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 12/18/2019] [Indexed: 12/29/2022] Open
Abstract
Bovine tuberculosis is caused by infection with Mycobacterium bovis, which can also cause disease in a range of other mammals, including humans. Alveolar macrophages are the key immune effector cells that first encounter M. bovis and how the macrophage epigenome responds to mycobacterial pathogens is currently not well understood. Here, we have used chromatin immunoprecipitation sequencing (ChIP-seq), RNA-seq and miRNA-seq to examine the effect of M. bovis infection on the bovine alveolar macrophage (bAM) epigenome. We show that H3K4me3 is more prevalent, at a genome-wide level, in chromatin from M. bovis-infected bAM compared to control non-infected bAM; this was particularly evident at the transcriptional start sites of genes that determine programmed macrophage responses to mycobacterial infection (e.g. M1/M2 macrophage polarisation). This pattern was also supported by the distribution of RNA Polymerase II (Pol II) ChIP-seq results, which highlighted significantly increased transcriptional activity at genes demarcated by permissive chromatin. Identification of these genes enabled integration of high-density genome-wide association study (GWAS) data, which revealed genomic regions associated with resilience to infection with M. bovis in cattle. Through integration of these data, we show that bAM transcriptional reprogramming occurs through differential distribution of H3K4me3 and Pol II at key immune genes. Furthermore, this subset of genes can be used to prioritise genomic variants from a relevant GWAS data set.
Collapse
Affiliation(s)
- Thomas J. Hall
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, College Dublin, Dublin, Ireland
| | - Douglas Vernimmen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian, United Kingdom
| | - John A. Browne
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, College Dublin, Dublin, Ireland
| | - Michael P. Mullen
- Bioscience Research Institute, Athlone Institute of Technology, Athlone, Ireland
| | - Stephen V. Gordon
- UCD School of Veterinary Medicine, University College Dublin, Dublin, Ireland
- UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland
| | - David E. MacHugh
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, College Dublin, Dublin, Ireland
- UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland
| | - Alan M. O’Doherty
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, College Dublin, Dublin, Ireland
| |
Collapse
|
27
|
Yan F, Powell DR, Curtis DJ, Wong NC. From reads to insight: a hitchhiker's guide to ATAC-seq data analysis. Genome Biol 2020; 21:22. [PMID: 32014034 PMCID: PMC6996192 DOI: 10.1186/s13059-020-1929-3] [Citation(s) in RCA: 204] [Impact Index Per Article: 51.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 01/08/2020] [Indexed: 12/16/2022] Open
Abstract
Assay of Transposase Accessible Chromatin sequencing (ATAC-seq) is widely used in studying chromatin biology, but a comprehensive review of the analysis tools has not been completed yet. Here, we discuss the major steps in ATAC-seq data analysis, including pre-analysis (quality check and alignment), core analysis (peak calling), and advanced analysis (peak differential analysis and annotation, motif enrichment, footprinting, and nucleosome position analysis). We also review the reconstruction of transcriptional regulatory networks with multiomics data and highlight the current challenges of each step. Finally, we describe the potential of single-cell ATAC-seq and highlight the necessity of developing ATAC-seq specific analysis tools to obtain biologically meaningful insights.
Collapse
Affiliation(s)
- Feng Yan
- Australian Centre for Blood Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia
| | - David R Powell
- Monash Bioinformatics Platform, Monash University, Melbourne, VIC, Australia
| | - David J Curtis
- Australian Centre for Blood Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia.,Department of Clinical Haematology, Alfred Health, Melbourne, VIC, Australia
| | - Nicholas C Wong
- Australian Centre for Blood Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia. .,Monash Bioinformatics Platform, Monash University, Melbourne, VIC, Australia.
| |
Collapse
|
28
|
Hiranuma N, Lundberg SM, Lee SI. AIControl: replacing matched control experiments with machine learning improves ChIP-seq peak identification. Nucleic Acids Res 2019; 47:e58. [PMID: 30869146 PMCID: PMC6547432 DOI: 10.1093/nar/gkz156] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 02/15/2019] [Accepted: 02/28/2019] [Indexed: 01/24/2023] Open
Abstract
ChIP-seq is a technique to determine binding locations of transcription factors, which remains a central challenge in molecular biology. Current practice is to use a 'control' dataset to remove background signals from a immunoprecipitation (IP) 'target' dataset. We introduce the AIControl framework, which eliminates the need to obtain a control dataset and instead identifies binding peaks by estimating the distributions of background signals from many publicly available control ChIP-seq datasets. We thereby avoid the cost of running control experiments while simultaneously increasing the accuracy of binding location identification. Specifically, AIControl can (i) estimate background signals at fine resolution, (ii) systematically weigh the most appropriate control datasets in a data-driven way, (iii) capture sources of potential biases that may be missed by one control dataset and (iv) remove the need for costly and time-consuming control experiments. We applied AIControl to 410 IP datasets in the ENCODE ChIP-seq database, using 440 control datasets from 107 cell types to impute background signal. Without using matched control datasets, AIControl identified peaks that were more enriched for putative binding sites than those identified by other popular peak callers that used a matched control dataset. We also demonstrated that our framework identifies binding sites that recover documented protein interactions more accurately.
Collapse
Affiliation(s)
- Naozumi Hiranuma
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195-2350
| | - Scott M Lundberg
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195-2350
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195-2350
| |
Collapse
|
29
|
Kimes PK, Reyes A. Reproducible and replicable comparisons using SummarizedBenchmark. Bioinformatics 2019; 35:137-139. [PMID: 30016409 DOI: 10.1093/bioinformatics/bty627] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 07/12/2018] [Indexed: 11/14/2022] Open
Abstract
Summary Benchmark studies are widely used to compare and evaluate tools developed for answering various biological questions. Despite the popularity of these comparisons, the implementation is often ad hoc, with little consistency across studies. To address this problem, we developed SummarizedBenchmark, an R package and framework for organizing and structuring benchmark comparisons. SummarizedBenchmark defines a general grammar for benchmarking and allows for easier setup and execution of benchmark comparisons, while improving the reproducibility and replicability of such comparisons. We demonstrate the wide applicability of our framework using four examples from different applications. Availability and implementation SummarizedBenchmark is an R package available through Bioconductor (http://bioconductor.org/packages/SummarizedBenchmark). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Patrick K Kimes
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Alejandro Reyes
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
30
|
Gheorghe M, Sandve GK, Khan A, Chèneby J, Ballester B, Mathelier A. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res 2019; 47:e21. [PMID: 30517703 PMCID: PMC6393237 DOI: 10.1093/nar/gky1210] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2018] [Revised: 10/31/2018] [Accepted: 11/20/2018] [Indexed: 12/11/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the most popular assay to identify genomic regions, called ChIP-seq peaks, that are bound in vivo by transcription factors (TFs). These regions are derived from direct TF-DNA interactions, indirect binding of the TF to the DNA (through a co-binding partner), nonspecific binding to the DNA, and noise/bias/artifacts. Delineating the bona fide direct TF-DNA interactions within the ChIP-seq peaks remains challenging. We developed a dedicated software, ChIP-eat, that combines computational TF binding models and ChIP-seq peaks to automatically predict direct TF-DNA interactions. Our work culminated with predicted interactions covering >4% of the human genome, obtained by uniformly processing 1983 ChIP-seq peak data sets from the ReMap database for 232 unique TFs. The predictions were a posteriori assessed using protein binding microarray and ChIP-exo data, and were predominantly found in high quality ChIP-seq peaks. The set of predicted direct TF-DNA interactions suggested that high-occupancy target regions are likely not derived from direct binding of the TFs to the DNA. Our predictions derived co-binding TFs supported by protein-protein interaction data and defined cis-regulatory modules enriched for disease- and trait-associated SNPs. We provide this collection of direct TF-DNA interactions and cis-regulatory modules through the UniBind web-interface (http://unibind.uio.no).
Collapse
Affiliation(s)
- Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | | | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | - Jeanne Chèneby
- Aix Marseille Université, INSERM, TAGC, Marseille, France
| | | | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Radiumhospitalet, Oslo, Norway
| |
Collapse
|
31
|
Rioualen C, Charbonnier-Khamvongsa L, Collado-Vides J, van Helden J. Integrating Bacterial ChIP-seq and RNA-seq Data With SnakeChunks. CURRENT PROTOCOLS IN BIOINFORMATICS 2019; 66:e72. [PMID: 30786165 PMCID: PMC7302399 DOI: 10.1002/cpbi.72] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Next-generation sequencing (NGS) is becoming a routine approach in most domains of the life sciences. To ensure reproducibility of results, there is a crucial need to improve the automation of NGS data processing and enable forthcoming studies relying on big datasets. Although user-friendly interfaces now exist, there remains a strong need for accessible solutions that allow experimental biologists to analyze and explore their results in an autonomous and flexible way. The protocols here describe a modular system that enable a user to compose and fine-tune workflows based on SnakeChunks, a library of rules for the Snakemake workflow engine. They are illustrated using a study combining ChIP-seq and RNA-seq to identify target genes of the global transcription factor FNR in Escherichia coli, which has the advantage that results can be compared with the most up-to-date collection of existing knowledge about transcriptional regulation in this model organism, extracted from the RegulonDB database. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Claire Rioualen
- Aix-Marseille University, INSERM, Laboratory of Theory and Approaches of Genome Complexity (TAGC), Marseille, France
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Lucie Charbonnier-Khamvongsa
- Aix-Marseille University, INSERM, Laboratory of Theory and Approaches of Genome Complexity (TAGC), Marseille, France
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts
| | - Jacques van Helden
- Aix-Marseille University, INSERM, Laboratory of Theory and Approaches of Genome Complexity (TAGC), Marseille, France
- Institut Français de Bioinformatique (IFB), UMS 3601-CNRS, Université Paris-Saclay, Orsay, France
| |
Collapse
|
32
|
Berger S, Pachkov M, Arnold P, Omidi S, Kelley N, Salatino S, van Nimwegen E. Crunch: integrated processing and modeling of ChIP-seq data in terms of regulatory motifs. Genome Res 2019; 29:1164-1177. [PMID: 31138617 PMCID: PMC6633267 DOI: 10.1101/gr.239319.118] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Accepted: 05/14/2019] [Indexed: 01/10/2023]
Abstract
Although ChIP-seq has become a routine experimental approach for quantitatively characterizing the genome-wide binding of transcription factors (TFs), computational analysis procedures remain far from standardized, making it difficult to compare ChIP-seq results across experiments. In addition, although genome-wide binding patterns must ultimately be determined by local constellations of DNA-binding sites, current analysis is typically limited to identifying enriched motifs in ChIP-seq peaks. Here we present Crunch, a completely automated computational method that performs all ChIP-seq analysis from quality control through read mapping and peak detecting and that integrates comprehensive modeling of the ChIP signal in terms of known and novel binding motifs, quantifying the contribution of each motif and annotating which combinations of motifs explain each binding peak. By applying Crunch to 128 data sets from the ENCODE Project, we show that Crunch outperforms current peak finders and find that TFs naturally separate into "solitary TFs," for which a single motif explains the ChIP-peaks, and "cobinding TFs," for which multiple motifs co-occur within peaks. Moreover, for most data sets, the motifs that Crunch identified de novo outperform known motifs, and both the set of cobinding motifs and the top motif of solitary TFs are consistent across experiments and cell lines. Crunch is implemented as a web server, enabling standardized analysis of any collection of ChIP-seq data sets by simply uploading raw sequencing data. Results are provided both in a graphical web interface and as downloadable files.
Collapse
Affiliation(s)
- Severin Berger
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Mikhail Pachkov
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Phil Arnold
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Saeed Omidi
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Nicholas Kelley
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Silvia Salatino
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Erik van Nimwegen
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| |
Collapse
|
33
|
Grytten I, Rand KD, Nederbragt AJ, Storvik GO, Glad IK, Sandve GK. Graph Peak Caller: Calling ChIP-seq peaks on graph-based reference genomes. PLoS Comput Biol 2019; 15:e1006731. [PMID: 30779737 PMCID: PMC6396939 DOI: 10.1371/journal.pcbi.1006731] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Revised: 03/01/2019] [Accepted: 12/19/2018] [Indexed: 11/30/2022] Open
Abstract
Graph-based representations are considered to be the future for reference genomes, as they allow integrated representation of the steadily increasing data on individual variation. Currently available tools allow de novo assembly of graph-based reference genomes, alignment of new read sets to the graph representation as well as certain analyses like variant calling and haplotyping. We here present a first method for calling ChIP-Seq peaks on read data aligned to a graph-based reference genome. The method is a graph generalization of the peak caller MACS2, and is implemented in an open source tool, Graph Peak Caller. By using the existing tool vg to build a pan-genome of Arabidopsis thaliana, we validate our approach by showing that Graph Peak Caller with a pan-genome reference graph can trace variants within peaks that are not part of the linear reference genome, and find peaks that in general are more motif-enriched than those found by MACS2. The expression of genes is a tightly regulated process. A key regulatory mechanism is the modulation of transcription by a class of proteins called transcription factors that bind to DNA in the spatial proximity of regulated genes. Determining the binding locations of transcription factors for specific cell types and settings is thus a key step in understanding the dynamics of normal cells as well as disease states. Binding sites for a given transcription factor are typically obtained through an experimental technique called CHiP-seq, in which DNA binding locations are obtained by sequencing DNA fragments attached to the transcription factor and aligning these sequences to a reference genome. A computational technique known as peak calling is then used to separate signal from noise and predict where the protein binds. Current peak callers are based on linear reference genomes that do not contain known genetic variants from the population. They thus potentially miss cases where proteins bind to such alternative genome sequences. Recently, a new type of reference genomes based on graph representations have become popular, as they are able to also incorporate alternative genome sequences. We here present Graph Peak Caller, the first peak caller that is able to exploit such graph representations for the detection of transcription factor binding locations. Using a graph-based reference genome for Arabidopsis thaliana, we show that our peak caller can lead to better detection of transcription factor binding locations as compared to a similar existing peak caller that uses a linear reference genome representation.
Collapse
Affiliation(s)
- Ivar Grytten
- Department of informatics, University of Oslo, Oslo, Norway
- * E-mail:
| | - Knut D. Rand
- Department of Mathematics, University of Oslo, Oslo, Norway
| | - Alexander J. Nederbragt
- Department of informatics, University of Oslo, Oslo, Norway
- Department of Biosciences, University of Oslo, Oslo, Norway
| | | | - Ingrid K. Glad
- Department of Mathematics, University of Oslo, Oslo, Norway
| | - Geir K. Sandve
- Department of informatics, University of Oslo, Oslo, Norway
| |
Collapse
|
34
|
Fu S, Wang Q, Moore JE, Purcaro MJ, Pratt HE, Fan K, Gu C, Jiang C, Zhu R, Kundaje A, Lu A, Weng Z. Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers. Nucleic Acids Res 2018; 46:11184-11201. [PMID: 30137428 PMCID: PMC6265487 DOI: 10.1093/nar/gky753] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Revised: 07/15/2018] [Accepted: 08/08/2018] [Indexed: 12/11/2022] Open
Abstract
Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks.
Collapse
Affiliation(s)
- Shaliu Fu
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Qin Wang
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Jill E Moore
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Michael J Purcaro
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Henry E Pratt
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Kaili Fan
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Cuihua Gu
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Cizhong Jiang
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Ruixin Zhu
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Anshul Kundaje
- Department of Genetics, School of Medicine, Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Aiping Lu
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Zhiping Weng
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| |
Collapse
|
35
|
Wiegreffe D, Müller L, Steuck J, Zeckzer D, Stadler PF. The Sierra Platinum Service for generating peak-calls for replicated ChIP-seq experiments. BMC Res Notes 2018; 11:512. [PMID: 30055643 PMCID: PMC6064048 DOI: 10.1186/s13104-018-3633-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 07/20/2018] [Indexed: 11/10/2022] Open
Abstract
Objective Sierra Platinum is a fast and robust peak-caller for replicated ChIP-seq experiments with visual quality-control and -steering. The required computing resources are optimized but still may exceed the resources available to researchers at biological research institutes. Results Sierra Platinum Service provides the full functionality of Sierra Platinum: using a web interface, a new instance of the service can be generated. Then experimental data is uploaded and the computation of the peaks is started. Upon completion, the results can be inspected interactively and then downloaded for further analysis, at which point the service terminates.
Collapse
Affiliation(s)
- Daniel Wiegreffe
- Image and Signal Processing Group, Department of Computer Science, University of Leipzig, Augustusplatz 10, 04109, Leipzig, Germany.
| | - Lydia Müller
- Natural Language Processing Department, Department of Computer Science, University of Leipzig, Augustusplatz 10, 04109, Leipzig, Germany
| | - Jens Steuck
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany
| | - Dirk Zeckzer
- Image and Signal Processing Group, Department of Computer Science, University of Leipzig, Augustusplatz 10, 04109, Leipzig, Germany
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.,Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.,Max Planck Institute MIS, Inselstraße 22, 04103, Leipzig, Germany.,Fraunhofer Institute for Cell Therapy and Immunology IZI, Perlickstraße 1, 04103, Leipzig, Germany.,Institute for Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090, Vienna, Austria.,Center for Non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, 1870, Copenhagen, Denmark.,The Santa Fe Institute, 1399 Hyde Park Road, 87501, Santa Fe, NM, USA
| |
Collapse
|
36
|
Lichtenberg J, Elnitski L, Bodine DM. SigSeeker: a peak-calling ensemble approach for constructing epigenetic signatures. Bioinformatics 2018; 33:2615-2621. [PMID: 28449120 DOI: 10.1093/bioinformatics/btx276] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Accepted: 04/20/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Epigenetic data are invaluable when determining the regulatory programs governing a cell. Based on use of next-generation sequencing data for characterizing epigenetic marks and transcription factor binding, numerous peak-calling approaches have been developed to determine sites of genomic significance in these data. Such analyses can produce a large number of false positive predictions, suggesting that sites supported by multiple algorithms provide a stronger foundation for inferring and characterizing regulatory programs associated with the epigenetic data. Few methodologies integrate epigenetic based predictions of multiple approaches when combining profiles generated by different tools. Results The SigSeeker peak-calling ensemble uses multiple tools to identify peaks, and with user-defined thresholds for peak overlap and signal strength it retains only those peaks that are concordant across multiple tools. Peaks predicted to be co-localized by only a very small number of tools, discovered to be only marginally overlapping, or found to represent significant outliers to the approximation model are removed from the results, providing concise and high quality epigenetic datasets. SigSeeker has been validated using established benchmarks for transcription factor binding and histone modification ChIP-Seq data. These comparisons indicate that the quality of our ensemble technique exceeds that of single tool approaches, enhances existing peak-calling ensembles, and results in epigenetic profiles of higher confidence. Availability and implementation http://sigseeker.org. Contact lichtenbergj@mail.nih.gov. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jens Lichtenberg
- Genetics and Molecular Biology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Laura Elnitski
- Translational and Functional Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - David M Bodine
- Genetics and Molecular Biology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
37
|
Girimurugan SB, Liu Y, Lung PY, Vera DL, Dennis JH, Bass HW, Zhang J. iSeg: an efficient algorithm for segmentation of genomic and epigenomic data. BMC Bioinformatics 2018; 19:131. [PMID: 29642840 PMCID: PMC5896135 DOI: 10.1186/s12859-018-2140-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 03/26/2018] [Indexed: 11/16/2022] Open
Abstract
Background Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. Results We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. Conclusions We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions. Electronic supplementary material The online version of this article (10.1186/s12859-018-2140-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Yuhang Liu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Daniel L Vera
- Center for Genomics and Personalized Medicine, Florida State University, Tallahassee, FL, USA
| | - Jonathan H Dennis
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Hank W Bass
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA.
| |
Collapse
|
38
|
Bishop SM, Ercole A. Multi-Scale Peak and Trough Detection Optimised for Periodic and Quasi-Periodic Neuroscience Data. ACTA NEUROCHIRURGICA. SUPPLEMENT 2018; 126:189-195. [PMID: 29492559 DOI: 10.1007/978-3-319-65798-1_39] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
OBJECTIVES The reliable detection of peaks and troughs in physiological signals is essential to many investigative techniques in medicine and computational biology. Analysis of the intracranial pressure (ICP) waveform is a particular challenge due to multi-scale features, a changing morphology over time and signal-to-noise limitations. Here we present an efficient peak and trough detection algorithm that extends the scalogram approach of Scholkmann et al., and results in greatly improved algorithm runtime performance. MATERIALS AND METHODS Our improved algorithm (modified Scholkmann) was developed and analysed in MATLAB R2015b. Synthesised waveforms (periodic, quasi-periodic and chirp sinusoids) were degraded with white Gaussian noise to achieve signal-to-noise ratios down to 5 dB and were used to compare the performance of the original Scholkmann and modified Scholkmann algorithms. RESULTS The modified Scholkmann algorithm has false-positive (0%) and false-negative (0%) detection rates identical to the original Scholkmann when applied to our test suite. Actual compute time for a 200-run Monte Carlo simulation over a multicomponent noisy test signal was 40.96 ± 0.020 s (mean ± 95%CI) for the original Scholkmann and 1.81 ± 0.003 s (mean ± 95%CI) for the modified Scholkmann, demonstrating the expected improvement in runtime complexity from [Formula: see text] to [Formula: see text]. CONCLUSIONS The accurate interpretation of waveform data to identify peaks and troughs is crucial in signal parameterisation, feature extraction and waveform identification tasks. Modification of a standard scalogram technique has produced a robust algorithm with linear computational complexity that is particularly suited to the challenges presented by large, noisy physiological datasets. The algorithm is optimised through a single parameter and can identify sub-waveform features with minimal additional overhead, and is easily adapted to run in real time on commodity hardware.
Collapse
Affiliation(s)
- Steven M Bishop
- Division of Anaesthesia, University of Cambridge, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK.
| | - Ari Ercole
- Division of Anaesthesia, University of Cambridge, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| |
Collapse
|
39
|
Jordán-Pla A, Visa N. Considerations on Experimental Design and Data Analysis of Chromatin Immunoprecipitation Experiments. Methods Mol Biol 2018; 1689:9-28. [PMID: 29027161 DOI: 10.1007/978-1-4939-7380-4_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Arguably one of the most valuable techniques to study chromatin organization, ChIP is the method of choice to map the contacts established between proteins and genomic DNA. Ever since its inception, more than 30 years ago, ChIP has been constantly evolving, improving, and expanding its capabilities and reach. Despite its widespread use by many laboratories across a wide variety of disciplines, ChIP assays can be sometimes challenging to design, and are often sensitive to variations in practical implementation.In this chapter, we provide a general overview of the ChIP method and its most common variations, with a special focus on ChIP-seq. We try to address some of the most important aspects that need to be taken into account in order to design and perform experiments that generate the most reproducible, high-quality data. Some of the main topics covered include the use of properly characterized antibodies, alternatives to chromatin preparation, the need for proper controls, and some recommendations about ChIP-seq data analysis.
Collapse
Affiliation(s)
- Antonio Jordán-Pla
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Svante Arrhenius väg 20c, 10691, Stockholm, Sweden.
| | - Neus Visa
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Svante Arrhenius väg 20c, 10691, Stockholm, Sweden
| |
Collapse
|
40
|
Patten DK, Corleone G, Magnani L. Chromatin Immunoprecipitation and High-Throughput Sequencing (ChIP-Seq): Tips and Tricks Regarding the Laboratory Protocol and Initial Downstream Data Analysis. Methods Mol Biol 2018; 1767:271-288. [PMID: 29524141 DOI: 10.1007/978-1-4939-7774-1_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2024]
Abstract
Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become an essential tool for epigenetic scientists. ChIP-seq is used to map protein-DNA interactions and epigenetic marks such as histone modifications at the genome-wide level. Here we describe a complete ChIP-seq laboratory protocol (tailored toward processing tissue samples as well as cell lines) and the bioinformatic pipelines utilized for handling raw sequencing files through to peak calling.
Collapse
Affiliation(s)
- Darren K Patten
- Department of Surgery and Cancer, Imperial College London, London, UK
- Department of Bariatric and Emergency General Surgery, Homerton University Hospital, London, UK
| | - Giacomo Corleone
- Department of Surgery and Cancer, Imperial College London, London, UK
| | - Luca Magnani
- Department of Surgery and Cancer, Imperial College London, London, UK.
| |
Collapse
|
41
|
Nakato R, Shirahige K. Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Brief Bioinform 2017; 18:279-290. [PMID: 26979602 PMCID: PMC5444249 DOI: 10.1093/bib/bbw023] [Citation(s) in RCA: 78] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Indexed: 02/06/2023] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) analysis can detect protein/DNA-binding and histone-modification sites across an entire genome. Recent advances in sequencing technologies and analyses enable us to compare hundreds of samples simultaneously; such large-scale analysis has potential to reveal the high-dimensional interrelationship level for regulatory elements and annotate novel functional genomic regions de novo. Because many experimental considerations are relevant to the choice of a method in a ChIP-seq analysis, the overall design and quality management of the experiment are of critical importance. This review offers guiding principles of computation and sample preparation for ChIP-seq analyses, highlighting the validity and limitations of the state-of-the-art procedures at each step. We also discuss the latest challenges of single-cell analysis that will encourage a new era in this field.
Collapse
Affiliation(s)
- Ryuichiro Nakato
- Research Center for Epigenetic Disease, Institute of Molecular and Cellular Biosciences, The University of Tokyo, Tokyo, Japan
| | - Katsuhiko Shirahige
- Research Center for Epigenetic Disease, Institute of Molecular and Cellular Biosciences, The University of Tokyo, Tokyo, Japan.,Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency, Kawaguchi, Japan
| |
Collapse
|
42
|
Bottini S, Hamouda-Tekaya N, Tanasa B, Zaragosi LE, Grandjean V, Repetto E, Trabucchi M. From benchmarking HITS-CLIP peak detection programs to a new method for identification of miRNA-binding sites from Ago2-CLIP data. Nucleic Acids Res 2017; 45:e71. [PMID: 28108660 PMCID: PMC5435922 DOI: 10.1093/nar/gkx007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 01/03/2017] [Indexed: 12/20/2022] Open
Abstract
Experimental evidence indicates that about 60% of miRNA-binding activity does not follow the canonical rule about the seed matching between miRNA and target mRNAs, but rather a non-canonical miRNA targeting activity outside the seed or with a seed-like motifs. Here, we propose a new unbiased method to identify canonical and non-canonical miRNA-binding sites from peaks identified by Ago2 Cross-Linked ImmunoPrecipitation associated to high-throughput sequencing (CLIP-seq). Since the quality of peaks is of pivotal importance for the final output of the proposed method, we provide a comprehensive benchmarking of four peak detection programs, namely CIMS, PIPE-CLIP, Piranha and Pyicoclip, on four publicly available Ago2-HITS-CLIP datasets and one unpublished in-house Ago2-dataset in stem cells. We measured the sensitivity, the specificity and the position accuracy toward miRNA binding sites identification, and the agreement with TargetScan. Secondly, we developed a new pipeline, called miRBShunter, to identify canonical and non-canonical miRNA-binding sites based on de novo motif identification from Ago2 peaks and prediction of miRNA::RNA heteroduplexes. miRBShunter was tested and experimentally validated on the in-house Ago2-dataset and on an Ago2-PAR-CLIP dataset in human stem cells. Overall, we provide guidelines to choose a suitable peak detection program and a new method for miRNA-target identification.
Collapse
Affiliation(s)
- Silvia Bottini
- Université Côte d'Azur, Inserm, C3M, Nice, 06204, France
| | | | - Bogdan Tanasa
- Stanford University School of Medicine, 265 Campus Drive, LLSCR Building, Stanford, CA 94305, USA
| | | | | | | | | |
Collapse
|
43
|
An introduction to computational tools for differential binding analysis with ChIP-seq data. QUANTITATIVE BIOLOGY 2017. [DOI: 10.1007/s40484-017-0111-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
44
|
Yang A, Troup M, Ho JWK. Scalability and Validation of Big Data Bioinformatics Software. Comput Struct Biotechnol J 2017; 15:379-386. [PMID: 28794828 PMCID: PMC5537105 DOI: 10.1016/j.csbj.2017.07.002] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Revised: 06/30/2017] [Accepted: 07/17/2017] [Indexed: 12/20/2022] Open
Abstract
This review examines two important aspects that are central to modern big data bioinformatics analysis – software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.
Collapse
Affiliation(s)
- Andrian Yang
- Victor Chang Cardiac Research Institute, Darlinghurst, NSW 2010, Australia.,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, NSW 2010, Australia
| | - Michael Troup
- Victor Chang Cardiac Research Institute, Darlinghurst, NSW 2010, Australia
| | - Joshua W K Ho
- Victor Chang Cardiac Research Institute, Darlinghurst, NSW 2010, Australia.,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, NSW 2010, Australia
| |
Collapse
|
45
|
Xiong X, Yi C, Peng J. Epitranscriptomics: Toward A Better Understanding of RNA Modifications. GENOMICS PROTEOMICS & BIOINFORMATICS 2017; 15:147-153. [PMID: 28533024 PMCID: PMC5487522 DOI: 10.1016/j.gpb.2017.03.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2016] [Revised: 02/18/2017] [Accepted: 03/22/2017] [Indexed: 12/11/2022]
Affiliation(s)
- Xushen Xiong
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China; Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Chengqi Yi
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China; Synthetic and Functional Biomolecules Center, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Jinying Peng
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China.
| |
Collapse
|
46
|
Thomas R, Thomas S, Holloway AK, Pollard KS. Features that define the best ChIP-seq peak calling algorithms. Brief Bioinform 2017; 18:441-450. [PMID: 27169896 PMCID: PMC5429005 DOI: 10.1093/bib/bbw035] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2016] [Revised: 03/01/2016] [Indexed: 12/20/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is an important tool for studying gene regulatory proteins, such as transcription factors and histones. Peak calling is one of the first steps in the analysis of these data. Peak calling consists of two sub-problems: identifying candidate peaks and testing candidate peaks for statistical significance. We surveyed 30 methods and identified 12 features of the two sub-problems that distinguish methods from each other. We picked six methods GEM, MACS2, MUSIC, BCP, Threshold-based method (TM) and ZINBA] that span this feature space and used a combination of 300 simulated ChIP-seq data sets, 3 real data sets and mathematical analyses to identify features of methods that allow some to perform better than the others. We prove that methods that explicitly combine the signals from ChIP and input samples are less powerful than methods that do not. Methods that use windows of different sizes are more powerful than the ones that do not. For statistical testing of candidate peaks, methods that use a Poisson test to rank their candidate peaks are more powerful than those that use a Binomial test. BCP and MACS2 have the best operating characteristics on simulated transcription factor binding data. GEM has the highest fraction of the top 500 peaks containing the binding motif of the immunoprecipitated factor, with 50% of its peaks within 10 base pairs of a motif. BCP and MUSIC perform best on histone data. These findings provide guidance and rationale for selecting the best peak caller for a given application.
Collapse
Affiliation(s)
| | - Sean Thomas
- Gladstone Institutes, San Francisco, CA, USA
- Division of Biostatistics, University of California, San Francisco, CA, USA
| | - Alisha K Holloway
- Gladstone Institutes, San Francisco, CA, USA
- Division of Biostatistics, University of California, San Francisco, CA, USA
- Phylos Biosciences, Portland, OR, USA
| | - Katherine S Pollard
- Gladstone Institutes, San Francisco, CA, USA
- Division of Biostatistics, University of California, San Francisco, CA, USA
- Institute for Human Genetics and Institute for Computational Health Sciences, University of California, San Francisco, CA, USA
| |
Collapse
|
47
|
Soleymani A, Pennekamp F, Dodge S, Weibel R. Characterizing change points and continuous transitions in movement behaviours using wavelet decomposition. Methods Ecol Evol 2017. [DOI: 10.1111/2041-210x.12755] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ali Soleymani
- Department of Geography University of Zurich Zurich Switzerland
| | - Frank Pennekamp
- Institute of Evolutionary Biology and Environmental Studies University of Zurich Zurich Switzerland
| | - Somayeh Dodge
- Department of Geography, Environment, and Society University of Minnesota Twin Cities MN USA
| | - Robert Weibel
- Department of Geography University of Zurich Zurich Switzerland
| |
Collapse
|
48
|
Hung JH, Weng Z. Peak-Finding Algorithms. Cold Spring Harb Protoc 2017; 2017:pdb.top093179. [PMID: 27574196 DOI: 10.1101/pdb.top093179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Microarray and next-generation sequencing technologies have greatly expedited the discovery of genomic DNA that can be enriched using various biochemical methods. Chromatin immunoprecipitation (ChIP) is a general method for enriching chromatin fragments that are specifically recognized by an antibody. The resulting DNA fragments can be assayed by microarray (ChIP-chip) or sequencing (ChIP-seq). This introduction focuses on ChIP-seq data analysis. The first step of analyzing ChIP-seq data is identifying regions in the genome that are enriched in a ChIP sample; these regions are called peaks.
Collapse
|
49
|
Loh YH, Feng J, Nestler E, Shen L. Bioinformatic Analysis for Profiling Drug-induced Chromatin Modification Landscapes in Mouse Brain Using ChlP-seq Data. Bio Protoc 2017; 7:e2123. [DOI: 10.21769/bioprotoc.2123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022] Open
|
50
|
Han Y, He X. Integrating Epigenomics into the Understanding of Biomedical Insight. Bioinform Biol Insights 2016; 10:267-289. [PMID: 27980397 PMCID: PMC5138066 DOI: 10.4137/bbi.s38427] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Revised: 11/01/2016] [Accepted: 11/06/2016] [Indexed: 12/13/2022] Open
Abstract
Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics.
Collapse
Affiliation(s)
- Yixing Han
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, USA.; Present address: Genetics and Biochemistry Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Ximiao He
- Laboratory of Metabolism, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.; Present address: Department of Medical Genetics, School of Basic Medicine, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|