1
|
Hocking TD, Goerner-Potvin P, Morin A, Shao X, Pastinen T, Bourque G. Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning. Bioinformatics 2017; 33:491-499. [PMID: 27797775 PMCID: PMC5408812 DOI: 10.1093/bioinformatics/btw672] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Accepted: 10/20/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. Results We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms. Availability and Implementation Labeled histone mark data http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/ , R package to compute the label error of predicted peaks https://github.com/tdhock/PeakError. Contacts toby.hocking@mail.mcgill.ca or guil.bourque@mcgill.ca. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Toby Dylan Hocking
- Department of Human Genetics, McGill University, H3A-1A4, Montréal, Canada
| | | | - Andreanne Morin
- Department of Human Genetics, McGill University, H3A-1A4, Montréal, Canada
| | - Xiaojian Shao
- Department of Human Genetics, McGill University, H3A-1A4, Montréal, Canada
| | - Tomi Pastinen
- Department of Human Genetics, McGill University, H3A-1A4, Montréal, Canada
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, H3A-1A4, Montréal, Canada
| |
Collapse
|
2
|
Lizio M, Ishizu Y, Itoh M, Lassmann T, Hasegawa A, Kubosaki A, Severin J, Kawaji H, Nakamura Y, Suzuki H, Hayashizaki Y, Carninci P, Forrest ARR. Mapping Mammalian Cell-type-specific Transcriptional Regulatory Networks Using KD-CAGE and ChIP-seq Data in the TC-YIK Cell Line. Front Genet 2015; 6:331. [PMID: 26635867 PMCID: PMC4650373 DOI: 10.3389/fgene.2015.00331] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2015] [Accepted: 10/30/2015] [Indexed: 12/22/2022] Open
Abstract
Mammals are composed of hundreds of different cell types with specialized functions. Each of these cellular phenotypes are controlled by different combinations of transcription factors. Using a human non islet cell insulinoma cell line (TC-YIK) which expresses insulin and the majority of known pancreatic beta cell specific genes as an example, we describe a general approach to identify key cell-type-specific transcription factors (TFs) and their direct and indirect targets. By ranking all human TFs by their level of enriched expression in TC-YIK relative to a broad collection of samples (FANTOM5), we confirmed known key regulators of pancreatic function and development. Systematic siRNA mediated perturbation of these TFs followed by qRT-PCR revealed their interconnections with NEUROD1 at the top of the regulation hierarchy and its depletion drastically reducing insulin levels. For 15 of the TF knock-downs (KD), we then used Cap Analysis of Gene Expression (CAGE) to identify thousands of their targets genome-wide (KD-CAGE). The data confirm NEUROD1 as a key positive regulator in the transcriptional regulatory network (TRN), and ISL1, and PROX1 as antagonists. As a complimentary approach we used ChIP-seq on four of these factors to identify NEUROD1, LMX1A, PAX6, and RFX6 binding sites in the human genome. Examining the overlap between genes perturbed in the KD-CAGE experiments and genes with a ChIP-seq peak within 50 kb of their promoter, we identified direct transcriptional targets of these TFs. Integration of KD-CAGE and ChIP-seq data shows that both NEUROD1 and LMX1A work as the main transcriptional activators. In the core TRN (i.e., TF-TF only), NEUROD1 directly transcriptionally activates the pancreatic TFs HSF4, INSM1, MLXIPL, MYT1, NKX6-3, ONECUT2, PAX4, PROX1, RFX6, ST18, DACH1, and SHOX2, while LMX1A directly transcriptionally activates DACH1, SHOX2, PAX6, and PDX1. Analysis of these complementary datasets suggests the need for caution in interpreting ChIP-seq datasets. (1) A large fraction of binding sites are at distal enhancer sites and cannot be directly associated to their targets, without chromatin conformation data. (2) Many peaks may be non-functional: even when there is a peak at a promoter, the expression of the gene may not be affected in the matching perturbation experiment.
Collapse
Affiliation(s)
- Marina Lizio
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan
| | - Yuri Ishizu
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan
| | - Masayoshi Itoh
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan ; RIKEN Preventive Medicine and Diagnosis Innovation Program Yokohama, Japan
| | - Timo Lassmann
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan ; Telethon Kids Institute, The University of Western Australia Subiaco, WA, Australia
| | - Akira Hasegawa
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan
| | | | - Jessica Severin
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan
| | - Hideya Kawaji
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan ; RIKEN Preventive Medicine and Diagnosis Innovation Program Yokohama, Japan
| | - Yukio Nakamura
- Cell Engineering Division, RIKEN BioResource Center Ibaraki, Japan
| | | | - Harukazu Suzuki
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan
| | - Yoshihide Hayashizaki
- RIKEN Center for Life Science Technologies Yokohama, Japan ; RIKEN Preventive Medicine and Diagnosis Innovation Program Yokohama, Japan
| | - Piero Carninci
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan
| | - Alistair R R Forrest
- RIKEN Center for Life Science Technologies Yokohama, Japan ; Division of Genomic Technologies, RIKEN Center for Life Science Technologies Yokohama, Japan ; QEII Medical Centre and Centre for Medical Research, Harry Perkins Institute of Medical Research, The University of Western Australia Nedlands, WA, Australia
| |
Collapse
|