1
|
Almeida da Paz M, Warger S, Taher L. Disregarding multimappers leads to biases in the functional assessment of NGS data. BMC Genomics 2024; 25:455. [PMID: 38720252 PMCID: PMC11078754 DOI: 10.1186/s12864-024-10344-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/24/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Standard ChIP-seq and RNA-seq processing pipelines typically disregard sequencing reads whose origin is ambiguous ("multimappers"). This usual practice has potentially important consequences for the functional interpretation of the data: genomic elements belonging to clusters composed of highly similar members are left unexplored. RESULTS In particular, disregarding multimappers leads to the underrepresentation in epigenetic studies of recently active transposable elements, such as AluYa5, L1HS and SVAs. Furthermore, this common strategy also has implications for transcriptomic analysis: members of repetitive gene families, such the ones including major histocompatibility complex (MHC) class I and II genes, are under-quantified. CONCLUSION Revealing inherent biases that permeate routine tasks such as functional enrichment analysis, our results underscore the urgency of broadly adopting multimapper-aware bioinformatic pipelines -currently restricted to specific contexts or communities- to ensure the reliability of genomic and transcriptomic studies.
Collapse
Affiliation(s)
| | - Sarah Warger
- Institute of Biomedical Informatics, Graz University of Technology, Graz, Austria
| | - Leila Taher
- Institute of Biomedical Informatics, Graz University of Technology, Graz, Austria.
| |
Collapse
|
2
|
Zeng C, Takeda A, Sekine K, Osato N, Fukunaga T, Hamada M. Bioinformatics Approaches for Determining the Functional Impact of Repetitive Elements on Non-coding RNAs. Methods Mol Biol 2022; 2509:315-340. [PMID: 35796972 DOI: 10.1007/978-1-0716-2380-0_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With a large number of annotated non-coding RNAs (ncRNAs), repetitive sequences are found to constitute functional components (termed as repetitive elements) in ncRNAs that perform specific biological functions. Bioinformatics analysis is a powerful tool for improving our understanding of the role of repetitive elements in ncRNAs. This chapter summarizes recent findings that reveal the role of repetitive elements in ncRNAs. Furthermore, relevant bioinformatics approaches are systematically reviewed, which promises to provide valuable resources for studying the functional impact of repetitive elements on ncRNAs.
Collapse
Affiliation(s)
- Chao Zeng
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan.
| | - Atsushi Takeda
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Kotaro Sekine
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Naoki Osato
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo, Japan
| | - Michiaki Hamada
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan.
| |
Collapse
|
3
|
Shah RN, Ruthenburg AJ. Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads. PLoS Comput Biol 2021; 17:e1008926. [PMID: 33872311 PMCID: PMC8084338 DOI: 10.1371/journal.pcbi.1008926] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 04/29/2021] [Accepted: 03/30/2021] [Indexed: 11/18/2022] Open
Abstract
Next-generation sequencing (NGS) has transformed molecular biology and contributed to many seminal insights into genomic regulation and function. Apart from whole-genome sequencing, an NGS workflow involves alignment of the sequencing reads to the genome of study, after which the resulting alignments can be used for downstream analyses. However, alignment is complicated by the repetitive sequences; many reads align to more than one genomic locus, with 15-30% of the genome not being uniquely mappable by short-read NGS. This problem is typically addressed by discarding reads that do not uniquely map to the genome, but this practice can lead to systematic distortion of the data. Previous studies that developed methods for handling ambiguously mapped reads were often of limited applicability or were computationally intensive, hindering their broader usage. In this work, we present SmartMap: an algorithm that augments industry-standard aligners to enable usage of ambiguously mapped reads by assigning weights to each alignment with Bayesian analysis of the read distribution and alignment quality. SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads in approximately one hour on a desktop PC. By applying SmartMap to peak-type NGS data, including MNase-seq, ChIP-seq, and ATAC-seq in three organisms, we can increase read depth by up to 53% and increase the mapped proportion of the genome by up to 18% compared to analyses utilizing only uniquely mapped reads. We further show that SmartMap enables the analysis of more than 140,000 repetitive elements that could not be analyzed by traditional ChIP-seq workflows, and we utilize this method to gain insight into the epigenetic regulation of different classes of repetitive elements. These data emphasize both the dangers of discarding ambiguously mapped reads and their power for driving biological discovery.
Collapse
Affiliation(s)
- Rohan N. Shah
- Pritzker School of Medicine, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- Department of Molecular Biology and Cell Genetics, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (RNS); (AJR)
| | - Alexander J. Ruthenburg
- Department of Molecular Biology and Cell Genetics, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- Department of Biochemistry and Molecular Biology, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (RNS); (AJR)
| |
Collapse
|
4
|
Li M, Tang L, Wu FX, Pan Y, Wang J. CSA: a web service for the complete process of ChIP-Seq analysis. BMC Bioinformatics 2019; 20:515. [PMID: 31874601 PMCID: PMC6929326 DOI: 10.1186/s12859-019-3090-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 09/10/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Chromatin immunoprecipitation sequencing (ChIP-seq) is a technology that combines chromatin immunoprecipitation (ChIP) with next generation of sequencing technology (NGS) to analyze protein interactions with DNA. At present, most ChIP-seq analysis tools adopt the command line, which lacks user-friendly interfaces. Although some web services with graphical interfaces have been developed for ChIP-seq analysis, these sites cannot provide a comprehensive analysis of ChIP-seq from raw data to downstream analysis. RESULTS In this study, we develop a web service for the whole process of ChIP-Seq Analysis (CSA), which covers mapping, quality control, peak calling, and downstream analysis. In addition, CSA provides a customization function for users to define their own workflows. And the visualization of mapping, peak calling, motif finding, and pathway analysis results are also provided in CSA. For the different types of ChIP-seq datasets, CSA can provide the corresponding tool to perform the analysis. Moreover, CSA can detect differences in ChIP signals between ChIP samples and controls to identify absolute binding sites. CONCLUSIONS The two case studies demonstrate the effectiveness of CSA, which can complete the whole procedure of ChIP-seq analysis. CSA provides a web interface for users, and implements the visualization of every analysis step. The website of CSA is available at http://CompuBio.csu.edu.cn.
Collapse
Affiliation(s)
- Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Li Tang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, SKS7N5A9, Saskatoon, Canada
| | - Yi Pan
- Department of Computer Science, Georgia State University, GA30303, Atlanta, USA
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
5
|
Zhang H, Chan Y, Fan K, Schmidt B, Liu W. Fast and efficient short read mapping based on a succinct hash index. BMC Bioinformatics 2018. [PMID: 29523083 PMCID: PMC5845352 DOI: 10.1186/s12859-018-2094-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Background Various indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time. Results We present the succinct hash index – a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3. Conclusions The presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM.
Collapse
Affiliation(s)
- Haowen Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Yuandong Chan
- School of Software, Shandong University, Shunhua Road 1500, Jinan, Shandong, China
| | - Kaichao Fan
- School of Software, Shandong University, Shunhua Road 1500, Jinan, Shandong, China
| | | | - Weiguo Liu
- School of Software, Shandong University, Shunhua Road 1500, Jinan, Shandong, China. .,Laboratory for Regional Oceanography and Numerical Modeling, Qingdao National Laboratory for Marine Science and Technology, Qingdao, 266237, Shandong, China.
| |
Collapse
|
6
|
Stanton KP, Jin J, Lederman RR, Weissman SM, Kluger Y. Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling. Nucleic Acids Res 2017; 45:e173. [PMID: 28981893 PMCID: PMC5716106 DOI: 10.1093/nar/gkx799] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 08/30/2017] [Indexed: 02/03/2023] Open
Abstract
With the advent of next generation high-throughput DNA sequencing technologies, omics experiments have become the mainstay for studying diverse biological effects on a genome wide scale. Chromatin immunoprecipitation (ChIP-seq) is the omics technique that enables genome wide localization of transcription factor (TF) binding or epigenetic modification events. Since the inception of ChIP-seq in 2007, many methods have been developed to infer ChIP-target binding loci from the resultant reads after mapping them to a reference genome. However, interpreting these data has proven challenging, and as such these algorithms have several shortcomings, including susceptibility to false positives due to artifactual peaks, poor localization of binding sites and the requirement for a total DNA input control which increases the cost of performing these experiments. We present Ritornello, a new approach for finding TF-binding sites in ChIP-seq, with roots in digital signal processing that addresses all of these problems. We show that Ritornello generally performs equally or better than the peak callers tested and recommended by the ENCODE consortium, but in contrast, Ritornello does not require a matched total DNA input control to avoid false positives, effectively decreasing the sequencing cost to perform ChIP-seq. Ritornello is freely available at https://github.com/KlugerLab/Ritornello.
Collapse
Affiliation(s)
- Kelly P Stanton
- Department of Pathology, Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06520, USA.,Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| | - Jiaqi Jin
- Department of Genetics, Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06520, USA
| | - Roy R Lederman
- Program of Applied Mathematics, Yale University, 51 Prospect Street, New Haven, CT 06511, USA.,Department of Mathematics and PACM, Princeton University, Fine Hall, Washington Road, Princeton, NJ 08544-1000, USA
| | - Sherman M Weissman
- Department of Genetics, Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06520, USA
| | - Yuval Kluger
- Department of Pathology, Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06520, USA.,Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA.,Program of Applied Mathematics, Yale University, 51 Prospect Street, New Haven, CT 06511, USA
| |
Collapse
|
7
|
Newkirk DA, Chen YY, Chien R, Zeng W, Biesinger J, Flowers E, Kawauchi S, Santos R, Calof AL, Lander AD, Xie X, Yokomori K. The effect of Nipped-B-like (Nipbl) haploinsufficiency on genome-wide cohesin binding and target gene expression: modeling Cornelia de Lange syndrome. Clin Epigenetics 2017; 9:89. [PMID: 28855971 PMCID: PMC5574093 DOI: 10.1186/s13148-017-0391-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 08/15/2017] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Cornelia de Lange syndrome (CdLS) is a multisystem developmental disorder frequently associated with heterozygous loss-of-function mutations of Nipped-B-like (NIPBL), the human homolog of Drosophila Nipped-B. NIPBL loads cohesin onto chromatin. Cohesin mediates sister chromatid cohesion important for mitosis but is also increasingly recognized as a regulator of gene expression. In CdLS patient cells and animal models, expression changes of multiple genes with little or no sister chromatid cohesion defect suggests that disruption of gene regulation underlies this disorder. However, the effect of NIPBL haploinsufficiency on cohesin binding, and how this relates to the clinical presentation of CdLS, has not been fully investigated. Nipbl haploinsufficiency causes CdLS-like phenotype in mice. We examined genome-wide cohesin binding and its relationship to gene expression using mouse embryonic fibroblasts (MEFs) from Nipbl+/- mice that recapitulate the CdLS phenotype. RESULTS We found a global decrease in cohesin binding, including at CCCTC-binding factor (CTCF) binding sites and repeat regions. Cohesin-bound genes were found to be enriched for histone H3 lysine 4 trimethylation (H3K4me3) at their promoters; were disproportionately downregulated in Nipbl mutant MEFs; and displayed evidence of reduced promoter-enhancer interaction. The results suggest that gene activation is the primary cohesin function sensitive to Nipbl reduction. Over 50% of significantly dysregulated transcripts in mutant MEFs come from cohesin target genes, including genes involved in adipogenesis that have been implicated in contributing to the CdLS phenotype. CONCLUSIONS Decreased cohesin binding at the gene regions is directly linked to disease-specific expression changes. Taken together, our Nipbl haploinsufficiency model allows us to analyze the dosage effect of cohesin loading on CdLS development.
Collapse
Affiliation(s)
- Daniel A. Newkirk
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697 USA
- Department of Computer Sciences, University of California, Irvine, CA 92697 USA
| | - Yen-Yun Chen
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697 USA
- Current address: ResearchDx Inc., 5 Mason, Irvine, CA 92618 USA
| | - Richard Chien
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697 USA
- Current address: Thermo Fisher Scientific, Inc., 180 Oyster Point Blvd South, San Francisco, CA 94080 USA
| | - Weihua Zeng
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697 USA
- Current address: Department of Developmental & Cell Biology, School of Biological Sciences, University of California, Irvine, CA 92697 USA
| | - Jacob Biesinger
- Department of Computer Sciences, University of California, Irvine, CA 92697 USA
- Current address: Verily Life Scienceds, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA
| | - Ebony Flowers
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697 USA
- California State University Long Beach, Long Beach, CA 90840 USA
- Current address: UT Southwestern Medical Center, 5323 Harry Hines Blvd, NA8.124, Dallas, TX 75390 USA
| | - Shimako Kawauchi
- Department of Anatomy & Neurobiology, School of Medicine, University of California, Irvine, CA 92697 USA
| | - Rosaysela Santos
- Department of Anatomy & Neurobiology, School of Medicine, University of California, Irvine, CA 92697 USA
| | - Anne L. Calof
- Department of Anatomy & Neurobiology, School of Medicine, University of California, Irvine, CA 92697 USA
| | - Arthur D. Lander
- Department of Developmental & Cell Biology, School of Biological Sciences, University of California, Irvine, CA 92697 USA
| | - Xiaohui Xie
- Department of Computer Sciences, University of California, Irvine, CA 92697 USA
| | - Kyoko Yokomori
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697 USA
| |
Collapse
|
8
|
Yin Z, Lan H, Tan G, Lu M, Vasilakos AV, Liu W. Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges. Comput Struct Biotechnol J 2017; 15:403-411. [PMID: 28883909 PMCID: PMC5581845 DOI: 10.1016/j.csbj.2017.07.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Revised: 06/30/2017] [Accepted: 07/28/2017] [Indexed: 12/25/2022] Open
Abstract
The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.
Collapse
Affiliation(s)
- Zekun Yin
- Shandong University, Jinan, Shandong, China
| | | | - Guangming Tan
- Institute of Computing Technology, Chinese Academy of Sciences, China
| | - Mian Lu
- Huawei Singapore Research Centre, Singapore
| | - Athanasios V Vasilakos
- Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, Skellefteå SE-931 87, Sweden
| | - Weiguo Liu
- Shandong University, Jinan, Shandong, China
| |
Collapse
|
9
|
Zeng X, Li B, Welch R, Rojo C, Zheng Y, Dewey CN, Keleş S. Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior-Enhanced Read Mapping. PLoS Comput Biol 2015; 11:e1004491. [PMID: 26484757 PMCID: PMC4618727 DOI: 10.1371/journal.pcbi.1004491] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2014] [Accepted: 08/06/2015] [Indexed: 11/19/2022] Open
Abstract
Segmental duplications and other highly repetitive regions of genomes contribute significantly to cells' regulatory programs. Advancements in next generation sequencing enabled genome-wide profiling of protein-DNA interactions by chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq). However, interactions in highly repetitive regions of genomes have proven difficult to map since short reads of 50-100 base pairs (bps) from these regions map to multiple locations in reference genomes. Standard analytical methods discard such multi-mapping reads and the few that can accommodate them are prone to large false positive and negative rates. We developed Perm-seq, a prior-enhanced read allocation method for ChIP-seq experiments, that can allocate multi-mapping reads in highly repetitive regions of the genomes with high accuracy. We comprehensively evaluated Perm-seq, and found that our prior-enhanced approach significantly improves multi-read allocation accuracy over approaches that do not utilize additional data types. The statistical formalism underlying our approach facilitates supervising of multi-read allocation with a variety of data sources including histone ChIP-seq. We applied Perm-seq to 64 ENCODE ChIP-seq datasets from GM12878 and K562 cells and identified many novel protein-DNA interactions in segmental duplication regions. Our analysis reveals that although the protein-DNA interactions sites are evolutionarily less conserved in repetitive regions, they share the overall sequence characteristics of the protein-DNA interactions in non-repetitive regions.
Collapse
Affiliation(s)
- Xin Zeng
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Bo Li
- California Institute for Quantitative Biosciences, University of California, Berkeley, California, United States of America
| | - Rene Welch
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Constanza Rojo
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Ye Zheng
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Colin N. Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
| |
Collapse
|
10
|
Hansen P, Hecht J, Ibrahim DM, Krannich A, Truss M, Robinson PN. Saturation analysis of ChIP-seq data for reproducible identification of binding peaks. Genome Res 2015; 25:1391-400. [PMID: 26163319 PMCID: PMC4561497 DOI: 10.1101/gr.189894.115] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Accepted: 07/06/2015] [Indexed: 11/24/2022]
Abstract
Chromatin immunoprecipitation coupled with next-generation sequencing (ChIP-seq) is a powerful technology to identify the genome-wide locations of transcription factors and other DNA binding proteins. Computational ChIP-seq peak calling infers the location of protein–DNA interactions based on various measures of enrichment of sequence reads. In this work, we introduce an algorithm, Q, that uses an assessment of the quadratic enrichment of reads to center candidate peaks followed by statistical analysis of saturation of candidate peaks by 5′ ends of reads. We show that our method not only is substantially faster than several competing methods but also demonstrates statistically significant advantages with respect to reproducibility of results and in its ability to identify peaks with reproducible binding site motifs. We show that Q has superior performance in the delineation of double RNAPII and H3K4me3 peaks surrounding transcription start sites related to a better ability to resolve individual peaks. The method is implemented in C+l+ and is freely available under an open source license.
Collapse
Affiliation(s)
- Peter Hansen
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany; Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Jochen Hecht
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany; Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany; Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Daniel M Ibrahim
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany; Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Alexander Krannich
- Department of Biostatistics, Clinical Research Unit, Berlin Institute of Health, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Matthias Truss
- Labor für Pädiatrische Molekularbiologie, Charité-Universitätsmedizin Berlin, 10117, Berlin, Germany
| | - Peter N Robinson
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany; Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany; Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany; Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany
| |
Collapse
|
11
|
Kelley DR, Hendrickson DG, Tenen D, Rinn JL. Transposable elements modulate human RNA abundance and splicing via specific RNA-protein interactions. Genome Biol 2014; 15:537. [PMID: 25572935 PMCID: PMC4272801 DOI: 10.1186/s13059-014-0537-5] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 11/07/2014] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Transposable elements (TEs) have significantly influenced the evolution of transcriptional regulatory networks in the human genome. Post-transcriptional regulation of human genes by TE-derived sequences has been observed in specific contexts, but has yet to be systematically and comprehensively investigated. Here, we study a collection of 75 CLIP-Seq experiments mapping the RNA binding sites for a diverse set of 51 human proteins to explore the role of TEs in post-transcriptional regulation of human mRNAs and lncRNAs via RNA-protein interactions. RESULTS We detect widespread interactions between RNA binding proteins (RBPs) and many families of TE-derived sequence in the CLIP-Seq data. Further, alignment coverage peaks on specific positions of the TE consensus sequences, illuminating a diversity of TE-specific RBP binding motifs. Evidence of binding and conservation of these motifs in the nonrepetitive transcriptome suggests that TEs have generally appropriated existing sequence preferences of the RBPs. Depletion assays for numerous RBPs show that TE-derived binding sites affect transcript abundance and splicing similarly to nonrepetitive sites. However, in a few cases the effect of RBP binding depends on the specific TE family bound; for example, the ubiquitously expressed RBP HuR confers transcript stability unless bound to an Alu element. CONCLUSIONS Our meta-analysis suggests a widespread role for TEs in shaping RNA-protein regulatory networks in the human genome.
Collapse
|
12
|
Abstract
MOTIVATION In chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and other short-read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighborhood of the alignment locations and ignore the variation in the copy numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads. RESULTS We propose cnvCSEM (CNV-guided ChIP-Seq by expectation-maximization algorithm), a flexible framework that incorporates CNV in multi-read allocation. cnvCSEM eliminates the CNV bias in multi-read allocation by initializing the read allocation algorithm with CNV-aware initial values. Our data-driven simulations illustrate that cnvCSEM leads to higher read coverage with satisfactory accuracy and lower loss in read-depth recovery (estimation). We evaluate the biological relevance of the cnvCSEM-allocated reads and the resultant peaks with the analysis of several ENCODE ChIP-seq datasets. AVAILABILITY AND IMPLEMENTATION Available at http://www.stat.wisc.edu/∼qizhang/ CONTACT : qizhang@stat.wisc.edu or keles@stat.wisc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qi Zhang
- Department of Biostatistics and Medical Informatics, 425 Henry Mall and Department of Statistics, 1300 University Avenue, Madison, WI 53706, USA
| | - Sündüz Keleş
- Department of Biostatistics and Medical Informatics, 425 Henry Mall and Department of Statistics, 1300 University Avenue, Madison, WI 53706, USA Department of Biostatistics and Medical Informatics, 425 Henry Mall and Department of Statistics, 1300 University Avenue, Madison, WI 53706, USA
| |
Collapse
|
13
|
DeVilbiss AW, Sanalkumar R, Johnson KD, Keles S, Bresnick EH. Hematopoietic transcriptional mechanisms: from locus-specific to genome-wide vantage points. Exp Hematol 2014; 42:618-29. [PMID: 24816274 DOI: 10.1016/j.exphem.2014.05.004] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 05/04/2014] [Indexed: 12/12/2022]
Abstract
Hematopoiesis is an exquisitely regulated process in which stem cells in the developing embryo and the adult generate progenitor cells that give rise to all blood lineages. Master regulatory transcription factors control hematopoiesis by integrating signals from the microenvironment and dynamically establishing and maintaining genetic networks. One of the most rudimentary aspects of cell type-specific transcription factor function, how they occupy a highly restricted cohort of cis-elements in chromatin, remains poorly understood. Transformative technologic advances involving the coupling of next-generation DNA sequencing technology with the chromatin immunoprecipitation assay (ChIP-seq) have enabled genome-wide mapping of factor occupancy patterns. However, formidable problems remain; notably, ChIP-seq analysis yields hundreds to thousands of chromatin sites occupied by a given transcription factor, and only a fraction of the sites appear to be endowed with critical, non-redundant function. It has become en vogue to map transcription factor occupancy patterns genome-wide, while using powerful statistical tools to establish correlations to inform biology and mechanisms. With the advent of revolutionary genome editing technologies, one can now reach beyond correlations to conduct definitive hypothesis testing. This review focuses on key discoveries that have emerged during the path from single loci to genome-wide analyses, specifically in the context of hematopoietic transcriptional mechanisms.
Collapse
Affiliation(s)
- Andrew W DeVilbiss
- Carbone Cancer Center, Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA; University of Wisconsin-Madison Blood Research Program, Madison, Wisconsin, USA
| | - Rajendran Sanalkumar
- Carbone Cancer Center, Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA; University of Wisconsin-Madison Blood Research Program, Madison, Wisconsin, USA
| | - Kirby D Johnson
- Carbone Cancer Center, Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA; University of Wisconsin-Madison Blood Research Program, Madison, Wisconsin, USA
| | - Sunduz Keles
- University of Wisconsin-Madison Blood Research Program, Madison, Wisconsin, USA; Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA
| | - Emery H Bresnick
- Carbone Cancer Center, Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA; University of Wisconsin-Madison Blood Research Program, Madison, Wisconsin, USA.
| |
Collapse
|
14
|
Kim J, Li C, Xie X. Improving read mapping using additional prefix grams. BMC Bioinformatics 2014; 15:42. [PMID: 24499321 PMCID: PMC3927682 DOI: 10.1186/1471-2105-15-42] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 02/03/2014] [Indexed: 11/20/2022] Open
Abstract
Background Next-generation sequencing (NGS) enables rapid production of billions of bases at
a relatively low cost. Mapping reads from next-generation sequencers to a given
reference genome is an important first step in many sequencing applications.
Popular read mappers, such as Bowtie and BWA, are optimized to return top one or a
few candidate locations of each read. However, identifying all mapping locations
of each read, instead of just one or a few, is also important in some sequencing
applications such as ChIP-seq for discovering binding sites in repeat regions, and
RNA-seq for transcript abundance estimation. Results Here we present Hobbes2, a software package designed for fast and accurate
alignment of NGS reads and specialized in identifying all mapping locations of
each read. Hobbes2 efficiently identifies all mapping locations of reads using a
novel technique that utilizes additional prefix q-grams to improve
filtering. We extensively compare Hobbes2 with state-of-the-art read mappers, and
show that Hobbes2 can be an order of magnitude faster than other read mappers
while consuming less memory space and achieving similar accuracy. Conclusions We propose Hobbes2 to improve the accuracy of read mapping, specialized in
identifying all mapping locations of each read. Hobbes2 is implemented in C++, and
the source code is freely available for download at
http://hobbes.ics.uci.edu.
Collapse
Affiliation(s)
| | | | - Xiaohui Xie
- Department of Computer Science, University of California, Irvine, USA.
| |
Collapse
|
15
|
Conway T, Wazny J, Bromage A, Tymms M, Sooraj D, Williams ED, Beresford-Smith B. Xenome--a tool for classifying reads from xenograft samples. Bioinformatics 2013; 28:i172-8. [PMID: 22689758 PMCID: PMC3371868 DOI: 10.1093/bioinformatics/bts236] [Citation(s) in RCA: 189] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivation: Shotgun sequence read data derived from xenograft material contains a mixture of reads arising from the host and reads arising from the graft. Classifying the read mixture to separate the two allows for more precise analysis to be performed. Results: We present a technique, with an associated tool Xenome, which performs fast, accurate and specific classification of xenograft-derived sequence read data. We have evaluated it on RNA-Seq data from human, mouse and human-in-mouse xenograft datasets. Availability:Xenome is available for non-commercial use from http://www.nicta.com.au/bioinformatics Contact:tom.conway@nicta.com.au
Collapse
Affiliation(s)
- Thomas Conway
- NICTA Victoria Research Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Parkville and Monash Institute of Medical Research, Monash University, Clayton, Australia.
| | | | | | | | | | | | | |
Collapse
|
16
|
Kelley D, Rinn J. Transposable elements reveal a stem cell-specific class of long noncoding RNAs. Genome Biol 2012. [PMID: 23181609 PMCID: PMC3580499 DOI: 10.1186/gb-2012-13-11-r107] [Citation(s) in RCA: 386] [Impact Index Per Article: 32.2] [Reference Citation Analysis] [Abstract] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Background Numerous studies over the past decade have elucidated a large set of long intergenic noncoding RNAs (lincRNAs) in the human genome. Research since has shown that lincRNAs constitute an important layer of genome regulation across a wide spectrum of species. However, the factors governing their evolution and origins remain relatively unexplored. One possible factor driving lincRNA evolution and biological function is transposable element (TE) insertions. Here, we comprehensively characterize the TE content of lincRNAs relative to genomic averages and protein coding transcripts. Results Our analysis of the TE composition of 9,241 human lincRNAs revealed that, in sharp contrast to protein coding genes, 83% of lincRNAs contain a TE, and TEs comprise 42% of lincRNA sequence. lincRNA TE composition varies significantly from genomic averages - L1 and Alu elements are depleted and broad classes of endogenous retroviruses are enriched. TEs occur in biased positions and orientations within lincRNAs, particularly at their transcription start sites, suggesting a role in lincRNA transcriptional regulation. Accordingly, we observed a dramatic example of HERVH transcriptional regulatory signals correlating strongly with stem cell-specific expression of lincRNAs. Conversely, lincRNAs devoid of TEs are expressed at greater levels than lincRNAs with TEs in all tissues and cell lines, particularly in the testis. Conclusions TEs pervade lincRNAs, dividing them into classes, and may have shaped lincRNA evolution and function by conferring tissue-specific expression from extant transcriptional regulatory signals.
Collapse
|
17
|
Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X. Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res 2011; 40:e41. [PMID: 22199254 PMCID: PMC3315303 DOI: 10.1093/nar/gkr1246] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2–10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu.
Collapse
Affiliation(s)
- Athena Ahmadi
- Department of Computer Science, University of California, Irvine, CA 92697, USA
| | | | | | | | | | | |
Collapse
|