1
|
Yassi M, Shams Davodly E, Hajebi Khaniki S, Kerachian MA. HBCR_DMR: A Hybrid Method Based on Beta-Binomial Bayesian Hierarchical Model and Combination of Ranking Method to Detect Differential Methylation Regions in Bisulfite Sequencing Data. J Pers Med 2024; 14:361. [PMID: 38672987 PMCID: PMC11051304 DOI: 10.3390/jpm14040361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 10/20/2023] [Accepted: 01/09/2024] [Indexed: 04/28/2024] Open
Abstract
DNA methylation is a key epigenetic modification involved in gene regulation, contributing to both physiological and pathological conditions. For a more profound comprehension, it is essential to conduct a precise comparison of DNA methylation patterns between sample groups that represent distinct statuses. Analysis of differentially methylated regions (DMRs) using computational approaches can help uncover the precise relationships between these phenomena. This paper describes a hybrid model that combines the beta-binomial Bayesian hierarchical model with a combination of ranking methods known as HBCR_DMR. During the initial phase, we model the actual methylation proportions of the CpG sites (CpGs) within the replicates. This modeling is achieved through beta-binomial distribution, with parameters set by a group mean and a dispersion parameter. During the second stage, we establish the selection of distinguishing CpG sites based on their methylation status, employing multiple ranking techniques. Finally, we combine the ranking lists of differentially methylated CpG sites through a voting system. Our analyses, encompassing simulations and real data, reveal outstanding performance metrics, including a sensitivity of 0.72, specificity of 0.89, and an F1 score of 0.76, yielding an overall accuracy of 0.82 and an AUC of 0.94. These findings underscore HBCR_DMR's robust capacity to distinguish methylated regions, confirming its utility as a valuable tool for DNA methylation analysis.
Collapse
Affiliation(s)
- Maryam Yassi
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad 9184156815, Iran; (M.Y.); (E.S.D.)
- Department of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand
- Department of Pathology, Dunedin School of Medicine, University of Otago, Dunedin 9054, New Zealand
| | - Ehsan Shams Davodly
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad 9184156815, Iran; (M.Y.); (E.S.D.)
| | - Saeedeh Hajebi Khaniki
- Student Research Committee, Department of Biostatistics, School of Health, Mashhad University of Medical Sciences, Mashhad 9177948564, Iran;
| | - Mohammad Amin Kerachian
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad 9184156815, Iran; (M.Y.); (E.S.D.)
- Medical Genetics Research Center, Mashhad University of Medical Sciences, Mashhad 9177948564, Iran
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad 9177948564, Iran
- Department of Chemistry and Biology, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| |
Collapse
|
2
|
Song B, Buckler ES, Stitzer MC. New whole-genome alignment tools are needed for tapping into plant diversity. TRENDS IN PLANT SCIENCE 2024; 29:355-369. [PMID: 37749022 DOI: 10.1016/j.tplants.2023.08.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 07/19/2023] [Accepted: 08/23/2023] [Indexed: 09/27/2023]
Abstract
Genome alignment is one of the most foundational methods for genome sequence studies. With rapid advances in sequencing and assembly technologies, these newly assembled genomes present challenges for alignment tools to meet the increased complexity and scale. Plant genome alignment is technologically challenging because of frequent whole-genome duplications (WGDs) as well as chromosome rearrangements and fractionation, high nucleotide diversity, widespread structural variation, and high transposable element (TE) activity causing large proportions of repeat elements. We summarize classical pairwise and multiple genome alignment (MGA) methods, and highlight techniques that are widely used or are being developed by the plant research community. We also outline the remaining challenges for precise genome alignment and the interpretation of alignment results in plants.
Collapse
Affiliation(s)
- Baoxing Song
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong 261325, China; Key Laboratory of Maize Biology and Genetic Breeding in Arid Area of Northwest Region of the Ministry of Agriculture, College of Agronomy, Northwest A&F University, Yangling, Shaanxi 712100, China.
| | - Edward S Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853, USA; Agricultural Research Service, United States Department of Agriculture, Ithaca, NY 14853, USA
| | - Michelle C Stitzer
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA.
| |
Collapse
|
3
|
Fischer J, Schulz MH. Efficiently quantifying DNA methylation for bulk- and single-cell bisulfite data. Bioinformatics 2023; 39:btad386. [PMID: 37326968 PMCID: PMC10310462 DOI: 10.1093/bioinformatics/btad386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 05/17/2023] [Accepted: 06/14/2023] [Indexed: 06/17/2023] Open
Abstract
MOTIVATION DNA CpG methylation (CpGm) has proven to be a crucial epigenetic factor in the mammalian gene regulatory system. Assessment of DNA CpG methylation values via whole-genome bisulfite sequencing (WGBS) is, however, computationally extremely demanding. RESULTS We present FAst MEthylation calling (FAME), the first approach to quantify CpGm values directly from bulk or single-cell WGBS reads without intermediate output files. FAME is very fast but as accurate as standard methods, which first produce BS alignment files before computing CpGm values. We present experiments on bulk and single-cell bisulfite datasets in which we show that data analysis can be significantly sped-up and help addressing the current WGBS analysis bottleneck for large-scale datasets without compromising accuracy. AVAILABILITY AND IMPLEMENTATION An implementation of FAME is open source and licensed under GPL-3.0 at https://github.com/FischerJo/FAME.
Collapse
Affiliation(s)
- Jonas Fischer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken 66123, Germany
| | - Marcel H Schulz
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken 66123, Germany
- Institute of Cardiovascular Regeneration, Department of Medicine, Goethe University, Frankfurt am Main 60590, Germany
- Cardio-Pulmonary Institute, Goethe University, Frankfurt am Main, Germany
- German Centre for Cardiovascular Research, Partner Site Rhein-Main, Frankfurt am Main 60590, Germany
| |
Collapse
|
4
|
Gong T, Borgard H, Zhang Z, Chen S, Gao Z, Deng Y. Analysis and Performance Assessment of the Whole Genome Bisulfite Sequencing Data Workflow: Currently Available Tools and a Practical Guide to Advance DNA Methylation Studies. SMALL METHODS 2022; 6:e2101251. [PMID: 35064762 PMCID: PMC8963483 DOI: 10.1002/smtd.202101251] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 11/30/2021] [Indexed: 05/09/2023]
Abstract
DNA methylation is associated with transcriptional repression, genomic imprinting, stem cell differentiation, embryonic development, and inflammation. Aberrant DNA methylation can indicate disease states, including cancer and neurological disorders. Therefore, the prevalence and location of 5-methylcytosine in the human genome is a topic of interest. Whole-genome bisulfite sequencing (WGBS) is a high-throughput method for analyzing DNA methylation. This technique involves library preparation, alignment, and quality control. Advancements in epigenetic technology have led to an increase in DNA methylation studies. This review compares the detailed experimental methodology of WGBS using accessible and up-to-date analysis tools. Practical codes for WGBS data processing are included as a general guide to assist progress in DNA methylation studies through a comprehensive case study.
Collapse
Affiliation(s)
- Ting Gong
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu HI 96813, USA
| | - Heather Borgard
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu HI 96813, USA
| | - Zao Zhang
- Department of Medicine, The Queen’s Medical Center, Honolulu HI 96813, USA
| | - Shaoqiu Chen
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu HI 96813, USA
| | - Zitong Gao
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu HI 96813, USA
| | - Youping Deng
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu HI 96813, USA
| |
Collapse
|
5
|
Okada T, Sun X, McIlfatrick S, St. John JC. Low guanine content and biased nucleotide distribution in vertebrate mtDNA can cause overestimation of non-CpG methylation. NAR Genom Bioinform 2022; 4:lqab119. [PMID: 35047811 PMCID: PMC8759572 DOI: 10.1093/nargab/lqab119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 11/24/2021] [Accepted: 01/09/2022] [Indexed: 11/12/2022] Open
Abstract
Mitochondrial DNA (mtDNA) methylation in vertebrates has been hotly debated for over 40 years. Most contrasting results have been reported following bisulfite sequencing (BS-seq) analyses. We addressed whether BS-seq experimental and analysis conditions influenced the estimation of the levels of methylation in specific mtDNA sequences. We found false positive non-CpG methylation in the CHH context (fpCHH) using unmethylated Sus scrofa mtDNA BS-seq data. fpCHH methylation was detected on the top/plus strand of mtDNA within low guanine content regions. These top/plus strand sequences of fpCHH regions would become extremely AT-rich sequences after BS-conversion, whilst bottom/minus strand sequences remained almost unchanged. These unique sequences caused BS-seq aligners to falsely assign the origin of each strand in fpCHH regions, resulting in false methylation calls. fpCHH methylation detection was enhanced by short sequence reads, short library inserts, skewed top/bottom read ratios and non-directional read mapping modes. We confirmed no detectable CHH methylation in fpCHH regions by BS-amplicon sequencing. The fpCHH peaks were located in the D-loop, ATP6, ND2, ND4L, ND5 and ND6 regions and identified in our S. scrofa ovary and oocyte data and human BS-seq data sets. We conclude that non-CpG methylation could potentially be overestimated in specific sequence regions by BS-seq analysis.
Collapse
|
6
|
Lee H. Analysis of Bisulfite Sequencing Data Using Bismark and DMRcaller to Identify Differentially Methylated Regions. Methods Mol Biol 2022; 2443:451-463. [PMID: 35037220 DOI: 10.1007/978-1-0716-2067-0_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The mechanism of the addition of a methyl group to cytosine has been identified as one of several heritable epigenetic mechanisms. In plants, DNA methylation is involved in mediating response to stress, plant development, polyploidy, and domestication through regulation of gene expression. The correlation of epigenetic variation to phenotypic traits expands our understanding toward plant evolution, and provides new source for targeted manipulation in crop improvement. To address the increasing interest to map methylation landscape in plant species, this chapter describes methods to analyze bisulfite sequencing data and identify epigenetic variation between samples. We also detailed guidelines to highlight possible optimizations, as well as ways to tailor parameters according to data and biological variability.
Collapse
Affiliation(s)
- HueyTyng Lee
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany.
| |
Collapse
|
7
|
Sharma M, Verma RK, Kumar S, Kumar V. Computational challenges in detection of cancer using cell-free DNA methylation. Comput Struct Biotechnol J 2021; 20:26-39. [PMID: 34976309 PMCID: PMC8669313 DOI: 10.1016/j.csbj.2021.12.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 12/02/2021] [Accepted: 12/02/2021] [Indexed: 12/18/2022] Open
Abstract
Cell-free DNA(cfDNA) methylation profiling is considered promising and potentially reliable for liquid biopsy to study progress of diseases and develop reliable and consistent diagnostic and prognostic biomarkers. There are several different mechanisms responsible for the release of cfDNA in blood plasma, and henceforth it can provide information regarding dynamic changes in the human body. Due to the fragmented nature, low concentration of cfDNA, and high background noise, there are several challenges in its analysis for regular use in diagnosis of cancer. Such challenges in the analysis of the methylation profile of cfDNA are further aggravated due to heterogeneity, biomarker sensitivity, platform biases, and batch effects. This review delineates the origin of cfDNA methylation, its profiling, and associated computational problems in analysis for diagnosis. Here we also contemplate upon the multi-marker approach to handle the scenario of cancer heterogeneity and explore the utility of markers for 5hmC based cfDNA methylation pattern. Further, we provide a critical overview of deconvolution and machine learning methods for cfDNA methylation analysis. Our review of current methods reveals the potential for further improvement in analysis strategies for detecting early cancer using cfDNA methylation.
Collapse
Key Words
- Cancer heterogeneity
- Cell free DNA
- Computation
- DMP, Differentially methylated base position
- DMR, Differentially methylated regions
- Diagnosis
- HELP-seq, HpaII-tiny fragment Enrichment by Ligation-mediated PCR sequencing
- MBD-seq, Methyl-CpG Binding Domain Protein Capture Sequencing
- MCTA-seq, Methylated CpG tandems amplification and sequencing
- MSCC, Methylation Sensitive Cut Counting
- MSRE, methylation sensitive restriction enzymes
- MeDIP-seq, Methylated DNA Immunoprecipitation Sequencing
- RRBS, Reduced-Representation Bisulfite Sequencing
- WGBS, Whole Genome Bisulfite Sequencing
- cfDNA, cell free DNA
- ctDNA, circulating tumor DNA
- dPCR, digital polymerase chain reaction
- ddMCP, droplet digital methylation-specific PCR
- ddPCR, droplet digital polymerase chain reaction
- scCGI, methylated CGIs at single cell level
Collapse
Affiliation(s)
- Madhu Sharma
- Department for Computational Biology, Indraprastha Institute of Information Technology, Delhi 110020, India
| | - Rohit Kumar Verma
- Department for Computational Biology, Indraprastha Institute of Information Technology, Delhi 110020, India
| | - Sunil Kumar
- Department of Surgical oncology, All India Institute of Medical sciences, New Delhi 110029, India
| | - Vibhor Kumar
- Department for Computational Biology, Indraprastha Institute of Information Technology, Delhi 110020, India
| |
Collapse
|
8
|
Müller R, Nebel M. On the use of sequence-quality information in OTU clustering. PeerJ 2021; 9:e11717. [PMID: 34458017 PMCID: PMC8375510 DOI: 10.7717/peerj.11717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 06/11/2021] [Indexed: 11/20/2022] Open
Abstract
Background High-throughput sequencing has become an essential technology in life science research. Despite continuous improvements in technology, the produced sequences are still not entirely accurate. Consequently, the sequences are usually equipped with error probabilities. The quality information is already employed to find better solutions to a number of bioinformatics problems (e.g. read mapping). Data processing pipelines benefit in particular (especially when incorporating the quality information early), since enhanced outcomes of one step can improve all subsequent ones. Preprocessing steps, thus, quite regularly consider the sequence quality to fix errors or discard low-quality data. Other steps, however, like clustering sequences into operational taxonomic units (OTUs), a common task in the analysis of microbial communities, are typically performed without making use of the available quality information. Results In this paper, we present quality-aware clustering methods inspired by quality-weighted alignments and model-based denoising, and explore their applicability to OTU clustering. We implemented the quality-aware methods in a revised version of our de novo clustering tool GeFaST and evaluated their clustering quality and performance on mock-community data sets. Quality-weighted alignments were able to improve the clustering quality of GeFaST by up to 10%. The examination of the model-supported methods provided a more diverse picture, hinting at a narrower applicability, but they were able to attain similar improvements. Considering the quality information enlarged both runtime and memory consumption, even though the increase of the former depended heavily on the applied method and clustering threshold. Conclusions The quality-aware methods expand the iterative, de novo clustering approach by new clustering and cluster refinement methods. Our results indicate that OTU clustering constitutes yet another analysis step benefiting from the integration of quality information. Beyond the shown potential, the quality-aware methods offer a range of opportunities for fine-tuning and further extensions.
Collapse
Affiliation(s)
- Robert Müller
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Markus Nebel
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
9
|
Liu M, Xu Y. An effective method to resolve ambiguous bisulfite-treated reads. BMC Bioinformatics 2021; 22:283. [PMID: 34044763 PMCID: PMC8161933 DOI: 10.1186/s12859-021-04204-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Accepted: 05/17/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The combination of the bisulfite treatment and the next-generation sequencing is an important method for methylation analysis, and aligning the bisulfite-treated reads (BS-reads) is the critical step for the downstream applications. As bisulfite treatment reduces the complexity of the sequences, a large portion of BS-reads might be aligned to multiple locations of the reference genome ambiguously, called multireads. These multireads cannot be employed in the downstream applications since they are likely to introduce artifacts. To identify the best mapping location of each multiread, existing Bayesian-based methods calculate the probability of the read at each position by considering how does it overlap with unique mapped reads. However, [Formula: see text]% of multireads are not overlapped with any unique reads, which are unresolvable for existing method. RESULTS Here we propose a novel method (EM-MUL) that not only rescues multireads overlapped with unique reads, but also uses the overall coverage and accurate base-level alignment to resolve multireads that cannot be handled by current methods. We benchmark our method on both simulated datasets and real datasets. Experimental results show that it is able to align more than 80% of multireads to the best mapping position with very high accuracy. CONCLUSIONS EM-MUL is an effective method designed to accurately determine the best mapping position of multireads in BS-reads. For the downstream applications, it is useful to improve the methylation resolution on the repetitive regions of genome. EM-MUL is free available at https://github.com/lmylynn/EM-MUL.
Collapse
Affiliation(s)
- Mengya Liu
- School of Computer Science, University of Science and Technology of China, Hefei, 230027, Anhui, China.,Key Laboratory on High Performance Computing of Anhui Province, Hefei, China
| | - Yun Xu
- School of Computer Science, University of Science and Technology of China, Hefei, 230027, Anhui, China. .,Key Laboratory on High Performance Computing of Anhui Province, Hefei, China.
| |
Collapse
|
10
|
Nunn A, Otto C, Stadler PF, Langenberger D. Comprehensive benchmarking of software for mapping whole genome bisulfite data: from read alignment to DNA methylation analysis. Brief Bioinform 2021; 22:6146770. [PMID: 33624017 PMCID: PMC8425420 DOI: 10.1093/bib/bbab021] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Revised: 12/30/2020] [Indexed: 01/23/2023] Open
Abstract
Whole genome bisulfite sequencing is currently at the forefront of epigenetic analysis, facilitating the nucleotide-level resolution of 5-methylcytosine (5mC) on a genome-wide scale. Specialized software have been developed to accommodate the unique difficulties in aligning such sequencing reads to a given reference, building on the knowledge acquired from model organisms such as human, or Arabidopsis thaliana. As the field of epigenetics expands its purview to non-model plant species, new challenges arise which bring into question the suitability of previously established tools. Herein, nine short-read aligners are evaluated: Bismark, BS-Seeker2, BSMAP, BWA-meth, ERNE-BS5, GEM3, GSNAP, Last and segemehl. Precision-recall of simulated alignments, in comparison to real sequencing data obtained from three natural accessions, reveals on-balance that BWA-meth and BSMAP are able to make the best use of the data during mapping. The influence of difficult-to-map regions, characterized by deviations in sequencing depth over repeat annotations, is evaluated in terms of the mean absolute deviation of the resulting methylation calls in comparison to a realistic methylome. Downstream methylation analysis is responsive to the handling of multi-mapping reads relative to mapping quality (MAPQ), and potentially susceptible to bias arising from the increased sequence complexity of densely methylated reads.
Collapse
Affiliation(s)
- Adam Nunn
- ecSeq Bioinformatics GmbH, Sternwartenstraße 29, 04103, Saxony, Germany.,Institut für Informatik, Universität Leipzig, Härtelstraße 16-18, 04107, Saxony, Germany
| | - Christian Otto
- ecSeq Bioinformatics GmbH, Sternwartenstraße 29, 04103, Saxony, Germany
| | - Peter F Stadler
- Institut für Informatik, Universität Leipzig, Härtelstraße 16-18, 04107, Saxony, Germany
| | | |
Collapse
|
11
|
Chung RH, Kang CY. pWGBSSimla: a profile-based whole-genome bisulfite sequencing data simulator incorporating methylation QTLs, allele-specific methylations and differentially methylated regions. Bioinformatics 2020; 36:660-665. [PMID: 31397839 DOI: 10.1093/bioinformatics/btz635] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Revised: 08/05/2019] [Accepted: 08/08/2019] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION DNA methylation plays an important role in regulating gene expression. DNA methylation is commonly analyzed using bisulfite sequencing (BS-seq)-based designs, such as whole-genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS) and oxidative bisulfite sequencing (oxBS-seq). Furthermore, there has been growing interest in investigating the roles that genetic variants play in changing the methylation levels (i.e. methylation quantitative trait loci or meQTLs), how methylation regulates the imprinting of gene expression (i.e. allele-specific methylation or ASM) and the differentially methylated regions (DMRs) among different cell types. However, none of the current simulation tools can generate different BS-seq data types (e.g. WGBS, RRBS and oxBS-seq) while modeling meQTLs, ASM and DMRs. RESULTS We developed profile-based whole-genome bisulfite sequencing data simulator (pWGBSSimla), a profile-based bisulfite sequencing data simulator, which simulates WGBS, RRBS and oxBS-seq data for different cell types based on real data. meQTLs and ASM are modeled based on the block structures of the methylation status at CpGs, whereas the simulation of DMRs is based on observations of methylation rates in real data. We demonstrated that pWGBSSimla adequately simulates data and allows performance comparisons among different methylation analysis methods. AVAILABILITY AND IMPLEMENTATION pWGBSSimla is available at https://omicssimla.sourceforge.io. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 350, Taiwan
| | - Chen-Yu Kang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 350, Taiwan
| |
Collapse
|
12
|
Choi J, Chae H. methCancer-gen: a DNA methylome dataset generator for user-specified cancer type based on conditional variational autoencoder. BMC Bioinformatics 2020; 21:181. [PMID: 32393170 PMCID: PMC7216580 DOI: 10.1186/s12859-020-3516-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Accepted: 04/29/2020] [Indexed: 12/31/2022] Open
Abstract
Background Recently, DNA methylation has drawn great attention due to its strong correlation with abnormal gene activities and informative representation of the cancer status. As a number of studies focus on DNA methylation signatures in cancer, demand for utilizing publicly available methylome dataset has been increased. To satisfy this, large-scale projects were launched to discover biological insights into cancer, providing a collection of the dataset. However, public cancer data, especially for certain cancer types, is still limited to be used in research. Several simulation tools for producing epigenetic dataset have been introduced in order to alleviate the issue, still, to date, generation for user-specified cancer type dataset has not been proposed. Results In this paper, we present methCancer-gen, a tool for generating DNA methylome dataset considering type for cancer. Employing conditional variational autoencoder, a neural network-based generative model, it estimates the conditional distribution with latent variables and data, and generates samples for specified cancer type. Conclusions To evaluate the simulation performance of methCancer-gen for the user-specified cancer type, our proposed model was compared to a benchmark method and it could successfully reproduce cancer type-wise data with high accuracy helping to alleviate the lack of condition-specific data issue. methCancer-gen is publicly available at https://github.com/cbi-bioinfo/methCancer-gen.
Collapse
Affiliation(s)
- Joungmin Choi
- Division of Computer Science, Sookmyung Women's University, Seoul, Republic of Korea
| | - Heejoon Chae
- Division of Computer Science, Sookmyung Women's University, Seoul, Republic of Korea.
| |
Collapse
|
13
|
Martín B, Pappa S, Díez-Villanueva A, Mallona I, Custodio J, Barrero MJ, Peinado MA, Jordà M. Tissue and cancer-specific expression of DIEXF is epigenetically mediated by an Alu repeat. Epigenetics 2020; 15:765-779. [PMID: 32041475 DOI: 10.1080/15592294.2020.1722398] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Alu repeats constitute a major fraction of human genome and for a small subset of them a role in gene regulation has been described. The number of studies focused on the functional characterization of particular Alu elements is very limited. Most Alu elements are DNA methylated and then assumed to lie in repressed chromatin domains. We hypothesize that Alu elements with low or variable DNA methylation are candidates for a functional role. In a genome-wide study in normal and cancer tissues, we pinpointed an Alu repeat (AluSq2) with differential methylation located upstream of the promoter region of the DIEXF gene. DIEXF encodes a highly conserved factor essential for the development of zebrafish digestive tract. To characterize the contribution of the Alu element to the regulation of DIEXF we analysed the epigenetic landscapes of the gene promoter and flanking regions in different cell types and cancers. Alternate epigenetic profiles (DNA methylation and histone modifications) of the AluSq2 element were associated with DIEXF transcript diversity as well as protein levels, while the epigenetic profile of the CpG island associated with the DIEXF promoter remained unchanged. These results suggest that AluSq2 might directly contribute to the regulation of DIEXF transcription and protein expression. Moreover, AluSq2 was DNA hypomethylated in different cancer types, pointing out its putative contribution to DIEXF alteration in cancer and its potential as tumoural biomarker.
Collapse
Affiliation(s)
- Berta Martín
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| | - Stella Pappa
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| | - Anna Díez-Villanueva
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| | - Izaskun Mallona
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| | - Joaquín Custodio
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| | - María José Barrero
- Center for Regenerative Medicine in Barcelona (CMRB), Avinguda de la Granvia de l'Hospitalet , Barcelona, Spain
| | - Miguel A Peinado
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| | - Mireia Jordà
- Program of Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP) , Barcelona, Spain
| |
Collapse
|
14
|
Rauluseviciute I, Drabløs F, Rye MB. DNA methylation data by sequencing: experimental approaches and recommendations for tools and pipelines for data analysis. Clin Epigenetics 2019; 11:193. [PMID: 31831061 PMCID: PMC6909609 DOI: 10.1186/s13148-019-0795-x] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Accepted: 12/04/2019] [Indexed: 02/06/2023] Open
Abstract
Sequencing technologies have changed not only our approaches to classical genetics, but also the field of epigenetics. Specific methods allow scientists to identify novel genome-wide epigenetic patterns of DNA methylation down to single-nucleotide resolution. DNA methylation is the most researched epigenetic mark involved in various processes in the human cell, including gene regulation and development of diseases, such as cancer. Increasing numbers of DNA methylation sequencing datasets from human genome are produced using various platforms-from methylated DNA precipitation to the whole genome bisulfite sequencing. Many of those datasets are fully accessible for repeated analyses. Sequencing experiments have become routine in laboratories around the world, while analysis of outcoming data is still a challenge among the majority of scientists, since in many cases it requires advanced computational skills. Even though various tools are being created and published, guidelines for their selection are often not clear, especially to non-bioinformaticians with limited experience in computational analyses. Separate tools are often used for individual steps in the analysis, and these can be challenging to manage and integrate. However, in some instances, tools are combined into pipelines that are capable to complete all the essential steps to achieve the result. In the case of DNA methylation sequencing analysis, the goal of such pipeline is to map sequencing reads, calculate methylation levels, and distinguish differentially methylated positions and/or regions. The objective of this review is to describe basic principles and steps in the analysis of DNA methylation sequencing data that in particular have been used for mammalian genomes, and more importantly to present and discuss the most pronounced computational pipelines that can be used to analyze such data. We aim to provide a good starting point for scientists with limited experience in computational analyses of DNA methylation and hydroxymethylation data, and recommend a few tools that are powerful, but still easy enough to use for their own data analysis.
Collapse
Affiliation(s)
- Ieva Rauluseviciute
- Department of Clinical and Molecular Medicine, NTNU - Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway.
| | - Finn Drabløs
- Department of Clinical and Molecular Medicine, NTNU - Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway
| | - Morten Beck Rye
- Department of Clinical and Molecular Medicine, NTNU - Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway.,Clinic of Surgery, St. Olavs Hospital, Trondheim University Hospital, NO-7030, Trondheim, Norway
| |
Collapse
|
15
|
Frith MC. How sequence alignment scores correspond to probability models. Bioinformatics 2019; 36:408-415. [PMID: 31329241 PMCID: PMC9883716 DOI: 10.1093/bioinformatics/btz576] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 05/31/2019] [Accepted: 07/17/2019] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments. RESULTS This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
16
|
Frith MC, Shrestha AMS. A Simplified Description of Child Tables for Sequence Similarity Search. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:2067-2073. [PMID: 29994365 DOI: 10.1109/tcbb.2018.2796064] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find "seed" matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply equally to all the above-mentioned seed types and more. We also show that child tables can be used without LCP (longest common prefix) tables, reducing the memory requirement.
Collapse
|
17
|
Corso-Díaz X, Jaeger C, Chaitankar V, Swaroop A. Epigenetic control of gene regulation during development and disease: A view from the retina. Prog Retin Eye Res 2018; 65:1-27. [PMID: 29544768 PMCID: PMC6054546 DOI: 10.1016/j.preteyeres.2018.03.002] [Citation(s) in RCA: 82] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 02/01/2018] [Accepted: 03/08/2018] [Indexed: 12/20/2022]
Abstract
Complex biological processes, such as organogenesis and homeostasis, are stringently regulated by genetic programs that are fine-tuned by epigenetic factors to establish cell fates and/or to respond to the microenvironment. Gene regulatory networks that guide cell differentiation and function are modulated and stabilized by modifications to DNA, RNA and proteins. In this review, we focus on two key epigenetic changes - DNA methylation and histone modifications - and discuss their contribution to retinal development, aging and disease, especially in the context of age-related macular degeneration (AMD) and diabetic retinopathy. We highlight less-studied roles of DNA methylation and provide the RNA expression profiles of epigenetic enzymes in human and mouse retina in comparison to other tissues. We also review computational tools and emergent technologies to profile, analyze and integrate epigenetic information. We suggest implementation of editing tools and single-cell technologies to trace and perturb the epigenome for delineating its role in transcriptional regulation. Finally, we present our thoughts on exciting avenues for exploring epigenome in retinal metabolism, disease modeling, and regeneration.
Collapse
Affiliation(s)
- Ximena Corso-Díaz
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Catherine Jaeger
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Vijender Chaitankar
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anand Swaroop
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
18
|
Abstract
Background Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options. Results Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual’s genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark. Conclusions EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle. Electronic supplementary material The online version of this article (10.1186/s12920-018-0342-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tony Kuo
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo, 135-0064, Japan.,AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo, 152-8550, Japan
| | - Martin C Frith
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo, 135-0064, Japan.,Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8562, Japan.,AIST-Waseda CBBD-OIL, 3-4-1 Ookubo, Shinjuku-ku, Tokyo, 169-8555, Japan
| | - Jun Sese
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo, 135-0064, Japan.,AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo, 152-8550, Japan
| | - Paul Horton
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8562, Japan.
| |
Collapse
|
19
|
Yassi M, Shams Davodly E, Mojtabanezhad Shariatpanahi A, Heidari M, Dayyani M, Heravi-Moussavi A, Moattar MH, Kerachian MA. DMRFusion: A differentially methylated region detection tool based on the ranked fusion method. Genomics 2018; 110:366-374. [PMID: 29309841 DOI: 10.1016/j.ygeno.2017.12.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2017] [Revised: 11/05/2017] [Accepted: 12/11/2017] [Indexed: 12/11/2022]
Abstract
DNA methylation is an important epigenetic modification involved in many biological processes and diseases. Computational analysis of differentially methylated regions (DMRs) could explore the underlying reasons of methylation. DMRFusion is presented as a useful tool for comprehensive DNA methylation analysis of DMRs on methylation sequencing data. This tool is designed base on the integration of several ranking methods; Information gain, Between versus within Class scatter ratio, Fisher ratio, Z-score and Welch's t-test. In this study, DMRFusion on reduced representation bisulfite sequencing (RRBS) data in chronic lymphocytic leukemia cancer displayed 30 nominated regions and CpG sites with a maximum methylation difference detected in the hypermethylation DMRs. We realized that DMRFusion is able to process methylation sequencing data in an efficient and accurate manner and to provide annotation and visualization for DMRs with high fold difference score (p-value and FDR<0.05 and type I error: 0.04).
Collapse
Affiliation(s)
- Maryam Yassi
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad, Iran
| | - Ehsan Shams Davodly
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad, Iran
| | | | - Mehdi Heidari
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad, Iran
| | - Mahdieh Dayyani
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad, Iran
| | - Alireza Heravi-Moussavi
- Canada's Michael Smith Genome Sciences Center, BC Cancer Agency, Vancouver, British Columbia, Canada
| | | | - Mohammad Amin Kerachian
- Cancer Genetics Research Unit, Reza Radiotherapy and Oncology Center, Mashhad, Iran; Cancer Genetics Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
20
|
Strategies for analyzing bisulfite sequencing data. J Biotechnol 2017; 261:105-115. [PMID: 28822795 DOI: 10.1016/j.jbiotec.2017.08.007] [Citation(s) in RCA: 81] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 08/07/2017] [Accepted: 08/08/2017] [Indexed: 01/10/2023]
Abstract
DNA methylation is one of the main epigenetic modifications in the eukaryotic genome; it has been shown to play a role in cell-type specific regulation of gene expression, and therefore cell-type identity. Bisulfite sequencing is the gold-standard for measuring methylation over the genomes of interest. Here, we review several techniques used for the analysis of high-throughput bisulfite sequencing. We introduce specialized short-read alignment techniques as well as pre/post-alignment quality check methods to ensure data quality. Furthermore, we discuss subsequent analysis steps after alignment. We introduce various differential methylation methods and compare their performance using simulated and real bisulfite sequencing datasets. We also discuss the methods used to segment methylomes in order to pinpoint regulatory regions. We introduce annotation methods that can be used for further classification of regions returned by segmentation and differential methylation methods. Finally, we review software packages that implement strategies to efficiently deal with large bisulfite sequencing datasets locally and we discuss online analysis workflows that do not require any prior programming skills. The analysis strategies described in this review will guide researchers at any level to the best practices of bisulfite sequencing analysis.
Collapse
|
21
|
Han Y, He X. Integrating Epigenomics into the Understanding of Biomedical Insight. Bioinform Biol Insights 2016; 10:267-289. [PMID: 27980397 PMCID: PMC5138066 DOI: 10.4137/bbi.s38427] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Revised: 11/01/2016] [Accepted: 11/06/2016] [Indexed: 12/13/2022] Open
Abstract
Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics.
Collapse
Affiliation(s)
- Yixing Han
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, USA.; Present address: Genetics and Biochemistry Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Ximiao He
- Laboratory of Metabolism, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.; Present address: Department of Medical Genetics, School of Basic Medicine, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
22
|
Tsuji J, Weng Z. Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data. Brief Bioinform 2016; 17:938-952. [PMID: 26628557 PMCID: PMC5142012 DOI: 10.1093/bib/bbv103] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 10/02/2015] [Indexed: 01/03/2023] Open
Abstract
Cytosine methylation regulates many biological processes such as gene expression, chromatin structure and chromosome stability. The whole genome bisulfite sequencing (WGBS) technique measures the methylation level at each cytosine throughout the genome. There are an increasing number of publicly available pipelines for analyzing WGBS data, reflecting many choices of read mapping algorithms as well as preprocessing and postprocessing methods. We simulated single-end and paired-end reads based on three experimental data sets, and comprehensively evaluated 192 combinations of three preprocessing, five postprocessing and five widely used read mapping algorithms. We also compared paired-end data with single-end data at the same sequencing depth for performance of read mapping and methylation level estimation. Bismark and LAST were the most robust mapping algorithms. We found that Mott trimming and quality filtering individually improved the performance of both read mapping and methylation level estimation, but combining them did not lead to further improvement. Furthermore, we confirmed that paired-end sequencing reduced error rate and enhanced sensitivity for both read mapping and methylation level estimation, especially for short reads and in repetitive regions of the human genome.
Collapse
|
23
|
Abstract
Aberrant DNA methylation is considered to be one of the most common hallmarks of cancer. Several recent advances in assessing the DNA methylome provide great promise for deciphering the cancer-specific DNA methylation patterns. Herein, we present the current key technologies used to detect high-throughput genome-wide DNA methylation, and the available cancer-associated methylation databases. Additionally, we focus on the computational methods for preprocessing, analyzing and interpreting the cancer methylome data. It not only discusses the challenges of the differentially methylated region calling and the prediction model construction but also highlights the biomarker investigation for cancer diagnosis, prognosis and response to treatment. Finally, some emerging challenges in the computational analysis of cancer methylome data are summarized.
Collapse
|
24
|
Baheti S, Kanwar R, Goelzenleuchter M, Kocher JPA, Beutler AS, Sun Z. Targeted alignment and end repair elimination increase alignment and methylation measure accuracy for reduced representation bisulfite sequencing data. BMC Genomics 2016; 17:149. [PMID: 26922377 PMCID: PMC4769831 DOI: 10.1186/s12864-016-2494-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 02/17/2016] [Indexed: 11/10/2022] Open
Abstract
Background DNA methylation is an important epigenetic modification involved in many biological processes. Reduced representation bisulfite sequencing (RRBS) is a cost-effective method for studying DNA methylation at single base resolution. Although several tools are available for RRBS data processing and analysis, it is not clear which strategy performs the best and there has not been much attention to the contamination issue from artificial cytosines incorporated during the end repair step of library preparation. To address these issues, we describe a new method, Targeted Alignment and Artificial Cytosine Elimination for RRBS (TRACE-RRBS), which aligns bisulfite sequence reads to MSP1 digitally digested reference and specifically removes the end repair cytosines. We compared this approach on a simulated and a real dataset with 7 other RRBS analysis tools and Illumina 450 K microarray platform. Results TRACE-RRBS aligns sequence reads to a small fraction of the genome where RRBS protocol targets on and was demonstrated as the fastest, most sensitive and specific tool for the simulated dataset. For the real dataset, TRACE-RRBS took about the same time as RRBSMAP, a third to a sixth of time needed for BISMARK and NOVOALIGN. TRACE-RRBS aligned more reads uniquely than other tools and achieved the highest correlation with 450 k microarray data. The end repair artificial cytosine removal increased correlation between nearby CpGs and accuracy of methylation quantification. Conclusions TRACE-RRBS is fast and more accurate tool for RRBS data analysis. It is freely available for academic use at http://bioinformaticstools.mayo.edu/. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2494-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saurabh Baheti
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Rahul Kanwar
- Department of Medical Oncology, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Meike Goelzenleuchter
- Department of Medical Oncology, Mayo Clinic, Rochester, MN, 55905, USA. .,Charité - Universitaetsmedizin Berlin, Berlin, Germany.
| | - Jean-Pierre A Kocher
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Andreas S Beutler
- Department of Medical Oncology, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Zhifu Sun
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, 55905, USA.
| |
Collapse
|
25
|
Saito Y, Mituyama T. Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions. BMC Genomics 2015; 16 Suppl 12:S3. [PMID: 26681544 PMCID: PMC4682380 DOI: 10.1186/1471-2164-16-s12-s3] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Detection of differential methylation between biological samples is an important task in bisulfite-seq data analysis. Several studies have attempted de novo finding of differentially methylated regions (DMRs) using hidden Markov models (HMMs). However, there is room for improvement in the design of HMMs, especially on emission functions that evaluate the likelihood of differential methylation at each cytosine site. RESULTS We describe a new HMM for DMR detection from bisulfite-seq data. Our method utilizes emission functions that combine binomial models for aligned read counts, and beta mixtures for incorporating genome-wide methylation level distributions. We also develop unsupervised learning algorithms to adjust parameters of the beta-binomial models depending on differential methylation types (up, down, and not changed). In experiments on both simulated and real datasets, the new HMM improves DMR detection accuracy compared with HMMs in our previous study. Furthermore, our method achieves better accuracy than other methods using Fisher's exact test and methylation level smoothing. CONCLUSIONS Our method enables accurate DMR detection from bisulfite-seq data. The implementation of our method is named ComMet, and distributed as a part of Bisulfighter package, which is available at http://epigenome.cbrc.jp/bisulfighter.
Collapse
|
26
|
Klein HU, Hebestreit K. An evaluation of methods to test predefined genomic regions for differential methylation in bisulfite sequencing data. Brief Bioinform 2015; 17:796-807. [DOI: 10.1093/bib/bbv095] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Indexed: 12/24/2022] Open
|
27
|
Hénaff E, Zapata L, Casacuberta JM, Ossowski S. Jitterbug: somatic and germline transposon insertion detection at single-nucleotide resolution. BMC Genomics 2015; 16:768. [PMID: 26459856 PMCID: PMC4603299 DOI: 10.1186/s12864-015-1975-5] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Accepted: 10/02/2015] [Indexed: 11/20/2022] Open
Abstract
Background Transposable elements are major players in genome evolution. Transposon insertion polymorphisms can translate into phenotypic differences in plants and animals and are linked to different diseases including human cancer, making their characterization highly relevant to the study of genome evolution and genetic diseases. Results Here we present Jitterbug, a novel tool that identifies transposable element insertion sites at single-nucleotide resolution based on the pairedend mapping and clipped-read signatures produced by NGS alignments. Jitterbug can be easily integrated into existing NGS analysis pipelines, using the standard BAM format produced by frequently applied alignment tools (e.g. bwa, bowtie2), with no need to realign reads to a set of consensus transposon sequences. Jitterbug is highly sensitive and able to recall transposon insertions with a very high specificity, as demonstrated by benchmarks in the human and Arabidopsis genomes, and validation using long PacBio reads. In addition, Jitterbug estimates the zygosity of transposon insertions with high accuracy and can also identify somatic insertions. Conclusions We demonstrate that Jitterbug can identify mosaic somatic transposon movement using sequenced tumor-normal sample pairs and allows for estimating the cancer cell fraction of clones containing a somatic TE insertion. We suggest that the independent methods we use to evaluate performance are a step towards creating a gold standard dataset for benchmarking structural variant prediction tools. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1975-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Elizabeth Hénaff
- Genomic and Epigenomic Variation in Disease Group, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003, Barcelona, Spain. .,Center for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Barcelona, Spain. .,current address: Weill Cornell Medical College, Institute for Computational Biomedicine, 1305 York Avenue, New York, NY, 10021, USA.
| | - Luís Zapata
- Genomic and Epigenomic Variation in Disease Group, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003, Barcelona, Spain. .,Universitat Pompeu Fabra (UPF), Barcelona, Spain.
| | - Josep M Casacuberta
- Center for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Barcelona, Spain.
| | - Stephan Ossowski
- Genomic and Epigenomic Variation in Disease Group, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003, Barcelona, Spain. .,Universitat Pompeu Fabra (UPF), Barcelona, Spain.
| |
Collapse
|
28
|
Sheetlin S, Park Y, Frith MC, Spouge JL. ALP & FALP: C++ libraries for pairwise local alignment E-values. Bioinformatics 2015; 32:304-5. [PMID: 26428291 DOI: 10.1093/bioinformatics/btv575] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 09/28/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT spouge@nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Martin C Frith
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| |
Collapse
|
29
|
Břinda K, Boeva V, Kucherov G. RNF: a general framework to evaluate NGS read mappers. Bioinformatics 2015; 32:136-9. [PMID: 26353839 PMCID: PMC4681991 DOI: 10.1093/bioinformatics/btv524] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 08/31/2015] [Indexed: 11/12/2022] Open
Abstract
Motivation: Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. Results: To solve this obstacle, we have created a generic format Read Naming Format (Rnf) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RnfTools containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim, etc.) and transforms the generated reads into Rnf format. LAVEnder evaluates then a given read mapper using simulated reads in Rnf format. A special attention is payed to mapping qualities that serve for parametrization of Roc curves, and to evaluation of the effect of read sample contamination. Availability and implementation: RnfTools: http://karel-brinda.github.io/rnftools Spec. of Rnf: http://karel-brinda.github.io/rnf-spec Contact:karel.brinda@univ-mlv.fr
Collapse
Affiliation(s)
- Karel Břinda
- LIGM/CNRS, Université Paris-Est, 77454 Marne-la-Vallée, France
| | - Valentina Boeva
- Inserm, U900, Bioinformatics, Biostatistics, Epidemiology and Computational Systems Biology of Cancer, 75248 Paris, France, Institut Curie, Centre de Recherche, 26 rue d'Ulm, 75248 Paris, France and Mines ParisTech, 77300 Fontainebleau, France
| | | |
Collapse
|
30
|
Saito Y, Tsuji J, Mituyama T. Bisulfighter: accurate detection of methylated cytosines and differentially methylated regions. Nucleic Acids Res 2014; 42:e45. [PMID: 24423865 PMCID: PMC3973284 DOI: 10.1093/nar/gkt1373] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Analysis of bisulfite sequencing data usually requires two tasks: to call methylated cytosines (mCs) in a sample, and to detect differentially methylated regions (DMRs) between paired samples. Although numerous tools have been proposed for mC calling, methods for DMR detection have been largely limited. Here, we present Bisulfighter, a new software package for detecting mCs and DMRs from bisulfite sequencing data. Bisulfighter combines the LAST alignment tool for mC calling, and a novel framework for DMR detection based on hidden Markov models (HMMs). Unlike previous attempts that depend on empirical parameters, Bisulfighter can use the expectation-maximization algorithm for HMMs to adjust parameters for each data set. We conduct extensive experiments in which accuracy of mC calling and DMR detection is evaluated on simulated data with various mC contexts, read qualities, sequencing depths and DMR lengths, as well as on real data from a wide range of biological processes. We demonstrate that Bisulfighter consistently achieves better accuracy than other published tools, providing greater sensitivity for mCs with fewer false positives, more precise estimates of mC levels, more exact locations of DMRs and better agreement of DMRs with gene expression and DNase I hypersensitivity. The source code is available at http://epigenome.cbrc.jp/bisulfighter.
Collapse
Affiliation(s)
- Yutaka Saito
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan, Japan Science and Technology Agency, CREST, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012, Japan and Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 55 Lake Avenue North, Worcester, MA 01655, USA
| | | | | |
Collapse
|
31
|
Shrestha AMS, Frith MC, Horton P. A bioinformatician's guide to the forefront of suffix array construction algorithms. Brief Bioinform 2014; 15:138-54. [PMID: 24413184 PMCID: PMC3956071 DOI: 10.1093/bib/bbt081] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support 'spaced seeds' and 'subset seeds' used in many biological applications.
Collapse
|
32
|
Hong C, Clement NL, Clement S, Hammoud SS, Carrell DT, Cairns BR, Snell Q, Clement MJ, Johnson WE. Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data. BMC Bioinformatics 2013; 14:337. [PMID: 24261665 PMCID: PMC3924334 DOI: 10.1186/1471-2105-14-337] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Accepted: 11/19/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA methylation has been linked to many important biological phenomena. Researchers have recently begun to sequence bisulfite treated DNA to determine its pattern of methylation. However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases. Therefore, it is often difficult to align reads to the correct locations in the reference genome. Furthermore, bisulfite sequencing experiments have the additional complexity of having to estimate the DNA methylation levels within the sample. RESULTS Here, we present a highly accurate probabilistic algorithm, which is an extension of the Genomic Next-generation Universal MAPper to accommodate bisulfite sequencing data (GNUMAP-bs), that addresses the computational problems associated with aligning bisulfite sequencing data to a reference genome. GNUMAP-bs integrates uncertainty from read and mapping qualities to help resolve the difference between poor quality bases and the ambiguity inherent in bisulfite conversion. We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods. CONCLUSIONS The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments. The GNUMAP-bs algorithm is freely available for download at: http://dna.cs.byu.edu/gnumap. The software runs on multiple threads and multiple processors to increase the alignment speed.
Collapse
Affiliation(s)
- Changjin Hong
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA
| | - Nathan L Clement
- Department of Computer Science, University of Texas, Austin, TX, USA
| | - Spencer Clement
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| | - Saher Sue Hammoud
- IVF and Andrology Laboratories, Departments of Surgery, Obstetrics and Gynecology, and Physiology, University of Utah School of Medicine, Salt Lake City, UT, USA
- Department of Oncological Sciences, Huntsman Cancer Institute, Salt Lake City, UT, USA
| | - Douglas T Carrell
- IVF and Andrology Laboratories, Departments of Surgery, Obstetrics and Gynecology, and Physiology, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Bradley R Cairns
- Department of Oncological Sciences, Huntsman Cancer Institute, Salt Lake City, UT, USA
| | - Quinn Snell
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| | - Mark J Clement
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| | - William Evan Johnson
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA
| |
Collapse
|
33
|
Barturen G, Rueda A, Oliver JL, Hackenberg M. MethylExtract: High-Quality methylation maps and SNV calling from whole genome bisulfite sequencing data. F1000Res 2013; 2:217. [PMID: 24627790 PMCID: PMC3938178 DOI: 10.12688/f1000research.2-217.v2] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/19/2014] [Indexed: 01/10/2023] Open
Abstract
Whole genome methylation profiling at a single cytosine resolution is now feasible due to the advent of high-throughput sequencing techniques together with bisulfite treatment of the DNA. To obtain the methylation value of each individual cytosine, the bisulfite-treated sequence reads are first aligned to a reference genome, and then the profiling of the methylation levels is done from the alignments. A huge effort has been made to quickly and correctly align the reads and many different algorithms and programs to do this have been created. However, the second step is just as crucial and non-trivial, but much less attention has been paid to the final inference of the methylation states. Important error sources do exist, such as sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants. We developed
MethylExtract, a user friendly tool to: i) generate high quality, whole genome methylation maps and ii) detect sequence variation within the same sample preparation. The program is implemented into a single script and takes into account all major error sources.
MethylExtract detects variation (SNVs – Single Nucleotide Variants) in a similar way to
VarScan, a very sensitive method extensively used in SNV and genotype calling based on non-bisulfite-treated reads. The usefulness of
MethylExtract is shown by means of extensive benchmarking based on artificial bisulfite-treated reads and a comparison to a recently published method, called
Bis-SNP. MethylExtract is able to detect SNVs within High-Throughput Sequencing experiments of bisulfite treated DNA at the same time as it generates high quality methylation maps. This simultaneous detection of DNA methylation and sequence variation is crucial for many downstream analyses, for example when deciphering the impact of SNVs on differential methylation. An exclusive feature of
MethylExtract, in comparison with existing software, is the possibility to assess the bisulfite failure in a statistical way. The source code, tutorial and artificial bisulfite datasets are available at
http://bioinfo2.ugr.es/MethylExtract/ and
http://sourceforge.net/projects/methylextract/, and also permanently accessible from
10.5281/zenodo.7144.
Collapse
Affiliation(s)
- Guillermo Barturen
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Antonio Rueda
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| |
Collapse
|
34
|
Barturen G, Rueda A, Oliver JL, Hackenberg M. MethylExtract: High-Quality methylation maps and SNV calling from whole genome bisulfite sequencing data. F1000Res 2013; 2:217. [PMID: 24627790 DOI: 10.12688/f1000research.2-217.v1] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/09/2013] [Indexed: 01/30/2023] Open
Abstract
Whole genome methylation profiling at a single cytosine resolution is now feasible due to the advent of high-throughput sequencing techniques together with bisulfite treatment of the DNA. To obtain the methylation value of each individual cytosine, the bisulfite-treated sequence reads are first aligned to a reference genome, and then the profiling of the methylation levels is done from the alignments. A huge effort has been made to quickly and correctly align the reads and many different algorithms and programs to do this have been created. However, the second step is just as crucial and non-trivial, but much less attention has been paid to the final inference of the methylation states. Important error sources do exist, such as sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants. We developed MethylExtract, a user friendly tool to: i) generate high quality, whole genome methylation maps and ii) detect sequence variation within the same sample preparation. The program is implemented into a single script and takes into account all major error sources. MethylExtract detects variation (SNVs - Single Nucleotide Variants) in a similar way to VarScan, a very sensitive method extensively used in SNV and genotype calling based on non-bisulfite-treated reads. The usefulness of MethylExtract is shown by means of extensive benchmarking based on artificial bisulfite-treated reads and a comparison to a recently published method, called Bis-SNP. MethylExtract is able to detect SNVs within High-Throughput Sequencing experiments of bisulfite treated DNA at the same time as it generates high quality methylation maps. This simultaneous detection of DNA methylation and sequence variation is crucial for many downstream analyses, for example when deciphering the impact of SNVs on differential methylation. An exclusive feature of MethylExtract, in comparison with existing software, is the possibility to assess the bisulfite failure in a statistical way. The source code, tutorial and artificial bisulfite datasets are available at http://bioinfo2.ugr.es/MethylExtract/ and http://sourceforge.net/projects/methylextract/, and also permanently accessible from 10.5281/zenodo.7144.
Collapse
Affiliation(s)
- Guillermo Barturen
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Antonio Rueda
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| |
Collapse
|
35
|
Stevens M, Cheng JB, Li D, Xie M, Hong C, Maire CL, Ligon KL, Hirst M, Marra MA, Costello JF, Wang T. Estimating absolute methylation levels at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods. Genome Res 2013; 23:1541-53. [PMID: 23804401 PMCID: PMC3759729 DOI: 10.1101/gr.152231.112] [Citation(s) in RCA: 109] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Recent advancements in sequencing-based DNA methylation profiling methods provide an unprecedented opportunity to map complete DNA methylomes. These include whole-genome bisulfite sequencing (WGBS, MethylC-seq, or BS-seq), reduced-representation bisulfite sequencing (RRBS), and enrichment-based methods such as MeDIP-seq, MBD-seq, and MRE-seq. These methods yield largely comparable results but differ significantly in extent of genomic CpG coverage, resolution, quantitative accuracy, and cost, at least while using current algorithms to interrogate the data. None of these existing methods provides single-CpG resolution, comprehensive genome-wide coverage, and cost feasibility for a typical laboratory. We introduce methylCRF, a novel conditional random fields–based algorithm that integrates methylated DNA immunoprecipitation (MeDIP-seq) and methylation-sensitive restriction enzyme (MRE-seq) sequencing data to predict DNA methylation levels at single-CpG resolution. Our method is a combined computational and experimental strategy to produce DNA methylomes of all 28 million CpGs in the human genome for a fraction (<10%) of the cost of whole-genome bisulfite sequencing methods. methylCRF was benchmarked for accuracy against Infinium arrays, RRBS, WGBS sequencing, and locus-specific bisulfite sequencing performed on the same human embryonic stem cell line. methylCRF transformation of MeDIP-seq/MRE-seq was equivalent to a biological replicate of WGBS in quantification, coverage, and resolution. We used conventional bisulfite conversion, PCR, cloning, and sequencing to validate loci where our predictions do not agree with whole-genome bisulfite data, and in 11 out of 12 cases, methylCRF predictions of methylation level agree better with validated results than does whole-genome bisulfite sequencing. Therefore, methylCRF transformation of MeDIP-seq/MRE-seq data provides an accurate, inexpensive, and widely accessible strategy to create full DNA methylomes.
Collapse
Affiliation(s)
- Michael Stevens
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Hardcastle TJ. High-throughput sequencing of cytosine methylation in plant DNA. PLANT METHODS 2013; 9:16. [PMID: 23758782 PMCID: PMC3691832 DOI: 10.1186/1746-4811-9-16] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Accepted: 03/29/2013] [Indexed: 05/26/2023]
Abstract
: Cytosine methylation is a significant and widespread regulatory factor in plant systems. Methods for the high-throughput sequencing of methylation have allowed a greatly improved characterisation of the methylome. Here we discuss currently available methods for generation and analysis of high-throughput sequencing of methylation data. We also discuss the results previously acquired through sequencing plant methylomes, and highlight remaining challenges in this field.
Collapse
Affiliation(s)
- Thomas J Hardcastle
- Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge CB23EA, UK.
| |
Collapse
|
37
|
Tretyakov K, Goldberg T, Jin VX, Horton P. Summary of talks and papers at ISCB-Asia/SCCG 2012. BMC Genomics 2013. [PMCID: PMC3639071 DOI: 10.1186/1471-2164-14-s2-i1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
The second ISCB-Asia conference of the International Society for Computational Biology took place December 17-19, 2012, in Shenzhen, China. The conference was co-hosted by BGI as the first Shenzhen Conference on Computational Genomics (SCCG).
45 talks were presented at ISCB-Asia/SCCG 2012. The topics covered included software tools, reproducible computing, next-generation sequencing data analysis, transcription and mRNA regulation, protein structure and function, cancer genomics and personalized medicine. Nine of the proceedings track talks are included as full papers in this supplement.
In this report we first give a short overview of the conference by listing some statistics and visualizing the talk abstracts as word clouds. Then we group the talks by topic and briefly summarize each one, providing references to related publications whenever possible. Finally, we close with a few comments on the success of this conference.
Collapse
|
38
|
Abstract
DNA methylation is an epigenetic mark that has suspected regulatory roles in a broad range of biological processes and diseases. The technology is now available for studying DNA methylation genome-wide, at a high resolution and in a large number of samples. This Review discusses relevant concepts, computational methods and software tools for analysing and interpreting DNA methylation data. It focuses not only on the bioinformatic challenges of large epigenome-mapping projects and epigenome-wide association studies but also highlights software tools that make genome-wide DNA methylation mapping more accessible for laboratories with limited bioinformatics experience.
Collapse
Affiliation(s)
- Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, 1090 Vienna, Austria.
| |
Collapse
|
39
|
Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol 2012; 13:R87. [PMID: 23034086 PMCID: PMC3491415 DOI: 10.1186/gb-2012-13-10-r87] [Citation(s) in RCA: 1259] [Impact Index Per Article: 104.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2012] [Accepted: 10/03/2012] [Indexed: 12/14/2022] Open
Abstract
DNA methylation is a chemical modification of cytosine bases that is pivotal for gene regulation,
cellular specification and cancer development. Here, we describe an R package, methylKit, that
rapidly analyzes genome-wide cytosine epigenetic profiles from high-throughput methylation and
hydroxymethylation sequencing experiments. methylKit includes functions for clustering, sample
quality visualization, differential methylation analysis and annotation features, thus automating
and simplifying many of the steps for discerning statistically significant bases or regions of DNA
methylation. Finally, we demonstrate methylKit on breast cancer data, in which we find statistically
significant regions of differential methylation and stratify tumor subtypes. methylKit is available
at http://code.google.com/p/methylkit.
Collapse
|