1
|
Zhan X, Banerjee K, Chen J. Variant-set association test for generalized linear mixed model. Genet Epidemiol 2021; 45:402-412. [PMID: 33604919 DOI: 10.1002/gepi.22378] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 01/18/2021] [Accepted: 01/25/2021] [Indexed: 12/22/2022]
Abstract
Advances in high-throughput biotechnologies have culminated in a wide range of omics (such as genomics, epigenomics, transcriptomics, metabolomics, and metagenomics) studies, and increasing evidence in these studies indicates that the biological architecture of complex traits involves a large number of omics variants each with minor effects but collectively accounting for the full phenotypic variability. Thus, a major challenge in many "ome-wide" association analyses is to achieve adequate statistical power to identify multiple variants of small effect sizes, which is notoriously difficult for studies with relatively small-sample sizes. A small-sample adjustment incorporated in the kernel machine regression framework was proposed to solve this for association studies under various settings. However, such an adjustment in the generalized linear mixed model (GLMM) framework, which accounts for both sample relatedness and non-Gaussian outcomes, has not yet been attempted. In this study, we fill this gap by extending small-sample adjustment in kernel machine association test to GLMM. We propose a new Variant-Set Association Test (VSAT), a powerful and efficient analysis tool in GLMM, to examine the association between a set of omics variants and correlated phenotypes. The usefulness of VSAT is demonstrated using both numerical simulation studies and applications to data collected from multiple association studies. The software for implementing the proposed method in R is available at https://www.github.com/jchen1981/SSKAT.
Collapse
Affiliation(s)
- Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, Pennsylvania, USA
| | - Kalins Banerjee
- Department of Public Health Sciences, Pennsylvania State University, Hershey, Pennsylvania, USA
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
2
|
Milad M, Olbricht GR. Testing differentially methylated regions through functional principal component analysis. J Appl Stat 2021; 49:1677-1691. [DOI: 10.1080/02664763.2021.1877636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Mohamed Milad
- Department of Mathematics and Statistics, Arkansas State University, Jonesboro, AR, USA
| | - Gayla R. Olbricht
- Department of Mathematics and Statistics, Missouri University of Science and Technology, Rolla, MO, USA
| |
Collapse
|
3
|
Halla-Aho V, Lähdesmäki H. LuxUS: DNA methylation analysis using generalized linear mixed model with spatial correlation. Bioinformatics 2020; 36:4535-4543. [PMID: 32484876 PMCID: PMC7750928 DOI: 10.1093/bioinformatics/btaa539] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Revised: 05/05/2020] [Accepted: 05/27/2020] [Indexed: 11/19/2022] Open
Abstract
Motivation DNA methylation is an important epigenetic modification, which has multiple functions. DNA methylation and its connections to diseases have been extensively studied in recent years. It is known that DNA methylation levels of neighboring cytosines are correlated and that differential DNA methylation typically occurs rather as regions instead of individual cytosine level. Results We have developed a generalized linear mixed model, LuxUS, that makes use of the correlation between neighboring cytosines to facilitate analysis of differential methylation. LuxUS implements a likelihood model for bisulfite sequencing data that accounts for experimental variation in underlying biochemistry. LuxUS can model both binary and continuous covariates, and mixed model formulation enables including replicate and cytosine random effects. Spatial correlation is included to the model through a cytosine random effect correlation structure. We show with simulation experiments that using the spatial correlation, we gain more power to the statistical testing of differential DNA methylation. Results with real bisulfite sequencing dataset show that LuxUS is able to detect biologically significant differentially methylated cytosines. Availability and implementation The tool is available at https://github.com/hallav/LuxUS. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Viivi Halla-Aho
- Department of Computer Science, Aalto University, FI-00076 Aalto, Finland
| | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, FI-00076 Aalto, Finland
| |
Collapse
|
4
|
Liu Y, Han Y, Zhou L, Pan X, Sun X, Liu Y, Liang M, Qin J, Lu Y, Liu P. A comprehensive evaluation of computational tools to identify differential methylation regions using RRBS data. Genomics 2020; 112:4567-4576. [PMID: 32712292 DOI: 10.1016/j.ygeno.2020.07.032] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 07/20/2020] [Indexed: 01/01/2023]
Abstract
DNA methylation plays a vital role in transcription regulation. Reduced representation bisulfite sequencing (RRBS) is becoming common for analyzing genome-wide methylation profiles at the single nucleotide level. A major goal of RRBS studies is to detect differentially methylated regions (DMRs) between different biological conditions. The previous tools to predict DMRs lack consistency. Here, we simulated RRBS datasets with significant attributes of real sequencing data under a wide range of scenarios, and systematically evaluated seven DMR detection tools in terms of type I error rate, precision/recall (PR), and area under ROC curve (AUC) using different methylation levels, sequencing coverage depth, length of DMRs, read length, and sample sizes. DMRfinder, methylSig, and methylKit were our preferred tools for RRBS data analysis, in terms of their AUC and PR curves. Our comparison highlights the different applicability of DMR detection tools and provides information to guide researchers towards the advancement of sequence-based DMR analysis.
Collapse
Affiliation(s)
- Yi Liu
- Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China
| | - Yi Han
- Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China
| | - Liyuan Zhou
- Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China
| | - Xiaoqing Pan
- Center of Systems Molecular Medicine, Department of Physiology, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Xiwei Sun
- Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China
| | - Yong Liu
- Center of Systems Molecular Medicine, Department of Physiology, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Mingyu Liang
- Center of Systems Molecular Medicine, Department of Physiology, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Jiale Qin
- Center for Uterine Cancer Diagnosis & Therapy Research of Zhejiang Province, Women's Reproductive Health Key Laboratory of Zhejiang Province, Department of Gynecologic Oncology, Women's Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310029, China.
| | - Yan Lu
- Center for Uterine Cancer Diagnosis & Therapy Research of Zhejiang Province, Women's Reproductive Health Key Laboratory of Zhejiang Province, Department of Gynecologic Oncology, Women's Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310029, China.
| | - Pengyuan Liu
- Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China; Center of Systems Molecular Medicine, Department of Physiology, Medical College of Wisconsin, Milwaukee, WI 53226, USA.
| |
Collapse
|
5
|
Kreutz C, Can NS, Bruening RS, Meyberg R, Mérai Z, Fernandez-Pozo N, Rensing SA. A blind and independent benchmark study for detecting differeally methylated regions in plants. Bioinformatics 2020; 36:3314-3321. [PMID: 32181821 DOI: 10.1093/bioinformatics/btaa191] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 01/31/2020] [Accepted: 03/13/2020] [Indexed: 01/03/2023] Open
Abstract
MOTIVATION Bisulfite sequencing (BS-seq) is a state-of-the-art technique for investigating methylation of the DNA to gain insights into the epigenetic regulation. Several algorithms have been published for identification of differentially methylated regions (DMRs). However, the performances of the individual methods remain unclear and it is difficult to optimally select an algorithm in application settings. RESULTS We analyzed BS-seq data from four plants covering three taxonomic groups. We first characterized the data using multiple summary statistics describing methylation levels, coverage, noise, as well as frequencies, magnitudes and lengths of methylated regions. Then, simulated datasets with most similar characteristics to real experimental data were created. Seven different algorithms (metilene, methylKit, MOABS, DMRcate, Defiant, BSmooth, MethylSig) for DMR identification were applied and their performances were assessed. A blind and independent study design was chosen to reduce bias and to derive practical method selection guidelines. Overall, metilene had superior performance in most settings. Data attributes, such as coverage and spread of the DMR lengths, were found to be useful for selecting the best method for DMR detection. A decision tree to select the optimal approach based on these data attributes is provided. The presented procedure might serve as a general strategy for deriving algorithm selection rules tailored to demands in specific application settings. AVAILABILITY AND IMPLEMENTATION Scripts that were used for the analyses and that can be used for prediction of the optimal algorithm are provided at https://github.com/kreutz-lab/DMR-DecisionTree. Simulated and experimental data are available at https://doi.org/10.6084/m9.figshare.11619045. CONTACT ckreutz@imbi.uni-freiburg.de. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Clemens Kreutz
- Faculty of Medicine and Medical Center, Institute of Medical Biometry and Statistics, University of Freiburg, 79104 Freiburg, Germany
- Centre for Integrative Biological Signalling Studies (CIBSS), University of Freiburg, 79104 Freiburg, Germany
| | - Nilay S Can
- Plant Cell Biology, Faculty of Biology, University of Marburg, 35043 Marburg, Germany
| | - Ralf Schulze Bruening
- Plant Cell Biology, Faculty of Biology, University of Marburg, 35043 Marburg, Germany
| | - Rabea Meyberg
- Plant Cell Biology, Faculty of Biology, University of Marburg, 35043 Marburg, Germany
| | - Zsuzsanna Mérai
- Gregor Mendel Institute of Molecular Plant Biology, Austrian Academy of Sciences, Vienna BioCenter (VBC), 1030 Vienna, Austria
| | - Noe Fernandez-Pozo
- Plant Cell Biology, Faculty of Biology, University of Marburg, 35043 Marburg, Germany
| | - Stefan A Rensing
- Plant Cell Biology, Faculty of Biology, University of Marburg, 35043 Marburg, Germany
- Centre for Biological Signaling Studies (BIOSS), University of Freiburg, 79104 Freiburg, Germany
| |
Collapse
|
6
|
Abstract
Motivation High-throughput measurements of DNA methylation are increasingly becoming a mainstay of biomedical investigations. While the methylation status of individual cytosines can sometimes be informative, several recent papers have shown that the functional role of DNA methylation is better captured by a quantitative analysis of the spatial variation of methylation across a genomic region. Results Here, we present BPRMeth, a Bioconductor package that quantifies methylation profiles by generalized linear model regression. The original implementation has been enhanced in two important ways: we introduced a fast, variational inference approach that enables the quantification of Bayesian posterior confidence measures on the model, and we adapted the method to use several observation models, making it suitable for a diverse range of platforms including single-cell analyses and methylation arrays. Availability and implementation http://bioconductor.org/packages/BPRMeth Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Guido Sanguinetti
- School of Informatics, University of Edinburgh, Edinburgh, UK.,Synthetic and Systems Biology, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
7
|
Catoni M, Tsang JM, Greco AP, Zabet NR. DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts. Nucleic Acids Res 2019; 46:e114. [PMID: 29986099 PMCID: PMC6212837 DOI: 10.1093/nar/gky602] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 06/25/2018] [Indexed: 12/27/2022] Open
Abstract
DNA methylation has been associated with transcriptional repression and detection of differential methylation is important in understanding the underlying causes of differential gene expression. Bisulfite-converted genomic DNA sequencing is the current gold standard in the field for building genome-wide maps at a base pair resolution of DNA methylation. Here we systematically investigate the underlying features of detecting differential DNA methylation in CpG and non-CpG contexts, considering both the case of mammalian systems and plants. In particular, we introduce DMRcaller, a highly efficient R/Bioconductor package, which implements several methods to detect differentially methylated regions (DMRs) between two samples. Most importantly, we show that different algorithms are required to compute DMRs and the most appropriate algorithm in each case depends on the sequence context and levels of methylation. Furthermore, we show that DMRcaller outperforms other available packages and we propose a new method to select the parameters for this tool and for other available tools. DMRcaller is a comprehensive tool for differential methylation analysis which displays high sensitivity and specificity for the detection of DMRs and performs entire genome wide analysis within a few hours.
Collapse
Affiliation(s)
- Marco Catoni
- The Sainsbury Laboratory, University of Cambridge, Cambridge CB2 1LR, UK
| | - Jonathan Mf Tsang
- The Sainsbury Laboratory, University of Cambridge, Cambridge CB2 1LR, UK.,DAMTP, University of Cambridge, Cambridge CB3 0WA, UK
| | - Alessandro P Greco
- School of Biological Sciences, University of Essex, Colchester CO4 3SQ, UK
| | - Nicolae Radu Zabet
- The Sainsbury Laboratory, University of Cambridge, Cambridge CB2 1LR, UK.,School of Biological Sciences, University of Essex, Colchester CO4 3SQ, UK
| |
Collapse
|
8
|
Korthauer K, Chakraborty S, Benjamini Y, Irizarry RA. Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics 2019; 20:367-383. [PMID: 29481604 PMCID: PMC6587918 DOI: 10.1093/biostatistics/kxy007] [Citation(s) in RCA: 81] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 01/21/2018] [Indexed: 12/22/2022] Open
Abstract
With recent advances in sequencing technology, it is now feasible to measure DNA methylation at tens of millions of sites across the entire genome. In most applications, biologists are interested in detecting differentially methylated regions, composed of multiple sites with differing methylation levels among populations. However, current computational approaches for detecting such regions do not provide accurate statistical inference. A major challenge in reporting uncertainty is that a genome-wide scan is involved in detecting these regions, which needs to be accounted for. A further challenge is that sample sizes are limited due to the costs associated with the technology. We have developed a new approach that overcomes these challenges and assesses uncertainty for differentially methylated regions in a rigorous manner. Region-level statistics are obtained by fitting a generalized least squares regression model with a nested autoregressive correlated error structure for the effect of interest on transformed methylation proportions. We develop an inferential approach, based on a pooled null distribution, that can be implemented even when as few as two samples per population are available. Here, we demonstrate the advantages of our method using both experimental data and Monte Carlo simulation. We find that the new method improves the specificity and sensitivity of lists of regions and accurately controls the false discovery rate.
Collapse
Affiliation(s)
- Keegan Korthauer
- Department of Biostatistics & Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| | - Sutirtha Chakraborty
- Novartis, Inorbit Mall Rd, Silpa Gram Craft Village, HITEC City, Hyderabad, Telangana, India
| | - Yuval Benjamini
- The Statistics Department, Hebrew University, Mount Scopus, Jerusalem, Israel
| | - Rafael A Irizarry
- Department of Biostatistics & Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| |
Collapse
|
9
|
Kapourani CA, Sanguinetti G. Melissa: Bayesian clustering and imputation of single-cell methylomes. Genome Biol 2019; 20:61. [PMID: 30898142 PMCID: PMC6427844 DOI: 10.1186/s13059-019-1665-8] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 02/28/2019] [Indexed: 12/12/2022] Open
Abstract
Measurements of single-cell methylation are revolutionizing our understanding of epigenetic control of gene expression, yet the intrinsic data sparsity limits the scope for quantitative analysis of such data. Here, we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells. The clustering also acts as an effective regularization for data imputation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings and state-of-the-art imputation performance.
Collapse
Affiliation(s)
- Chantriolnt-Andreas Kapourani
- School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK.
- MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, EH4 2XU, UK.
| | - Guido Sanguinetti
- School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK.
- Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK.
| |
Collapse
|
10
|
Choudhary K, Lai YH, Tran EJ, Aviran S. dStruct: identifying differentially reactive regions from RNA structurome profiling data. Genome Biol 2019; 20:40. [PMID: 30791935 PMCID: PMC6385470 DOI: 10.1186/s13059-019-1641-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Accepted: 01/24/2019] [Indexed: 12/16/2022] Open
Abstract
RNA biology is revolutionized by recent developments of diverse high-throughput technologies for transcriptome-wide profiling of molecular RNA structures. RNA structurome profiling data can be used to identify differentially structured regions between groups of samples. Existing methods are limited in scope to specific technologies and/or do not account for biological variation. Here, we present dStruct which is the first broadly applicable method for differential analysis accounting for biological variation in structurome profiling data. dStruct is compatible with diverse profiling technologies, is validated with experimental data and simulations, and outperforms existing methods.
Collapse
Affiliation(s)
- Krishna Choudhary
- Department of Biomedical Engineering and Genome Center, University of California, Davis, One Shields Avenue, Davis, 95616 CA USA
| | - Yu-Hsuan Lai
- Department of Biochemistry, Purdue University, BCHM 305, 175 S. University Street, West Lafayette, 47907-2063 IN USA
| | - Elizabeth J. Tran
- Department of Biochemistry, Purdue University, BCHM 305, 175 S. University Street, West Lafayette, 47907-2063 IN USA
- Purdue University Center for Cancer Research, Purdue University, Hansen Life Sciences Research Building, Room 141, 201 S. University Street, West Lafayette, 47907-2064 IN USA
| | - Sharon Aviran
- Department of Biomedical Engineering and Genome Center, University of California, Davis, One Shields Avenue, Davis, 95616 CA USA
| |
Collapse
|
11
|
Zhao N, Zhan X, Huang YT, Almli LM, Smith A, Epstein MP, Conneely K, Wu MC. Kernel machine methods for integrative analysis of genome-wide methylation and genotyping studies. Genet Epidemiol 2017; 42:156-167. [DOI: 10.1002/gepi.22100] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 09/26/2017] [Accepted: 10/27/2017] [Indexed: 12/22/2022]
Affiliation(s)
- Ni Zhao
- Department of Biostatistics; Johns Hopkins University; Baltimore Maryland 21205 United States of America
| | - Xiang Zhan
- Department of Public Health Sciences; Pennsylvania State University; Hershey Pennsylvania 17033 United States of America
| | - Yen-Tsung Huang
- Institute of Statistical Science; Academia Sinica; Taipei 11529 Taiwan
| | - Lynn M Almli
- Department of Psychiatry and Behavioral Sciences; Emory University; Atlanta Georgia 30322 United States of America
| | - Alicia Smith
- Department of Gynecology and Obstetrics; Emory University; Atlanta Georgia 30322 United States of America
| | - Michael P. Epstein
- Department of Human Genetics; Emory University; Atlanta Georgia 30322 United States of America
| | - Karen Conneely
- Department of Human Genetics; Emory University; Atlanta Georgia 30322 United States of America
| | - Michael C. Wu
- Public Health Sciences; Fred Hutchinson Cancer Research Center; Seattle Washington 98109 United States of America
| |
Collapse
|
12
|
Wang Y, Teschendorff AE, Widschwendter M, Wang S. Accounting for differential variability in detecting differentially methylated regions. Brief Bioinform 2017; 20:47-57. [DOI: 10.1093/bib/bbx097] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/11/2022] Open
Affiliation(s)
- Ya Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Andrew E Teschendorff
- Department of Women's Cancer, University College London, London, UK
- CAS Key Lab of Computational Biology, Shanghai Institute for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Statistical Cancer Genomics, UCL Cancer Institute, University College London, London, UK
| | | | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| |
Collapse
|
13
|
Kapourani CA, Sanguinetti G. Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics 2017; 32:i405-i412. [PMID: 27587656 DOI: 10.1093/bioinformatics/btw432] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION DNA methylation is an intensely studied epigenetic mark, yet its functional role is incompletely understood. Attempts to quantitatively associate average DNA methylation to gene expression yield poor correlations outside of the well-understood methylation-switch at CpG islands. RESULTS Here, we use probabilistic machine learning to extract higher order features associated with the methylation profile across a defined region. These features quantitate precisely notions of shape of a methylation profile, capturing spatial correlations in DNA methylation across genomic regions. Using these higher order features across promoter-proximal regions, we are able to construct a powerful machine learning predictor of gene expression, significantly improving upon the predictive power of average DNA methylation levels. Furthermore, we can use higher order features to cluster promoter-proximal regions, showing that five major patterns of methylation occur at promoters across different cell lines, and we provide evidence that methylation beyond CpG islands may be related to regulation of gene expression. Our results support previous reports of a functional role of spatial correlations in methylation patterns, and provide a mean to quantitate such features for downstream analyses. AVAILABILITY AND IMPLEMENTATION https://github.com/andreaskapou/BPRMeth CONTACT G.Sanguinetti@ed.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Guido Sanguinetti
- IANC, School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK Synthetic and Systems Biology, University of Edinburgh, Edinburgh EH9 3JD, UK
| |
Collapse
|
14
|
Han Y, He X. Integrating Epigenomics into the Understanding of Biomedical Insight. Bioinform Biol Insights 2016; 10:267-289. [PMID: 27980397 PMCID: PMC5138066 DOI: 10.4137/bbi.s38427] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Revised: 11/01/2016] [Accepted: 11/06/2016] [Indexed: 12/13/2022] Open
Abstract
Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics.
Collapse
Affiliation(s)
- Yixing Han
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, USA.; Present address: Genetics and Biochemistry Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Ximiao He
- Laboratory of Metabolism, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.; Present address: Department of Medical Genetics, School of Basic Medicine, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
15
|
Wang F, Zhang N, Wang J, Wu H, Zheng X. Tumor purity and differential methylation in cancer epigenomics. Brief Funct Genomics 2016; 15:408-419. [PMID: 27199459 DOI: 10.1093/bfgp/elw016] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
DNA methylation is an epigenetic modification of DNA molecule that plays a vital role in gene expression regulation. It is not only involved in many basic biological processes, but also considered an important factor for tumorigenesis and other human diseases. Study of DNA methylation has been an active field in cancer epigenomics research. With the advances of high-throughput technologies and the accumulation of enormous amount of data, method development for analyzing these data has gained tremendous interests in the fields of computational biology and bioinformatics. In this review, we systematically summarize the recent developments of computational methods and software tools in high-throughput methylation data analysis with focus on two aspects: differential methylation analysis and tumor purity estimation in cancer studies.
Collapse
|
16
|
Kishore K, de Pretis S, Lister R, Morelli MJ, Bianchi V, Amati B, Ecker JR, Pelizzola M. methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data. BMC Bioinformatics 2015; 16:313. [PMID: 26415965 PMCID: PMC4587815 DOI: 10.1186/s12859-015-0742-6] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Accepted: 09/16/2015] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Numerous methods are available to profile several epigenetic marks, providing data with different genome coverage and resolution. Large epigenomic datasets are then generated, and often combined with other high-throughput data, including RNA-seq, ChIP-seq for transcription factors (TFs) binding and DNase-seq experiments. Despite the numerous computational tools covering specific steps in the analysis of large-scale epigenomics data, comprehensive software solutions for their integrative analysis are still missing. Multiple tools must be identified and combined to jointly analyze histone marks, TFs binding and other -omics data together with DNA methylation data, complicating the analysis of these data and their integration with publicly available datasets. RESULTS To overcome the burden of integrating various data types with multiple tools, we developed two companion R/Bioconductor packages. The former, methylPipe, is tailored to the analysis of high- or low-resolution DNA methylomes in several species, accommodating (hydroxy-)methyl-cytosines in both CpG and non-CpG sequence context. The analysis of multiple whole-genome bisulfite sequencing experiments is supported, while maintaining the ability of integrating targeted genomic data. The latter, compEpiTools, seamlessly incorporates the results obtained with methylPipe and supports their integration with other epigenomics data. It provides a number of methods to score these data in regions of interest, leading to the identification of enhancers, lncRNAs, and RNAPII stalling/elongation dynamics. Moreover, it allows a fast and comprehensive annotation of the resulting genomic regions, and the association of the corresponding genes with non-redundant GeneOntology terms. Finally, the package includes a flexible method based on heatmaps for the integration of various data types, combining annotation tracks with continuous or categorical data tracks. CONCLUSIONS methylPipe and compEpiTools provide a comprehensive Bioconductor-compliant solution for the integrative analysis of heterogeneous epigenomics data. These packages are instrumental in providing biologists with minimal R skills a complete toolkit facilitating the analysis of their own data, or in accelerating the analyses performed by more experienced bioinformaticians.
Collapse
Affiliation(s)
- Kamal Kishore
- Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), Milano, 20139, Italy.
| | - Stefano de Pretis
- Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), Milano, 20139, Italy.
| | - Ryan Lister
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Perth, WA, 6009, Australia. .,Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA.
| | - Marco J Morelli
- Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), Milano, 20139, Italy.
| | - Valerio Bianchi
- Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), Milano, 20139, Italy.
| | - Bruno Amati
- Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), Milano, 20139, Italy. .,Department of Experimental Oncology, European Institute of Oncology (IEO), Milano, 20139, Italy.
| | - Joseph R Ecker
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA. .,Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA.
| | - Mattia Pelizzola
- Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), Milano, 20139, Italy. .,Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA.
| |
Collapse
|
17
|
Madrigal P, Krajewski P. Uncovering correlated variability in epigenomic datasets using the Karhunen-Loeve transform. BioData Min 2015; 8:20. [PMID: 26140054 PMCID: PMC4488123 DOI: 10.1186/s13040-015-0051-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Accepted: 06/17/2015] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Larger variation exists in epigenomes than in genomes, as a single genome shapes the identity of multiple cell types. With the advent of next-generation sequencing, one of the key problems in computational epigenomics is the poor understanding of correlations and quantitative differences between large scale data sets. RESULTS Here we bring to genomics a scenario of functional principal component analysis, a finite Karhunen-Loève transform, and explicitly decompose the variation in the coverage profiles of 27 chromatin mark ChIP-seq datasets at transcription start sites for H1, one of the most used human embryonic stem cell lines. Using this approach we identify positive correlations between H3K4me3 and H3K36me3, as well as between H3K9ac and H3K36me3, so far undetected by the most commonly used Pearson correlation between read enrichment coverages. We uncover highly negative correlations between H2A.Z, H3K4me3, and several histone acetylation marks, but these occur only between principal components of first and second order. We also demonstrate that levels of gene expression correlate significantly with scores of components of order higher than one, demonstrating that transcriptional regulation by histone marks escapes simple one-to-one relationships. This correlations were higher in significance and magnitude in protein coding genes than in non-coding RNAs. CONCLUSIONS In summary, we present a methodology to explore and uncover novel patterns of epigenomic variability and covariability in genomic data sets by using a functional eigenvalue decomposition of genomic data. R code is available at: http://github.com/pmb59/KLTepigenome.
Collapse
Affiliation(s)
- Pedro Madrigal
- Department of Biometry and Bioinformatics, Institute of Plant Genetics of the Polish Academy of Sciences, Strzeszyńska 34, Poznań, 60-479 Poland ; Present address: Wellcome Trust-MRC Cambridge Stem Cell Institute, Anne McLaren Laboratory for Regenerative Medicine, Department of Surgery, University of Cambridge, West Forvie Building, Forvie Site, Robinson Way, Cambridge, CB2 0SZ UK ; Present address: Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK
| | - Paweł Krajewski
- Department of Biometry and Bioinformatics, Institute of Plant Genetics of the Polish Academy of Sciences, Strzeszyńska 34, Poznań, 60-479 Poland
| |
Collapse
|
18
|
Robinson MD, Pelizzola M. Computational epigenomics: challenges and opportunities. Front Genet 2015; 6:88. [PMID: 25798147 PMCID: PMC4350413 DOI: 10.3389/fgene.2015.00088] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Accepted: 02/18/2015] [Indexed: 12/31/2022] Open
Affiliation(s)
- Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich Zurich, Switzerland ; SIB Swiss Institute of Bioinformatics, University of Zurich Zurich, Switzerland
| | - Mattia Pelizzola
- Computational Epigenomics, Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia Milano, Italy
| |
Collapse
|