1
|
Han C, Park J, Lin S. BCurve: Bayesian Curve Credible Bands Approach for the Detection of Differentially Methylated Regions. Methods Mol Biol 2022; 2432:167-185. [PMID: 35505215 DOI: 10.1007/978-1-0716-1994-0_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
High-throughput assays have been developed to measure DNA methylation, among which bisulfite-based sequencing (BS-seq) and microarray technologies are the most popular for genome-wide profiling. A major goal in DNA methylation analysis is the detection of differentially methylated genomic regions under two different conditions. To accomplish this, many state-of-the-art methods have been proposed in the past few years; only a handful of these methods are capable of analyzing both types of data (BS-seq and microarray), though. On the other hand, covariates, such as sex and age, are known to be potentially influential on DNA methylation; and thus, it would be important to adjust for their effects on differential methylation analysis. In this chapter, we describe a Bayesian curve credible bands approach and the accompanying software, BCurve, for detecting differentially methylated regions for data generated from either microarray or BS-Seq. The unified theme underlying the analysis of these two different types of data is the model that accounts for correlation between DNA methylation in nearby sites, covariates, and between-sample variability. The BCurve R software package also provides tools for simulating both microarray and BS-seq data, which can be useful for facilitating comparisons of methods given the known "gold standard" in the simulated data. We provide detailed description of the main functions in BCurve and demonstrate the utility of the package for analyzing data from both platforms using simulated data from the functions provided in the package. Analyses of two real datasets, one from BS-seq and one from microarray, are also furnished to further illustrate the capability of BCurve.
Collapse
Affiliation(s)
- Chenggong Han
- Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH, USA
| | - Jincheol Park
- Department of Statistics, Keimyung University, South Korea, Korea
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
2
|
Xu T, Zheng X, Li B, Jin P, Qin Z, Wu H. A comprehensive review of computational prediction of genome-wide features. Brief Bioinform 2020; 21:120-134. [PMID: 30462144 PMCID: PMC10233247 DOI: 10.1093/bib/bby110] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 10/15/2018] [Accepted: 10/16/2018] [Indexed: 12/15/2022] Open
Abstract
There are significant correlations among different types of genetic, genomic and epigenomic features within the genome. These correlations make the in silico feature prediction possible through statistical or machine learning models. With the accumulation of a vast amount of high-throughput data, feature prediction has gained significant interest lately, and a plethora of papers have been published in the past few years. Here we provide a comprehensive review on these published works, categorized by the prediction targets, including protein binding site, enhancer, DNA methylation, chromatin structure and gene expression. We also provide discussions on some important points and possible future directions.
Collapse
Affiliation(s)
- Tianlei Xu
- Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Peng Jin
- Department of Human Genetics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Zhaohui Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| |
Collapse
|
3
|
Chen L, Qin ZS. Using DIVAN to assess disease/trait-associated single nucleotide variants in genome-wide scale. BMC Res Notes 2017; 10:530. [PMID: 29084591 PMCID: PMC5663107 DOI: 10.1186/s13104-017-2851-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2017] [Accepted: 10/23/2017] [Indexed: 01/01/2023] Open
Abstract
OBJECTIVE The majority of sequence variants identified by Genome-wide association studies (GWASs) fall outside of the protein-coding regions. Unlike coding variants, it is challenging to connect these noncoding variants to the pathophysiology of complex diseases/traits due to the lack of functional annotations in the non-coding regions. To overcome this, by leveraging the rich collection of genomic and epigenomic profiles, we have developed DIVAN, or Disease/trait-specific Variant ANnotation, which enables the assignment of a measurement (D-score) for each base of the human genome in a disease/trait-specific manner. To facilitate the utilization of DIVAN, we pre-computed D-scores for every base of the human genome (hg19) for 45 different diseases/traits. RESULTS In this work, we present a detailed protocol on how to utilize DIVAN software toolkit to retrieve D-scores either by variant identifiers or by genomic regions for a disease/trait of interest. We also demonstrate the utilities of the D-scores using real data examples. We believe that the pre-computed D-scores for 45 diseases/traits is a useful resource to follow up on the discoveries made by GWASs, and the DIVAN software toolkit provides a convenient way to access this resource. DIVAN is freely available at https://sites.google.com/site/emorydivan/software .
Collapse
Affiliation(s)
- Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University, Auburn, AL, 36849, USA.
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, 30322, USA. .,Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, 30322, USA.
| |
Collapse
|
4
|
Li B, Li Y, Qin ZS. Improving Hierarchical Models Using Historical Data with Applications in High-Throughput Genomics Data Analysis. STATISTICS IN BIOSCIENCES 2017; 9:73-90. [PMID: 28919931 DOI: 10.1007/s12561-016-9156-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Modern high-throughput biotechnologies such as microarray and next generation sequencing produce a massive amount of information for each sample assayed. However, in a typical high-throughput experiment, only limited amount of data are observed for each individual feature, thus the classical 'large p, small n' problem. Bayesian hierarchical model, capable of borrowing strength across features within the same dataset, has been recognized as an effective tool in analyzing such data. However, the shrinkage effect, the most prominent feature of hierarchical features, can lead to undesirable over-correction for some features. In this work, we discuss possible causes of the over-correction problem and propose several alternative solutions. Our strategy is rooted in the fact that in the Big Data era, large amount of historical data are available which should be taken advantage of. Our strategy presents a new framework to enhance the Bayesian hierarchical model. Through simulation and real data analysis, we demonstrated superior performance of the proposed strategy. Our new strategy also enables borrowing information across different platforms which could be extremely useful with emergence of new technologies and accumulation of data from different platforms in the Big Data era. Our method has been implemented in R package "adaptiveHM", which is freely available from https://github.com/benliemory/adaptiveHM.
Collapse
Affiliation(s)
- Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Yunxiao Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA
| |
Collapse
|
5
|
Tang B, Cheng X, Xi Y, Chen Z, Zhou Y, Jin VX. Advances in Genomic Profiling and Analysis of 3D Chromatin Structure and Interaction. Genes (Basel) 2017; 8:E223. [PMID: 28885554 PMCID: PMC5615356 DOI: 10.3390/genes8090223] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 08/25/2017] [Accepted: 09/04/2017] [Indexed: 02/08/2023] Open
Abstract
Recent sequence-based profiling technologies such as high-throughput sequencing to detect fragment nucleotide sequence (Hi-C) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) have revolutionized the field of three-dimensional (3D) chromatin architecture. It is now recognized that human genome functions as folded 3D chromatin units and looping paradigm is the basic principle of gene regulation. To better interpret the 3D data dramatically accumulating in past five years and to gain deep biological insights, huge efforts have been made in developing novel quantitative analysis methods. However, the full understanding of genome regulation requires thorough knowledge in both genomic technologies and their related data analyses. We summarize the recent advances in genomic technologies in identifying the 3D chromatin structure and interaction, and illustrate the quantitative analysis methods to infer functional domains and chromatin interactions, and further elucidate the emerging single-cell Hi-C technique and its computational analysis, and finally discuss the future directions such as advances of 3D chromatin techniques in diseases.
Collapse
Affiliation(s)
- Binhua Tang
- Epigenetics & Function Group, School of the Internet of Things, Hohai University, Changzhou Campus, Changzhou 213022, Jiangsu, China.
- School of Public Health, Shanghai Jiao Tong University, Shanghai 200025, China.
| | - Xiaolong Cheng
- Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, TX 78229, USA.
| | - Yunlong Xi
- Epigenetics & Function Group, School of the Internet of Things, Hohai University, Changzhou Campus, Changzhou 213022, Jiangsu, China.
| | - Zixin Chen
- Epigenetics & Function Group, School of the Internet of Things, Hohai University, Changzhou Campus, Changzhou 213022, Jiangsu, China.
| | - Yufan Zhou
- Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, TX 78229, USA.
| | - Victor X Jin
- Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, TX 78229, USA.
| |
Collapse
|
6
|
Li Z, Safo SE, Long Q. Incorporating biological information in sparse principal component analysis with application to genomic data. BMC Bioinformatics 2017; 18:332. [PMID: 28697740 PMCID: PMC5504598 DOI: 10.1186/s12859-017-1740-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Accepted: 06/22/2017] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Sparse principal component analysis (PCA) is a popular tool for dimensionality reduction, pattern recognition, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often represented by graphs. Recent work has shown that incorporating such biological information improves feature selection and prediction performance in regression analysis, but there has been limited work on extending this approach to PCA. In this article, we propose two new sparse PCA methods called Fused and Grouped sparse PCA that enable incorporation of prior biological information in variable selection. RESULTS Our simulation studies suggest that, compared to existing sparse PCA methods, the proposed methods achieve higher sensitivity and specificity when the graph structure is correctly specified, and are fairly robust to misspecified graph structures. Application to a glioblastoma gene expression dataset identified pathways that are suggested in the literature to be related with glioblastoma. CONCLUSIONS The proposed sparse PCA methods Fused and Grouped sparse PCA can effectively incorporate prior biological information in variable selection, leading to improved feature selection and more interpretable principal component loadings and potentially providing insights on molecular underpinnings of complex diseases.
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, Atlanta, 30322 GA USA
| | - Sandra E. Safo
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, Atlanta, 30322 GA USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104 PA USA
| |
Collapse
|