1
|
Madrid Padilla OH, Chatterjee S. Risk bounds for quantile trend filtering. Biometrika 2021. [DOI: 10.1093/biomet/asab045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
We study quantile trend filtering, a recently proposed method for nonparametric quantile regression, with the goal of generalizing existing risk bounds for the usual trend-filtering estimators that perform mean regression. We study both the penalized and the constrained versions, of order $r \geqslant 1$, of univariate quantile trend filtering. Our results show that both the constrained and the penalized versions of order $r \geqslant 1$ attain the minimax rate up to logarithmic factors, when the $(r-1)$th discrete derivative of the true vector of quantiles belongs to the class of bounded-variation signals. Moreover, we show that if the true vector of quantiles is a discrete spline with a few polynomial pieces, then both versions attain a near-parametric rate of convergence. Corresponding results for the usual trend-filtering estimators are known to hold only when the errors are sub-Gaussian. In contrast, our risk bounds are shown to hold under minimal assumptions on the error variables. In particular, no moment assumptions are needed and our results hold under heavy-tailed errors. Our proof techniques are general, and thus can potentially be used to study other nonparametric quantile regression methods. To illustrate this generality, we employ our proof techniques to obtain new results for multivariate quantile total-variation denoising and high-dimensional quantile linear regression.
Collapse
Affiliation(s)
- Oscar Hernan Madrid Padilla
- Department of Statistics, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, California 90095, U.S.A
| | - Sabyasachi Chatterjee
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright St. M/C 374, Champaign, Illinois 61820, U.S.A
| |
Collapse
|
2
|
Jula Vanegas L, Behr M, Munk A. Multiscale Quantile Segmentation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2020.1859380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Laura Jula Vanegas
- Institute for Mathematical Stochastics, University of Göttingen, Göttingen, Germany
| | - Merle Behr
- Department of Statistics, University of California at Berkeley, Berkeley, CA
| | - Axel Munk
- Institute for Mathematical Stochastics, University of Göttingen, Göttingen, Germany;
- Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| |
Collapse
|
3
|
Lee J, Chen J. A modified information criterion for tuning parameter selection in 1d fused LASSO for inference on multiple change points. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1732379] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- J. Lee
- Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, USA
| | - J. Chen
- Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, USA
| |
Collapse
|
4
|
Lee J, Chen J. A penalized regression approach for DNA copy number study using the sequencing data. Stat Appl Genet Mol Biol 2019; 18:sagmb-2018-0001. [PMID: 31145697 DOI: 10.1515/sagmb-2018-0001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.
Collapse
Affiliation(s)
- Jaeeun Lee
- Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA 30912, USA
| | - Jie Chen
- Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA 30912, USA
| |
Collapse
|
5
|
Affiliation(s)
- Hosik Choi
- Department of Applied Statistics, Kyonggi University, Suwon, Korea
| | - J. C. Poythress
- Department of Statistics, University of Georgia, Athens, GA, Georgia
| | - Cheolwoo Park
- Department of Statistics, University of Georgia, Athens, GA, Georgia
| | - Jong-June Jeon
- Department of Statistics, University of Seoul, Dongdaemun-gu, Seoul Korea
- Natural Science Research Institute, University of Seoul, Dongdaemun-gu, Seoul, Korea
| | - Changyi Park
- Department of Statistics, University of Seoul, Dongdaemun-gu, Seoul Korea
| |
Collapse
|
6
|
Song L, Bhuvaneshwar K, Wang Y, Feng Y, Shih IM, Madhavan S, Gusev Y. CINdex: A Bioconductor Package for Analysis of Chromosome Instability in DNA Copy Number Data. Cancer Inform 2017; 16:1176935117746637. [PMID: 29343938 PMCID: PMC5761903 DOI: 10.1177/1176935117746637] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2017] [Accepted: 10/26/2017] [Indexed: 01/10/2023] Open
Abstract
The CINdex Bioconductor package addresses an important area of high-throughput genomic analysis. It calculates the chromosome instability (CIN) index, a novel measurement that quantitatively characterizes genome-wide copy number alterations (CNAs) as a measure of CIN. The advantage of this package is an ability to compare CIN index values between several groups for patients (case and control groups), which is a typical use case in translational research. The differentially changed cytobands or chromosomes can then be linked to genes located in the affected genomic regions, as well as pathways. This enables in-depth systems biology-based network analysis and assessment of the impact of CNA on various biological processes or clinical outcomes. This package was successfully applied to analysis of DNA copy number data in colorectal cancer as a part of multi-omics integrative study as well as for analysis of several other cancer types. The source code, along with an end-to-end tutorial, and example data are freely available in Bioconductor at http://bioconductor.org/packages/CINdex/.
Collapse
Affiliation(s)
- Lei Song
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Krithika Bhuvaneshwar
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Yue Wang
- The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Yuanjian Feng
- The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Ie-Ming Shih
- Department of Gynecology and Obstetrics, The Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Subha Madhavan
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Yuriy Gusev
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| |
Collapse
|
7
|
Affiliation(s)
- Gabriela Ciuperca
- Université Claude Bernard Lyon, Institut Camille Jordan, Villeurbanne, France
| |
Collapse
|
8
|
Sun Y, Wang HJ, Fuentes M. Fused Adaptive Lasso for Spatial and Temporal Quantile Function Estimation. Technometrics 2016. [DOI: 10.1080/00401706.2015.1017115] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Ying Sun
- CEMSE Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Huixia J. Wang
- Department of Statistics, George Washington University, Washington, DC 20052,
| | - Montserrat Fuentes
- Department of Statistics, North Carolina State University, Raleigh, NC 27695,
| |
Collapse
|
9
|
Briollais L, Durrieu G. Application of quantile regression to recent genetic and -omic studies. Hum Genet 2014; 133:951-66. [DOI: 10.1007/s00439-014-1440-6] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2013] [Accepted: 03/10/2014] [Indexed: 12/01/2022]
|
10
|
Jiang L, Bondell HD, Wang HJ. Interquantile Shrinkage and Variable Selection in Quantile Regression. Comput Stat Data Anal 2014; 69:208-219. [PMID: 24653545 DOI: 10.1016/j.csda.2013.08.006] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Examination of multiple conditional quantile functions provides a comprehensive view of the relationship between the response and covariates. In situations where quantile slope coefficients share some common features, estimation efficiency and model interpretability can be improved by utilizing such commonality across quantiles. Furthermore, elimination of irrelevant predictors will also aid in estimation and interpretation. These motivations lead to the development of two penalization methods, which can identify the interquantile commonality and nonzero quantile coefficients simultaneously. The developed methods are based on a fused penalty that encourages sparsity of both quantile coefficients and interquantile slope differences. The oracle properties of the proposed penalization methods are established. Through numerical investigations, it is demonstrated that the proposed methods lead to simpler model structure and higher estimation efficiency than the traditional quantile regression estimation.
Collapse
Affiliation(s)
- Liewen Jiang
- Department of Statistics, North Carolina State University, Raleigh, NC 27606, U.S.A
| | - Howard D Bondell
- Department of Statistics, North Carolina State University, Raleigh, NC 27606, U.S.A
| | - Huixia Judy Wang
- Department of Statistics, North Carolina State University, Raleigh, NC 27606, U.S.A
| |
Collapse
|
11
|
Abstract
Conventional analysis using quantile regression typically focuses on fitting the regression model at different quantiles separately. However, in situations where the quantile coefficients share some common feature, joint modeling of multiple quantiles to accommodate the commonality often leads to more efficient estimation. One example of common features is that a predictor may have a constant effect over one region of quantile levels but varying effects in other regions. To automatically perform estimation and detection of the interquantile commonality, we develop two penalization methods. When the quantile slope coefficients indeed do not change across quantile levels, the proposed methods will shrink the slopes towards constant and thus improve the estimation efficiency. We establish the oracle properties of the two proposed penalization methods. Through numerical investigations, we demonstrate that the proposed methods lead to estimations with competitive or higher efficiency than the standard quantile regression estimation in finite samples. Supplemental materials for the article are available online.
Collapse
|
12
|
Plummer PJ, Chen J. A Bayesian approach for locating change points in a compound Poisson process with application to detecting DNA copy number variations. J Appl Stat 2013. [DOI: 10.1080/02664763.2013.840272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
13
|
Nowak G, Hastie T, Pollack JR, Tibshirani R. A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 2011; 12:776-91. [PMID: 21642389 DOI: 10.1093/biostatistics/kxr012] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Collapse
Affiliation(s)
- Gen Nowak
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA.
| | | | | | | |
Collapse
|
14
|
Zhang Z, Lange K, Ophoff R, Sabatti C. RECONSTRUCTING DNA COPY NUMBER BY PENALIZED ESTIMATION AND IMPUTATION. Ann Appl Stat 2010; 4:1749-1773. [PMID: 21572975 DOI: 10.1214/10-aoas357] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang [Biostatistics 9 (2008) 18-29]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization-minimization) algorithm, and (c) applying a fast version of Newton's method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way.We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.
Collapse
Affiliation(s)
- Zhongyang Zhang
- Department of Statistics University of California, Los Angeles Los Angeles, California 90095 USA
| | | | | | | |
Collapse
|
15
|
Gao X, Huang J. A robust penalized method for the analysis of noisy DNA copy number data. BMC Genomics 2010; 11:517. [PMID: 20868505 PMCID: PMC3247090 DOI: 10.1186/1471-2164-11-517] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Accepted: 09/25/2010] [Indexed: 11/20/2022] Open
Abstract
Background Deletions and amplifications of the human genomic DNA copy number are the causes of numerous diseases, such as, various forms of cancer. Therefore, the detection of DNA copy number variations (CNV) is important in understanding the genetic basis of many diseases. Various techniques and platforms have been developed for genome-wide analysis of DNA copy number, such as, array-based comparative genomic hybridization (aCGH) and high-resolution mapping with high-density tiling oligonucleotide arrays. Since complicated biological and experimental processes are often associated with these platforms, data can be potentially contaminated by outliers. Results We propose a penalized LAD regression model with the adaptive fused lasso penalty for detecting CNV. This method contains robust properties and incorporates both the spatial dependence and sparsity of CNV into the analysis. Our simulation studies and real data analysis indicate that the proposed method can correctly detect the numbers and locations of the true breakpoints while appropriately controlling the false positives. Conclusions The proposed method has three advantages for detecting CNV change points: it contains robustness properties; incorporates both spatial dependence and sparsity; and estimates the true values at each marker accurately.
Collapse
Affiliation(s)
- Xiaoli Gao
- Department of Mathematics and Statistics, Oakland University, Rochester, MI 48309, USA.
| | | |
Collapse
|
16
|
Abstract
Most existing methods for identifying aberrant regions with array CGH data are confined to a single target sample. Focusing on the comparison of multiple samples from two different groups, we develop a new penalized regression approach with a fused adaptive lasso penalty to accommodate the spatial dependence of the clones. The nonrandom aberrant genomic segments are determined by assessing the significance of the differences between neighboring clones and neighboring segments. The algorithm proposed in this article is a first attempt to simultaneously detect the common aberrant regions within each group, and the regions where the two groups differ in copy number changes. The simulation study suggests that the proposed procedure outperforms the commonly used single-sample aberration detection methods for segmentation in terms of both false positives and false negatives. To further assess the value of the proposed method, we analyze a data set from a study that identified the aberrant genomic regions associated with grade subgroups of breast cancer tumors.
Collapse
Affiliation(s)
- Huixia Judy Wang
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA.
| | | |
Collapse
|
17
|
Nguyen N, Huang H, Oraintara S, Vo A. Stationary wavelet packet transform and dependent laplacian bivariate shrinkage estimator for array-CGH data smoothing. J Comput Biol 2010; 17:139-52. [PMID: 20078226 DOI: 10.1089/cmb.2009.0013] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Array-based comparative genomic hybridization (aCGH) has merged as a highly efficient technique for the detection of chromosomal imbalances. Characteristics of these DNA copy number aberrations provide the insights into cancer, and they are useful for the diagnostic and therapy strategies. In this article, we propose a statistical bivariate model for aCGH data in the stationary wavelet packet transform (SWPT) and apply this bivariate shrinkage estimator into the aCGH smoothing study. Because our new dependent Laplacian bivariate shrinkage estimator covers the dependency between wavelet coefficients and the shift invariant SWPT results include both low- and high-frequency information, our dependent Laplacian bivariate shrinkage estimator based SWPT method (named as SWPT-LaBi) has fundamental advantages to solve aCGH data smoothing problem compared to other methods. In our experiments, two standard evaluation methods, the Root Mean Squared Error (RMSE) and the Receiver Operating Characteristic (ROC) curve, are calculated to demonstrate the performance of our method. In all experimental results, our SWPT-LaBi method outperforms the previous most commonly used aCGH smoothing algorithms on both synthetic data and real data. Meantime, we also propose a new synthetic data generation method for aCGH smoothing algorithms evaluation. In our new data model, the noise from real aCGH data is extracted and used to improve synthetic data generation.
Collapse
Affiliation(s)
- Nha Nguyen
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas 76019, USA
| | | | | | | |
Collapse
|
18
|
Kim KY, Kim J, Kim HJ, Nam W, Cha IH. A method for detecting significant genomic regions associated with oral squamous cell carcinoma using aCGH. Med Biol Eng Comput 2010; 48:459-68. [DOI: 10.1007/s11517-010-0595-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Accepted: 02/26/2010] [Indexed: 12/14/2022]
|
19
|
Ho JWK, Stefani M, dos Remedios CG, Charleston MA. A model selection approach to discover age-dependent gene expression patterns using quantile regression models. BMC Genomics 2009; 10 Suppl 3:S16. [PMID: 19958479 PMCID: PMC2788368 DOI: 10.1186/1471-2164-10-s3-s16] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background It has been a long-standing biological challenge to understand the molecular regulatory mechanisms behind mammalian ageing. Harnessing the availability of many ageing microarray datasets, a number of studies have shown that it is possible to identify genes that have age-dependent differential expression (DE) or differential variability (DV) patterns. The majority of the studies identify "interesting" genes using a linear regression approach, which is known to perform poorly in the presence of outliers or if the underlying age-dependent pattern is non-linear. Clearly a more robust and flexible approach is needed to identify genes with various age-dependent gene expression patterns. Results Here we present a novel model selection approach to discover genes with linear or non-linear age-dependent gene expression patterns from microarray data. To identify DE genes, our method fits three quantile regression models (constant, linear and piecewise linear models) to the expression profile of each gene, and selects the least complex model that best fits the available data. Similarly, DV genes are identified by fitting and comparing two quantile regression models (non-DV and the DV models) to the expression profile of each gene. We show that our approach is much more robust than the standard linear regression approach in discovering age-dependent patterns. We also applied our approach to analyze two human brain ageing datasets and found many biologically interesting gene expression patterns, including some very interesting DV patterns, that have been overlooked in the original studies. Furthermore, we propose that our model selection approach can be extended to discover DE and DV genes from microarray datasets with discrete class labels, by considering different quantile regression models. Conclusion In this paper, we present a novel application of quantile regression models to identify genes that have interesting linear or non-linear age-dependent expression patterns. One important contribution of this paper is to introduce a model selection approach to DE and DV gene identification, which is most commonly tackled by null hypothesis testing approaches. We show that our approach is robust in analyzing real and simulated datasets. We believe that our approach is applicable in many ageing or time-series data analysis tasks.
Collapse
Affiliation(s)
- Joshua W K Ho
- School of Information Technologies, The University of Sydney, NSW 2006, Australia.
| | | | | | | |
Collapse
|
20
|
Kim KY, Lee GY, Kim J, Jeung HC, Chung HC, Rha SY. Identification of significant regional genetic variations using continuous CNV values in aCGH data. Genomics 2009; 94:317-23. [DOI: 10.1016/j.ygeno.2009.08.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2009] [Revised: 07/20/2009] [Accepted: 08/11/2009] [Indexed: 11/26/2022]
|
21
|
Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S, Futreal PA, Stratton MR. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics 2009; 11:164-75. [PMID: 19837654 PMCID: PMC2800165 DOI: 10.1093/biostatistics/kxp045] [Citation(s) in RCA: 168] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
High-throughput oligonucleotide microarrays are commonly employed to investigate genetic disease, including cancer. The algorithms employed to extract genotypes and copy number variation function optimally for diploid genomes usually associated with inherited disease. However, cancer genomes are aneuploid in nature leading to systematic errors when using these techniques. We introduce a preprocessing transformation and hidden Markov model algorithm bespoke to cancer. This produces genotype classification, specification of regions of loss of heterozygosity, and absolute allelic copy number segmentation. Accurate prediction is demonstrated with a combination of independent experimental techniques. These methods are exemplified with affymetrix genome-wide SNP6.0 data from 755 cancer cell lines, enabling inference upon a number of features of biological interest. These data and the coded algorithm are freely available for download.
Collapse
Affiliation(s)
- Chris D Greenman
- Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Abstract
In this chapter, we introduce a few statistical algorithms for calling gains and losses in array-based comparative genomic hybridization (array CGH) data, including CBS, CLAC, CGHseg, and Fused Lasso. We illustrate the performance of the methods through simulated and real data examples. We also provide brief guidance on how to use the corresponding software at the end of this chapter.
Collapse
Affiliation(s)
- Pei Wang
- Cancer Prevention Program, Division of Public Health Science, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| |
Collapse
|
23
|
Han ST, Kang HC, Choi HS, Jang MS. A Study on Development of Scoring Campaign System. KOREAN JOURNAL OF APPLIED STATISTICS 2009. [DOI: 10.5351/kjas.2009.22.1.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
24
|
Kim BS, Kim SC. A Penalized Spline Based Method for Detecting the DNA Copy Number Alteration in an Array-CGH Experiment. KOREAN JOURNAL OF APPLIED STATISTICS 2009. [DOI: 10.5351/kjas.2009.22.1.115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
25
|
Budinska E, Gelnarova E, Schimek MG. MSMAD: a computationally efficient method for the analysis of noisy array CGH data. ACTA ACUST UNITED AC 2009; 25:703-13. [PMID: 19147666 DOI: 10.1093/bioinformatics/btp022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Genome analysis has become one of the most important tools for understanding the complex process of cancerogenesis. With increasing resolution of CGH arrays, the demand for computationally efficient algorithms arises, which are effective in the detection of aberrations even in very noisy data. RESULTS We developed a rather simple, non-parametric technique of high computational efficiency for CGH array analysis that adopts a median absolute deviation concept for breakpoint detection, comprising median smoothing for pre-processing. The resulting algorithm has the potential to outperform any single smoothing approach as well as several recently proposed segmentation techniques. We show its performance through the application of simulated and real datasets in comparison to three other methods for array CGH analysis. IMPLEMENTATION Our approach is implemented in the R-language and environment for statistical computing (version 2.6.1 for Windows, R-project, 2007). The code is available at: http://www.iba.muni.cz/~budinska/msmad.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eva Budinska
- Institute of Biostatistics and Analyses, Masaryk University, Kamenice 126/3, 625 00 Brno, Czech Republic.
| | | | | |
Collapse
|
26
|
Huang H, Nguyen N, Oraintara S, Vo A. Array CGH data modeling and smoothing in Stationary Wavelet Packet Transform domain. BMC Genomics 2008; 9 Suppl 2:S17. [PMID: 18831782 PMCID: PMC2559881 DOI: 10.1186/1471-2164-9-s2-s17] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Array-based comparative genomic hybridization (array CGH) is a highly efficient technique, allowing the simultaneous measurement of genomic DNA copy number at hundreds or thousands of loci and the reliable detection of local one-copy-level variations. Characterization of these DNA copy number changes is important for both the basic understanding of cancer and its diagnosis. In order to develop effective methods to identify aberration regions from array CGH data, many recent research work focus on both smoothing-based and segmentation-based data processing. In this paper, we propose stationary packet wavelet transform based approach to smooth array CGH data. Our purpose is to remove CGH noise in whole frequency while keeping true signal by using bivariate model. Results In both synthetic and real CGH data, Stationary Wavelet Packet Transform (SWPT) is the best wavelet transform to analyze CGH signal in whole frequency. We also introduce a new bivariate shrinkage model which shows the relationship of CGH noisy coefficients of two scales in SWPT. Before smoothing, the symmetric extension is considered as a preprocessing step to save information at the border. Conclusion We have designed the SWTP and the SWPT-Bi which are using the stationary wavelet packet transform with the hard thresholding and the new bivariate shrinkage estimator respectively to smooth the array CGH data. We demonstrate the effectiveness of our approach through theoretical and experimental exploration of a set of array CGH data, including both synthetic data and real data. The comparison results show that our method outperforms the previous approaches.
Collapse
Affiliation(s)
- Heng Huang
- Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA.
| | | | | | | |
Collapse
|