1
|
Wang J, Li N, Meng Z, Li Q. Change point detection for high dimensional data via kernel measure with application to human aging brain data. Stat Med 2023; 42:4644-4663. [PMID: 37649243 DOI: 10.1002/sim.9881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 08/07/2023] [Accepted: 08/14/2023] [Indexed: 09/01/2023]
Abstract
Identifying the existence and locations of change points has been a broadly encountered task in many statistical application areas. The existing change point detection methods may produce unsatisfactory results for high-dimensional data since certain distributional assumptions are made on data, which are hard to verify in practice. Moreover, some parameters (such as the number of change points) need to be estimated beforehand for some methods, making their powers sensitive to these values. Here, we propose a kernel-basedU $$ U $$ -statistic to identify change points (KUCP) for high dimensional data, which is free of distributional assumptions and sup-parameter estimations. Specifically, we employ a kernel function to describe similarities among the subjects and construct aU $$ U $$ -statistic to test the existence of change point for a given location. The asymptotic properties of theU $$ U $$ -statistic are deduced. We also develop a procedure to locate the change points sequentially via a dichotomy algorithm. Extensive simulations demonstrate that KUCP has higher sensitivity in identifying existence of change points and higher accuracy in locating these change points than its counterparts. We further illustrate its practical utility by analyzing a gene expression data of human brain to detect the time point when gene expression profiles begin to change, which has been reported to be closely related with aging brain.
Collapse
Affiliation(s)
- Jinjuan Wang
- School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, China
| | - Na Li
- School of Applied Science, Beijing Information Science and Technology University, Beijing, China
| | - Zhen Meng
- School of Statistics, Capital University of Economics and Business, Beijing, China
| | - Qizhai Li
- LSC Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
2
|
Juodakis J, Marsland S. Epidemic changepoint detection in the presence of nuisance changes. Stat Pap (Berl) 2023; 64:17-39. [PMID: 35400849 PMCID: PMC8977442 DOI: 10.1007/s00362-022-01307-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 01/09/2022] [Accepted: 03/14/2022] [Indexed: 01/25/2023]
Abstract
Many time series problems feature epidemic changes-segments where a parameter deviates from a background baseline. Detection of such changepoints can be improved by accounting for the epidemic structure, but this is currently difficult if the background level is unknown. Furthermore, in practical data the background often undergoes nuisance changes, which interfere with standard estimation techniques and appear as false alarms. To solve these issues, we develop a new, efficient approach to simultaneously detect epidemic changes and estimate unknown, but fixed, background level, based on a penalised cost. Using it, we build a two-level detector that models and separates nuisance and signal changes. The analytic and computational properties of the proposed methods are established, including consistency and convergence. We demonstrate via simulations that our two-level detector provides accurate estimation of changepoints under a nuisance process, while other state-of-the-art detectors fail. In real-world genomic and demographic datasets, the proposed method identified and localised target events while separating out seasonal variations and experimental artefacts. Supplementary Information The online version contains supplementary material available at 10.1007/s00362-022-01307-x.
Collapse
Affiliation(s)
- Julius Juodakis
- School of Mathematics and Statistics, Victoria University of Wellington, PO Box 600, Wellington, 6140 New Zealand
| | - Stephen Marsland
- School of Mathematics and Statistics, Victoria University of Wellington, PO Box 600, Wellington, 6140 New Zealand
| |
Collapse
|
3
|
Luo X, Cai G, Mclain AC, Amos CI, Cai B, Xiao F. BMI-CNV: a Bayesian framework for multiple genotyping platforms detection of copy number variants. Genetics 2022; 222:iyac147. [PMID: 36171678 PMCID: PMC9713397 DOI: 10.1093/genetics/iyac147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 09/08/2022] [Indexed: 12/13/2022] Open
Abstract
Whole-exome sequencing (WES) enables the detection of copy number variants (CNVs) with high resolution in protein-coding regions. However, variants in the intergenic or intragenic regions are excluded from studies. Fortunately, many of these samples have been previously sequenced by other genotyping platforms which are sparse but cover a wide range of genomic regions, such as SNP array. Moreover, conventional single sample-based methods suffer from a high false discovery rate due to prominent data noise. Therefore, methods for integrating multiple genotyping platforms and multiple samples are highly demanded for improved copy number variant detection. We developed BMI-CNV, a Bayesian Multisample and Integrative CNV (BMI-CNV) profiling method with data sequenced by both whole-exome sequencing and microarray. For the multisample integration, we identify the shared copy number variants regions across samples using a Bayesian probit stick-breaking process model coupled with a Gaussian Mixture model estimation. With extensive simulations, BMI-copy number variant outperformed existing methods with improved accuracy. In the matched data from the 1000 Genomes Project and HapMap project data, BMI-CNV also accurately detected common variants and significantly enlarged the detection spectrum of whole-exome sequencing. Further application to the data from The Research of International Cancer of Lung consortium (TRICL) identified lung cancer risk variant candidates in 17q11.2, 1p36.12, 8q23.1, and 5q22.2 regions.
Collapse
Affiliation(s)
- Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Alexander C Mclain
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Christopher I Amos
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX 77030, USA
| | - Bo Cai
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| |
Collapse
|
4
|
Follain B, Wang T, Samworth RJ. High‐dimensional changepoint estimation with heterogeneous missingness. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12540] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Bertille Follain
- Statistical LaboratoryUniversity of Cambridge CambridgeCambridgeshireUK
- Ecole Normale SupérieurePSL Research University, INRIA ParisFrance
| | - Tengyao Wang
- Department of StatisticsLondon School of Economics and Political Science LondonLondonUK
- Department of Statistical ScienceUniversity College London LondonUK
| | | |
Collapse
|
5
|
Fisch ATM, Eckley IA, Fearnhead P. A linear time method for the detection of collective and point anomalies. Stat Anal Data Min 2022. [DOI: 10.1002/sam.11586] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
| | - Idris A. Eckley
- Department of Mathematics and Statistics Lancaster University Lancaster UK
| | - Paul Fearnhead
- Department of Mathematics and Statistics Lancaster University Lancaster UK
| |
Collapse
|
6
|
De SK, Mukherjee SS. Exact tests for offline changepoint detection in multichannel binary and count data with application to networks. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2081689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Shyamal K. De
- Applied Statistics Unit, Indian Statistical Institute, Kolkata, India
| | - Soumendu Sundar Mukherjee
- Interdisciplinary Statistical Research Unit, Indian Statistical Institute, Kolkata, India
- Department of Mathematics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
7
|
Liu B, Zhang X, Liu Y. High Dimensional Change Point Inference: Recent Developments and Extensions. J MULTIVARIATE ANAL 2022; 188:104833. [PMID: 35177873 PMCID: PMC8846568 DOI: 10.1016/j.jmva.2021.104833] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Change point analysis aims to detect structural changes in a data sequence. It has always been an active research area since it was introduced in the 1950s. In modern statistical applications, however, high-throughput data with increasing dimensions are ubiquitous in fields ranging from economics, finance to genetics and engineering. For those problems, the earlier works are typically no longer applicable. As a result, the problem of testing a change point for high dimensional data sequences has been an important yet challenging task. In this paper, we first focus on models for at most one change point, and review recent state-of-art techniques for change point testing of high dimensional mean vectors and compare their theoretical properties. Based on that, we provide a survey of some extensions to general high dimensional parameters beyond mean vectors as well as strategies for testing multiple change points in high dimensions. Finally, we discuss some open problems for possible future research directions.
Collapse
Affiliation(s)
- Bin Liu
- School of Management, Fudan University, Shanghai, 200433, China
| | - Xinsheng Zhang
- School of Management, Fudan University, Shanghai, 200433, China
| | - Yufeng Liu
- Department of Statistics and Operations Research, Department of Genetics, and Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, U.S.A.,Corresponding author. . (Yufeng Liu)
| |
Collapse
|
8
|
Hahn G. Online multivariate changepoint detection with type I error control and constant time/memory updates per series. Stat Probab Lett 2022. [DOI: 10.1016/j.spl.2021.109258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
9
|
Fisch ATM, Eckley IA, Fearnhead P. Subset Multivariate Collective and Point Anomaly Detection. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1987257] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
| | - Idris A. Eckley
- Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
| | - Paul Fearnhead
- Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
| |
Collapse
|
10
|
Song H, Chen H. Asymptotic distribution-free change-point detection for data with repeated observations. Biometrika 2021. [DOI: 10.1093/biomet/asab048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
In the regime of change-point detection, a nonparametric framework based on scan statistics utilizing graphs representing similarities among observations is gaining attention due to its flexibility and good performances for high-dimensional and non-Euclidean data sequences, which are ubiquitous in this big data era. However, this graph-based framework encounters problems when there are repeated observations in the sequence, which often happens for discrete data, such as network data. In this work, we extend the graph-based framework to solve this problem by averaging or taking union of all possible optimal graphs resulted from repeated observations. We consider both the single change-point alternative and the changed-interval alternative, and derive analytic formulas to control the Type I error for the new methods, making them fast applicable to large datasets. The extended methods are illustrated on an application in detecting changes in a sequence of dynamic networks over time. All proposed methods are implemented in an R package gSeg available on CRAN.
Collapse
Affiliation(s)
- Hoseung Song
- Department of Statistics, University of California, Davis, Davis, California 95616, U.S.A
| | - Hao Chen
- Department of Statistics, University of California, Davis, Davis, California 95616, U.S.A
| |
Collapse
|
11
|
Wang R, Lin DY, Jiang Y. SCOPE: A Normalization and Copy-Number Estimation Method for Single-Cell DNA Sequencing. Cell Syst 2021; 10:445-452.e6. [PMID: 32437686 DOI: 10.1016/j.cels.2020.03.005] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 02/11/2020] [Accepted: 03/26/2020] [Indexed: 01/01/2023]
Abstract
Whole-genome single-cell DNA sequencing (scDNA-seq) enables characterization of copy-number profiles at the cellular level. We propose SCOPE, a normalization and copy-number estimation method for the noisy scDNA-seq data. SCOPE's main features include the following: (1) a Poisson latent factor model for normalization, which borrows information across cells and regions to estimate bias, using in silico identified negative control cells; (2) an expectation-maximization algorithm embedded in the normalization step, which accounts for the aberrant copy-number changes and allows direct ploidy estimation without the need for post hoc adjustment; and (3) a cross-sample segmentation procedure to identify breakpoints that are shared across cells with the same genetic background. We evaluate SCOPE on a diverse set of scDNA-seq data in cancer genomics and show that SCOPE offers accurate copy-number estimates and successfully reconstructs subclonal structure. A record of this paper's transparent peer review process is included in the Supplemental Information.
Collapse
Affiliation(s)
- Rujin Wang
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Dan-Yu Lin
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC 27599, USA; Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Yuchao Jiang
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC 27599, USA; Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA; Department of Genetics, School of Medicine, University of North Carolina, Chapel Hill, NC 27599, USA.
| |
Collapse
|
12
|
Affiliation(s)
- Likai Chen
- Department of Mathematics and Statistics, Washington University in St. Louis, MO
| | - Weining Wang
- Department of Economics and Related Studies, University of York, New York
| | - Wei Biao Wu
- Department of Statistics, University of Chicago, Chicago, IL
| |
Collapse
|
13
|
Taylor SAC, Killick R, Burr J, Rogerson L. Assessing daily patterns using home activity sensors and within period changepoint detection. J R Stat Soc Ser C Appl Stat 2021. [DOI: 10.1111/rssc.12472] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | - Rebecca Killick
- Department of Mathematics and Statistics Lancaster University Lancaster UK
| | | | | |
Collapse
|
14
|
Zhang Y, Wang R, Shao X. Adaptive Inference for Change Points in High-Dimensional Data. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1884562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Yangfan Zhang
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Runmin Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX
| | - Xiaofeng Shao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL
| |
Collapse
|
15
|
Liu H, Gao C, Samworth RJ. Minimax rates in sparse, high-dimensional change point detection. Ann Stat 2021. [DOI: 10.1214/20-aos1994] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Haoyang Liu
- Department of Statistics, University of Chicago
| | - Chao Gao
- Department of Statistics, University of Chicago
| | - Richard J. Samworth
- Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge
| |
Collapse
|
16
|
Liang W, Guo Y, Wu Y. Joint estimation of gradual variance changepoint for panel data with common structures. Stat (Int Stat Inst) 2021. [DOI: 10.1002/sta4.359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Wanfeng Liang
- School of Statistics and Data Science Nankai University Tianjin China
| | - Yunfei Guo
- School of Statistics and Data Science Nankai University Tianjin China
- Mathematics Department Yanbian University Jilin China
| | - Yue Wu
- School of Statistics and Data Science Nankai University Tianjin China
| |
Collapse
|
17
|
Hutch MR, Liu M, Avillach P, Luo Y, Bourgeois FT. National Trends in Disease Activity for COVID-19 Among Children in the US. Front Pediatr 2021; 9:700656. [PMID: 34307261 PMCID: PMC8295521 DOI: 10.3389/fped.2021.700656] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 06/04/2021] [Indexed: 11/13/2022] Open
Abstract
Ongoing monitoring of COVID-19 disease burden in children will help inform mitigation strategies and guide pediatric vaccination programs. Leveraging a national, comprehensive dataset, we sought to quantify and compare disease burden and trends in hospitalizations for children and adults in the US.
Collapse
Affiliation(s)
- Meghan R Hutch
- Department of Preventive Medicine, Northwestern University, Chicago, IL, United States
| | - Molei Liu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Paul Avillach
- Department of Pediatrics, Harvard Medical School, Boston, MA, United States.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.,Computational Health Informatics Program (CHIP), Boston Children's Hospital, Boston, MA, United States
| | | | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, United States
| | - Florence T Bourgeois
- Department of Pediatrics, Harvard Medical School, Boston, MA, United States.,Computational Health Informatics Program (CHIP), Boston Children's Hospital, Boston, MA, United States
| |
Collapse
|
18
|
Yu M, Chen X. Finite sample change point inference and identification for high‐dimensional mean vectors. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12406] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Mengjia Yu
- Department of Statistics University of Illinois at Urbana‐Champaign Champaign IL USA
| | - Xiaohui Chen
- Department of Statistics University of Illinois at Urbana‐Champaign Champaign IL USA
| |
Collapse
|
19
|
Li Z, Liu Y, Lin X. Simultaneous Detection of Signal Regions Using Quadratic Scan Statistics With Applications to Whole Genome Association Studies. J Am Stat Assoc 2020; 117:823-834. [PMID: 35845434 PMCID: PMC9285665 DOI: 10.1080/01621459.2020.1822849] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 06/18/2020] [Accepted: 08/25/2020] [Indexed: 01/03/2023]
Abstract
We consider in this paper detection of signal regions associated with disease outcomes in whole genome association studies. Gene- or region-based methods have become increasingly popular in whole genome association analysis as a complementary approach to traditional individual variant analysis. However, these methods test for the association between an outcome and the genetic variants in a pre-specified region, e.g., a gene. In view of massive intergenic regions in whole genome sequencing (WGS) studies, we propose a computationally efficient quadratic scan (Q-SCAN) statistic based method to detect the existence and the locations of signal regions by scanning the genome continuously. The proposed method accounts for the correlation (linkage disequilibrium) among genetic variants, and allows for signal regions to have both causal and neutral variants, and the effects of signal variants to be in different directions. We study the asymptotic properties of the proposed Q-SCAN statistics. We derive an empirical threshold that controls for the family-wise error rate, and show that under regularity conditions the proposed method consistently selects the true signal regions. We perform simulation studies to evaluate the finite sample performance of the proposed method. Our simulation results show that the proposed procedure outperforms the existing methods, especially when signal regions have causal variants whose effects are in different directions, or are contaminated with neutral variants. We illustrate Q-SCAN by analyzing the WGS data from the Atherosclerosis Risk in Communities study.
Collapse
Affiliation(s)
- Zilin Li
- Harvard University T H Chan School of Public Health, Biostatistics, 655 Huntington Avenue, Boston, 02115 United States
| | - Yaowu Liu
- Southwestern University of Finance and Economics School of Statistics, Chengdu, 610074 China
| | - Xihong Lin
- Harvard University T H Chan School of Public Health, Biostatistics, 655 Huntington Avenue, Boston, 02115 United States
| |
Collapse
|
20
|
Eckley I, Kirch C, Weber S. A novel change-point approach for the detection of gas emission sources using remotely contained concentration data. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
21
|
Tang ZZ, Sliwoski GR, Chen G, Jin B, Bush WS, Li B, Capra JA. PSCAN: Spatial scan tests guided by protein structures improve complex disease gene discovery and signal variant detection. Genome Biol 2020; 21:217. [PMID: 32847609 PMCID: PMC7448521 DOI: 10.1186/s13059-020-02121-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 07/27/2020] [Indexed: 12/25/2022] Open
Abstract
Germline disease-causing variants are generally more spatially clustered in protein 3-dimensional structures than benign variants. Motivated by this tendency, we develop a fast and powerful protein-structure-based scan (PSCAN) approach for evaluating gene-level associations with complex disease and detecting signal variants. We validate PSCAN's performance on synthetic data and two real data sets for lipid traits and Alzheimer's disease. Our results demonstrate that PSCAN performs competitively with existing gene-level tests while increasing power and identifying more specific signal variant sets. Furthermore, PSCAN enables generation of hypotheses about the molecular basis for the associations in the context of protein structures and functional domains.
Collapse
Affiliation(s)
- Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, 53715 WI USA
- Wisconsin Institute for Discovery, Madison, 53715 WI USA
| | - Gregory R. Sliwoski
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, 37232 TN USA
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, 53715 WI USA
| | - Bowen Jin
- Department for Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, 44106 OH USA
| | - William S. Bush
- Department for Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, 44106 OH USA
- Institute for Computational Biology, Case Western Reserve University, Cleveland, 44106 OH USA
| | - Bingshan Li
- Department of Molecular Physiology & Biophysics, Vanderbilt University Medical Center, Nashville, 37232 TN USA
| | - John A. Capra
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, 37232 TN USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, 37232 TN USA
- Departments of Biological Sciences and Computer Science, Vanderbilt University, Nashville, 37232 TN USA
- Center for Structural Biology, Vanderbilt University, Nashville, 37232 TN USA
| |
Collapse
|
22
|
Mallory XF, Edrisi M, Navin N, Nakhleh L. Methods for copy number aberration detection from single-cell DNA-sequencing data. Genome Biol 2020; 21:208. [PMID: 32807205 PMCID: PMC7433197 DOI: 10.1186/s13059-020-02119-8] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 07/23/2020] [Indexed: 02/06/2023] Open
Abstract
Copy number aberrations (CNAs), which are pathogenic copy number variations (CNVs), play an important role in the initiation and progression of cancer. Single-cell DNA-sequencing (scDNAseq) technologies produce data that is ideal for inferring CNAs. In this review, we review eight methods that have been developed for detecting CNAs in scDNAseq data, and categorize them according to the steps of a seven-step pipeline that they employ. Furthermore, we review models and methods for evolutionary analyses of CNAs from scDNAseq data and highlight advances and future research directions for computational methods for CNA detection from scDNAseq data.
Collapse
Affiliation(s)
- Xian F. Mallory
- Department of Computer Science, Rice University, Houston, TX USA
- Department of Computer Science, Florida State University, Tallahassee, FL USA
| | | | - Nicholas Navin
- Department of Genetics, the University of Texas M.D. Anderson Cancer Center, Houston, TX USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, TX USA
| |
Collapse
|
23
|
Liu B, Zhou C, Zhang X, Liu Y. A unified data‐adaptive framework for high dimensional change point detection. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12375] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Bin Liu
- Fudan University Shanghai People's Republic of China
| | - Cheng Zhou
- Robotics X Lab Tencent People's Republic of China
| | | | - Yufeng Liu
- University of North Carolina at Chapel Hill USA
| |
Collapse
|
24
|
Fang X, Li J, Siegmund D. Segmentation and estimation of change-point models: False positive control and confidence regions. Ann Stat 2020. [DOI: 10.1214/19-aos1861] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
25
|
Atashgar K, Rafiee N, Karbasian M. A new hybrid approach to panel data change point detection. COMMUN STAT-THEOR M 2020. [DOI: 10.1080/03610926.2020.1760298] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Karim Atashgar
- Department of Industrial Engineering, Malek Ashtar University of Technology, Tehran, Iran
| | - Naser Rafiee
- Department of Industrial Engineering, Malek Ashtar University of Technology, Tehran, Iran
| | - Mahdi Karbasian
- Department of Industrial Engineering, Malek Ashtar University of Technology, Tehran, Iran
| |
Collapse
|
26
|
Alshawaqfeh M, Al Kawam A, Serpedin E, Datta A. Robust Recurrent CNV Detection in the Presence of Inter-Subject Variability. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1056-1067. [PMID: 30387737 DOI: 10.1109/tcbb.2018.2878560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The study of recurrent copy number variations (CNVs) plays an important role in understanding the onset and evolution of complex diseases such as cancer. Array-based comparative genomic hybridization (aCGH) is a widely used microarray based technology for identifying CNVs. However, due to high noise levels and inter-sample variability, detecting recurrent CNVs from aCGH data remains a challenging topic. This paper proposes a novel method for identification of the recurrent CNVs. In the proposed method, the noisy aCGH data is modeled as the superposition of three matrices: a full-rank matrix of weighted piece-wise generating signals accounting for the clean aCGH data, a Gaussian noise matrix to model the inherent experimentation errors and other sources of error, and a sparse matrix to capture the sparse inter-sample (sample-specific) variations. We demonstrated the ability of our method to separate accurately recurrent CNVs from sample-specific variations and noise in both simulated (artificial) data and real data. The proposed method produced more accurate results than current state-of-the-art methods used in recurrent CNV detection and exhibited robustness to noise and sample-specific variations.
Collapse
|
27
|
Li J. Asymptotic distribution-free change-point detection based on interpoint distances for high-dimensional data. J Nonparametr Stat 2020. [DOI: 10.1080/10485252.2019.1710505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Jun Li
- Department of Statistics, University of California, Riverside, CA, USA
| |
Collapse
|
28
|
Fischer A, Picard D. On change-point estimation under Sobolev sparsity. Electron J Stat 2020. [DOI: 10.1214/20-ejs1692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
29
|
Enikeeva F, Harchaoui Z. High-dimensional change-point detection under sparse alternatives. Ann Stat 2019. [DOI: 10.1214/18-aos1740] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
30
|
Li W, Xiang D, Tsung F, Pu X. A Diagnostic Procedure for High-Dimensional Data Streams via Missed Discovery Rate Control. Technometrics 2019. [DOI: 10.1080/00401706.2019.1575284] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Wendong Li
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, China
| | - Dongdong Xiang
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, China
| | - Fugee Tsung
- Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Xiaolong Pu
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, China
| |
Collapse
|
31
|
Zhang Z, Cheng H, Hong X, Di Narzo AF, Franzen O, Peng S, Ruusalepp A, Kovacic JC, Bjorkegren JLM, Wang X, Hao K. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data. Nucleic Acids Res 2019; 47:e39. [PMID: 30722045 PMCID: PMC6468244 DOI: 10.1093/nar/gkz068] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 12/17/2018] [Accepted: 01/25/2019] [Indexed: 12/30/2022] Open
Abstract
The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
Collapse
Affiliation(s)
- Zhongyang Zhang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Haoxiang Cheng
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Xiumei Hong
- Center on the Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Antonio F Di Narzo
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Oscar Franzen
- Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Karolinska Universitetssjukhuset, Huddinge, Sweden
| | - Shouneng Peng
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Arno Ruusalepp
- Department of Cardiac Surgery, Tartu University Hospital, Tartu, Estonia
| | - Jason C Kovacic
- Cardiovascular Research Center, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Johan L M Bjorkegren
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Karolinska Universitetssjukhuset, Huddinge, Sweden
| | - Xiaobin Wang
- Center on the Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Division of General Pediatrics & Adolescent Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Ke Hao
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- The Tenth People's Hospital, Tongji University, Shanghai 200072, China
- College of Environmental Science and Engineering, Tongji University, Shanghai 200092, China
| |
Collapse
|
32
|
Chu L, Chen H. Asymptotic distribution-free change-point detection for multivariate and non-Euclidean data. Ann Stat 2019. [DOI: 10.1214/18-aos1691] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
33
|
Collilieux X, Lebarbier E, Robin S. A factor model approach for the joint segmentation with between‐series correlation. Scand Stat Theory Appl 2018. [DOI: 10.1111/sjos.12368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Xavier Collilieux
- Laboratoire de Recherche en Géodésie (LAREG), l'Institut National de l'information Géographique et forestière (IGN)Université Paris Diderot Paris France
| | - Emilie Lebarbier
- UMR MIA‐Paris, AgroParisTech, INRAUniversité Paris‐Saclay Paris France
| | - Stéphane Robin
- UMR MIA‐Paris, AgroParisTech, INRAUniversité Paris‐Saclay Paris France
| |
Collapse
|
34
|
Affiliation(s)
- Yanhong Wu
- Department of Mathematics, California State University Stanislaus, Turlock, California, USA
| |
Collapse
|
35
|
DMD genomic deletions characterize a subset of progressive/higher-grade meningiomas with poor outcome. Acta Neuropathol 2018; 136:779-792. [PMID: 30123936 DOI: 10.1007/s00401-018-1899-7] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Revised: 08/10/2018] [Accepted: 08/10/2018] [Indexed: 01/22/2023]
Abstract
Progressive meningiomas that have failed surgery and radiation have a poor prognosis and no standard therapy. While meningiomas are more common in females overall, progressive meningiomas are enriched in males. We performed a comprehensive molecular characterization of 169 meningiomas from 53 patients with progressive/high-grade tumors, including matched primary and recurrent samples. Exome sequencing in an initial cohort (n = 24) detected frequent alterations in genes residing on the X chromosome, with somatic intragenic deletions of the dystrophin-encoding and muscular dystrophy-associated DMD gene as the most common alteration (n = 5, 20.8%), along with alterations of other known X-linked cancer-related genes KDM6A (n =2, 8.3%), DDX3X, RBM10 and STAG2 (n = 1, 4.1% each). DMD inactivation (by genomic deletion or loss of protein expression) was ultimately detected in 17/53 progressive meningioma patients (32%). Importantly, patients with tumors harboring DMD inactivation had a shorter overall survival (OS) than their wild-type counterparts [5.1 years (95% CI 1.3-9.0) vs. median not reached (95% CI 2.9-not reached, p = 0.006)]. Given the known poor prognostic association of TERT alterations in these tumors, we also assessed for these events, and found seven patients with TERT promoter mutations and three with TERT rearrangements in this cohort (n = 10, 18.8%), including a recurrent novel RETREG1-TERT rearrangement that was present in two patients. In a multivariate model, DMD inactivation (p = 0.033, HR = 2.6, 95% CI 1.0-6.6) and TERT alterations (p = 0.005, HR = 3.8, 95% CI 1.5-9.9) were mutually independent in predicting unfavorable outcomes. Thus, DMD alterations identify a subset of progressive/high-grade meningiomas with worse outcomes.
Collapse
|
36
|
Cai Q. A scoring criterion for rejection of clustered p-values. Comput Stat Data Anal 2018. [DOI: 10.1016/j.csda.2016.02.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
37
|
|
38
|
Zhang Z, Hao K. Using SAAS-CNV to Detect and Characterize Somatic Copy Number Alterations in Cancer Genomes from Next Generation Sequencing and SNP Array Data. Methods Mol Biol 2018; 1833:29-47. [PMID: 30039361 DOI: 10.1007/978-1-4939-8666-8_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Somatic copy number alterations (SCNAs) are profound in cancer genomes at different stages: oncogenesis, progression, and metastasis. Accurate detection and characterization of SCNA landscape at genome-wide scale are of great importance. Next-generation sequencing and SNP array are current technology of choice for SCNA analysis. They are able to quantify SCNA with high resolution and meanwhile raise great challenges in data analysis. To this end, we have developed an R package saasCNV for SCNA analysis using (1) whole-genome sequencing (WGS), (2) whole-exome sequencing (WES) or (3) whole-genome SNP array data. In this chapter, we provide the features of the package and step-by-step instructions in detail.
Collapse
Affiliation(s)
- Zhongyang Zhang
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ke Hao
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
39
|
Fan Z, Mackey L. Empirical Bayesian analysis of simultaneous changepoints in multiple data sequences. Ann Appl Stat 2017. [DOI: 10.1214/17-aoas1075] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
40
|
Wang T, Samworth RJ. High dimensional change point estimation via sparse projection. J R Stat Soc Series B Stat Methodol 2017. [DOI: 10.1111/rssb.12243] [Citation(s) in RCA: 81] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
41
|
Song C, Min X, Zhang H. THE SCREENING AND RANKING ALGORITHM FOR CHANGE-POINTS DETECTION IN MULTIPLE SAMPLES. Ann Appl Stat 2017; 10:2102-2129. [PMID: 28090239 DOI: 10.1214/16-aoas966] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The chromosome copy number variation (CNV) is the deviation of genomic regions from their normal copy number states, which may associate with many human diseases. Current genetic studies usually collect hundreds to thousands of samples to study the association between CNV and diseases. CNVs can be called by detecting the change-points in mean for sequences of array-based intensity measurements. Although multiple samples are of interest, the majority of the available CNV calling methods are single sample based. Only a few multiple sample methods have been proposed using scan statistics that are computationally intensive and designed toward either common or rare change-points detection. In this paper, we propose a novel multiple sample method by adaptively combining the scan statistic of the screening and ranking algorithm (SaRa), which is computationally efficient and is able to detect both common and rare change-points. We prove that asymptotically this method can find the true change-points with almost certainty and show in theory that multiple sample methods are superior to single sample methods when shared change-points are of interest. Additionally, we report extensive simulation studies to examine the performance of our proposed method. Finally, using our proposed method as well as two competing approaches, we attempt to detect CNVs in the data from the Primary Open-Angle Glaucoma Genes and Environment study, and conclude that our method is faster and requires less information while our ability to detect the CNVs is comparable or better.
Collapse
|
42
|
Suvorikova A, Spokoiny V. Multiscale Change Point Detection. THEORY OF PROBABILITY AND ITS APPLICATIONS 2017. [DOI: 10.1137/s0040585x97t988411] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
43
|
Sharifi Noghabi H, Mohammadi M, Tan Y. Robust group fused lasso for multisample copy number variation detection under uncertainty. IET Syst Biol 2016; 10:229-236. [DOI: 10.1049/iet-syb.2015.0081] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Hossein Sharifi Noghabi
- Department of Computer EngineeringFerdowsi University of MashhadIran
- The Center of Excellence of Soft Computing and Intelligent Information Processing (SCIIP)Ferdowsi University of MashhadIran
| | - Majid Mohammadi
- Department of Technology, Policy and ManagementDelft University of TechnologyNetherlands
| | - Yao‐Hua Tan
- Department of Technology, Policy and ManagementDelft University of TechnologyNetherlands
| |
Collapse
|
44
|
Ma TF, Yau CY. A pairwise likelihood-based approach for changepoint detection in multivariate time series models. Biometrika 2016; 103:409-421. [DOI: 10.1093/biomet/asw002] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
45
|
Hu J, Zhang L, Wang HJ. Sequential model selection-based segmentation to detect DNA copy number variation. Biometrics 2016; 72:815-26. [PMID: 26954760 DOI: 10.1111/biom.12478] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Revised: 08/01/2015] [Accepted: 09/01/2015] [Indexed: 12/16/2022]
Abstract
Array-based CGH experiments are designed to detect genomic aberrations or regions of DNA copy-number variation that are associated with an outcome, typically a state of disease. Most of the existing statistical methods target on detecting DNA copy number variations in a single sample or array. We focus on the detection of group effect variation, through simultaneous study of multiple samples from multiple groups. Rather than using direct segmentation or smoothing techniques, as commonly seen in existing detection methods, we develop a sequential model selection procedure that is guided by a modified Bayesian information criterion. This approach improves detection accuracy by accumulatively utilizing information across contiguous clones, and has computational advantage over the existing popular detection methods. Our empirical investigation suggests that the performance of the proposed method is superior to that of the existing detection methods, in particular, in detecting small segments or separating neighboring segments with differential degrees of copy-number variation.
Collapse
Affiliation(s)
- Jianhua Hu
- Department of Biostatistics, UT M. D. Anderson Cancer Center, Houston, Texas 77030, U.S.A..
| | - Liwen Zhang
- School of Economics, Shanghai University, Shanghai 200444, China.
| | - Huixia Judy Wang
- Department of Statistics, George Washington University, Washington D.C. 20052, U.S.A..
| |
Collapse
|
46
|
Maidstone R, Hocking T, Rigaill G, Fearnhead P. On optimal multiple changepoint algorithms for large data. STATISTICS AND COMPUTING 2016; 27:519-533. [PMID: 32355427 PMCID: PMC7175693 DOI: 10.1007/s11222-016-9636-3] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Accepted: 02/01/2016] [Indexed: 06/11/2023]
Abstract
Many common approaches to detecting changepoints, for example based on statistical criteria such as penalised likelihood or minimum description length, can be formulated in terms of minimising a cost over segmentations. We focus on a class of dynamic programming algorithms that can solve the resulting minimisation problem exactly, and thus find the optimal segmentation under the given statistical criteria. The standard implementation of these dynamic programming methods have a computational cost that scales at least quadratically in the length of the time-series. Recently pruning ideas have been suggested that can speed up the dynamic programming algorithms, whilst still being guaranteed to be optimal, in that they find the true minimum of the cost function. Here we extend these pruning methods, and introduce two new algorithms for segmenting data: FPOP and SNIP. Empirical results show that FPOP is substantially faster than existing dynamic programming methods, and unlike the existing methods its computational efficiency is robust to the number of changepoints in the data. We evaluate the method for detecting copy number variations and observe that FPOP has a computational cost that is even competitive with that of binary segmentation, but can give much more accurate segmentations.
Collapse
Affiliation(s)
- Robert Maidstone
- STOR-i Centre for Doctoral Training, Lancaster University, Lancaster, UK
| | - Toby Hocking
- McGill University and Genome Quebec Innovation Center, Quebec, Canada
| | - Guillem Rigaill
- Institute of Plant Sciences Paris-Saclay, UMR 9213/UMR1403, CNRS, INRA, Université Paris-Sud, Université d’Evry, Université Paris-Diderot, Sorbonne Paris-Cité, Paris, France
| | - Paul Fearnhead
- Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
| |
Collapse
|
47
|
|
48
|
Gao X. Penalized weighted low-rank approximation for robust recovery of recurrent copy number variations. BMC Bioinformatics 2015; 16:407. [PMID: 26652207 PMCID: PMC4676147 DOI: 10.1186/s12859-015-0835-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 11/23/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variation (CNV) analysis has become one of the most important research areas for understanding complex disease. With increasing resolution of array-based comparative genomic hybridization (aCGH) arrays, more and more raw copy number data are collected for multiple arrays. It is natural to realize the co-existence of both recurrent and individual-specific CNVs, together with the possible data contamination during the data generation process. Therefore, there is a great need for an efficient and robust statistical model for simultaneous recovery of both recurrent and individual-specific CNVs. RESULT We develop a penalized weighted low-rank approximation method (WPLA) for robust recovery of recurrent CNVs. In particular, we formulate multiple aCGH arrays into a realization of a hidden low-rank matrix with some random noises and let an additional weight matrix account for those individual-specific effects. Thus, we do not restrict the random noise to be normally distributed, or even homogeneous. We show its performance through three real datasets and twelve synthetic datasets from different types of recurrent CNV regions associated with either normal random errors or heavily contaminated errors. CONCLUSION Our numerical experiments have demonstrated that the WPLA can successfully recover the recurrent CNV patterns from raw data under different scenarios. Compared with two other recent methods, it performs the best regarding its ability to simultaneously detect both recurrent and individual-specific CNVs under normal random errors. More importantly, the WPLA is the only method which can effectively recover the recurrent CNVs region when the data is heavily contaminated.
Collapse
Affiliation(s)
- Xiaoli Gao
- Department of Mathematics and Statistics, University of North Carolina at Greensboro, 1400 Spring Garden St, Greensoboro, NC, USA.
| |
Collapse
|
49
|
Walter V, Wright FA, Nobel AB. Consistent testing for recurrent genomic aberrations. Biometrika 2015; 102:783-796. [PMID: 30799871 DOI: 10.1093/biomet/asv046] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We consider the detection and identification of recurrent departures from stationary behaviour in genomic or similarly arranged data containing measurements at an ordered set of variables. Our primary focus is on departures that occur only at a single variable, or within a small window of contiguous variables, but involve more than one sample. This encompasses the identification of aberrant markers in genome-wide measurements of DNA copy number and DNA methylation, as well as meta-analyses of genome-wide association studies. We propose and analyse a cyclic shift-based procedure for testing recurrent departures from stationarity. Our analysis establishes the consistency of cyclic shift [Formula: see text]-values for datasets with a fixed set of samples as the number of observed variables tends to infinity, under the assumption that each sample is an independent realization of a stationary Markov chain. Our results apply to any test statistic satisfying a simple invariance condition.
Collapse
Affiliation(s)
- V Walter
- Department of Biochemistry and Molecular Biology, Pennyslvania State University College of Medicine, Milton S. Hershey Medical Center, 500 University Drive, P.O. Box 850, Hershey, Pennsylvania 17033 U.S.A
| | - F A Wright
- Department of Statistics, North Carolina State University Bioinformatics Research Center, Campus Box 7566, 2601 Stinson Drive, Raleigh, North Carolina 27695 U.S.A.,
| | - A B Nobel
- Department of Statistics and Operations Research, CB 3260, University of North Carolina, Chapel Hill, North Carolina, 27599 U.S.A.,
| |
Collapse
|
50
|
Zhang Z, Hao K. SAAS-CNV: A Joint Segmentation Approach on Aggregated and Allele Specific Signals for the Identification of Somatic Copy Number Alterations with Next-Generation Sequencing Data. PLoS Comput Biol 2015; 11:e1004618. [PMID: 26583378 DOI: 10.1371/journal.pcbi.1004618] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 10/20/2015] [Indexed: 11/18/2022] Open
Abstract
Cancer genomes exhibit profound somatic copy number alterations (SCNAs). Studying tumor SCNAs using massively parallel sequencing provides unprecedented resolution and meanwhile gives rise to new challenges in data analysis, complicated by tumor aneuploidy and heterogeneity as well as normal cell contamination. While the majority of read depth based methods utilize total sequencing depth alone for SCNA inference, the allele specific signals are undervalued. We proposed a joint segmentation and inference approach using both signals to meet some of the challenges. Our method consists of four major steps: 1) extracting read depth supporting reference and alternative alleles at each SNP/Indel locus and comparing the total read depth and alternative allele proportion between tumor and matched normal sample; 2) performing joint segmentation on the two signal dimensions; 3) correcting the copy number baseline from which the SCNA state is determined; 4) calling SCNA state for each segment based on both signal dimensions. The method is applicable to whole exome/genome sequencing (WES/WGS) as well as SNP array data in a tumor-control study. We applied the method to a dataset containing no SCNAs to test the specificity, created by pairing sequencing replicates of a single HapMap sample as normal/tumor pairs, as well as a large-scale WGS dataset consisting of 88 liver tumors along with adjacent normal tissues. Compared with representative methods, our method demonstrated improved accuracy, scalability to large cancer studies, capability in handling both sequencing and SNP array data, and the potential to improve the estimation of tumor ploidy and purity.
Collapse
Affiliation(s)
- Zhongyang Zhang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Ke Hao
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Respiratory Medicine, Shanghai Tenth People's Hospital, Tongji University, Shanghai, China
| |
Collapse
|