1
|
Wang H, Lu W, Chang Z. Simultaneous identification of groundwater contamination source and aquifer parameters with a new weighted-average wavelet variable-threshold denoising method. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2021; 28:38292-38307. [PMID: 33733419 DOI: 10.1007/s11356-021-12959-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 02/10/2021] [Indexed: 06/12/2023]
Abstract
This paper first proposed a parallel heuristic search strategy for simultaneous identification of groundwater contamination source and aquifer parameters. As identification results are influenced by many factors, such as noisy contamination concentration data, data denoising is necessary. The existing wavelet threshold denoising method has unavoidable shortcomings; therefore, this paper first proposed a new weighted-average wavelet variable-threshold denoising (WWVD) method to improve the denoising effect for concentration data, which further enhanced the subsequent identification accuracy. However, frequent calls to the simulation model could produce high computational cost during likelihood calculation. Hence, single surrogate model of the simulation model was developed to reduce cost; however, it presented limitation. Thus, this paper first developed a differential evolution-tabu search (DE-TS) hybrid algorithm to construct an optimal ensemble surrogate model, which assembled Gaussian process, kernel extreme learning machine, and support vector regression. The first proposed DE-TS algorithm also improved the approximation accuracy of surrogate model to simulation model. This paper first proposed and implemented a parallel heuristic search iterative process for simultaneous identification, and the identification results were obtained when the iteration process terminated. The accuracy and efficiency of these newly proposed approaches were tested through a hypothetical case. Results showed that the WWVD method not only improved the denoising effect for concentration data but also enhanced the subsequent identification accuracy. The OES model using DE-TS hybrid algorithm improved the approximation accuracy of surrogate model to simulation model, and the parallel heuristic search strategy is helpful for simultaneous identification of groundwater contamination source and aquifer parameters.
Collapse
Affiliation(s)
- Han Wang
- Key Laboratory of Groundwater Resources and Environment, Ministry of Education, Jilin Univ., Changchun, 130021, China
- Jilin Provincial Key Laboratory of Water Resources and Environment, Jilin Univ., Changchun, 130021, China
- College of New Energy and Environment, Jilin Univ., Changchun, 130021, China
| | - Wenxi Lu
- Key Laboratory of Groundwater Resources and Environment, Ministry of Education, Jilin Univ., Changchun, 130021, China.
- Jilin Provincial Key Laboratory of Water Resources and Environment, Jilin Univ., Changchun, 130021, China.
- College of New Energy and Environment, Jilin Univ., Changchun, 130021, China.
| | - Zhenbo Chang
- Key Laboratory of Groundwater Resources and Environment, Ministry of Education, Jilin Univ., Changchun, 130021, China
- Jilin Provincial Key Laboratory of Water Resources and Environment, Jilin Univ., Changchun, 130021, China
- College of New Energy and Environment, Jilin Univ., Changchun, 130021, China
| |
Collapse
|
2
|
Yoo HB, Mohan A, De Ridder D, Vanneste S. Paradoxical relationship between distress and functional network topology in phantom sound perception. PROGRESS IN BRAIN RESEARCH 2020; 260:367-395. [PMID: 33637228 DOI: 10.1016/bs.pbr.2020.08.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Distress is a domain-general symptom that accompanies several disorders, including tinnitus. Based on previous studies, we know that distress is encoded by changes in functional connectivity between cortical and subcortical regions. However, how distress relates to large-scale brain networks is not yet clear. In the current study, we investigate the relationship between distress and the efficiency of a network by examining its topological properties using resting state fMRI collected from 90 chronic tinnitus patients. The present results indicate that distress negatively correlates with path length and positively correlates with clustering coefficient, small-worldness, and efficiency of information transfer. Specifically, path analysis showed that the relationship between distress and efficiency is significantly mediated by the resilience of the feeder connections and the centrality of the rich-club connections. In other words, the higher the network efficiency, the lower the resilience of the feeder connections and the centrality of the rich-club connections, which in turn reflects in higher distress in tinnitus patients. This indicates a reorganization of the network towards a paradoxically more efficient topology in patients with high distress, potentially explaining their increased rumination on the tinnitus percept itself.
Collapse
Affiliation(s)
- Hye Bin Yoo
- Department of Neurological Surgery, University of Texas Southwestern, United States
| | - Anusha Mohan
- Lab for Clinical and Integrative Neuroscience, Global Brain Health Institute, Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland
| | - Dirk De Ridder
- Department of Surgical Sciences, Section of Neurosurgery, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
| | - Sven Vanneste
- Lab for Clinical and Integrative Neuroscience, Global Brain Health Institute, Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland; Lab for Clinical and Integrative Neuroscience, School of Behavioral and Brain Sciences, University of Texas at Dallas, Richardson, TX, United States.
| |
Collapse
|
3
|
The interaction and mechanism of monoterpenes with tyramine receptor (SoTyrR) of rice weevil (Sitophilus oryzae). SN APPLIED SCIENCES 2020. [DOI: 10.1007/s42452-020-03395-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
|
4
|
Zare F, Ansari S, Najarian K, Nabavi S. Preprocessing Sequence Coverage Data for More Precise Detection of Copy Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:868-876. [PMID: 30222580 PMCID: PMC7278033 DOI: 10.1109/tcbb.2018.2869738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Copy number variation (CNV) is a type of genomic/genetic variation that plays an important role in phenotypic diversity, evolution, and disease susceptibility. Next generation sequencing (NGS) technologies have created an opportunity for more accurate detection of CNVs with higher resolution. However, efficient and precise detection of CNVs remains challenging due to high levels of noise and biases, data heterogeneity, and the "big data" nature of NGS data. Sequence coverage (readcount) data are mostly used for detecting CNVs, specially for whole exome sequencing data. Readcount data are contaminated with several types of biases and noise that hinder accurate detection of CNVs. In this work, we introduce a novel preprocessing pipeline for reducing noise and biases to improve the detection accuracy of CNVs in heterogeneous NGS data, such as cancer whole exome sequencing data. We have employed several normalization methods to reduce readcount's biases that are due to GC content of reads, read alignment problems, and sample impurity. We have also developed a novel efficient and effective smoothing approach based on Taut String to reduce noise and increase CNV detection power. Using simulated and real data we showed that employing the proposed preprocessing pipeline significantly improves the accuracy of CNV detection.
Collapse
|
5
|
Yuan X, Gao M, Bai J, Duan J. SVSR: A Program to Simulate Structural Variations and Generate Sequencing Reads for Multiple Platforms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1082-1091. [PMID: 30334804 DOI: 10.1109/tcbb.2018.2876527] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Structural variation accounts for a major fraction of mutations in the human genome and confers susceptibility to complex diseases. Next generation sequencing along with the rapid development of computational methods provides a cost-effective procedure to detect such variations. Simulation of structural variations and sequencing reads with real characteristics is essential for benchmarking the computational methods. Here, we develop a new program, SVSR, to simulate five types of structural variations (indels, tandem duplication, CNVs, inversions, and translocations) and SNPs for the human genome and to generate sequencing reads with features from popular platforms (Illumina, SOLiD, 454, and Ion Torrent). We adopt a selection model trained from real data to predict copy number states, starting from the first site of a particular genome to the end. Furthermore, we utilize references of microbial genomes to produce insertion fragments and design probabilistic models to imitate inversions and translocations. Moreover, we create platform-specific errors and base quality profiles to generate normal, tumor, or normal-tumor mixture reads. Experimental results show that SVSR could capture more features that are realistic and generate datasets with satisfactory quality scores. SVSR is able to evaluate the performance of structural variation detection methods and guide the development of new computational methods.
Collapse
|
6
|
Agarwal D, Wang J, Zhang NR. Data Denoising and Post-Denoising Corrections in Single Cell RNA Sequencing. Stat Sci 2020. [DOI: 10.1214/19-sts7560] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
7
|
Kachouie NN, Shutaywi M, Christiani DC. Discriminant Analysis of Lung Cancer Using Nonlinear Clustering of Copy Numbers. Cancer Invest 2020; 38:102-112. [PMID: 31977287 PMCID: PMC10283398 DOI: 10.1080/07357907.2020.1719501] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 01/18/2020] [Indexed: 01/14/2023]
Abstract
Background: Patient survival is not optimal for non-small cell lung cancer (NSCLC) patients, recurrence rate is high, and hence, early detection is crucial to increase the patient's survival. Gene-cancer mapping intends to discover associated genes with cancers and due to advances in high-throughput genotyping, screening for disease loci on a genome-wide scale is now possible. DNA copy numbers can potentially be used to identify cancer from normal cells in early detection of cancer.Methods: We use a nonlinear clustering method, so-called kernel K-means to separate cancer from normal samples. Kernel K-means is applied to the copy numbers obtained for each chromosome to cluster 63 paired cancer-blood samples (total of 126 samples) into two groups. Clustering performance is evaluated using true and false-positive rates, true and false-negative rates, and a nonlinear criterion, normalized mutual information (NMI).Results: Copy numbers of paired cancer-blood samples for 63 NSCLC patients are used in this study. Kernel K-means was applied to cluster 126 samples in two groups using copy numbers on each chromosome separately. The clustering results for 22 chromosomes are evaluated and discriminant power of them in identifying cancer is computed. We identified the top five and bottom five chromosomes based on their discriminant power.Conclusions: The results reveal high discriminant power of chromosomes 8, 5, 1, 3, and 19 for identifying cancer with the highest sensitivity of 75% yielded by chromosome 5. Bottom 5 chromosomes 9, 6, 4, 13, and 21 show low discriminant power with the accuracy of below 54% where true cancer and normal samples are grouped into substantially overlapping groups using copy numbers. This indicates the similarities of copy numbers obtained for cancer and normal samples on these chromosomes.
Collapse
Affiliation(s)
| | - Meshal Shutaywi
- Department of Mathematical Sciences, Florida Institute of Technology
| | - David C. Christiani
- Department of Environmental Health, Harvard School of Public Health
- Department of Epidemiology, Harvard School of Public Health
| |
Collapse
|
8
|
Cheng D, He Z, Schwartzman A. Multiple testing of local extrema for detection of change points. Electron J Stat 2020. [DOI: 10.1214/20-ejs1751] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Png G, Suveges D, Park YC, Walter K, Kundu K, Ntalla I, Tsafantakis E, Karaleftheri M, Dedoussis G, Zeggini E, Gilly A. Population-wide copy number variation calling using variant call format files from 6,898 individuals. Genet Epidemiol 2019; 44:79-89. [PMID: 31520489 PMCID: PMC8653900 DOI: 10.1002/gepi.22260] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 07/31/2019] [Accepted: 08/28/2019] [Indexed: 11/10/2022]
Abstract
Copy number variants (CNVs) play an important role in a number of human diseases, but the accurate calling of CNVs remains challenging. Most current approaches to CNV detection use raw read alignments, which are computationally intensive to process. We use a regression tree-based approach to call germline CNVs from whole-genome sequencing (WGS, >18x) variant call sets in 6,898 samples across four European cohorts, and describe a rich large variation landscape comprising 1,320 CNVs. Eighty-one percent of detected events have been previously reported in the Database of Genomic Variants. Twenty-three percent of high-quality deletions affect entire genes, and we recapitulate known events such as the GSTM1 and RHD gene deletions. We test for association between the detected deletions and 275 protein levels in 1,457 individuals to assess the potential clinical impact of the detected CNVs. We describe complex CNV patterns underlying an association with levels of the CCL3 protein (MAF = 0.15, p = 3.6x10-12 ) at the CCL3L3 locus, and a novel cis-association between a low-frequency NOMO1 deletion and NOMO1 protein levels (MAF = 0.02, p = 2.2x10-7 ). This study demonstrates that existing population-wide WGS call sets can be mined for germline CNVs with minimal computational overhead, delivering insight into a less well-studied, yet potentially impactful class of genetic variant.
Collapse
Affiliation(s)
- Grace Png
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Department of Medical Genetics, University of Cambridge, Cambridge, United Kingdom.,Institute of Translational Genomics, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, Germany
| | - Daniel Suveges
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Young-Chan Park
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Department of Medical Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Klaudia Walter
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Kousik Kundu
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Ioanna Ntalla
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | | | | | - George Dedoussis
- Department of Nutrition and Dietetics, School of Health Science and Education, Harokopio University of Athens, Athens, Greece
| | - Eleftheria Zeggini
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Institute of Translational Genomics, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, Germany
| | - Arthur Gilly
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Institute of Translational Genomics, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, Germany.,Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
10
|
Zare F, Hosny A, Nabavi S. Noise cancellation using total variation for copy number variation detection. BMC Bioinformatics 2018; 19:361. [PMID: 30343665 PMCID: PMC6196408 DOI: 10.1186/s12859-018-2332-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Due to recent advances in sequencing technologies, sequence-based analysis has been widely applied to detecting copy number variations (CNVs). There are several techniques for identifying CNVs using next generation sequencing (NGS) data, however methods employing depth of coverage or read depth (RD) have recently become a main technique to identify CNVs. The main assumption of the RD-based CNV detection methods is that the readcount value at a specific genomic location is correlated with the copy number at that location. However, readcount data's noise and biases distort the association between the readcounts and copy numbers. For more accurate CNV identification, these biases and noise need to be mitigated. In this work, to detect CNVs more precisely and efficiently we propose a novel denoising method based on the total variation approach and the Taut String algorithm. RESULTS To investigate the performance of the proposed denoising method, we computed sensitivities, false discovery rates and specificities of CNV detection when employing denoising, using both simulated and real data. We also compared the performance of the proposed denoising method, Taut String, with that of the commonly used approaches such as moving average (MA) and discrete wavelet transforms (DWT) in terms of sensitivity of detecting true CNVs and time complexity. The results show that Taut String works better than DWT and MA and has a better power to identify very narrow CNVs. The ability of Taut String denoising in preserving CNV segments' breakpoints and narrow CNVs increases the detection accuracy of segmentation algorithms, resulting in higher sensitivities and lower false discovery rates. CONCLUSIONS In this study, we proposed a new denoising method for sequence-based CNV detection based on a signal processing technique. Existing CNV detection algorithms identify many false CNV segments and fail in detecting short CNV segments due to noise and biases. Employing an effective and efficient denoising method can significantly enhance the detection accuracy of the CNV segmentation algorithms. Advanced denoising methods from the signal processing field can be employed to implement such algorithms. We showed that non-linear denoising methods that consider sparsity and piecewise constant characteristics of CNV data result in better performance in CNV detection.
Collapse
Affiliation(s)
- Fatima Zare
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT, USA.
| | - Abdelrahman Hosny
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT, USA
| | - Sheida Nabavi
- Computer Science and Engineering Department and Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
| |
Collapse
|
11
|
Nguyen N, Vo A, Sun H, Huang H. Heavy-Tailed Noise Suppression and Derivative Wavelet Scalogram for Detecting DNA Copy Number Aberrations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1625-1635. [PMID: 28692986 DOI: 10.1109/tcbb.2017.2723884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Most existing array comparative genomic hybridization (array CGH) data processing methods and evaluation models assumed that the probability density function (pdf) of noise in array CGH data is a Gaussian distribution. However, in practice, such noise distribution is peaky and heavy-tailed. Therefore, a Gaussian pdf is not adequate to approximate the noise in array CGH data and hence introduces wrong detections of chromosomal aberrations and leads misunderstanding on disease pathogenesis. A more accurate and sufficient model of noise in array CGH data is necessary and beneficial to the detection of DNA copy number variations. We analyze the real array CGH data from different platforms and show that the distribution of noise in array CGH data is fitted very well by generalized Gaussian distribution (GGD). Based on our new noise model, we propose a novel array CGH processing method combining the advantages of both the smoothing and segmentation approaches. The new method uses generalized Gaussian bivariate shrinkage function and one-directional derivative wavelet scalogram in generalized Gaussian noise. In the smoothing step, with the new generalized Gaussian noise model, we derive the heavy-tailed noise suppression algorithm in stationary wavelet domain. In the segmentation step, the 1D Gaussian derivative wavelet scalogram is employed to detect break points. Both real and simulated array CGH data with different noises (such as Gaussian noise, GGD noise, and real noise) are used in our experiments. We demonstrate that our new method outperforms other state-of-the-art methods, in terms of both root mean squared errors and receiver operating characteristic curves.
Collapse
|
12
|
Chari R, Lockwood WW, Lam WL. Computational Methods for the Analysis of Array Comparative Genomic Hybridization. Cancer Inform 2017. [DOI: 10.1177/117693510600200007] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Array comparative genomic hybridization (array CGH) is a technique for assaying the copy number status of cancer genomes. The widespread use of this technology has lead to a rapid accumulation of high throughput data, which in turn has prompted the development of computational strategies for the analysis of array CGH data. Here we explain the principles behind array image processing, data visualization and genomic profile analysis, review currently available software packages, and raise considerations for future software development.
Collapse
Affiliation(s)
- Raj Chari
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, Vancouver BC, Canada V5Z 1L3
- These authors contributed equally to this work
| | - William W. Lockwood
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, Vancouver BC, Canada V5Z 1L3
- These authors contributed equally to this work
| | - Wan L. Lam
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, Vancouver BC, Canada V5Z 1L3
| |
Collapse
|
13
|
Fasola S, Muggeo VMR, Küchenhoff H. A heuristic, iterative algorithm for change-point detection in abrupt change models. Comput Stat 2017. [DOI: 10.1007/s00180-017-0740-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
Zhang L, Baladandayuthapani V, Zhu H, Baggerly KA, Majewski T, Czerniak BA, Morris JS. Functional CAR models for large spatially correlated functional datasets. J Am Stat Assoc 2016; 111:772-786. [PMID: 28018013 DOI: 10.1080/01621459.2015.1042581] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
We develop a functional conditional autoregressive (CAR) model for spatially correlated data for which functions are collected on areal units of a lattice. Our model performs functional response regression while accounting for spatial correlations with potentially nonseparable and nonstationary covariance structure, in both the space and functional domains. We show theoretically that our construction leads to a CAR model at each functional location, with spatial covariance parameters varying and borrowing strength across the functional domain. Using basis transformation strategies, the nonseparable spatial-functional model is computationally scalable to enormous functional datasets, generalizable to different basis functions, and can be used on functions defined on higher dimensional domains such as images. Through simulation studies, we demonstrate that accounting for the spatial correlation in our modeling leads to improved functional regression performance. Applied to a high-throughput spatially correlated copy number dataset, the model identifies genetic markers not identified by comparable methods that ignore spatial correlations.
Collapse
Affiliation(s)
- Lin Zhang
- The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| | | | | | - Keith A Baggerly
- The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| | - Tadeusz Majewski
- The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| | - Bogdan A Czerniak
- The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| | - Jeffrey S Morris
- The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| |
Collapse
|
15
|
Kachouie NN, Lin X, Christiani DC, Schwartzman A. Detection of Local DNA Copy Number Changes in Lung Cancer Population Analyses Using A Multi-Scale Approach. COMMUNICATIONS IN STATISTICS. CASE STUDIES, DATA ANALYSIS AND APPLICATIONS 2016; 1:206-216. [PMID: 31489360 PMCID: PMC6727850 DOI: 10.1080/23737484.2016.1197079] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Emerging advances in genomic sequencing have prompted the development of new computational methods for studying the genomic sources of human diseases. This paper presents a recent statistical approach for detection of local regions with significant copy number alterations (CNAs) in lung cancer population. Mapping such regions is of interest as they are potentially associated with lung cancer. Conventional application of multiple testing methods corresponds to testing for CNAs at each probe separately and thresholding the t-statistics as test statistics. Due to the large number of probes, this approach often fails to detect CNA regions. In contrast, the proposed method uses the heights of located peaks and improves the detection power. This is achieved by taking advantage of the spatial structure in the data as well as reducing the number of tests in the multiple comparisons problem. In copy number analysis, it is common to apply segmentation or change detection tools to each individual genomic sample. However, since segmentation results vary among subjects, it becomes difficult to find the common genomic regions in population analyses. Our approach solves this problem by performing the analysis using summary statistics to study at population level directly. Hence, the region detection is performed on the summary t-statistic map. The proposed method is applied to lung cancer data and shows promise for detection of local regions with significant CNAs.
Collapse
Affiliation(s)
| | - Xihong Lin
- Department of Statistics, Harvard School of Public Health
| | - David C Christiani
- Department of Environmental Health, Harvard School of Public Health
- Department of Epidemiology, Harvard School of Public Health
| | | |
Collapse
|
16
|
Zhang L, Yuan Y, Lu KH, Zhang L. Identification of recurrent focal copy number variations and their putative targeted driver genes in ovarian cancer. BMC Bioinformatics 2016; 17:222. [PMID: 27230211 PMCID: PMC4881176 DOI: 10.1186/s12859-016-1085-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 05/14/2016] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Genomic regions with recurrent DNA copy number variations (CNVs) are generally believed to encode oncogenes and tumor suppressor genes (TSGs) that drive cancer growth. However, it remains a challenge to delineate the key cancer driver genes from the regions encoding a large number of genes. RESULTS In this study, we developed a new approach to CNV analysis based on spectral decomposition of CNV profiles into focal CNVs and broad CNVs. We performed an analysis of CNV data of 587 serous ovarian cancer samples on multiple platforms. We identified a number of novel focal regions, such as focal gain of ESR1, focal loss of LSAMP, prognostic site at 3q26.2 and losses of sub-telomere regions in multiple chromosomes. Furthermore, we performed network modularity analysis to examine the relationships among genes encoded in the focal CNV regions. Our results also showed that the recurrent focal gains were significantly associated with the known oncogenes and recurrent losses associated with TSGs and the CNVs had a greater effect on the mRNA expression of the driver genes than that of the non-driver genes. CONCLUSIONS Our results demonstrate that spectral decomposition of CNV profiles offers a new way of understanding the role of CNVs in cancer.
Collapse
Affiliation(s)
- Liangcai Zhang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler St, Unit 1410, Houston, TX, 77401, USA
- Department of Statistics, Rice University, Houston, TX, USA
- Department of Biophysics, College of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin, China
| | - Ying Yuan
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler St, Unit 1410, Houston, TX, 77401, USA
- Department of Statistics, Rice University, Houston, TX, USA
| | - Karen H Lu
- Department of Gynecologic Oncology and Reproductive Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Li Zhang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler St, Unit 1410, Houston, TX, 77401, USA.
| |
Collapse
|
17
|
Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression. PLoS Comput Biol 2016; 12:e1004871. [PMID: 27177143 PMCID: PMC4866742 DOI: 10.1371/journal.pcbi.1004871] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 03/14/2016] [Indexed: 11/22/2022] Open
Abstract
By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.
Collapse
|
18
|
Stamoulis C, Betensky RA. Optimization of Signal Decomposition Matched Filtering (SDMF) for Improved Detection of Copy-Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:584-591. [PMID: 27295643 PMCID: PMC4905595 DOI: 10.1109/tcbb.2015.2448077] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We aim to improve the performance of the previously proposed signal decomposition matched filtering (SDMF) method [26] for the detection of copy-number variations (CNV) in the human genome. Through simulations, we show that the modified SDMF is robust even at high noise levels and outperforms the original SDMF method, which indirectly depends on CNV frequency. Simulations are also used to develop a systematic approach for selecting relevant parameter thresholds in order to optimize sensitivity, specificity and computational efficiency. We apply the modified method to array CGH data from normal samples in the cancer genome atlas (TCGA) and compare detected CNVs to those estimated using circular binary segmentation (CBS) [19], a hidden Markov model (HMM)-based approach [11] and a subset of CNVs in the Database of Genomic Variants. We show that a substantial number of previously identified CNVs are detected by the optimized SDMF, which also outperforms the other two methods.
Collapse
Affiliation(s)
- Catherine Stamoulis
- Department of Radiology, Harvard Medical School and Boston Children’s Hospital, Boston, MA 02115
| | - Rebecca A. Betensky
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115
| |
Collapse
|
19
|
Gao X. Penalized weighted low-rank approximation for robust recovery of recurrent copy number variations. BMC Bioinformatics 2015; 16:407. [PMID: 26652207 PMCID: PMC4676147 DOI: 10.1186/s12859-015-0835-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 11/23/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variation (CNV) analysis has become one of the most important research areas for understanding complex disease. With increasing resolution of array-based comparative genomic hybridization (aCGH) arrays, more and more raw copy number data are collected for multiple arrays. It is natural to realize the co-existence of both recurrent and individual-specific CNVs, together with the possible data contamination during the data generation process. Therefore, there is a great need for an efficient and robust statistical model for simultaneous recovery of both recurrent and individual-specific CNVs. RESULT We develop a penalized weighted low-rank approximation method (WPLA) for robust recovery of recurrent CNVs. In particular, we formulate multiple aCGH arrays into a realization of a hidden low-rank matrix with some random noises and let an additional weight matrix account for those individual-specific effects. Thus, we do not restrict the random noise to be normally distributed, or even homogeneous. We show its performance through three real datasets and twelve synthetic datasets from different types of recurrent CNV regions associated with either normal random errors or heavily contaminated errors. CONCLUSION Our numerical experiments have demonstrated that the WPLA can successfully recover the recurrent CNV patterns from raw data under different scenarios. Compared with two other recent methods, it performs the best regarding its ability to simultaneously detect both recurrent and individual-specific CNVs under normal random errors. More importantly, the WPLA is the only method which can effectively recover the recurrent CNVs region when the data is heavily contaminated.
Collapse
Affiliation(s)
- Xiaoli Gao
- Department of Mathematics and Statistics, University of North Carolina at Greensboro, 1400 Spring Garden St, Greensoboro, NC, USA.
| |
Collapse
|
20
|
Lee W, Morris JS. Identification of differentially methylated loci using wavelet-based functional mixed models. Bioinformatics 2015; 32:664-72. [PMID: 26559505 DOI: 10.1093/bioinformatics/btv659] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2014] [Accepted: 11/05/2015] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION DNA methylation is a key epigenetic modification that can modulate gene expression. Over the past decade, a lot of studies have focused on profiling DNA methylation and investigating its alterations in complex diseases such as cancer. While early studies were mostly restricted to CpG islands or promoter regions, recent findings indicate that many of important DNA methylation changes can occur in other regions and DNA methylation needs to be examined on a genome-wide scale. In this article, we apply the wavelet-based functional mixed model methodology to analyze the high-throughput methylation data for identifying differentially methylated loci across the genome. Contrary to many commonly-used methods that model probes independently, this framework accommodates spatial correlations across the genome through basis function modeling as well as correlations between samples through functional random effects, which allows it to be applied to many different settings and potentially leads to more power in detection of differential methylation. RESULTS We applied this framework to three different high-dimensional methylation data sets (CpG Shore data, THREE data and NIH Roadmap Epigenomics data), studied previously in other works. A simulation study based on CpG Shore data suggested that in terms of detection of differentially methylated loci, this modeling approach using wavelets outperforms analogous approaches modeling the loci as independent. For the THREE data, the method suggests newly detected regions of differential methylation, which were not reported in the original study. AVAILABILITY AND IMPLEMENTATION Automated software called WFMM is available at https://biostatistics.mdanderson.org/SoftwareDownload CpG Shore data is available at http://rafalab.dfci.harvard.edu NIH Roadmap Epigenomics data is available at http://compbio.mit.edu/roadmap SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT jefmorris@mdanderson.org.
Collapse
Affiliation(s)
- Wonyul Lee
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, USA
| | - Jeffrey S Morris
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
21
|
Arsuaga J, Borrman T, Cavalcante R, Gonzalez G, Park C. Identification of Copy Number Aberrations in Breast Cancer Subtypes Using Persistence Topology. MICROARRAYS (BASEL, SWITZERLAND) 2015; 4:339-69. [PMID: 27600228 PMCID: PMC4996377 DOI: 10.3390/microarrays4030339] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 08/03/2015] [Indexed: 01/01/2023]
Abstract
DNA copy number aberrations (CNAs) are of biological and medical interest because they help identify regulatory mechanisms underlying tumor initiation and evolution. Identification of tumor-driving CNAs (driver CNAs) however remains a challenging task, because they are frequently hidden by CNAs that are the product of random events that take place during tumor evolution. Experimental detection of CNAs is commonly accomplished through array comparative genomic hybridization (aCGH) assays followed by supervised and/or unsupervised statistical methods that combine the segmented profiles of all patients to identify driver CNAs. Here, we extend a previously-presented supervised algorithm for the identification of CNAs that is based on a topological representation of the data. Our method associates a two-dimensional (2D) point cloud with each aCGH profile and generates a sequence of simplicial complexes, mathematical objects that generalize the concept of a graph. This representation of the data permits segmenting the data at different resolutions and identifying CNAs by interrogating the topological properties of these simplicial complexes. We tested our approach on a published dataset with the goal of identifying specific breast cancer CNAs associated with specific molecular subtypes. Identification of CNAs associated with each subtype was performed by analyzing each subtype separately from the others and by taking the rest of the subtypes as the control. Our results found a new amplification in 11q at the location of the progesterone receptor in the Luminal A subtype. Aberrations in the Luminal B subtype were found only upon removal of the basal-like subtype from the control set. Under those conditions, all regions found in the original publication, except for 17q, were confirmed; all aberrations, except those in chromosome arms 8q and 12q were confirmed in the basal-like subtype. These two chromosome arms, however, were detected only upon removal of three patients with exceedingly large copy number values. More importantly, we detected 10 and 21 additional regions in the Luminal B and basal-like subtypes, respectively. Most of the additional regions were either validated on an independent dataset and/or using GISTIC. Furthermore, we found three new CNAs in the basal-like subtype: a combination of gains and losses in 1p, a gain in 2p and a loss in 14q. Based on these results, we suggest that topological approaches that incorporate multiresolution analyses and that interrogate topological properties of the data can help in the identification of copy number changes in cancer.
Collapse
Affiliation(s)
- Javier Arsuaga
- Department of Mathematics, University of California Davis, 1 Shields Avenue, Davis, CA 95616, USA.
- Department of Molecular and Cellular Biology, University of California Davis, 1 Shields Avenue, Davis, CA 95616, USA.
| | - Tyler Borrman
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA.
| | - Raymond Cavalcante
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
| | - Georgina Gonzalez
- Department of Mathematics, San Francisco State University, 1600 Holloway Avenue, San Francisco, CA 96132, USA.
| | - Catherine Park
- Helen Diller Comprehensive Cancer Center,University of California San Francisco, 1600 Divisadero Street, San Francisco, CA 94143, USA.
| |
Collapse
|
22
|
Liu Y, Li A, Feng H, Wang M. TAFFYS: An Integrated Tool for Comprehensive Analysis of Genomic Aberrations in Tumor Samples. PLoS One 2015; 10:e0129835. [PMID: 26111017 PMCID: PMC4482394 DOI: 10.1371/journal.pone.0129835] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2014] [Accepted: 05/13/2015] [Indexed: 01/13/2023] Open
Abstract
Background Tumor single nucleotide polymorphism (SNP) array is a common platform for investigating the cancer genomic aberration and the functionally important altered genes. Original SNP array signals are usually corrupted by noise, and need to be de-convoluted into absolute copy number profile by analytical methods. Unfortunately, in contrast with the popularity of tumor Affymetrix SNP array, the methods that are specifically designed for this platform are still limited. The complicated characteristics of noise in signals is one of the difficulties for dissecting tumor Affymetrix SNP array data, as they inevitably blur the distinction between aberrations and create an obstacle for the copy number aberration (CNA) identification. Results We propose a tool named TAFFYS for comprehensive analysis of tumor Affymetrix SNP array data. TAFFYS introduce a wavelet-based de-noising approach and copy number-specific signal variance model for suppressing and modelling the noise in signals. Then a hidden Markov model is employed for copy number inference. Finally, by using the absolute copy number profile, statistical significance of each aberration region is calculated in term of different aberration types, including amplification, deletion and loss of heterozygosity (LOH). The result shows that copy number specific-variance model and wavelet de-noising algorithm fits well with the Affymetrix SNP array signals, leading to more accurate estimation for diluted tumor sample (even with only 30% of cancer cells) than other existed methods. Results of examinations also demonstrate a good compatibility and extensibility for different Affymetrix SNP array platforms. Application on the 35 breast tumor samples shows that TAFFYS can automatically dissect the tumor samples and reveal statistically significant aberration regions where cancer-related genes locate. Conclusions TAFFYS provide an efficient and convenient tool for identifying the copy number alteration and allelic imbalance and assessing the recurrent aberrations for the tumor Affymetrix SNP array data.
Collapse
Affiliation(s)
- Yuanning Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei, AH230027, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei, AH230027, China
- Research centres for Biomedical Engineering, University of Science and Technology of China, Hefei, AH230027, China
- * E-mail:
| | - Huanqing Feng
- School of Information Science and Technology, University of Science and Technology of China, Hefei, AH230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei, AH230027, China
- Research centres for Biomedical Engineering, University of Science and Technology of China, Hefei, AH230027, China
| |
Collapse
|
23
|
Kachouie NN, Lin X, Schwartzman A. FDR control of detected regions by multiscale matched filtering. COMMUN STAT-SIMUL C 2014; 46:127-144. [PMID: 31501637 PMCID: PMC6733272 DOI: 10.1080/03610918.2014.957842] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Accepted: 08/15/2014] [Indexed: 10/24/2022]
Abstract
Feature extraction from observed noisy samples is a common important problem in statistics and engineering. This paper presents a novel general statistical approach to the region detection problem in long data sequences. The proposed technique is a multi-scale kernel regression in conjunction with statistical multiple testing for region detection while controlling the false discovery rate (FDR) and maximizing the signal to noise ratio (SNR) via matched filtering. This is achieved by considering a one-dimensional (1D) region detection problem as its equivalent 0D (zero dimensional) peak detection problem. The detection method does not require a priori knowledge of the shape of the non-zero regions. However, if the shape of the non-zero regions is known a priori, e.g. rectangular pulse, the signal regions can also be reconstructed from the detected peaks, seen as their topological point representatives. Simulations show that the method can effectively perform signal detection and reconstruction in the simulated data under high noise conditions, while controlling the FDR of detected regions and their reconstructed length.
Collapse
Affiliation(s)
- Nezamoddin N Kachouie
- Department of Mathematical Sciences, Florida Institute of Technology, Melbourne, FL, USA
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Armin Schwartzman
- Department of Statistics, North Carolina State University, Raleigh, NC, USA
| |
Collapse
|
24
|
Seifert M, Abou-El-Ardat K, Friedrich B, Klink B, Deutsch A. Autoregressive higher-order hidden Markov models: exploiting local chromosomal dependencies in the analysis of tumor expression profiles. PLoS One 2014; 9:e100295. [PMID: 24955771 PMCID: PMC4067306 DOI: 10.1371/journal.pone.0100295] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Accepted: 05/22/2014] [Indexed: 12/21/2022] Open
Abstract
Changes in gene expression programs play a central role in cancer. Chromosomal aberrations such as deletions, duplications and translocations of DNA segments can lead to highly significant positive correlations of gene expression levels of neighboring genes. This should be utilized to improve the analysis of tumor expression profiles. Here, we develop a novel model class of autoregressive higher-order Hidden Markov Models (HMMs) that carefully exploit local data-dependent chromosomal dependencies to improve the identification of differentially expressed genes in tumor. Autoregressive higher-order HMMs overcome generally existing limitations of standard first-order HMMs in the modeling of dependencies between genes in close chromosomal proximity by the simultaneous usage of higher-order state-transitions and autoregressive emissions as novel model features. We apply autoregressive higher-order HMMs to the analysis of breast cancer and glioma gene expression data and perform in-depth model evaluation studies. We find that autoregressive higher-order HMMs clearly improve the identification of overexpressed genes with underlying gene copy number duplications in breast cancer in comparison to mixture models, standard first- and higher-order HMMs, and other related methods. The performance benefit is attributed to the simultaneous usage of higher-order state-transitions in combination with autoregressive emissions. This benefit could not be reached by using each of these two features independently. We also find that autoregressive higher-order HMMs are better able to identify differentially expressed genes in tumors independent of the underlying gene copy number status in comparison to the majority of related methods. This is further supported by the identification of well-known and of previously unreported hotspots of differential expression in glioblastomas demonstrating the efficacy of autoregressive higher-order HMMs for the analysis of individual tumor expression profiles. Moreover, we reveal interesting novel details of systematic alterations of gene expression levels in known cancer signaling pathways distinguishing oligodendrogliomas, astrocytomas and glioblastomas. An implementation is available under www.jstacs.de/index.php/ARHMM.
Collapse
Affiliation(s)
- Michael Seifert
- Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany
| | - Khalil Abou-El-Ardat
- Institute for Clinical Genetics, Faculty of Medicine Carl Gustav Carus, Dresden University of Technology, Dresden, Germany
| | - Betty Friedrich
- Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany
| | - Barbara Klink
- Institute for Clinical Genetics, Faculty of Medicine Carl Gustav Carus, Dresden University of Technology, Dresden, Germany
| | - Andreas Deutsch
- Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany
| |
Collapse
|
25
|
Abstract
MOTIVATION Studies of genomic DNA copy number alteration can deal with datasets with several million probes and thousands of subjects. Analyzing these data with currently available software (e.g. as available from BioConductor) can be extremely slow and may not be feasible because of memory requirements. RESULTS We have developed a BioConductor package, ADaCGH2, that parallelizes the main segmentation algorithms (using forking on multicore computers or parallelization via message passing interface, etc., in clusters of computers) and uses ff objects for reading and data storage. We show examples of data with 6 million probes per array; we can analyze data that would otherwise not fit in memory, and compared with the non-parallelized versions we can achieve speedups of 25-40 times on a 64-cores machine. AVAILABILITY AND IMPLEMENTATION ADaCGH2 is an R package available from BioConductor. Version 2.3.11 or higher is available from the development branch: http://www.bioconductor.org/packages/devel/bioc/html/ADaCGH2.html.
Collapse
Affiliation(s)
- Ramon Diaz-Uriarte
- Department of Biochemistry, Universidad Autónoma de Madrid, Instituto de Investigaciones Biomédicas 'Alberto Sols' (UAM-CSIC), 28029 Madrid, Spain
| |
Collapse
|
26
|
Confidence limits for genome DNA copy number variations in HR-CGH array measurements. Biomed Signal Process Control 2014. [DOI: 10.1016/j.bspc.2013.11.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
27
|
Satten GA, Allen AS, Ikeda M, Mulle JG, Warren ST. Robust regression analysis of copy number variation data based on a univariate score. PLoS One 2014; 9:e86272. [PMID: 24516529 PMCID: PMC3917847 DOI: 10.1371/journal.pone.0086272] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2013] [Accepted: 12/12/2013] [Indexed: 11/18/2022] Open
Abstract
Motivation The discovery that copy number variants (CNVs) are widespread in the human genome has motivated development of numerous algorithms that attempt to detect CNVs from intensity data. However, all approaches are plagued by high false discovery rates. Further, because CNVs are characterized by two dimensions (length and intensity) it is unclear how to order called CNVs to prioritize experimental validation. Results We developed a univariate score that correlates with the likelihood that a CNV is true. This score can be used to order CNV calls in such a way that calls having larger scores are more likely to overlap a true CNV. We developed cnv.beast, a computationally efficient algorithm for calling CNVs that uses robust backward elimination regression to keep CNV calls with scores that exceed a user-defined threshold. Using an independent dataset that was measured using a different platform, we validated our score and showed that our approach performed better than six other currently-available methods. Availability cnv.beast is available at http://www.duke.edu/~asallen/Software.html.
Collapse
Affiliation(s)
- Glen A. Satten
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
- * E-mail:
| | - Andrew S. Allen
- Department of Biostatistics and Bioinformatics and Duke Clinical Research Institute, Duke University, Durham, North Carolina, United States of America
| | - Morna Ikeda
- Department of Human Genetics, Emory University, Atlanta, Georgia, United States of America
| | - Jennifer G. Mulle
- Department of Human Genetics, Emory University, Atlanta, Georgia, United States of America
- Department of Epidemiology, Emory University, Atlanta, Georgia, United States of America
| | - Stephen T. Warren
- Department of Epidemiology, Emory University, Atlanta, Georgia, United States of America
| |
Collapse
|
28
|
Vandeweyer G, Kooy RF. Detection and interpretation of genomic structural variation in health and disease. Expert Rev Mol Diagn 2014; 13:61-82. [DOI: 10.1586/erm.12.119] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
29
|
Luong TM, Rozenholc Y, Nuel G. Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model. Comput Stat Data Anal 2013. [DOI: 10.1016/j.csda.2013.06.020] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
30
|
Subramanian A, Shackney S, Schwartz R. Novel multisample scheme for inferring phylogenetic markers from whole genome tumor profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1422-1431. [PMID: 24407301 PMCID: PMC3830698 DOI: 10.1109/tcbb.2013.33] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Computational cancer phylogenetics seeks to enumerate the temporal sequences of aberrations in tumor evolution, thereby delineating the evolution of possible tumor progression pathways, molecular subtypes, and mechanisms of action. We previously developed a pipeline for constructing phylogenies describing evolution between major recurring cell types computationally inferred from whole-genome tumor profiles. The accuracy and detail of the phylogenies, however, depend on the identification of accurate, high-resolution molecular markers of progression, i.e., reproducible regions of aberration that robustly differentiate different subtypes and stages of progression. Here, we present a novel hidden Markov model (HMM) scheme for the problem of inferring such phylogenetically significant markers through joint segmentation and calling of multisample tumor data. Our method classifies sets of genome-wide DNA copy number measurements into a partitioning of samples into normal (diploid) or amplified at each probe. It differs from other similar HMM methods in its design specifically for the needs of tumor phylogenetics, by seeking to identify robust markers of progression conserved across a set of copy number profiles. We show an analysis of our method in comparison to other methods on both synthetic and real tumor data, which confirms its effectiveness for tumor phylogeny inference and suggests avenues for future advances.
Collapse
Affiliation(s)
- Ayshwarya Subramanian
- Graduate student at the Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213.
| | | | | |
Collapse
|
31
|
Comparing Segmentation Methods for Genome Annotation Based on RNA-Seq Data. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2013. [DOI: 10.1007/s13253-013-0159-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
32
|
Plummer PJ, Chen J. A Bayesian approach for locating change points in a compound Poisson process with application to detecting DNA copy number variations. J Appl Stat 2013. [DOI: 10.1080/02664763.2013.840272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
33
|
Amarasinghe KC, Li J, Halgamuge SK. CoNVEX: copy number variation estimation in exome sequencing data using HMM. BMC Bioinformatics 2013; 14 Suppl 2:S2. [PMID: 23368785 PMCID: PMC3549847 DOI: 10.1186/1471-2105-14-s2-s2] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background One of the main types of genetic variations in cancer is Copy Number Variations (CNV). Whole exome sequenicng (WES) is a popular alternative to whole genome sequencing (WGS) to study disease specific genomic variations. However, finding CNV in Cancer samples using WES data has not been fully explored. Results We present a new method, called CoNVEX, to estimate copy number variation in whole exome sequencing data. It uses ratio of tumour and matched normal average read depths at each exonic region, to predict the copy gain or loss. The useful signal produced by WES data will be hindered by the intrinsic noise present in the data itself. This limits its capacity to be used as a highly reliable CNV detection source. Here, we propose a method that consists of discrete wavelet transform (DWT) to reduce noise. The identification of copy number gains/losses of each targeted region is performed by a Hidden Markov Model (HMM). Conclusion HMM is frequently used to identify CNV in data produced by various technologies including Array Comparative Genomic Hybridization (aCGH) and WGS. Here, we propose an HMM to detect CNV in cancer exome data. We used modified data from 1000 Genomes project to evaluate the performance of the proposed method. Using these data we have shown that CoNVEX outperforms the existing methods significantly in terms of precision. Overall, CoNVEX achieved a sensitivity of more than 92% and a precision of more than 50%.
Collapse
Affiliation(s)
- Kaushalya C Amarasinghe
- Department of Mechanical Engineering, University of Melbourne, Parkville, VIC 3010, Australia.
| | | | | |
Collapse
|
34
|
Comparative analysis of methods for identifying recurrent copy number alterations in cancer. PLoS One 2012; 7:e52516. [PMID: 23285074 PMCID: PMC3527554 DOI: 10.1371/journal.pone.0052516] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2012] [Accepted: 11/14/2012] [Indexed: 11/19/2022] Open
Abstract
Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
Collapse
|
35
|
Lai LA, Risques RA, Bronner MP, Rabinovitch PS, Crispin D, Chen R, Brentnall TA. Pan-colonic field defects are detected by CGH in the colons of UC patients with dysplasia/cancer. Cancer Lett 2012; 320:180-8. [PMID: 22387989 PMCID: PMC3406733 DOI: 10.1016/j.canlet.2012.02.031] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2011] [Revised: 02/21/2012] [Accepted: 02/23/2012] [Indexed: 02/08/2023]
Abstract
BAC arrays were used to evaluate genomic instability along the colon of patients with ulcerative colitis (UC). Genomic instability increases with disease progression and biopsies more proximal to dysplasia showed increased instability. Pan-colonic field copy number gain or loss involving small (<1Mb) regions were detected in most patients and were particularly apparent in the UC progressor patients who had dysplasia or cancer. Chromosomal copy gains or losses affecting large regions were mainly restricted to dysplastic biopsies. Areas of significant chromosomal losses were detected in the UC progressors on chromosomes 2q36, 3q25, 3p21, 4q34, 4p16.2, 15q22, and 16p13 (p-value⩽0.04). These results extend our understanding of the dynamic nature of pan-colonic genomic instability in this disease.
Collapse
Affiliation(s)
- Lisa A Lai
- Department of Medicine, University of Washington, Seattle, WA, United States
| | | | | | | | | | | | | |
Collapse
|
36
|
Cutts RJ, Dayem Ullah AZ, Sangaralingam A, Gadaleta E, Lemoine NR, Chelala C. O-miner: an integrative platform for automated analysis and mining of -omics data. Nucleic Acids Res 2012; 40:W560-8. [PMID: 22600742 PMCID: PMC3394300 DOI: 10.1093/nar/gks432] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
High-throughput profiling has generated massive amounts of data across basic, clinical and translational research fields. However, open source comprehensive web tools for analysing data obtained from different platforms and technologies are still lacking. To fill this gap and the unmet computational needs of ongoing research projects, we developed O-miner, a rapid, comprehensive, efficient web tool that covers all the steps required for the analysis of both transcriptomic and genomic data starting from raw image files through in-depth bioinformatics analysis and annotation to biological knowledge extraction. O-miner was developed from a biologist end-user perspective. Hence, it is as simple to use as possible within the confines of the complexity of the data being analysed. It provides a strong analytical suite able to overlay and harness large, complicated, raw and heterogeneous sets of profiles with biological/clinical data. Biologists can use O-miner to analyse and integrate different types of data and annotations to build knowledge of relevant altered mechanisms and pathways in order to identify and prioritize novel targets for further biological validation. Here we describe the analytical workflows currently available using O-miner and present examples of use. O-miner is freely available at www.o-miner.org.
Collapse
Affiliation(s)
- Rosalind J Cutts
- Centre for Molecular Oncology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | | | | | | | | | | |
Collapse
|
37
|
Seifert M, Gohr A, Strickert M, Grosse I. Parsimonious higher-order hidden Markov models for improved array-CGH analysis with applications to Arabidopsis thaliana. PLoS Comput Biol 2012; 8:e1002286. [PMID: 22253580 PMCID: PMC3257270 DOI: 10.1371/journal.pcbi.1002286] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2011] [Accepted: 10/11/2011] [Indexed: 12/19/2022] Open
Abstract
Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM). Array-based comparative genomics is a standard approach for the identification of DNA copy number polymorphisms between closely related genomes. The huge amounts of data produced by these experiments require efficient and accurate bioinformatics tools for the identification of copy number polymorphisms. Hidden Markov Models (HMMs) are frequently used for analyzing such data sets, but current models are based on first-order HMMs only having limited capabilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. We develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling these dependencies to overcome this limitation. In an in-depth case study with Arabidopsis thaliana, we find that parsimonious higher-order HMMs clearly improve the identification of copy number polymorphisms in comparison to standard first-order HMMs and other frequently used methods. Functional analysis of identified polymorphisms revealed details of genomic differences between the accessions C24 and Col-0 of Arabidopsis thaliana. An additional study on human cell lines further indicates that parsimonious HMMs are well-suited for the analysis of Array-CGH data.
Collapse
Affiliation(s)
- Michael Seifert
- Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
38
|
Baladandayuthapani V, Ji Y, Talluri R, Nieto-Barajas LE, Morris JS. Bayesian Random Segmentation Models to Identify Shared Copy Number Aberrations for Array CGH Data. J Am Stat Assoc 2012; 105:1358-1375. [PMID: 21512611 DOI: 10.1198/jasa.2010.ap09250] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Array-based comparative genomic hybridization (aCGH) is a high-resolution high-throughput technique for studying the genetic basis of cancer. The resulting data consists of log fluorescence ratios as a function of the genomic DNA location and provides a cytogenetic representation of the relative DNA copy number variation. Analysis of such data typically involves estimation of the underlying copy number state at each location and segmenting regions of DNA with similar copy number states. Most current methods proceed by modeling a single sample/array at a time, and thus fail to borrow strength across multiple samples to infer shared regions of copy number aberrations. We propose a hierarchical Bayesian random segmentation approach for modeling aCGH data that utilizes information across arrays from a common population to yield segments of shared copy number changes. These changes characterize the underlying population and allow us to compare different population aCGH profiles to assess which regions of the genome have differential alterations. Our method, referred to as BDSAcgh (Bayesian Detection of Shared Aberrations in aCGH), is based on a unified Bayesian hierarchical model that allows us to obtain probabilities of alteration states as well as probabilities of differential alteration that correspond to local false discovery rates. We evaluate the operating characteristics of our method via simulations and an application using a lung cancer aCGH data set.
Collapse
|
39
|
Abstract
Genomic alterations have been linked to the development and progression of cancer. The technique of comparative genomic hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data.We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Because the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme, and breast cancer are analyzed, and comparisons are made with some widely used algorithms to illustrate the reliability and success of the technique.
Collapse
Affiliation(s)
- Subharup Guha
- Department of Statistics, University of Missouri-Columbia, Columbia, MO 65211
| | | | | |
Collapse
|
40
|
Mader M, Simon R, Steinbiss S, Kurtz S. FISH Oracle: a web server for flexible visualization of DNA copy number data in a genomic context. J Clin Bioinforma 2011; 1:20. [PMID: 21884636 PMCID: PMC3164613 DOI: 10.1186/2043-9113-1-20] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2011] [Accepted: 07/28/2011] [Indexed: 12/17/2022] Open
Abstract
Background The rapidly growing amount of array CGH data requires improved visualization software supporting the process of identifying candidate cancer genes. Optimally, such software should work across multiple microarray platforms, should be able to cope with data from different sources and should be easy to operate. Results We have developed a web-based software FISH Oracle to visualize data from multiple array CGH experiments in a genomic context. Its fast visualization engine and advanced web and database technology supports highly interactive use. FISH Oracle comes with a convenient data import mechanism, powerful search options for genomic elements (e.g. gene names or karyobands), quick navigation and zooming into interesting regions, and mechanisms to export the visualization into different high quality formats. These features make the software especially suitable for the needs of life scientists. Conclusions FISH Oracle offers a fast and easy to use visualization tool for array CGH and SNP array data. It allows for the identification of genomic regions representing minimal common changes based on data from one or more experiments. FISH Oracle will be instrumental to identify candidate onco and tumor suppressor genes based on the frequency and genomic position of DNA copy number changes. The FISH Oracle application and an installed demo web server are available at http://www.zbh.uni-hamburg.de/fishoracle.
Collapse
Affiliation(s)
- Malte Mader
- Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.
| | | | | | | |
Collapse
|
41
|
Stamoulis C, Betensky RA. A novel signal processing approach for the detection of copy number variations in the human genome. Bioinformatics 2011; 27:2338-45. [PMID: 21752800 DOI: 10.1093/bioinformatics/btr402] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Human genomic variability occurs at different scales, from single nucleotide polymorphisms (SNPs) to large DNA segments. Copy number variations (CNVs) represent a significant part of our genetic heterogeneity and have also been associated with many diseases and disorders. Short, localized CNVs, which may play an important role in human disease, may be undetectable in noisy genomic data. Therefore, robust methodologies are needed for their detection. Furthermore, for meaningful identification of pathological CNVs, estimation of normal allelic aberrations is necessary. RESULTS We developed a signal processing-based methodology for sequence denoising followed by pattern matching, to increase SNR in genomic data and improve CNV detection. We applied this signal-decomposition-matched filtering (SDMF) methodology to 429 normal genomic sequences, and compared detected CNVs to those in the Database of Genomic Variants. SDMF successfully detected a significant number of previously identified CNVs with frequencies of occurrence ≥10%, as well as unreported short CNVs. Its performance was also compared to circular binary segmentation (CBS). through simulations. SDMF had a significantly lower false detection rate and was significantly faster than CBS, an important advantage for handling large datasets generated with high-resolution arrays. By focusing on improving SNR (instead of the robustness of the detection algorithm), SDMF is a very promising methodology for identifying CNVs at all genomic spatial scales. AVAILABILITY The data are available at http://tcga-data.nci.nih.gov/tcga/ The software and list of analyzed sequence IDs are available at http://www.hsph.harvard.edu/~betensky/ A Matlab code for Empirical Mode Decomposition may be found at: http://www.clear.rice.edu/elec301/Projects02/empiricalMode/code.html CONTACT caterina@mit.edu.
Collapse
Affiliation(s)
- Catherine Stamoulis
- Department of Radiology, Harvard School of Public Health, Boston, MA 02115, USA.
| | | |
Collapse
|
42
|
Dalmasso C, Broët P. Detection of chromosomal abnormalities using high resolution arrays in clinical cancer research. J Biomed Inform 2011; 44:936-42. [PMID: 21703362 DOI: 10.1016/j.jbi.2011.06.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2010] [Revised: 05/11/2011] [Accepted: 06/06/2011] [Indexed: 01/15/2023]
Abstract
In clinical cancer research, high throughput genomic technologies are increasingly used to identify copy number aberrations. However, the admixture of tumor and stromal cells and the inherent karyotypic heterogeneity of most of the solid tumor samples make this task highly challenging. Here, we propose a robust two-step strategy to detect copy number aberrations in such a context. A spatial mixture model is first used to fit the preprocessed data. Then, a calling algorithm is applied to classify the genomic segments in three biologically meaningful states (copy loss, copy gain and modal copy). The results of a simulation study show the good properties of the proposed procedure with complex patterns of genomic aberrations. The interest of the proposed procedure in clinical cancer research is then illustrated by the analysis of real lung adenocarcinoma samples.
Collapse
Affiliation(s)
- Cyril Dalmasso
- Genome Institute of Singapore, 60 Biopolis Street, 02-01 Genome, Singapore.
| | | |
Collapse
|
43
|
Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen RA, Seshan VE. Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. ACTA ACUST UNITED AC 2011; 27:2038-46. [PMID: 21666266 DOI: 10.1093/bioinformatics/btr329] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
MOTIVATION High-throughput techniques facilitate the simultaneous measurement of DNA copy number at hundreds of thousands of sites on a genome. Older techniques allow measurement only of total copy number, the sum of the copy number contributions from the two parental chromosomes. Newer single nucleotide polymorphism (SNP) techniques can in addition enable quantifying parent-specific copy number (PSCN). The raw data from such experiments are two-dimensional, but are unphased. Consequently, inference based on them necessitates development of new analytic methods. METHODS We have adapted and enhanced the circular binary segmentation (CBS) algorithm for this purpose with focus on paired test and reference samples. The essence of paired parent-specific CBS (Paired PSCBS) is to utilize the original CBS algorithm to identify regions of equal total copy number and then to further segment these regions where there have been changes in PSCN. For the final set of regions, calls are made of equal parental copy number and loss of heterozygosity (LOH). PSCN estimates are computed both before and after calling. RESULTS The methodology is evaluated by simulation and on glioblastoma data. In the simulation, PSCBS compares favorably to established methods. On the glioblastoma data, PSCBS identifies interesting genomic regions, such as copy-neutral LOH. AVAILABILITY The Paired PSCBS method is implemented in an open-source R package named PSCBS, available on CRAN (http://cran.r-project.org/).
Collapse
Affiliation(s)
- Adam B Olshen
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA.
| | | | | | | | | | | |
Collapse
|
44
|
Nowak G, Hastie T, Pollack JR, Tibshirani R. A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 2011; 12:776-91. [PMID: 21642389 DOI: 10.1093/biostatistics/kxr012] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Collapse
Affiliation(s)
- Gen Nowak
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA.
| | | | | | | |
Collapse
|
45
|
Darby BJ, Jones KL, Wheeler D, Herman MA. Normalization and centering of array-based heterologous genome hybridization based on divergent control probes. BMC Bioinformatics 2011; 12:183. [PMID: 21600029 PMCID: PMC3125262 DOI: 10.1186/1471-2105-12-183] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2010] [Accepted: 05/21/2011] [Indexed: 11/21/2022] Open
Abstract
Background Hybridization of heterologous (non-specific) nucleic acids onto arrays designed for model-organisms has been proposed as a viable genomic resource for estimating sequence variation and gene expression in non-model organisms. However, conventional methods of normalization that assume equivalent distributions (such as quantile normalization) are inappropriate when applied to non-specific (heterologous) hybridization. We propose an algorithm for normalizing and centering intensity data from heterologous hybridization that makes no prior assumptions of distribution, reduces the false appearance of homology, and provides a way for researchers to confirm whether heterologous hybridization is suitable. Results Data are normalized by adjusting for Gibbs free energy binding, and centered by adjusting for the median of a common set of control probes assumed to be equivalently dissimilar for all species. This procedure was compared to existing approaches and found to be as successful as Loess normalization at detecting sequence variations (deletions) and even more successful than quantile normalization at reducing the accumulation of false positive probe matches between two related nematode species, Caenorhabditis elegans and C. briggsae. Despite the improvements, we still found that probe fluorescence intensity was too poorly correlated with sequence similarity to result in reliable detection of matching probe sequence. Conclusions Cross-species hybridizations can be a way to adapt genome-enabled tools for closely related non-model organisms, but data must be appropriately normalized and centered in a way that accommodates hybridization of nucleic acids with diverged sequence. For short, 25-mer probes, hybridization intensity alone may be insufficiently correlated with sequence similarity to allow reliable inference of homology at the probe level.
Collapse
Affiliation(s)
- Brian J Darby
- Ecological Genomics Institute, Division of Biology, Kansas State University, Manhattan, KS 66506, USA
| | | | | | | |
Collapse
|
46
|
Chen CH, Lee HC, Ling Q, Chen HR, Ko YA, Tsou TS, Wang SC, Wu LC, Lee HC. An all-statistics, high-speed algorithm for the analysis of copy number variation in genomes. Nucleic Acids Res 2011; 39:e89. [PMID: 21576227 PMCID: PMC3141250 DOI: 10.1093/nar/gkr137] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Detection of copy number variation (CNV) in DNA has recently become an important method for understanding the pathogenesis of cancer. While existing algorithms for extracting CNV from microarray data have worked reasonably well, the trend towards ever larger sample sizes and higher resolution microarrays has vastly increased the challenges they face. Here, we present Segmentation analysis of DNA (SAD), a clustering algorithm constructed with a strategy in which all operational decisions are based on simple and rigorous applications of statistical principles, measurement theory and precise mathematical relations. Compared with existing packages, SAD is simpler in formulation, more user friendly, much faster and less thirsty for memory, offers higher accuracy and supplies quantitative statistics for its predictions. Unique among such algorithms, SAD's running time scales linearly with array size; on a typical modern notebook, it completes high-quality CNV analyses for a 250 thousand-probe array in ∼1 s and a 1.8 million-probe array in ∼8 s.
Collapse
Affiliation(s)
- Chih-Hao Chen
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan 32001
| | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Hur Y, Lee H. Wavelet-based identification of DNA focal genomic aberrations from single nucleotide polymorphism arrays. BMC Bioinformatics 2011; 12:146. [PMID: 21569311 PMCID: PMC3114745 DOI: 10.1186/1471-2105-12-146] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2010] [Accepted: 05/11/2011] [Indexed: 11/10/2022] Open
Abstract
Background Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs. Results In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12). Conclusions Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets.
Collapse
Affiliation(s)
- Youngmi Hur
- Dept. of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
| | | |
Collapse
|
48
|
Koike A, Nishida N, Yamashita D, Tokunaga K. Comparative analysis of copy number variation detection methods and database construction. BMC Genet 2011; 12:29. [PMID: 21385384 PMCID: PMC3058066 DOI: 10.1186/1471-2156-12-29] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2010] [Accepted: 03/07/2011] [Indexed: 12/13/2022] Open
Abstract
Background Array-based detection of copy number variations (CNVs) is widely used for identifying disease-specific genetic variations. However, the accuracy of CNV detection is not sufficient and results differ depending on the detection programs used and their parameters. In this study, we evaluated five widely used CNV detection programs, Birdsuite (mainly consisting of the Birdseye and Canary modules), Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data. Furthermore, we identified CNVs of 180 healthy Japanese individuals using parameters that showed the best performance in the HapMap data and investigated their characteristics. Results The results indicate that Hidden Markov model-based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite show better detection performance than other programs when the high reproducibility rates of the same individuals and the low Mendelian inconsistencies are considered. Furthermore, when rates of overlap with other experimental results were taken into account, Birdsuite showed the best performance from the view point of sensitivity but was expected to include many false negatives and some false positives. The results of 180 healthy Japanese demonstrate that the ratio containing repeat sequences, not only segmental repeats but also long interspersed nuclear element (LINE) sequences both in the start and end regions of the CNVs, is higher in CNVs that are commonly detected among multiple individuals than that in randomly selected regions, and the conservation score based on primates is lower in these regions than in randomly selected regions. Similar tendencies were observed in HapMap data and other experimental data. Conclusions Our results suggest that not only segmental repeats but also interspersed repeats, especially LINE sequences, are deeply involved in CNVs, particularly in common CNV formations. The detected CNVs are stored in the CNV repository database newly constructed by the "Japanese integrated database project" for sharing data among researchers. http://gwas.lifesciencedb.jp/cgi-bin/cnvdb/cnv_top.cgi
Collapse
Affiliation(s)
- Asako Koike
- Central Research Laboratory, Hitachi Ltd., Tokyo, Japan.
| | | | | | | |
Collapse
|
49
|
Chen H, Xing H, Zhang NR. Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biol 2011; 7:e1001060. [PMID: 21298078 PMCID: PMC3029233 DOI: 10.1371/journal.pcbi.1001060] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2010] [Accepted: 12/17/2010] [Indexed: 01/01/2023] Open
Abstract
Chromosomal gains and losses comprise an important type of genetic change in tumors, and can now be assayed using microarray hybridization-based experiments. Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, sometimes called parent specific copy number, is important for identifying allele-specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor. We propose a stochastic segmentation model for parent-specific DNA copy number in tumor samples, and give an estimation procedure that is computationally efficient and can be applied to data from the current high density genotyping platforms. The proposed method does not require matched normal samples, and can estimate the unknown genotypes simultaneously with the parent specific copy number. The new method is used to analyze 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the copy number events in these samples. Detailed case studies on these samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as the quantification of fractional gains and losses, the identification of copy neutral loss of heterozygosity, and the characterization of regions of simultaneous changes of both inherited chromosomes. Many genetic diseases are related to copy number aberrations of some regions of the genome. As we know, each chromosome normally has two copies. However, under some circumstances, for some regions, either one or both of the chromosomes change. Genotyping microarray data provides the copy number of the two alleles of polymorphic sites along the chromosomes, which make the inference of the copy number aberrations of the chromosome feasible. One difficulty is that genotyping microarray data cannot provide the haplotype of the two copies of a chromosome. In this paper, we model the copy number along the chromosome as a two-dimensional Markov Chain. Using the observed copy number of both alleles of all the sites, we can determine the parent specific copy number along the chromosome as well as infer the haplotypes of the two copies of the inherited chromosomes in regions where there is allelic imbalance. Simulation results show high sensitivity and specificity of the method. Applying this method to glioblastoma samples from the Cancer Genome Atlas data illustrate the insights gained from allele-specific copy number analysis.
Collapse
Affiliation(s)
- Hao Chen
- Department of Statistics, Stanford University, Stanford, California, United States of America
| | - Haipeng Xing
- Department of Applied Mathematics and Statistics, SUNY at Stony Brook, Stony Brook, New York, United States of America
| | - Nancy R. Zhang
- Department of Statistics, Stanford University, Stanford, California, United States of America
- * E-mail:
| |
Collapse
|
50
|
Genome-wide copy number alterations in subtypes of invasive breast cancers in young white and African American women. Breast Cancer Res Treat 2011; 127:297-308. [PMID: 21264507 DOI: 10.1007/s10549-010-1297-x] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 12/05/2010] [Indexed: 12/28/2022]
Abstract
Genomic copy number alterations (CNA) are common in breast cancer. Identifying characteristic CNAs associated with specific breast cancer subtypes is a critical step in defining potential mechanisms of disease initiation and progression. We used genome-wide array comparative genomic hybridization to identify distinctive CNAs in breast cancer subtypes from 259 young (diagnosed with breast cancer at <55 years) African American (AA) and Caucasian American (CA) women originally enrolled in a larger population-based study. We compared the average frequency of CNAs across the whole genome for each breast tumor subtype and found that estrogen receptor (ER)-negative tumors had a higher average frequency of genome-wide gain (P < 0.0001) and loss (P = 0.02) compared to ER-positive tumors. Triple-negative (TN) tumors had a higher average frequency of genome-wide gain (P < 0.0001) and loss (P = 0.003) than non-TN tumors. No significant difference in CNA frequency was observed between HER2-positive and -negative tumors. We also identified previously unreported recurrent CNAs (frequency >40%) for TN breast tumors at 10q, 11p, 11q, 16q, 20p, and 20q. In addition, we report CNAs that differ in frequency between TN breast tumors of AA and CA women. This is of particular relevance because TN breast cancer is associated with higher mortality and young AA women have higher rates of TN breast tumors compared to CA women. These data support the possibility that higher overall frequency of genomic alteration events as well as specific focal CNAs in TN breast tumors might contribute in part to the poor breast cancer prognosis for young AA women.
Collapse
|