1
|
Iddawela M, Rueda O, Eremin J, Eremin O, Cowley J, Earl HM, Caldas C. Integrative analysis of copy number and gene expression in breast cancer using formalin-fixed paraffin-embedded core biopsy tissue: a feasibility study. BMC Genomics 2017; 18:526. [PMID: 28697743 PMCID: PMC5506605 DOI: 10.1186/s12864-017-3867-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Accepted: 06/16/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An absence of reliable molecular markers has hampered individualised breast cancer treatments, and a major limitation for translational research is the lack of fresh tissue. There are, however, abundant banks of formalin-fixed paraffin-embedded (FFPE) tissue. This study evaluated two platforms available for the analysis of DNA copy number and gene expression using FFPE samples. METHODS The cDNA-mediated annealing, selection, extension, and ligation assay (DASL™) has been developed for gene expression analysis and the Molecular Inversion Probes assay (Oncoscan™), were used for copy number analysis using FFPE tissues. Gene expression and copy number were evaluated in core-biopsy samples from patients with breast cancer undergoing neoadjuvant chemotherapy (NAC). RESULTS Forty-three core-biopsies were evaluated and characteristic copy number changes in breast cancers, gains in 1q, 8q, 11q, 17q and 20q and losses in 6q, 8p, 13q and 16q, were confirmed. Regions that frequently exhibited gains in tumours showing a pathological complete response (pCR) to NAC were 1q (55%), 8q (40%) and 17q (40%), whereas 11q11 (37%) gain was the most frequent change in non-pCR tumours. Gains associated with poor survival were 11q13 (62%), 8q24 (54%) and 20q (47%). Gene expression assessed by DASL correlated with immunohistochemistry (IHC) analysis for oestrogen receptor (ER) [area under the curve (AUC) = 0.95], progesterone receptor (PR)(AUC = 0.90) and human epidermal growth factor type-2 receptor (HER-2) (AUC = 0.96). Differential expression analysis between ER+ and ER- cancers identified over-expression of TTF1, LAF-4 and C-MYB (p ≤ 0.05), and between pCR vs non-pCRs, over-expression of CXCL9, AREG, B-MYB and under-expression of ABCG2. CONCLUSION This study was an integrative analysis of copy number and gene expression using FFPE core biopsies and showed that molecular marker data from FFPE tissues were consistent with those in previous studies using fresh-frozen samples. FFPE tissue can provide reliable information and will be a useful tool in molecular marker studies. TRIAL REGISTRATION Trial registration number ISRCTN09184069 and registered retrospectively on 02/06/2010.
Collapse
Affiliation(s)
- Mahesh Iddawela
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 0RE UK
- Department of Oncology, University of Cambridge, Addenbrooke’s Hospital, Hills Road, Cambridge, UK
- Cambridge Breast Unit, Addenbrooke’s Hospital, Cambridge University Hospitals NHS Foundation Trust, NIHR Cambridge Biomedical Research Centre and Cambridge Experimental Cancer Medicine Centre, Cambridge, UK
- Department of Anatomy & Developmental Biology, Monash University, Clayton, VIC 3800 Australia
- School of Clinical Sciences, Monash University, Clayton, Australia
| | - Oscar Rueda
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 0RE UK
| | - Jenny Eremin
- Research and Development, Lincoln Breast Unit, Lincoln County Hospital, Lincoln, UK
- Nottingham Digestive Disease Centre, Faculty of Medicine and Health Sciences, University of Nottingham, Queen’s Medical Centre, Nottingham, UK
| | - Oleg Eremin
- Research and Development, Lincoln Breast Unit, Lincoln County Hospital, Lincoln, UK
- Nottingham Digestive Disease Centre, Faculty of Medicine and Health Sciences, University of Nottingham, Queen’s Medical Centre, Nottingham, UK
| | - Jed Cowley
- PathLinks, Lincoln County Hospital, Lincoln, UK
| | - Helena M. Earl
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 0RE UK
- Department of Oncology, University of Cambridge, Addenbrooke’s Hospital, Hills Road, Cambridge, UK
- Cambridge Breast Unit, Addenbrooke’s Hospital, Cambridge University Hospitals NHS Foundation Trust, NIHR Cambridge Biomedical Research Centre and Cambridge Experimental Cancer Medicine Centre, Cambridge, UK
| | - Carlos Caldas
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 0RE UK
- Department of Oncology, University of Cambridge, Addenbrooke’s Hospital, Hills Road, Cambridge, UK
- Cambridge Breast Unit, Addenbrooke’s Hospital, Cambridge University Hospitals NHS Foundation Trust, NIHR Cambridge Biomedical Research Centre and Cambridge Experimental Cancer Medicine Centre, Cambridge, UK
| |
Collapse
|
2
|
Malekpour SA, Pezeshk H, Sadeghi M. PSE-HMM: genome-wide CNV detection from NGS data using an HMM with Position-Specific Emission probabilities. BMC Bioinformatics 2016; 18:30. [PMID: 27809781 PMCID: PMC5445519 DOI: 10.1186/s12859-016-1296-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2016] [Accepted: 10/20/2016] [Indexed: 11/23/2022] Open
Abstract
Background Copy Number Variation (CNV) is envisaged to be a major source of large structural variations in the human genome. In recent years, many studies apply Next Generation Sequencing (NGS) data for the CNV detection. However, still there is a necessity to invent more accurate computational tools. Results In this study, mate pair NGS data are used for the CNV detection in a Hidden Markov Model (HMM). The proposed HMM has position specific emission probabilities, i.e. a Gaussian mixture distribution. Each component in the Gaussian mixture distribution captures a different type of aberration that is observed in the mate pairs, after being mapped to the reference genome. These aberrations may include any increase (decrease) in the insertion size or change in the direction of mate pairs that are mapped to the reference genome. This HMM with Position-Specific Emission probabilities (PSE-HMM) is utilized for the genome-wide detection of deletions and tandem duplications. The performance of PSE-HMM is evaluated on a simulated dataset and also on a real data of a Yoruban HapMap individual, NA18507. Conclusions PSE-HMM is effective in taking observation dependencies into account and reaches a high accuracy in detecting genome-wide CNVs. MATLAB programs are available at http://bs.ipm.ir/softwares/PSE-HMM/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1296-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, 14155-6455, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, 14155-6455, Iran. .,School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
3
|
Malekpour SA, Pezeshk H, Sadeghi M. MGP-HMM: Detecting genome-wide CNVs using an HMM for modeling mate pair insertion sizes and read counts. Math Biosci 2016; 279:53-62. [PMID: 27424951 DOI: 10.1016/j.mbs.2016.07.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Revised: 06/12/2016] [Accepted: 07/10/2016] [Indexed: 01/02/2023]
Abstract
MOTIVATION Association of Copy Number Variation (CNV) with schizophrenia, autism, developmental disabilities and fatal diseases such as cancer is verified. Recent developments in Next Generation Sequencing (NGS) have facilitated the CNV studies. However, many of the current CNV detection tools are not capable of discriminating tandem duplication from non-tandem duplications. RESULTS In this study, we propose MGP-HMM as a tool which besides detecting genome-wide deletions discriminates tandem duplications from non-tandem duplications. MGP-HMM takes mate pair abnormalities into account and predicts the digitized number of tandem or non-tandem copies. Abnormalities in the mate pair directions and insertion sizes, after being mapped to the reference genome, are elucidated using a Hidden Markov Model (HMM). For this purpose, a Mixture Gaussian density with time-dependent parameters is applied for emitting mate pair insertion sizes from HMM states. Indeed, depending on observed abnormalities in mate pair insertion size or its orientation, each component in the mixture density will have different parameters. MGP-HMM also applies a Poisson distribution for modeling read depth data. This parametric modeling of the mate pair reads enables us to estimate the length of CNVs precisely, which is an advantage over methods which rely only on read depth approach for the CNV detection. Hidden state of the proposed HMM is the digitized copy number of a genomic segment and states correspond to the multipliers of the mixture Gaussian components. The accuracy of our model is validated on a set of next generation sequencing real and simulated data and is compared to other tools.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran.
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran; School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran.
| |
Collapse
|
4
|
Limiting replication stress during somatic cell reprogramming reduces genomic instability in induced pluripotent stem cells. Nat Commun 2015; 6:8036. [PMID: 26292731 PMCID: PMC4560784 DOI: 10.1038/ncomms9036] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Accepted: 07/06/2015] [Indexed: 02/07/2023] Open
Abstract
The generation of induced pluripotent stem cells (iPSC) from adult somatic cells is one of the most remarkable discoveries in recent decades. However, several works have reported evidence of genomic instability in iPSC, raising concerns on their biomedical use. The reasons behind the genomic instability observed in iPSC remain mostly unknown. Here we show that, similar to the phenomenon of oncogene-induced replication stress, the expression of reprogramming factors induces replication stress. Increasing the levels of the checkpoint kinase 1 (CHK1) reduces reprogramming-induced replication stress and increases the efficiency of iPSC generation. Similarly, nucleoside supplementation during reprogramming reduces the load of DNA damage and genomic rearrangements on iPSC. Our data reveal that lowering replication stress during reprogramming, genetically or chemically, provides a simple strategy to reduce genomic instability on mouse and human iPSC. The expression of reprogramming factors can induce replication stress in induced pluripotent stem cells. In this study, to reduce such genomic instability, Ruiz et al. increase CHK1 kinase levels and nucleoside supplementation during reprogramming.
Collapse
|
5
|
Mayrink VD, Lucas JE. Bayesian factor models for the detection of coherent patterns in gene expression data. BRAZ J PROBAB STAT 2015. [DOI: 10.1214/13-bjps226] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
6
|
Analysis of structural diversity in wolf-like canids reveals post-domestication variants. BMC Genomics 2014; 15:465. [PMID: 24923435 PMCID: PMC4070573 DOI: 10.1186/1471-2164-15-465] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2014] [Accepted: 06/06/2014] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Although a variety of genetic changes have been implicated in causing phenotypic differences among dogs, the role of copy number variants (CNVs) and their impact on phenotypic variation is still poorly understood. Further, very limited knowledge exists on structural variation in the gray wolf, the ancestor of the dog, or other closely related wild canids. Documenting CNVs variation in wild canids is essential to identify ancestral states and variation that may have appeared after domestication. RESULTS In this work, we genotyped 1,611 dog CNVs in 23 wolf-like canids (4 purebred dogs, one dingo, 15 gray wolves, one red wolf, one coyote and one golden jackal) to identify CNVs that may have arisen after domestication. We have found an increase in GC-rich regions close to the breakpoints and around 1 kb away from them suggesting that some common motifs might be associated with the formation of CNVs. Among the CNV regions that showed the largest differentiation between dogs and wild canids we found 12 genes, nine of which are related to two known functions associated with dog domestication; growth (PDE4D, CRTC3 and NEB) and neurological function (PDE4D, EML5, ZNF500, SLC6A11, ELAVL2, RGS7 and CTSB). CONCLUSIONS Our results provide insight into the evolution of structural variation in canines, where recombination is not regulated by PRDM9 due to the inactivation of this gene. We also identified genes within the most differentiated CNV regions between dogs and wolves, which could reflect selection during the domestication process.
Collapse
|
7
|
Vandeweyer G, Kooy RF. Detection and interpretation of genomic structural variation in health and disease. Expert Rev Mol Diagn 2014; 13:61-82. [DOI: 10.1586/erm.12.119] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
8
|
Bao L, Pu M, Messer K. AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data. ACTA ACUST UNITED AC 2014; 30:1056-1063. [PMID: 24389661 DOI: 10.1093/bioinformatics/btt759] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2013] [Accepted: 12/23/2013] [Indexed: 12/30/2022]
Abstract
MOTIVATION Detection and quantification of the absolute DNA copy number alterations in tumor cells is challenging because the DNA specimen is extracted from a mixture of tumor and normal stromal cells. Estimates of tumor purity and ploidy are necessary to correctly infer copy number, and ploidy may itself be a prognostic factor in cancer progression. As deep sequencing of the exome or genome has become routine for characterization of tumor samples, in this work, we aim to develop a simple and robust algorithm to infer purity, ploidy and absolute copy numbers in whole numbers for tumor cells from sequencing data. RESULTS A simulation study shows that estimates have reasonable accuracy, and that the algorithm is robust against the presence of segmentation errors and subclonal populations. We validated our algorithm against a panel of cell lines with experimentally determined ploidy. We also compared our algorithm with the well-established single-nucleotide polymorphism array-based method called ABSOLUTE on three sets of tumors of different types. Our method had good performance on these four benchmark datasets for both purity and ploidy estimates, and may offer a simple solution to copy number alteration quantification for cancer sequencing projects. AVAILABILITY AND IMPLEMENTATION The R package absCNseq is available from http://biostats.mcc.ucsd.edu/files/absCNseq_1.0.tar.gz CONTACT: kmesser@ucsd.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lei Bao
- Division of Biostatistics, Moores Cancer Center, University of California-San Diego, La Jolla, CA 92093, USA
| | - Minya Pu
- Division of Biostatistics, Moores Cancer Center, University of California-San Diego, La Jolla, CA 92093, USA
| | - Karen Messer
- Division of Biostatistics, Moores Cancer Center, University of California-San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
9
|
Copy number variation genotyping using family information. BMC Bioinformatics 2013; 14:157. [PMID: 23656838 PMCID: PMC3668900 DOI: 10.1186/1471-2105-14-157] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2012] [Accepted: 04/30/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies. RESULTS To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments. CONCLUSIONS In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.
Collapse
|
10
|
Rueda OM, Rueda C, Diaz-Uriarte R. A Bayesian HMM with random effects and an unknown number of states for DNA copy number analysis. J STAT COMPUT SIM 2013. [DOI: 10.1080/00949655.2011.609818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
11
|
Rueda OM, Diaz-Uriarte R, Caldas C. Finding common regions of alteration in copy number data. Methods Mol Biol 2013; 973:339-53. [PMID: 23412800 DOI: 10.1007/978-1-62703-281-0_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Abstract
In this chapter, we review some recent methods designed for detecting recurrent copy number regions, that is, genomic regions that show evidence of being altered in a set of samples. We analyze Affymetrix SNP6 data from 87 Her2-type breast tumors from a recent study using three different methods, showing different definitions and features of common regions: studying heterogeneity in copy number profiles, refining candidates for driver oncogenes, and consolidating broad amplifications.
Collapse
Affiliation(s)
- Oscar M Rueda
- Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre, Cambridge, UK.
| | | | | |
Collapse
|
12
|
Scharpf RB, Beaty TH, Schwender H, Younkin SG, Scott AF, Ruczinski I. Fast detection of de novo copy number variants from SNP arrays for case-parent trios. BMC Bioinformatics 2012; 13:330. [PMID: 23234608 PMCID: PMC3576329 DOI: 10.1186/1471-2105-13-330] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2011] [Accepted: 12/07/2012] [Indexed: 11/10/2022] Open
Abstract
Background In studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differ from the parental copy numbers as de novo and of interest for their potential functional role in disease. Among the leading array-based methods for discovery of de novo CNVs in case-parent trios is the joint hidden Markov model (HMM) implemented in the PennCNV software. However, the computational demands of the joint HMM are substantial and the extent to which false positive identifications occur in case-parent trios has not been well described. We evaluate these issues in a study of oral cleft case-parent trios. Results Our analysis of the oral cleft trios reveals that genomic waves represent a substantial source of false positive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. In addition, the noise of low-level summaries of relative copy number (log R ratios) is strongly associated with batch and correlated with the frequency of de novo CNV calls. Exploiting the trio design, we propose a univariate statistic for relative copy number referred to as the minimum distance that can reduce technical variation from probe effects and genomic waves. We use circular binary segmentation to segment the minimum distance and maximum a posteriori estimation to infer de novo CNVs from the segmented genome. Compared to PennCNV on simulated data, MinimumDistance identifies fewer false positives on average and is comparable to PennCNV with respect to false negatives. Genomic waves contribute to discordance of PennCNV and MinimumDistance for high coverage de novo calls, while highly concordant calls on chromosome 22 were validated by quantitative PCR. Computationally, MinimumDistance provides a nearly 8-fold increase in speed relative to the joint HMM in a study of oral cleft trios. Conclusions Our results indicate that batch effects and genomic waves are important considerations for case-parent studies of de novo CNV, and that the minimum distance is an effective statistic for reducing technical variation contributing to false de novo discoveries. Coupled with segmentation and maximum a posteriori estimation, our algorithm compares favorably to the joint HMM with MinimumDistance being much faster.
Collapse
Affiliation(s)
- Robert B Scharpf
- Department of Oncology, Johns Hopkins University, Baltimore, MD, USA.
| | | | | | | | | | | |
Collapse
|
13
|
Baer C, Claus R, Frenzel LP, Zucknick M, Park YJ, Gu L, Weichenhan D, Fischer M, Pallasch CP, Herpel E, Rehli M, Byrd JC, Wendtner CM, Plass C. Extensive promoter DNA hypermethylation and hypomethylation is associated with aberrant microRNA expression in chronic lymphocytic leukemia. Cancer Res 2012; 72:3775-85. [PMID: 22710432 DOI: 10.1158/0008-5472.can-12-0803] [Citation(s) in RCA: 109] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Dysregulated microRNA (miRNA) expression contributes to the pathogenesis of hematopoietic malignancies, including chronic lymphocytic leukemia (CLL). However, an understanding of the mechanisms that cause aberrant miRNA transcriptional control is lacking. In this study, we comprehensively investigated the role and extent of miRNA epigenetic regulation in CLL. Genome-wide profiling conducted on 24 CLL and 10 healthy B cell samples revealed global DNA methylation patterns upstream of miRNA sequences that distinguished malignant from healthy cells and identified putative miRNA promoters. Integration of DNA methylation and miRNA promoter data led to the identification of 128 recurrent miRNA targets for aberrant promoter DNA methylation. DNA hypomethylation accounted for more than 60% of all aberrant promoter-associated DNA methylation in CLL, and promoter DNA hypomethylation was restricted to well-defined regions. Individual hyper- and hypomethylated promoters allowed discrimination of CLL samples from healthy controls. Promoter DNA methylation patterns were confirmed in an independent patient cohort, with 11 miRNAs consistently showing an inverse correlation between DNA methylation status and expression level. Together, our findings characterize the role of epigenetic changes in the regulation of miRNA transcription and create a repository of disease-specific promoter regions that may provide additional insights into the pathogenesis of CLL.
Collapse
Affiliation(s)
- Constance Baer
- Department of Epigenomics and Cancer Risk Factors, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, Heidelberg, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Seifert M, Gohr A, Strickert M, Grosse I. Parsimonious higher-order hidden Markov models for improved array-CGH analysis with applications to Arabidopsis thaliana. PLoS Comput Biol 2012; 8:e1002286. [PMID: 22253580 PMCID: PMC3257270 DOI: 10.1371/journal.pcbi.1002286] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2011] [Accepted: 10/11/2011] [Indexed: 12/19/2022] Open
Abstract
Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM). Array-based comparative genomics is a standard approach for the identification of DNA copy number polymorphisms between closely related genomes. The huge amounts of data produced by these experiments require efficient and accurate bioinformatics tools for the identification of copy number polymorphisms. Hidden Markov Models (HMMs) are frequently used for analyzing such data sets, but current models are based on first-order HMMs only having limited capabilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. We develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling these dependencies to overcome this limitation. In an in-depth case study with Arabidopsis thaliana, we find that parsimonious higher-order HMMs clearly improve the identification of copy number polymorphisms in comparison to standard first-order HMMs and other frequently used methods. Functional analysis of identified polymorphisms revealed details of genomic differences between the accessions C24 and Col-0 of Arabidopsis thaliana. An additional study on human cell lines further indicates that parsimonious HMMs are well-suited for the analysis of Array-CGH data.
Collapse
Affiliation(s)
- Michael Seifert
- Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
15
|
Nicholas TJ, Baker C, Eichler EE, Akey JM. A high-resolution integrated map of copy number polymorphisms within and between breeds of the modern domesticated dog. BMC Genomics 2011; 12:414. [PMID: 21846351 PMCID: PMC3166287 DOI: 10.1186/1471-2164-12-414] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2011] [Accepted: 08/16/2011] [Indexed: 01/22/2023] Open
Abstract
Background Structural variation contributes to the rich genetic and phenotypic diversity of the modern domestic dog, Canis lupus familiaris, although compared to other organisms, catalogs of canine copy number variants (CNVs) are poorly defined. To this end, we developed a customized high-density tiling array across the canine genome and used it to discover CNVs in nine genetically diverse dogs and a gray wolf. Results In total, we identified 403 CNVs that overlap 401 genes, which are enriched for defense/immunity, oxidoreductase, protease, receptor, signaling molecule and transporter genes. Furthermore, we performed detailed comparisons between CNVs located within versus outside of segmental duplications (SDs) and find that CNVs in SDs are enriched for gene content and complexity. Finally, we compiled all known dog CNV regions and genotyped them with a custom aCGH chip in 61 dogs from 12 diverse breeds. These data allowed us to perform the first population genetics analysis of canine structural variation and identify CNVs that potentially contribute to breed specific traits. Conclusions Our comprehensive analysis of canine CNVs will be an important resource in genetically dissecting canine phenotypic and behavioral variation.
Collapse
Affiliation(s)
- Thomas J Nicholas
- Department of Genome Sciences, University of Washington, 1705 NE Pacific, Seattle, WA 98195, USA
| | | | | | | |
Collapse
|
16
|
Stjernqvist S, Rydén T, Greenman CD. Model-integrated estimation of normal tissue contamination for cancer SNP allelic copy number data. Cancer Inform 2011; 10:159-73. [PMID: 21695067 PMCID: PMC3118450 DOI: 10.4137/cin.s6873] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
SNP allelic copy number data provides intensity measurements for the two different alleles separately. We present a method that estimates the number of copies of each allele at each SNP position, using a continuous-index hidden Markov model. The method is especially suited for cancer data, since it includes the fraction of normal tissue contamination, often present when studying data from cancer tumors, into the model. The continuous-index structure takes into account the distances between the SNPs, and is thereby appropriate also when SNPs are unequally spaced. In a simulation study we show that the method performs favorably compared to previous methods even with as much as 70% normal contamination. We also provide results from applications to clinical data produced using the Affymetrix genome-wide SNP 6.0 platform.
Collapse
Affiliation(s)
- Susann Stjernqvist
- Centre for Mathematical Sciences, Lund University, Box 118, 221 00 Lund, Sweden, Department of Mathematics, Royal Institute of Technology, 100 44 Stockholm, Sweden
| | | | | |
Collapse
|
17
|
Seifert M, Strickert M, Schliep A, Grosse I. Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended Hidden Markov Models. ACTA ACUST UNITED AC 2011; 27:1645-52. [PMID: 21511716 DOI: 10.1093/bioinformatics/btr199] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Changes in gene expression levels play a central role in tumors. Additional information about the distribution of gene expression levels and distances between adjacent genes on chromosomes should be integrated into the analysis of tumor expression profiles. RESULTS We use a Hidden Markov Model with distance-scaled transition matrices (DSHMM) to incorporate chromosomal distances of adjacent genes on chromosomes into the identification of differentially expressed genes in breast cancer. We train the DSHMM by integrating prior knowledge about potential distributions of expression levels of differentially expressed and unchanged genes in tumor. We find that especially the combination of these data and to a lesser extent the modeling of distances between adjacent genes contribute to a substantial improvement of the identification of differentially expressed genes in comparison to other existing methods. This performance benefit is also supported by the identification of genes well known to be associated with breast cancer. That suggests applications of DSHMMs for screening of other tumor expression profiles. AVAILABILITY The DSHMM is available as part of the open-source Java library Jstacs (www.jstacs.de/index.php/DSHMM).
Collapse
Affiliation(s)
- Michael Seifert
- Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
18
|
Yuan A, Chen G, Xiong J, He W, Rotimi C. Bayesian Frequentist hybrid Model wth Application to the Analysis of Gene Copy Number Changes. J Appl Stat 2011; 38:987-1005. [PMID: 24014930 PMCID: PMC3762327 DOI: 10.1080/02664761003692449] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Gene copy number (GCN) changes are common characteristics of many genetic diseases. Comparative genomic hybridization (CGH) is a new technology widely used today to screen the GCN changes in mutant cells with high resolution genome-wide. Statistical methods for analyzing such CGH data have been evolving. Existing methods are either frequentist's, or full Bayesian. The former often has computational advantage, while the latter can incorporate prior information into the model, but could be misleading when one does not have sound prior information. In an attempt to take full advantages of both approaches, we develop a Bayesian-frequentist hybrid approach, in which a subset of the model parameters is inferred by the Bayesian method, while the rest parameters by the frequentist's. This new hybrid approach provides advantages over those of the Bayesian or frequentist's method used alone. This is especially the case when sound prior information is available on part of the parameters, and the sample size is relatively small. Spatial dependence and false discovery rate are also discussed, and the parameter estimation is efficient. As an illustration, we used the proposed hybrid approach to analyze a real CGH data.
Collapse
Affiliation(s)
- Ao Yuan
- National Human Genome Center, Howard University, Washington D.C. USA
| | | | | | | | | |
Collapse
|
19
|
Guo B, Villagran A, Vannucci M, Wang J, Davis C, Man TK, Lau C, Guerra R. Bayesian estimation of genomic copy number with single nucleotide polymorphism genotyping arrays. BMC Res Notes 2010; 3:350. [PMID: 21192799 PMCID: PMC3023756 DOI: 10.1186/1756-0500-3-350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2010] [Accepted: 12/30/2010] [Indexed: 11/19/2022] Open
Abstract
Background The identification of copy number aberration in the human genome is an important area in cancer research. We develop a model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference. Results The performance of the algorithm is examined on both simulated and real cancer data, and it is compared with the popular CNAG algorithm for copy number detection. Conclusions We demonstrate that our Bayesian mixture model performs at least as well as the hidden Markov model based CNAG algorithm and in certain cases does better. One of the added advantages of our method is the flexibility of modeling normal cell contamination in tumor samples.
Collapse
Affiliation(s)
- Beibei Guo
- Department of Statistics, Rice University, 6100 Main, Houston, TX 77005-1827, USA.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Zhang ZD, Gerstein MB. Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model. BMC Bioinformatics 2010; 11:539. [PMID: 21034510 PMCID: PMC2992546 DOI: 10.1186/1471-2105-11-539] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2010] [Accepted: 10/31/2010] [Indexed: 11/17/2022] Open
Abstract
Background Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale. Results We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms. Conclusions In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.
Collapse
Affiliation(s)
- Zhengdong D Zhang
- Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| | | |
Collapse
|
21
|
A bayesian analysis for identifying DNA copy number variations using a compound poisson process. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010; 2010:268513. [PMID: 20976296 DOI: 10.1155/2010/268513] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2010] [Revised: 07/29/2010] [Accepted: 08/06/2010] [Indexed: 11/17/2022]
Abstract
To study chromosomal aberrations that may lead to cancer formation or genetic diseases, the array-based Comparative Genomic Hybridization (aCGH) technique is often used for detecting DNA copy number variants (CNVs). Various methods have been developed for gaining CNVs information based on aCGH data. However, most of these methods make use of the log-intensity ratios in aCGH data without taking advantage of other information such as the DNA probe (e.g., biomarker) positions/distances contained in the data. Motivated by the specific features of aCGH data, we developed a novel method that takes into account the estimation of a change point or locus of the CNV in aCGH data with its associated biomarker position on the chromosome using a compound Poisson process. We used a Bayesian approach to derive the posterior probability for the estimation of the CNV locus. To detect loci of multiple CNVs in the data, a sliding window process combined with our derived Bayesian posterior probability was proposed. To evaluate the performance of the method in the estimation of the CNV locus, we first performed simulation studies. Finally, we applied our approach to real data from aCGH experiments, demonstrating its applicability.
Collapse
|
22
|
Wagner JR, Ge B, Pokholok D, Gunderson KL, Pastinen T, Blanchette M. Computational analysis of whole-genome differential allelic expression data in human. PLoS Comput Biol 2010; 6:e1000849. [PMID: 20628616 PMCID: PMC2900287 DOI: 10.1371/journal.pcbi.1000849] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2009] [Accepted: 06/02/2010] [Indexed: 12/16/2022] Open
Abstract
Allelic imbalance (AI) is a phenomenon where the two alleles of a given gene are expressed at different levels in a given cell, either because of epigenetic inactivation of one of the two alleles, or because of genetic variation in regulatory regions. Recently, Bing et al. have described the use of genotyping arrays to assay AI at a high resolution (∼750,000 SNPs across the autosomes). In this paper, we investigate computational approaches to analyze this data and identify genomic regions with AI in an unbiased and robust statistical manner. We propose two families of approaches: (i) a statistical approach based on z-score computations, and (ii) a family of machine learning approaches based on Hidden Markov Models. Each method is evaluated using previously published experimental data sets as well as with permutation testing. When applied to whole genome data from 53 HapMap samples, our approaches reveal that allelic imbalance is widespread (most expressed genes show evidence of AI in at least one of our 53 samples) and that most AI regions in a given individual are also found in at least a few other individuals. While many AI regions identified in the genome correspond to known protein-coding transcripts, others overlap with recently discovered long non-coding RNAs. We also observe that genomic regions with AI not only include complete transcripts with consistent differential expression levels, but also more complex patterns of allelic expression such as alternative promoters and alternative 3′ end. The approaches developed not only shed light on the incidence and mechanisms of allelic expression, but will also help towards mapping the genetic causes of allelic expression and identify cases where this variation may be linked to diseases. Measures of gene expression, and the search for regulatory regions in the genome responsible for differences in levels of gene expression, is one of the key paths of research used to identify disease causing genes, as well as explain differences between healthy individuals. Typically, experiments have measured and compared gene expression in multiple individuals, and used this information to attempt to map regulatory regions responsible. Differences in environment between individuals can, however, cause differences in gene expression unrelated to the underlying regulatory sequence. New genotyping technologies enable the measurement of expression of both copies of a particular gene, at loci that are heterozygous within a particular individual. This will therefore act as an internal control, as environmental factors will continue to affect the expression of both copies of a gene at presumably equal levels, and differences in expression are more likely to be explicable by differences in regulatory regions specific to the two copies of the gene itself. Differences between regulatory regions are expected to lead to differences in expression of the two copies (or the two alleles) of a particular gene, also known as allelic imbalance. We describe a set of signal processing methods for the reliable detection of allelic expression within the genome.
Collapse
Affiliation(s)
- James R. Wagner
- School of Computer Science, McGill University, Montreal, Quebec, Canada
| | - Bing Ge
- McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada
| | | | | | - Tomi Pastinen
- McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada
- Department of Human and Medical Genetics, McGill University Health Centre, McGill University, Montreal, Quebec, Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, Montreal, Quebec, Canada
- * E-mail:
| |
Collapse
|
23
|
Choi H, Qin ZS, Ghosh D. A double-layered mixture model for the joint analysis of DNA copy number and gene expression data. J Comput Biol 2010; 17:121-37. [PMID: 20170400 DOI: 10.1089/cmb.2009.0019] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Copy number aberration is a common form of genomic instability in cancer. Gene expression is closely tied to cytogenetic events by the central dogma of molecular biology, and serves as a mediator of copy number changes in disease phenotypes. Accordingly, it is of interest to develop proper statistical methods for jointly analyzing copy number and gene expression data. This work describes a novel Bayesian inferential approach for a double-layered mixture model (DLMM) which directly models the stochastic nature of copy number data and identifies abnormally expressed genes due to aberrant copy number. Simulation studies were conducted to illustrate the robustness of DLMM under various settings of copy number aberration frequency, confounding effects, and signal-to-noise ratio in gene expression data. Analysis of a real breast cancer data shows that DLMM is able to identify expression changes specifically attributable to copy number aberration in tumors and that a sample-specific index built based on the selected genes is correlated with relevant clinical information.
Collapse
Affiliation(s)
- Hyungwon Choi
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | | | | |
Collapse
|
24
|
Mei TS, Salim A, Calza S, Seng KC, Seng CK, Pawitan Y. Identification of recurrent regions of Copy-Number Variants across multiple individuals. BMC Bioinformatics 2010; 11:147. [PMID: 20307285 PMCID: PMC2851607 DOI: 10.1186/1471-2105-11-147] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 03/22/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Algorithms and software for CNV detection have been developed, but they detect the CNV regions sample-by-sample with individual-specific breakpoints, while common CNV regions are likely to occur at the same genomic locations across different individuals in a homogenous population. Current algorithms to detect common CNV regions do not account for the varying reliability of the individual CNVs, typically reported as confidence scores by SNP-based CNV detection algorithms. General methodologies for identifying these recurrent regions, especially those directed at SNP arrays, are still needed. RESULTS In this paper, we describe two new approaches for identifying common CNV regions based on (i) the frequency of occurrence of reliable CNVs, where reliability is determined by high confidence scores, and (ii) a weighted frequency of occurrence of CNVs, where the weights are determined by the confidence scores. In addition, motivated by the fact that we often observe partially overlapping CNV regions as a mixture of two or more distinct subregions, regions identified using the two approaches can be fine-tuned to smaller sub-regions using a clustering algorithm. We compared the performance of the methods with sequencing-based results in terms of discordance rates, rates of departure from Hardy-Weinberg equilibrium (HWE) and average frequency and size of the identified regions. The discordance rates as well as the rates of departure from HWE decrease when we select CNVs with higher confidence scores. We also performed comparisons with two previously published methods, STAC and GISTIC, and showed that the methods we consider are better at identifying low-frequency but high-confidence CNV regions. CONCLUSIONS The proposed methods for identifying common CNV regions in multiple individuals perform well compared to existing methods. The identified common regions can be used for downstream analyses such as group comparisons in association studies.
Collapse
Affiliation(s)
- Teo Shu Mei
- Department of Epidemiology and Public Health, National University of Singapore, 16 Medical Drive, Singapore
| | | | | | | | | | | |
Collapse
|
25
|
Shearin AL, Ostrander EA. Leading the way: canine models of genomics and disease. Dis Model Mech 2010; 3:27-34. [PMID: 20075379 DOI: 10.1242/dmm.004358] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
In recent years Canis familiaris, the domestic dog, has drawn considerable attention as a system in which to investigate the genetics of disease susceptibility, morphology and behavior. Because dogs show remarkable intrabreed homogeneity, coupled with striking interbreed heterogeneity, the dog offers unique opportunities to understand the genetic underpinnings of natural variation in mammals, a portion of which is disease susceptibility. In this review, we highlight the unique features of the dog, such as population diversity and breed structure, that make it particularly amenable to genetic studies. We highlight recent advances in understanding the architecture of the dog genome, which propel the system to the forefront of consideration when selecting a system for disease gene studies. The most notable benefit of using the dog for genetic studies is that dogs get many of the same diseases as humans, with a similar frequency, and the same genetic factors are often involved. We discuss two approaches for localizing disease genes in the dog and provide examples of ongoing studies.
Collapse
Affiliation(s)
- Abigail L Shearin
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | | |
Collapse
|
26
|
Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S, Futreal PA, Stratton MR. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics 2009; 11:164-75. [PMID: 19837654 PMCID: PMC2800165 DOI: 10.1093/biostatistics/kxp045] [Citation(s) in RCA: 168] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
High-throughput oligonucleotide microarrays are commonly employed to investigate genetic disease, including cancer. The algorithms employed to extract genotypes and copy number variation function optimally for diploid genomes usually associated with inherited disease. However, cancer genomes are aneuploid in nature leading to systematic errors when using these techniques. We introduce a preprocessing transformation and hidden Markov model algorithm bespoke to cancer. This produces genotype classification, specification of regions of loss of heterozygosity, and absolute allelic copy number segmentation. Accurate prediction is demonstrated with a combination of independent experimental techniques. These methods are exemplified with affymetrix genome-wide SNP6.0 data from 755 cancer cell lines, enabling inference upon a number of features of biological interest. These data and the coded algorithm are freely available for download.
Collapse
Affiliation(s)
- Chris D Greenman
- Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Rueda OM, Diaz-Uriarte R. Detection of recurrent copy number alterations in the genome: taking among-subject heterogeneity seriously. BMC Bioinformatics 2009; 10:308. [PMID: 19775444 PMCID: PMC2760535 DOI: 10.1186/1471-2105-10-308] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2009] [Accepted: 09/23/2009] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Alterations in the number of copies of genomic DNA that are common or recurrent among diseased individuals are likely to contain disease-critical genes. Unfortunately, defining common or recurrent copy number alteration (CNA) regions remains a challenge. Moreover, the heterogeneous nature of many diseases requires that we search for common or recurrent CNA regions that affect only some subsets of the samples (without knowledge of the regions and subsets affected), but this is neglected by most methods. RESULTS We have developed two methods to define recurrent CNA regions from aCGH data. Our methods are unique and qualitatively different from existing approaches: they detect regions over both the complete set of arrays and alterations that are common only to some subsets of the samples (i.e., alterations that might characterize previously unknown groups); they use probabilities of alteration as input and return probabilities of being a common region, thus allowing researchers to modify thresholds as needed; the two parameters of the methods have an immediate, straightforward, biological interpretation. Using data from previous studies, we show that we can detect patterns that other methods miss and that researchers can modify, as needed, thresholds of immediate interpretability and develop custom statistics to answer specific research questions. CONCLUSION These methods represent a qualitative advance in the location of recurrent CNA regions, highlight the relevance of population heterogeneity for definitions of recurrence, and can facilitate the clustering of samples with respect to patterns of CNA. Ultimately, the methods developed can become important tools in the search for genomic regions harboring disease-critical genes.
Collapse
Affiliation(s)
- Oscar M Rueda
- Structural and Computational Biology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
- Breast Cancer Functional Genomics, Cancer Research UK, Cambridge, UK
| | - Ramon Diaz-Uriarte
- Structural and Computational Biology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
| |
Collapse
|
28
|
Tai YC, Kvale MN, Witte JS. Segmentation and estimation for SNP microarrays: a Bayesian multiple change-point approach. Biometrics 2009; 66:675-83. [PMID: 19764955 DOI: 10.1111/j.1541-0420.2009.01328.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
High-density single-nucleotide polymorphism (SNP) microarrays provide a useful tool for the detection of copy number variants (CNVs). The analysis of such large amounts of data is complicated, especially with regard to determining where copy numbers change and their corresponding values. In this article, we propose a Bayesian multiple change-point model (BMCP) for segmentation and estimation of SNP microarray data. Segmentation concerns separating a chromosome into regions of equal copy number differences between the sample of interest and some reference, and involves the detection of locations of copy number difference changes. Estimation concerns determining true copy number for each segment. Our approach not only gives posterior estimates for the parameters of interest, namely locations for copy number difference changes and true copy number estimates, but also useful confidence measures. In addition, our algorithm can segment multiple samples simultaneously, and infer both common and rare CNVs across individuals. Finally, for studies of CNVs in tumors, we incorporate an adjustment factor for signal attenuation due to tumor heterogeneity or normal contamination that can improve copy number estimates.
Collapse
Affiliation(s)
- Yu Chuan Tai
- Institute for Human Genetics, Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94143-0794, USA.
| | | | | |
Collapse
|
29
|
Rueda OM, Diaz-Uriarte R. RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions. Bioinformatics 2009; 25:1959-60. [PMID: 19420051 PMCID: PMC2712338 DOI: 10.1093/bioinformatics/btp307] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Revised: 04/20/2009] [Accepted: 04/30/2009] [Indexed: 11/29/2022] Open
Abstract
SUMMARY Several methods have been proposed to detect copy number changes and recurrent regions of copy number variation from aCGH, but few methods return probabilities of alteration explicitly, which are the direct answer to the question 'is this probe/region altered?' RJaCGH fits a Non-Homogeneous Hidden Markov model to the aCGH data using Markov Chain Monte Carlo with Reversible Jump, and returns the probability that each probe is gained or lost. Using these probabilites, recurrent regions (over sets of individuals) of copy number alteration can be found. AVAILABILITY RJaCGH is available as an R package from CRAN repositories (e.g. http://cran.r-project.org/web/packages).
Collapse
Affiliation(s)
- Oscar M Rueda
- Structural Biology and Biocomputing Programme, Spanish National Cancer Center (CNIO), Madrid 28029, Spain.
| | | |
Collapse
|
30
|
Wu LY, Chipman HA, Bull SB, Briollais L, Wang K. A Bayesian segmentation approach to ascertain copy number variations at the population level. ACTA ACUST UNITED AC 2009; 25:1669-79. [PMID: 19389735 DOI: 10.1093/bioinformatics/btp270] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Efficient and accurate ascertainment of copy number variations (CNVs) at the population level is essential to understand the evolutionary process and population genetics, and to apply CNVs in population-based genome-wide association studies for complex human diseases. We propose a novel Bayesian segmentation approach to identify CNVs in a defined population of any size. It is computationally efficient and provides statistical evidence for the detected CNVs through the Bayes factor. This approach has the unique feature of carrying out segmentation and assigning copy number status simultaneously-a desirable property that current segmentation methods do not share. RESULTS In comparisons with popular two-step segmentation methods for a single individual using benchmark simulation studies, we find the new approach to perform competitively with respect to false discovery rate and sensitivity in breakpoint detection. In a simulation study of multiple samples with recurrent copy numbers, the new approach outperforms two leading single sample methods. We further demonstrate the effectiveness of our approach in population-level analysis of previously published HapMap data. We also apply our approach in studying population genetics of CNVs. AVAILABILITY R programs are available at http://www.mshri.on.ca/mitacs/software/SOFTWARE.HTML
Collapse
Affiliation(s)
- Long Yang Wu
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada.
| | | | | | | | | |
Collapse
|
31
|
Cancer gene discovery in mouse and man. Biochim Biophys Acta Rev Cancer 2009; 1796:140-61. [PMID: 19285540 PMCID: PMC2756404 DOI: 10.1016/j.bbcan.2009.03.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2009] [Revised: 03/03/2009] [Accepted: 03/05/2009] [Indexed: 12/31/2022]
Abstract
The elucidation of the human and mouse genome sequence and developments in high-throughput genome analysis, and in computational tools, have made it possible to profile entire cancer genomes. In parallel with these advances mouse models of cancer have evolved into a powerful tool for cancer gene discovery. Here we discuss the approaches that may be used for cancer gene identification in both human and mouse and discuss how a cross-species 'oncogenomics' approach to cancer gene discovery represents a powerful strategy for finding genes that drive tumourigenesis.
Collapse
|
32
|
Nicholas TJ, Cheng Z, Ventura M, Mealey K, Eichler EE, Akey JM. The genomic architecture of segmental duplications and associated copy number variants in dogs. Genome Res 2009; 19:491-9. [PMID: 19129542 DOI: 10.1101/gr.084715.108] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Structural variation is an important and abundant source of genetic and phenotypic variation. Here we describe the first systematic and genome-wide analysis of segmental duplications and associated copy number variants (CNVs) in the modern domesticated dog, Canis familiaris, which exhibits considerable morphological, physiological, and behavioral variation. Through computational analyses of the publicly available canine reference sequence, we estimate that segmental duplications comprise approximately 4.21% of the canine genome. Segmental duplications overlap 841 genes and are significantly enriched for specific biological functions such as immunity and defense and KRAB box transcription factors. We designed high-density tiling arrays spanning all predicted segmental duplications and performed aCGH in a panel of 17 breeds and a gray wolf. In total, we identified 3583 CNVs, approximately 68% of which were found in two or more samples that map to 678 unique regions. CNVs span 429 genes that are involved in a wide variety of biological processes such as olfaction, immunity, and gene regulation. Our results provide insight into mechanisms of canine genome evolution and generate a valuable resource for future evolutionary and phenotypic studies.
Collapse
Affiliation(s)
- Thomas J Nicholas
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | | | | | | | | | | |
Collapse
|
33
|
Barutcuoglu Z, Airoldi EM, Dumeaux V, Schapire RE, Troyanskaya OG. Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields. Bioinformatics 2008; 25:1307-13. [PMID: 19052061 PMCID: PMC2677736 DOI: 10.1093/bioinformatics/btn585] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome. RESULTS Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.
Collapse
Affiliation(s)
- Zafer Barutcuoglu
- Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540, USA
| | | | | | | | | |
Collapse
|
34
|
Andersson R, Bruder CEG, Piotrowski A, Menzel U, Nord H, Sandgren J, Hvidsten TR, Diaz de Ståhl T, Dumanski JP, Komorowski J. A segmental maximum a posteriori approach to genome-wide copy number profiling. ACTA ACUST UNITED AC 2008; 24:751-8. [PMID: 18204059 DOI: 10.1093/bioinformatics/btn003] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Copy number profiling methods aim at assigning DNA copy numbers to chromosomal regions using measurements from microarray-based comparative genomic hybridizations. Among the proposed methods to this end, Hidden Markov Model (HMM)-based approaches seem promising since DNA copy number transitions are naturally captured in the model. Current discrete-index HMM-based approaches do not, however, take into account heterogeneous information regarding the genomic overlap between clones. Moreover, the majority of existing methods are restricted to chromosome-wise analysis. RESULTS We introduce a novel Segmental Maximum A Posteriori approach, SMAP, for DNA copy number profiling. Our method is based on discrete-index Hidden Markov Modeling and incorporates genomic distance and overlap between clones. We exploit a priori information through user-controllable parameterization that enables the identification of copy number deviations of various lengths and amplitudes. The model parameters may be inferred at a genome-wide scale to avoid overfitting of model parameters often resulting from chromosome-wise model inference. We report superior performances of SMAP on synthetic data when compared with two recent methods. When applied on our new experimental data, SMAP readily recognizes already known genetic aberrations including both large-scale regions with aberrant DNA copy number and changes affecting only single features on the array. We highlight the differences between the prediction of SMAP and the compared methods and show that SMAP accurately determines copy number changes and benefits from overlap consideration.
Collapse
Affiliation(s)
- Robin Andersson
- The Linnaeus Centre for Bioinformatics, Uppsala University, 751 24 Uppsala, Sweden
| | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. ACTA ACUST UNITED AC 2008; 24:309-18. [PMID: 18203770 DOI: 10.1093/bioinformatics/btm601] [Citation(s) in RCA: 97] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
MOTIVATION Genomic instability in cancer leads to abnormal genome copy number alterations (CNA) that are associated with the development and behavior of tumors. Advances in microarray technology have allowed for greater resolution in detection of DNA copy number changes (amplifications or deletions) across the genome. However, the increase in number of measured signals and accompanying noise from the array probes present a challenge in accurate and fast identification of breakpoints that define CNA. This article proposes a novel detection technique that exploits the use of piece wise constant (PWC) vectors to represent genome copy number and sparse Bayesian learning (SBL) to detect CNA breakpoints. METHODS First, a compact linear algebra representation for the genome copy number is developed from normalized probe intensities. Second, SBL is applied and optimized to infer locations where copy number changes occur. Third, a backward elimination (BE) procedure is used to rank the inferred breakpoints; and a cut-off point can be efficiently adjusted in this procedure to control for the false discovery rate (FDR). RESULTS The performance of our algorithm is evaluated using simulated and real genome datasets and compared to other existing techniques. Our approach achieves the highest accuracy and lowest FDR while improving computational speed by several orders of magnitude. The proposed algorithm has been developed into a free standing software application (GADA, Genome Alteration Detection Algorithm). AVAILABILITY http://biron.usc.edu/~piquereg/GADA
Collapse
Affiliation(s)
- Roger Pique-Regi
- Signal and Image Processing Institute, Ming Hsieh Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, EEB 400, 3740 McClintock Ave, Los Angeles, CA 90089-2564, USA.
| | | | | | | | | | | |
Collapse
|
36
|
A response to Yu et al. "A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array", BMC Bioinformatics 2007, 8: 145. BMC Bioinformatics 2007; 8:394. [PMID: 17939873 PMCID: PMC2222656 DOI: 10.1186/1471-2105-8-394] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2007] [Accepted: 10/16/2007] [Indexed: 12/16/2022] Open
Abstract
Background Yu et al. (BMC Bioinformatics 2007,8: 145+) have recently compared the performance of several methods for the detection of genomic amplification and deletion breakpoints using data from high-density single nucleotide polymorphism arrays. One of the methods compared is our non-homogenous Hidden Markov Model approach. Our approach uses Markov Chain Monte Carlo for inference, but Yu et al. ran the sampler for a severely insufficient number of iterations for a Markov Chain Monte Carlo-based method. Moreover, they did not use the appropriate reference level for the non-altered state. Methods We rerun the analysis in Yu et al. using appropriate settings for both the Markov Chain Monte Carlo iterations and the reference level. Additionally, to show how easy it is to obtain answers to additional specific questions, we have added a new analysis targeted specifically to the detection of breakpoints. Results The reanalysis shows that the performance of our method is comparable to that of the other methods analyzed. In addition, we can provide probabilities of a given spot being a breakpoint, something unique among the methods examined. Conclusion Markov Chain Monte Carlo methods require using a sufficient number of iterations before they can be assumed to yield samples from the distribution of interest. Running our method with too small a number of iterations cannot be representative of its performance. Moreover, our analysis shows how our original approach can be easily adapted to answer specific additional questions (e.g., identify edges).
Collapse
|
37
|
Díaz-Uriarte R, Rueda OM. ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data. PLoS One 2007; 2:e737. [PMID: 17710137 PMCID: PMC1940324 DOI: 10.1371/journal.pone.0000737] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2007] [Accepted: 07/09/2007] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Copy number alterations (CNAs) in genomic DNA have been associated with complex human diseases, including cancer. One of the most common techniques to detect CNAs is array-based comparative genomic hybridization (aCGH). The availability of aCGH platforms and the need for identification of CNAs has resulted in a wealth of methodological studies. METHODOLOGY/PRINCIPAL FINDINGS ADaCGH is an R package and a web-based application for the analysis of aCGH data. It implements eight methods for detection of CNAs, gains and losses of genomic DNA, including all of the best performing ones from two recent reviews (CBS, GLAD, CGHseg, HMM). For improved speed, we use parallel computing (via MPI). Additional information (GO terms, PubMed citations, KEGG and Reactome pathways) is available for individual genes, and for sets of genes with altered copy numbers. CONCLUSIONS/SIGNIFICANCE ADACGH represents a qualitative increase in the standards of these types of applications: a) all of the best performing algorithms are included, not just one or two; b) we do not limit ourselves to providing a thin layer of CGI on top of existing BioConductor packages, but instead carefully use parallelization, examining different schemes, and are able to achieve significant decreases in user waiting time (factors up to 45x); c) we have added functionality not currently available in some methods, to adapt to recent recommendations (e.g., merging of segmentation results in wavelet-based and CGHseg algorithms); d) we incorporate redundancy, fault-tolerance and checkpointing, which are unique among web-based, parallelized applications; e) all of the code is available under open source licenses, allowing to build upon, copy, and adapt our code for other software projects.
Collapse
Affiliation(s)
- Ramón Díaz-Uriarte
- Structural Biology and Biocomputing Programme, Spanish National Cancer Center, Madrid, Spain.
| | | |
Collapse
|