1
|
Sinha R, Pal RK, De RK. A novel method addressing NGS-based mappability bias for sensitive detection of DNA alterations. J Bioinform Comput Biol 2024; 22:2450009. [PMID: 39030667 DOI: 10.1142/s0219720024500094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/21/2024]
Abstract
A turning point in cancer research is the introduction of massively parallel sequencing technology which greatly reduced the cost and time for genome sequencing. This enhanced the scope for detecting and analyzing the role of structural alterations in cancer. However, certain bias exists in NGS-based approaches, which badly affects the CNV identification process. Moreover, DNA repeats existing in CNV regions need special attention as they will degrade the performance of majority of the existing CNV detection tools, even after applying generalized bias correction method. This motivated this work, where a novel method has been designed to address the issue of DNA repeats and thereby mappability bias existing in regions of CNV. The method consists of three phases, where the first phase computes the alignment information of uniquely mapped DNA reads, considering the base quality and base mismatch parameters at nucleotide level precision. The second and the third phase use a novel approach to allocate the non-uniquely mapped reads to an optimal region of the DNA repeats based on a probabilistic membership model. The proposed method is capable of identifying CNVs present in coding, as well as non-coding region of the DNA, and is also capable of detecting CNVs existing in DNA repeat regions. The methodology achieves a sensitivity greater than [Formula: see text] during the performed simulations, and on real data, the detected variants are validated with the database of genomic variants, where the percentage overlap is also greater than 95%, and has achieved much better breakpoint prediction, as compared with other popular bias correction CNV detection methods.
Collapse
Affiliation(s)
- Rituparna Sinha
- Information Technology, Heritage Institute of Technology, Anandapur Kolkata, West Bengal, India
| | - Rajat Kumar Pal
- Computer Science and Engineering Department, University of Calcutta, Kolkata, India
| | - Rajat Kumar De
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
| |
Collapse
|
2
|
Sinha R, Pal RK, De RK. ENLIGHTENMENT: A Scalable Annotated Database of Genomics and NGS-Based Nucleotide Level Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:155-168. [PMID: 38055361 DOI: 10.1109/tcbb.2023.3340067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
The revolution in sequencing technologies has enabled human genomes to be sequenced at a very low cost and time leading to exponential growth in the availability of whole-genome sequences. However, the complete understanding of our genome and its association with cancer is a far way to go. Researchers are striving hard to detect new variants and find their association with diseases, which further gives rise to the need for aggregation of this Big Data into a common standard scalable platform. In this work, a database named Enlightenment has been implemented which makes the availability of genomic data integrated from eight public databases, and DNA sequencing profiles of H. sapiens in a single platform. Annotated results with respect to cancer specific biomarkers, pharmacogenetic biomarkers and its association with variability in drug response, and DNA profiles along with novel copy number variants are computed and stored, which are accessible through a web interface. In order to overcome the challenge of storage and processing of NGS technology-based whole-genome DNA sequences, Enlightenment has been extended and deployed to a flexible and horizontally scalable database HBase, which is distributed over a hadoop cluster, which would enable the integration of other omics data into the database for enlightening the path towards eradication of cancer.
Collapse
|
3
|
Sinha R, Pal RK, De RK. GenSeg and MR-GenSeg: A Novel Segmentation Algorithm and its Parallel MapReduce Based Approach for Identifying Genomic Regions With Copy Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:443-454. [PMID: 32750860 DOI: 10.1109/tcbb.2020.3000661] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying intragenic as well as intergenic sequences of the DNA, having structural alterations, is a significantly important research area, since this may be the root cause of many neurological and autoimmune diseases, including cancer. Working with whole genome NGS data has provided a new insight in this regard, but has lead to huge explosion of data that is growing exponentially. Hence, the challenges lie in efficient means of storage and processing this big data. In this study, we have developed a novel segmentation algorithm, called GenSeg, and its parallel MapReduce based algorithm, called MR-GenSeg, for detecting copy number variations. In order to annotate CNVs (variants), segments formed by GenSeg/MR-GenSeg have been represented in a novel way using a binary tree, where each node is a CNV event. GenSeg considers each position specific data of whole genome DNA sequence, so that precise identification of breakpoints is possible. GenSeg/MR-GenSeg has been compared with twelve popular CNV detection algorithms, where it has outperformed the others in terms of sensitivity, and has achieved a good F-score value. MR-GenSeg has excelled in terms of SpeedUp, when compared with these algorithms. The effect of CNVs on immunoglobulin (IG) genes has also been analysed in this study. Availability: The source codes are available at https://github.com/rituparna-sinha/MapReduce-GENSEG.
Collapse
|
4
|
Yuan X, Xu X, Zhao H, Duan J. ERINS: Novel Sequence Insertion Detection by Constructing an Extended Reference. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1893-1901. [PMID: 31751246 DOI: 10.1109/tcbb.2019.2954315] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Next generation sequencing technology has led to the development of methods for the detection of novel sequence insertions (nsINS). Multiple signatures from short reads are usually extracted to improve nsINS detection performance. However, characterization of nsINSs larger than the mean insert size is still challenging. This article presents a new method, ERINS, to detect nsINS contents and genotypes of full spectrum range size. It integrates the features of structural variations and mapping states of split reads to find nsINS breakpoints, and then adopts a left-most mapping strategy to infer nsINS content by iteratively extending the standard reference at each breakpoint. Finally, it realigns all reads to the extended reference and infers nsINS genotypes through statistical testing on read counts. We test and validate the performance of ERINS on simulation and real sequencing datasets. The simulation experimental results demonstrate that it outperforms several peer methods with respect to sensitivity and precision. The real data application indicates that ERINS obtains high consistent results with those of previously reported and detects nsINSs over 200 base pairs that many other methods fail. In conclusion, ERINS can be used as a supplement to existing tools and will become a routine approach for characterizing nsINSs.
Collapse
|
5
|
Qin F, Luo X, Cai G, Xiao F. Shall genomic correlation structure be considered in copy number variants detection? Brief Bioinform 2021; 22:6295811. [PMID: 34114005 DOI: 10.1093/bib/bbab215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 04/16/2021] [Accepted: 05/17/2021] [Indexed: 11/14/2022] Open
Abstract
Copy number variation has been identified as a major source of genomic variation associated with disease susceptibility. With the advent of whole-exome sequencing (WES) technology, massive WES data have been generated, allowing for the identification of copy number variants (CNVs) in the protein-coding regions with direct functional interpretation. We have previously shown evidence of the genomic correlation structure in array data and developed a novel chromosomal breakpoint detection algorithm, LDcnv, which showed significantly improved detection power through integrating the correlation structure in a systematic modeling manner. However, it remains unexplored whether the genomic correlation exists in WES data and how such correlation structure integration can improve the CNV detection accuracy. In this study, we first explored the correlation structure of the WES data using the 1000 Genomes Project data. Both real raw read depth and median-normalized data showed strong evidence of the correlation structure. Motivated by this fact, we proposed a correlation-based method, CORRseq, as a novel release of the LDcnv algorithm in profiling WES data. The performance of CORRseq was evaluated in extensive simulation studies and real data analysis from the 1000 Genomes Project. CORRseq outperformed the existing methods in detecting medium and large CNVs. In conclusion, it would be more advantageous to model genomic correlation structure in detecting relatively long CNVs. This study provides great insights for methodology development of CNV detection with NGS data.
Collapse
Affiliation(s)
- Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina (USC), Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| |
Collapse
|
6
|
Yan C, He J, Luo J, Wang J, Zhang G, Luo H. SIns: A Novel Insertion Detection Approach Based on Soft-Clipped Reads. Front Genet 2021; 12:665812. [PMID: 33995493 PMCID: PMC8120196 DOI: 10.3389/fgene.2021.665812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/06/2021] [Indexed: 11/13/2022] Open
Abstract
As a common type of structural variation, an insertion refers to the addition of a DNA sequence into an individual genome and is usually associated with some inherited diseases. In recent years, many methods have been proposed for detecting insertions. However, the accurate calling of insertions is also a challenging task. In this study, we propose a novel insertion detection approach based on soft-clipped reads, which is called SIns. First, based on the alignments between paired reads and the reference genome, SIns extracts breakpoints from soft-clipped reads and determines insertion locations. The insert size information about paired reads is then further clustered to determine the genotype, and SIns subsequently adopts Minia to assemble the insertion sequences. Experimental results show that SIns can achieve better performance than other methods in terms of the F-score value for simulated and true datasets.
Collapse
Affiliation(s)
- Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Junyi He
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Ge Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
7
|
Statistical Considerations on NGS Data for Inferring Copy Number Variations. Methods Mol Biol 2021; 2243:27-58. [PMID: 33606251 DOI: 10.1007/978-1-0716-1103-6_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The next-generation sequencing (NGS) technology has revolutionized research in genetics and genomics, resulting in massive NGS data and opening more fronts to answer unresolved issues in genetics. NGS data are usually stored at three levels: image files, sequence tags, and alignment reads. The sizes of these types of data usually range from several hundreds of gigabytes to several terabytes. Biostatisticians and bioinformaticians are typically working with the aligned NGS read count data (hence the last level of NGS data) for data modeling and interpretation.To horn in on the use of NGS technology, researchers utilize it to profile the whole genome to study DNA copy number variations (CNVs) for an individual subject (or patient) as well as groups of subjects (or patients). The resulting aligned NGS read count data are then modeled by proper mathematical and statistical approaches so that the loci of CNVs can be accurately detected. In this book chapter, a summary of most popularly used statistical methods for detecting CNVs using NGS data is given. The goal is to provide readers with a comprehensive resource of available statistical approaches for inferring DNA copy number variations using NGS data.
Collapse
|
8
|
Alshawaqfeh M, Al Kawam A, Serpedin E, Datta A. Robust Recurrent CNV Detection in the Presence of Inter-Subject Variability. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1056-1067. [PMID: 30387737 DOI: 10.1109/tcbb.2018.2878560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The study of recurrent copy number variations (CNVs) plays an important role in understanding the onset and evolution of complex diseases such as cancer. Array-based comparative genomic hybridization (aCGH) is a widely used microarray based technology for identifying CNVs. However, due to high noise levels and inter-sample variability, detecting recurrent CNVs from aCGH data remains a challenging topic. This paper proposes a novel method for identification of the recurrent CNVs. In the proposed method, the noisy aCGH data is modeled as the superposition of three matrices: a full-rank matrix of weighted piece-wise generating signals accounting for the clean aCGH data, a Gaussian noise matrix to model the inherent experimentation errors and other sources of error, and a sparse matrix to capture the sparse inter-sample (sample-specific) variations. We demonstrated the ability of our method to separate accurately recurrent CNVs from sample-specific variations and noise in both simulated (artificial) data and real data. The proposed method produced more accurate results than current state-of-the-art methods used in recurrent CNV detection and exhibited robustness to noise and sample-specific variations.
Collapse
|
9
|
Xi J, Li A, Wang M. HetRCNA: A Novel Method to Identify Recurrent Copy Number Alternations from Heterogeneous Tumor Samples Based on Matrix Decomposition Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:422-434. [PMID: 29994262 DOI: 10.1109/tcbb.2018.2846599] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A common strategy to discovering cancer associated copy number aberrations (CNAs) from a cohort of cancer samples is to detect recurrent CNAs (RCNAs). Although the previous methods can successfully identify communal RCNAs shared by nearly all tumor samples, detecting subgroup-specific RCNAs and their related subgroup samples from cancer samples with heterogeneity is still invalid for these existing approaches. In this paper, we introduce a novel integrated method called HetRCNA, which can identify statistically significant subgroup-specific RCNAs and their related subgroup samples. Based on matrix decomposition framework with weight constraint, HetRCNA can successfully measure the subgroup samples by coefficients of left vectors with weight constraint and subgroup-specific RCNAs by coefficients of the right vectors and significance test. When we evaluate HetRCNA on simulated dataset, the results show that HetRCNA gives the best performances among the competing methods and is robust to the noise factors of the simulated data. When HetRCNA is applied on a real breast cancer dataset, our approach successfully identifies a bunch of RCNA regions and the result is highly correlated with the results of the other two investigated approaches. Notably, the genomic regions identified by HetRCNA harbor many breast cancer related genes reported by previous researches.
Collapse
|
10
|
Yang X, Han G, Cai H, Song Y. Recovering Hidden Diagonal Structures via Non-Negative Matrix Factorization with Multiple Constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1760-1772. [PMID: 28371782 DOI: 10.1109/tcbb.2017.2690282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Revealing data with intrinsically diagonal block structures is particularly useful for analyzing groups of highly correlated variables. Earlier researches based on non-negative matrix factorization (NMF) have been shown to be effective in representing such data by decomposing the observed data into two factors, where one factor is considered to be the feature and the other the expansion loading from a linear algebra perspective. If the data are sampled from multiple independent subspaces, the loading factor would possess a diagonal structure under an ideal matrix decomposition. However, the standard NMF method and its variants have not been reported to exploit this type of data via direct estimation. To address this issue, a non-negative matrix factorization with multiple constraints model is proposed in this paper. The constraints include an sparsity norm on the feature matrix and a total variational norm on each column of the loading matrix. The proposed model is shown to be capable of efficiently recovering diagonal block structures hidden in observed samples. An efficient numerical algorithm using the alternating direction method of multipliers model is proposed for optimizing the new model. Compared with several benchmark models, the proposed method performs robustly and effectively for simulated and real biological data.
Collapse
|
11
|
Zhang Z, Cheng H, Hong X, Di Narzo AF, Franzen O, Peng S, Ruusalepp A, Kovacic JC, Bjorkegren JLM, Wang X, Hao K. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data. Nucleic Acids Res 2019; 47:e39. [PMID: 30722045 PMCID: PMC6468244 DOI: 10.1093/nar/gkz068] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 12/17/2018] [Accepted: 01/25/2019] [Indexed: 12/30/2022] Open
Abstract
The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
Collapse
Affiliation(s)
- Zhongyang Zhang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Haoxiang Cheng
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Xiumei Hong
- Center on the Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Antonio F Di Narzo
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Oscar Franzen
- Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Karolinska Universitetssjukhuset, Huddinge, Sweden
| | - Shouneng Peng
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Arno Ruusalepp
- Department of Cardiac Surgery, Tartu University Hospital, Tartu, Estonia
| | - Jason C Kovacic
- Cardiovascular Research Center, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Johan L M Bjorkegren
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Karolinska Universitetssjukhuset, Huddinge, Sweden
| | - Xiaobin Wang
- Center on the Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Division of General Pediatrics & Adolescent Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Ke Hao
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- The Tenth People's Hospital, Tongji University, Shanghai 200072, China
- College of Environmental Science and Engineering, Tongji University, Shanghai 200092, China
| |
Collapse
|
12
|
Tran HV, Kiemer AK, Helms V. Copy Number Alterations in Tumor Genomes Deleting Antineoplastic Drug Targets Partially Compensated by Complementary Amplifications. Cancer Genomics Proteomics 2018; 15:365-378. [PMID: 30194077 PMCID: PMC6199575 DOI: 10.21873/cgp.20095] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 07/14/2018] [Accepted: 07/17/2018] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND/AIM Genomic DNA copy number alterations (CNAs) are frequent in tumors and have been catalogued by The Cancer Genome Atlas project. Emergence of chemoresistance frequently renders drug therapies ineffective. MATERIALS AND METHODS We analyzed how CNAs recurrently found in the genomes of TCGA patients of thirty-one tumor types affect protein targets of antineoplastic (AN) agents. RESULTS CNA deletions more frequently affected the targets of AN agents than CNA amplifications. Interestingly, in seven tumors we observed signs of compensatory CNAs. For example, in glioblastoma multiforme, two target genes (FLT1, FLT3) of the experimental drug sorafenib were recurrently deleted, whereas another target (KDR) of sorafenib was recurrently amplified. In renal clear cell carcinoma, the target FLT1 of pazopanib, sunitinib, sorafenib, and axitinib was recurrently deleted, whereas FLT4 bound by the same drugs, was recurrently amplified. CONCLUSION Deletions of AN target proteins can be compensated by amplification of alternative targets.
Collapse
Affiliation(s)
- Ha Vu Tran
- Saarland University, Center for Bioinformatics, Saarbruecken, Germany
- Department of Computer Science, Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi, Vietnam
| | - Alexandra K Kiemer
- Saarland University, Department of Pharmacy, Pharmaceutical Biology, Saarbruecken, Germany
| | - Volkhard Helms
- Saarland University, Center for Bioinformatics, Saarbruecken, Germany
| |
Collapse
|
13
|
Yuan X, Zhang J, Yang L, Bai J, Fan P. Detection of Significant Copy Number Variations From Multiple Samples in Next-Generation Sequencing Data. IEEE Trans Nanobioscience 2018; 17:12-20. [PMID: 29570071 DOI: 10.1109/tnb.2017.2783910] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Analyzing copy number variations (CNVs) from next-generation sequencing (NGS) data has become a common approach to detect disease susceptibility genes. The main challenge is how to utilize the NGS data with limited coverage depth to detect significant CNVs. Here, we introduce a new statistical method, the derivative of correlation coefficient (DCC), to detect significant CNVs that recurrently occur in multiple samples using read depth signals. We use a sliding window to calculate a correlation coefficient for each genome bin, and compute corresponding derivatives by fitting curves to the correlation coefficient. Then, the detection of significant CNVs was transformed into a problem of detecting significant derivatives reflecting genome breakpoints that can be solved using statistical hypothesis testing. We tested and compared the performance of DCC against several peer methods using a large number of simulation data sets, and validated DCC using several real sequencing data sets derived from the European Genome-Phenome archive, DNA Data Bank of Japan, and the 1000 Genomes Project. Experimental results suggest that DCC is an effective approach for identifying CNVs, outperforming peer methods in the terms of detection power and accuracy. DCC can be used to detect significant or recurrent CNVs in various NGS data sets, thus providing useful information to study genomic mutations and find disease susceptibility genes.
Collapse
|
14
|
Abstract
High-throughput biological technologies are routinely used to generate gene expression profiling or cytogenetics data. To achieve high performance, methods available in the literature become more specialized and often require high computational resources. Here, we propose a new versatile method based on the data-ordering rank values. We use linear algebra, the Perron-Frobenius theorem and also extend a method presented earlier for searching differentially expressed genes for the detection of recurrent copy number aberration. A result derived from the proposed method is a one-sample Student's t-test based on rank values. The proposed method is to our knowledge the only that applies to gene expression profiling and to cytogenetics data sets. This new method is fast, deterministic, and requires a low computational load. Probabilities are associated with genes to allow a statistically significant subset selection in the data set. Stability scores are also introduced as quality parameters. The performance and comparative analyses were carried out using real data sets. The proposed method can be accessed through an R package available from the CRAN (Comprehensive R Archive Network) website: https://cran.r-project.org/web/packages/fcros .
Collapse
Affiliation(s)
- Doulaye Dembélé
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), CNRS UMR 7104, INSERM U 1258, Université de Strasbourg, Illkirch-Graffenstaden, France
| |
Collapse
|
15
|
Malekpour SA, Pezeshk H, Sadeghi M. MSeq-CNV: accurate detection of Copy Number Variation from Sequencing of Multiple samples. Sci Rep 2018; 8:4009. [PMID: 29507384 PMCID: PMC5838159 DOI: 10.1038/s41598-018-22323-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Accepted: 02/16/2018] [Indexed: 01/23/2023] Open
Abstract
Currently a few tools are capable of detecting genome-wide Copy Number Variations (CNVs) based on sequencing of multiple samples. Although aberrations in mate pair insertion sizes provide additional hints for the CNV detection based on multiple samples, the majority of the current tools rely only on the depth of coverage. Here, we propose a new algorithm (MSeq-CNV) which allows detecting common CNVs across multiple samples. MSeq-CNV applies a mixture density for modeling aberrations in depth of coverage and abnormalities in the mate pair insertion sizes. Each component in this mixture density applies a Binomial distribution for modeling the number of mate pairs with aberration in the insertion size and also a Poisson distribution for emitting the read counts, in each genomic position. MSeq-CNV is applied on simulated data and also on real data of six HapMap individuals with high-coverage sequencing, in 1000 Genomes Project. These individuals include a CEU trio of European ancestry and a YRI trio of Nigerian ethnicity. Ancestry of these individuals is studied by clustering the identified CNVs. MSeq-CNV is also applied for detecting CNVs in two samples with low-coverage sequencing in 1000 Genomes Project and six samples form the Simons Genome Diversity Project.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran.
- School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran.
- Department of Mathematics and Statistics, Concordia University, Montreal, Canada.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
16
|
A Total-variation Constrained Permutation Model for Revealing Common Copy Number Patterns. Sci Rep 2017; 7:9666. [PMID: 28851906 PMCID: PMC5575355 DOI: 10.1038/s41598-017-09139-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2016] [Accepted: 07/24/2017] [Indexed: 01/20/2023] Open
Abstract
Variations in DNA copy number carry important information on genome evolution and regulation of DNA replication in cancer cells. The rapid development of single-cell sequencing technology enables exploration of gene-expression heterogeneity among single cells, providing important information on cell evolution. Evolutionary relationships in accumulated sequence data can be visualized by adjacent positioning of similar cells so that similar copy-number profiles are shown by block patterns. However, single-cell DNA sequencing data usually have low amount of starting genome, which requires an extra step of amplification to accumulate sufficient samples, introducing noise and making regular pattern-finding challenging. In this paper, we will propose to tackle this issue of recovering the hidden blocks within single-cell DNA-sequencing data through continuous sample permutations such that similar samples are positioned adjacently. The permutation is guided by the total variational norm of the recovered copy number profiles, and is continued until the total variational norm is minimized when similar samples are stacked together to reveal block patterns. An efficient numerical scheme for finding this permutation is designed, tailored from the alternating direction method of multipliers. Application of this method to both simulated and real data demonstrates its ability to recover the hidden structures of single-cell DNA sequences.
Collapse
|
17
|
Delatola EI, Lebarbier E, Mary-Huard T, Radvanyi F, Robin S, Wong J. SegCorr a statistical procedure for the detection of genomic regions of correlated expression. BMC Bioinformatics 2017; 18:333. [PMID: 28697800 PMCID: PMC5504623 DOI: 10.1186/s12859-017-1742-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Accepted: 06/26/2017] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Detecting local correlations in expression between neighboring genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigenetic alterations as factors that may significantly alter expression in large chromosomal regions (gene silencing or gene activation). RESULTS The identification of correlated regions requires segmenting the gene expression correlation matrix into regions of homogeneously correlated genes and assessing whether the observed local correlation is significantly higher than the background chromosomal correlation. A unified statistical framework is proposed to achieve these two tasks, where optimal segmentation is efficiently performed using dynamic programming algorithm, and detection of highly correlated regions is then achieved using an exact test procedure. We also propose a simple and efficient procedure to correct the expression signal for mechanisms already known to impact expression correlation. The performance and robustness of the proposed procedure, called SegCorr, are evaluated on simulated data. The procedure is illustrated on cancer data, where the signal is corrected for correlations caused by copy number variation. It permitted the detection of regions with high correlations linked to epigenetic marks like DNA methylation. CONCLUSIONS SegCorr is a novel method that performs correlation matrix segmentation and applies a test procedure in order to detect highly correlated regions in gene expression.
Collapse
Affiliation(s)
- Eleni Ioanna Delatola
- AgroParisTech UMR518, Paris, 75005, France.
- INRA UMR518, Paris, 75005, France.
- Institut Curie, PSL Research University, Cedex 05, Paris, 75248, France.
- CNRS UMR144, Equipe Labellisee par La Ligue Nationale contre le Cancer, Cedex 05, Paris, 75248, France.
| | - Emilie Lebarbier
- AgroParisTech UMR518, Paris, 75005, France
- INRA UMR518, Paris, 75005, France
| | - Tristan Mary-Huard
- AgroParisTech UMR518, Paris, 75005, France
- INRA UMR518, Paris, 75005, France
- INRA, UMR 0320 - UMR 8120 Genetique Quantitative et Evolution-Le Moulon, Gif-sur-Yvette, F-91190, France
| | - François Radvanyi
- Institut Curie, PSL Research University, Cedex 05, Paris, 75248, France
- CNRS UMR144, Equipe Labellisee par La Ligue Nationale contre le Cancer, Cedex 05, Paris, 75248, France
| | - Stéphane Robin
- AgroParisTech UMR518, Paris, 75005, France
- INRA UMR518, Paris, 75005, France
| | - Jennifer Wong
- Institut Curie, PSL Research University, Cedex 05, Paris, 75248, France
- CNRS UMR144, Equipe Labellisee par La Ligue Nationale contre le Cancer, Cedex 05, Paris, 75248, France
- Molecular Oncology Unit, Department of Biochemistry, Hospital Saint Louis, AP-HP, Cedex 10, Paris, 75475, France
- Université Paris Diderot, Sorbonne Paris Cité, CNRS UMR7212/INSERM U944, Cedex 10, Paris, 75475, France
| |
Collapse
|
18
|
Ji T, Chen J. Statistical models for DNA copy number variation detection using read-depth data from next generation sequencing experiments. AUST NZ J STAT 2016. [DOI: 10.1111/anzs.12175] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Affiliation(s)
- Tieming Ji
- Department of Statistics; University of Missouri at Columbia; Columbia MI 65211 USA
| | - Jie Chen
- Department of Biostatistics and Epidemiology; Medical College of Georgia, Augusta University; Augusta GA 30912 USA
| |
Collapse
|
19
|
Malekpour SA, Pezeshk H, Sadeghi M. PSE-HMM: genome-wide CNV detection from NGS data using an HMM with Position-Specific Emission probabilities. BMC Bioinformatics 2016; 18:30. [PMID: 27809781 PMCID: PMC5445519 DOI: 10.1186/s12859-016-1296-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2016] [Accepted: 10/20/2016] [Indexed: 11/23/2022] Open
Abstract
Background Copy Number Variation (CNV) is envisaged to be a major source of large structural variations in the human genome. In recent years, many studies apply Next Generation Sequencing (NGS) data for the CNV detection. However, still there is a necessity to invent more accurate computational tools. Results In this study, mate pair NGS data are used for the CNV detection in a Hidden Markov Model (HMM). The proposed HMM has position specific emission probabilities, i.e. a Gaussian mixture distribution. Each component in the Gaussian mixture distribution captures a different type of aberration that is observed in the mate pairs, after being mapped to the reference genome. These aberrations may include any increase (decrease) in the insertion size or change in the direction of mate pairs that are mapped to the reference genome. This HMM with Position-Specific Emission probabilities (PSE-HMM) is utilized for the genome-wide detection of deletions and tandem duplications. The performance of PSE-HMM is evaluated on a simulated dataset and also on a real data of a Yoruban HapMap individual, NA18507. Conclusions PSE-HMM is effective in taking observation dependencies into account and reaches a high accuracy in detecting genome-wide CNVs. MATLAB programs are available at http://bs.ipm.ir/softwares/PSE-HMM/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1296-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, 14155-6455, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, 14155-6455, Iran. .,School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
20
|
Malekpour SA, Pezeshk H, Sadeghi M. MGP-HMM: Detecting genome-wide CNVs using an HMM for modeling mate pair insertion sizes and read counts. Math Biosci 2016; 279:53-62. [PMID: 27424951 DOI: 10.1016/j.mbs.2016.07.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Revised: 06/12/2016] [Accepted: 07/10/2016] [Indexed: 01/02/2023]
Abstract
MOTIVATION Association of Copy Number Variation (CNV) with schizophrenia, autism, developmental disabilities and fatal diseases such as cancer is verified. Recent developments in Next Generation Sequencing (NGS) have facilitated the CNV studies. However, many of the current CNV detection tools are not capable of discriminating tandem duplication from non-tandem duplications. RESULTS In this study, we propose MGP-HMM as a tool which besides detecting genome-wide deletions discriminates tandem duplications from non-tandem duplications. MGP-HMM takes mate pair abnormalities into account and predicts the digitized number of tandem or non-tandem copies. Abnormalities in the mate pair directions and insertion sizes, after being mapped to the reference genome, are elucidated using a Hidden Markov Model (HMM). For this purpose, a Mixture Gaussian density with time-dependent parameters is applied for emitting mate pair insertion sizes from HMM states. Indeed, depending on observed abnormalities in mate pair insertion size or its orientation, each component in the mixture density will have different parameters. MGP-HMM also applies a Poisson distribution for modeling read depth data. This parametric modeling of the mate pair reads enables us to estimate the length of CNVs precisely, which is an advantage over methods which rely only on read depth approach for the CNV detection. Hidden state of the proposed HMM is the digitized copy number of a genomic segment and states correspond to the multipliers of the mixture Gaussian components. The accuracy of our model is validated on a set of next generation sequencing real and simulated data and is compared to other tools.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran.
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran; School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran.
| |
Collapse
|
21
|
Xi J, Li A. Discovering Recurrent Copy Number Aberrations in Complex Patterns via Non-Negative Sparse Singular Value Decomposition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:656-668. [PMID: 26372614 DOI: 10.1109/tcbb.2015.2474404] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Recurrent copy number aberrations (RCNAs) in multiple cancer samples are strongly associated with tumorigenesis, and RCNA discovery is helpful to cancer research and treatment. Despite the emergence of numerous RCNA discovering methods, most of them are unable to detect RCNAs in complex patterns that are influenced by complicating factors including aberration in partial samples, co-existing of gains and losses and normal-like tumor samples. Here, we propose a novel computational method, called non-negative sparse singular value decomposition (NN-SSVD), to address the RCNA discovering problem in complex patterns. In NN-SSVD, the measurement of RCNA is based on the aberration frequency in a part of samples rather than all samples, which can circumvent the complexity of different RCNA patterns. We evaluate NN-SSVD on synthetic dataset by comparison on detection scores and Receiver Operating Characteristics curves, and the results show that NN-SSVD outperforms existing methods in RCNA discovery and demonstrate more robustness to RCNA complicating factors. Applying our approach on a breast cancer dataset, we successfully identify a number of genomic regions that are strongly correlated with previous studies, which harbor a bunch of known breast cancer associated genes.
Collapse
|
22
|
Thangam M, Gopal RK. CRCDA--Comprehensive resources for cancer NGS data analysis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav092. [PMID: 26450948 PMCID: PMC4597977 DOI: 10.1093/database/bav092] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 08/31/2015] [Indexed: 12/24/2022]
Abstract
Next generation sequencing (NGS) innovations put a compelling landmark in life science and changed the direction of research in clinical oncology with its productivity to diagnose and treat cancer. The aim of our portal comprehensive resources for cancer NGS data analysis (CRCDA) is to provide a collection of different NGS tools and pipelines under diverse classes with cancer pathways and databases and furthermore, literature information from PubMed. The literature data was constrained to 18 most common cancer types such as breast cancer, colon cancer and other cancers that exhibit in worldwide population. NGS-cancer tools for the convenience have been categorized into cancer genomics, cancer transcriptomics, cancer epigenomics, quality control and visualization. Pipelines for variant detection, quality control and data analysis were listed to provide out-of-the box solution for NGS data analysis, which may help researchers to overcome challenges in selecting and configuring individual tools for analysing exome, whole genome and transcriptome data. An extensive search page was developed that can be queried by using (i) type of data [literature, gene data and sequence read archive (SRA) data] and (ii) type of cancer (selected based on global incidence and accessibility of data). For each category of analysis, variety of tools are available and the biggest challenge is in searching and using the right tool for the right application. The objective of the work is collecting tools in each category available at various places and arranging the tools and other data in a simple and user-friendly manner for biologists and oncologists to find information easier. To the best of our knowledge, we have collected and presented a comprehensive package of most of the resources available in cancer for NGS data analysis. Given these factors, we believe that this website will be an useful resource to the NGS research community working on cancer. Database URL: http://bioinfo.au-kbc.org.in/ngs/ngshome.html.
Collapse
Affiliation(s)
- Manonanthini Thangam
- AU-KBC Research Centre, MIT Campus of Anna University, Chromepet, Chennai, India
| | - Ramesh Kumar Gopal
- AU-KBC Research Centre, MIT Campus of Anna University, Chromepet, Chennai, India
| |
Collapse
|
23
|
Wang X, Li X, Cheng Y, Sun X, Sun X, Self S, Kooperberg C, Dai JY. Copy number alterations detected by whole-exome and whole-genome sequencing of esophageal adenocarcinoma. Hum Genomics 2015; 9:22. [PMID: 26374103 PMCID: PMC4570720 DOI: 10.1186/s40246-015-0044-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 08/25/2015] [Indexed: 02/08/2023] Open
Abstract
Background Esophageal adenocarcinoma (EA) is among the leading causes of cancer mortality, especially in developed countries. A high level of somatic copy number alterations (CNAs) accumulates over the decades in the progression from Barrett’s esophagus, the precursor lesion, to EA. Accurate identification of somatic CNAs is essential to understand cancer development. Many studies have been conducted for the detection of CNA in EA using microarrays. Next-generation sequencing (NGS) technologies are believed to have advantages in sensitivity and accuracy to detect CNA, yet no NGS-based CNA detection in EA has been reported. Results In this study, we analyzed whole-exome (WES) and whole-genome sequencing (WGS) data for detecting CNA from a published large-scale genomic study of EA. Two specific comparisons were conducted. First, the recurrent CNAs based on WGS and WES data from 145 EA samples were compared to those found in five previous microarray-based studies. We found that the majority of the previously identified regions were also detected in this study. Interestingly, some novel amplifications and deletions were discovered using the NGS data. In particular, SKI and PRKCZ detected in a deletion region are involved in transforming growth factor-β pathway, suggesting the potential utility of novel biomarkers for EA. Second, we compared CNAs detected in WGS and WES data from the same 15 EA samples. No large-scale CNA was identified statistically more frequently by WES or WGS, while more focal-scale CNAs were detected by WGS than by WES. Conclusions Our results suggest that NGS can replace microarrays to detect CNA in EA. WGS is superior to WES in that it can offer finer resolution for the detection, though if the interest is on recurrent CNAs, WES can be preferable to WGS for its cost-effectiveness. Electronic supplementary material The online version of this article (doi:10.1186/s40246-015-0044-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| | - Xiaohong Li
- Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, USA. .,Public Health Science Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| | - Yichen Cheng
- Public Health Science Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| | - Xin Sun
- Institute of Occupational Health and Poison Control, Chinese Center for Disease Control and Prevention, Beijing, China.
| | - Xibin Sun
- Henan Office for Cancer Research and Control, Henan Cancer Hospital, Zhengzhou, Henan, China.
| | - Steve Self
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| | - Charles Kooperberg
- Public Health Science Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| | - James Y Dai
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA. .,Public Health Science Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| |
Collapse
|
24
|
Masecchia S, Coco S, Barla A, Verri A, Tonini GP. Genome instability model of metastatic neuroblastoma tumorigenesis by a dictionary learning algorithm. BMC Med Genomics 2015; 8:57. [PMID: 26358114 PMCID: PMC4566396 DOI: 10.1186/s12920-015-0132-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 08/28/2015] [Indexed: 12/21/2022] Open
Abstract
Background Metastatic neuroblastoma (NB) occurs in pediatric patients as stage 4S or stage 4 and it is characterized by heterogeneous clinical behavior associated with diverse genotypes. Tumors of stage 4 contain several structural copy number aberrations (CNAs) rarely found in stage 4S. To date, the NB tumorigenesis is not still elucidated, although it is evident that genomic instability plays a critical role in the genesis of the tumor. Here we propose a mathematical approach to decipher genomic data and we provide a new model of NB metastatic tumorigenesis. Method We elucidate NB tumorigenesis using Enhanced Fused Lasso Latent Feature Model (E-FLLat) modeling the array comparative chromosome hybridization (aCGH) data of 190 metastatic NBs (63 stage 4S and 127 stage 4). This model for aCGH segmentation, based on the minimization of functional dictionary learning (DL), combines several penalties tailored to the specificities of aCGH data. In DL, the original signal is approximated by a linear weighted combination of atoms: the elements of the learned dictionary. Results The hierarchical structures for stage 4S shows at the first level of the oncogenetic tree several whole chromosome gains except to the unbalanced gains of 17q, 2p and 2q. Conversely, the high CNA complexity found in stage 4 tumors, requires two different trees. Both stage 4 oncogenetic trees are marked diverged, up to five sublevels and the 17q gain is the most common event at the first level (2/3 nodes). Moreover the 11q deletion, one of the major unfavorable marker of disease progression, occurs before 3p loss indicating that critical chromosome aberrations appear at early stages of tumorigenesis. Finally, we also observed a significant (p = 0.025) association between patient age and chromosome loss in stage 4 cases. Conclusion These results led us to propose a genome instability progressive model in which NB cells initiate with a DNA synthesis uncoupled from cell division, that leads to stage 4S tumors, primarily characterized by numerical aberrations, or stage 4 tumors with high levels of genome instability resulting in complex chromosome rearrangements associated with high tumor aggressiveness and rapid disease progression. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0132-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Simona Coco
- Lung Cancer Unit; IRCCS A.O.U. San Martino - IST, Genova, Italy.
| | - Annalisa Barla
- DIBRIS, Università degli Studi di Genova, Genova, Italy.
| | | | - Gian Paolo Tonini
- Neuroblastoma Laboratory, Onco/Hematology Laboratory, Department of Woman and Child Health, University of Padua, Pediatric Research Institute, Fondazione Città della Speranza, Padua, Corso Stati Uniti, 4, 35127, Padua, Italy.
| |
Collapse
|
25
|
CNV-CH: A Convex Hull Based Segmentation Approach to Detect Copy Number Variations (CNV) Using Next-Generation Sequencing Data. PLoS One 2015; 10:e0135895. [PMID: 26291322 PMCID: PMC4546278 DOI: 10.1371/journal.pone.0135895] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 07/28/2015] [Indexed: 11/19/2022] Open
Abstract
Copy number variation (CNV) is a form of structural alteration in the mammalian DNA sequence, which are associated with many complex neurological diseases as well as cancer. The development of next generation sequencing (NGS) technology provides us a new dimension towards detection of genomic locations with copy number variations. Here we develop an algorithm for detecting CNVs, which is based on depth of coverage data generated by NGS technology. In this work, we have used a novel way to represent the read count data as a two dimensional geometrical point. A key aspect of detecting the regions with CNVs, is to devise a proper segmentation algorithm that will distinguish the genomic locations having a significant difference in read count data. We have designed a new segmentation approach in this context, using convex hull algorithm on the geometrical representation of read count data. To our knowledge, most algorithms have used a single distribution model of read count data, but here in our approach, we have considered the read count data to follow two different distribution models independently, which adds to the robustness of detection of CNVs. In addition, our algorithm calls CNVs based on the multiple sample analysis approach resulting in a low false discovery rate with high precision.
Collapse
|
26
|
Glusman G, Severson A, Dhankani V, Robinson M, Farrah T, Mauldin DE, Stittrich AB, Ament SA, Roach JC, Brunkow ME, Bodian DL, Vockley JG, Shmulevich I, Niederhuber JE, Hood L. Identification of copy number variants in whole-genome data using Reference Coverage Profiles. Front Genet 2015; 6:45. [PMID: 25741365 PMCID: PMC4330915 DOI: 10.3389/fgene.2015.00045] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2014] [Accepted: 01/30/2015] [Indexed: 12/20/2022] Open
Abstract
The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Dale L Bodian
- Inova Translational Medicine Institute, Inova Health System Falls Church, VA, USA
| | - Joseph G Vockley
- Inova Translational Medicine Institute, Inova Health System Falls Church, VA, USA
| | | | - John E Niederhuber
- Inova Translational Medicine Institute, Inova Health System Falls Church, VA, USA
| | - Leroy Hood
- Institute for Systems Biology Seattle, WA, USA
| |
Collapse
|
27
|
Oleksiewicz U, Tomczak K, Woropaj J, Markowska M, Stępniak P, Shah PK. Computational characterisation of cancer molecular profiles derived using next generation sequencing. Contemp Oncol (Pozn) 2015; 19:A78-91. [PMID: 25691827 PMCID: PMC4322529 DOI: 10.5114/wo.2014.47137] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Our current understanding of cancer genetics is grounded on the principle that cancer arises from a clone that has accumulated the requisite somatically acquired genetic aberrations, leading to the malignant transformation. It also results in aberrent of gene and protein expression. Next generation sequencing (NGS) or deep sequencing platforms are being used to create large catalogues of changes in copy numbers, mutations, structural variations, gene fusions, gene expression, and other types of information for cancer patients. However, inferring different types of biological changes from raw reads generated using the sequencing experiments is algorithmically and computationally challenging. In this article, we outline common steps for the quality control and processing of NGS data. We highlight the importance of accurate and application-specific alignment of these reads and the methodological steps and challenges in obtaining different types of information. We comment on the importance of integrating these data and building infrastructure to analyse it. We also provide exhaustive lists of available software to obtain information and point the readers to articles comparing software for deeper insight in specialised areas. We hope that the article will guide readers in choosing the right tools for analysing oncogenomic datasets.
Collapse
Affiliation(s)
- Urszula Oleksiewicz
- Laboratory of Gene Therapy, Department of Cancer Immunology, The Greater Poland Cancer Centre, Poznan, Poland ; Department of Cancer Immunology and Diagnostics, Chair of Medical Biotechnology, Poznan University of Medical Sciences, Poznan, Poland ; These authors contributed equally to this paper
| | - Katarzyna Tomczak
- Laboratory of Gene Therapy, Department of Cancer Immunology, The Greater Poland Cancer Centre, Poznan, Poland ; Department of Cancer Immunology and Diagnostics, Chair of Medical Biotechnology, Poznan University of Medical Sciences, Poznan, Poland ; Postgraduate School of Molecular Medicine, Medical University of Warsaw, Warsaw ; These authors contributed equally to this paper
| | - Jakub Woropaj
- Poznan University of Economics, Poznań, Poland ; These authors contributed equally to this paper
| | | | | | - Parantu K Shah
- Institute for Applied Cancer Science, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
28
|
Abstract
High-throughput DNA sequencing has revolutionized the study of cancer genomics with numerous discoveries that are relevant to cancer diagnosis and treatment. The latest sequencing and analysis methods have successfully identified somatic alterations, including single-nucleotide variants, insertions and deletions, copy-number aberrations, structural variants and gene fusions. Additional computational techniques have proved useful for defining the mutations, genes and molecular networks that drive diverse cancer phenotypes and that determine clonal architectures in tumour samples. Collectively, these tools have advanced the study of genomic, transcriptomic and epigenomic alterations in cancer, and their association to clinical properties. Here, we review cancer genomics software and the insights that have been gained from their application.
Collapse
|
29
|
Zhou X, Liu J, Wan X, Yu W. Piecewise-constant and low-rank approximation for identification of recurrent copy number variations. Bioinformatics 2014; 30:1943-9. [PMID: 24642062 DOI: 10.1093/bioinformatics/btu131] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The post-genome era sees urgent need for more novel approaches to extracting useful information from the huge amount of genetic data. The identification of recurrent copy number variations (CNVs) from array-based comparative genomic hybridization (aCGH) data can help understand complex diseases, such as cancer. Most of the previous computational methods focused on single-sample analysis or statistical testing based on the results of single-sample analysis. Finding recurrent CNVs from multi-sample data remains a challenging topic worth further study. RESULTS We present a general and robust method to identify recurrent CNVs from multi-sample aCGH profiles. We express the raw dataset as a matrix and demonstrate that recurrent CNVs will form a low-rank matrix. Hence, we formulate the problem as a matrix recovering problem, where we aim to find a piecewise-constant and low-rank approximation (PLA) to the input matrix. We propose a convex formulation for matrix recovery and an efficient algorithm to globally solve the problem. We demonstrate the advantages of PLA compared with alternative methods using synthesized datasets and two breast cancer datasets. The experimental results show that PLA can successfully reconstruct the recurrent CNV patterns from raw data and achieve better performance compared with alternative methods under a wide range of scenarios. AVAILABILITY AND IMPLEMENTATION The MATLAB code is available at http://bioinformatics.ust.hk/pla.zip.
Collapse
Affiliation(s)
- Xiaowei Zhou
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Jiming Liu
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Xiang Wan
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Weichuan Yu
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| |
Collapse
|
30
|
Raphael BJ, Dobson JR, Oesper L, Vandin F. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome Med 2014; 6:5. [PMID: 24479672 PMCID: PMC3978567 DOI: 10.1186/gm524] [Citation(s) in RCA: 131] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer.
Collapse
Affiliation(s)
- Benjamin J Raphael
- Department of Computer Science, Brown University, 115 Waterman Street, Providence, RI 02912, USA
- Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, RI 02912, USA
| | - Jason R Dobson
- Department of Computer Science, Brown University, 115 Waterman Street, Providence, RI 02912, USA
- Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, RI 02912, USA
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, 185 Meeting Street, Providence, RI 02912, USA
| | - Layla Oesper
- Department of Computer Science, Brown University, 115 Waterman Street, Providence, RI 02912, USA
| | - Fabio Vandin
- Department of Computer Science, Brown University, 115 Waterman Street, Providence, RI 02912, USA
- Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, RI 02912, USA
| |
Collapse
|
31
|
Abstract
Exome sequencing in cancer is a powerful tool for identifying mutational events across the coding region of human genes. Here, we describe computational methods that use exome sequencing reads from cancer samples to identify somatic single nucleotide variants (SNVs), copy number alterations, and short insertions and deletions (InDels). We further describe analytical methods to generate lists of driver genes with more mutational events than expected by chance.
Collapse
Affiliation(s)
- Perry Evans
- Department of Pathology, Yale University School of Medicine, New Haven, CT, USA
| | | | | |
Collapse
|
32
|
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 2013; 14 Suppl 11:S1. [PMID: 24564169 PMCID: PMC3846878 DOI: 10.1186/1471-2105-14-s11-s1] [Citation(s) in RCA: 333] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
Collapse
|
33
|
Zhou X, Yang C, Wan X, Zhao H, Yu W. Multisample aCGH data analysis via total variation and spectral regularization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:230-235. [PMID: 23702561 PMCID: PMC3715577 DOI: 10.1109/tcbb.2012.166] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
DNA copy number variation (CNV) accounts for a large proportion of genetic variation. One commonly used approach to detecting CNVs is array-based comparative genomic hybridization (aCGH). Although many methods have been proposed to analyze aCGH data, it is not clear how to combine information from multiple samples to improve CNV detection. In this paper, we propose to use a matrix to approximate the multisample aCGH data and minimize the total variation of each sample as well as the nuclear norm of the whole matrix. In this way, we can make use of the smoothness property of each sample and the correlation among multiple samples simultaneously in a convex optimization framework. We also developed an efficient and scalable algorithm to handle large-scale data. Experiments demonstrate that the proposed method outperforms the state-of-the-art techniques under a wide range of scenarios and it is capable of processing large data sets with millions of probes.
Collapse
Affiliation(s)
- Xiaowei Zhou
- Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, China.
| | | | | | | | | |
Collapse
|
34
|
Abstract
Automated DNA sequencing instruments embody an elegant interplay among chemistry, engineering, software, and molecular biology and have built upon Sanger's founding discovery of dideoxynucleotide sequencing to perform once-unfathomable tasks. Combined with innovative physical mapping approaches that helped to establish long-range relationships between cloned stretches of genomic DNA, fluorescent DNA sequencers produced reference genome sequences for model organisms and for the reference human genome. New types of sequencing instruments that permit amazing acceleration of data-collection rates for DNA sequencing have been developed. The ability to generate genome-scale data sets is now transforming the nature of biological inquiry. Here, I provide an historical perspective of the field, focusing on the fundamental developments that predated the advent of next-generation sequencing instruments and providing information about how these instruments work, their application to biological research, and the newest types of sequencers that can extract data from single DNA molecules.
Collapse
Affiliation(s)
- Elaine R Mardis
- The Genome Institute at Washington University School of Medicine, St. Louis, Missouri 63108, USA.
| |
Collapse
|
35
|
Rueda OM, Diaz-Uriarte R, Caldas C. Finding common regions of alteration in copy number data. Methods Mol Biol 2013; 973:339-53. [PMID: 23412800 DOI: 10.1007/978-1-62703-281-0_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Abstract
In this chapter, we review some recent methods designed for detecting recurrent copy number regions, that is, genomic regions that show evidence of being altered in a set of samples. We analyze Affymetrix SNP6 data from 87 Her2-type breast tumors from a recent study using three different methods, showing different definitions and features of common regions: studying heterogeneity in copy number profiles, refining candidates for driver oncogenes, and consolidating broad amplifications.
Collapse
Affiliation(s)
- Oscar M Rueda
- Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre, Cambridge, UK.
| | | | | |
Collapse
|
36
|
Comparative analysis of methods for identifying recurrent copy number alterations in cancer. PLoS One 2012; 7:e52516. [PMID: 23285074 PMCID: PMC3527554 DOI: 10.1371/journal.pone.0052516] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2012] [Accepted: 11/14/2012] [Indexed: 11/19/2022] Open
Abstract
Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
Collapse
|
37
|
Xuan J, Yu Y, Qing T, Guo L, Shi L. Next-generation sequencing in the clinic: promises and challenges. Cancer Lett 2012; 340:284-95. [PMID: 23174106 DOI: 10.1016/j.canlet.2012.11.025] [Citation(s) in RCA: 198] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2012] [Revised: 11/13/2012] [Accepted: 11/13/2012] [Indexed: 02/06/2023]
Abstract
The advent of next generation sequencing (NGS) technologies has revolutionized the field of genomics, enabling fast and cost-effective generation of genome-scale sequence data with exquisite resolution and accuracy. Over the past years, rapid technological advances led by academic institutions and companies have continued to broaden NGS applications from research to the clinic. A recent crop of discoveries have highlighted the medical impact of NGS technologies on Mendelian and complex diseases, particularly cancer. However, the ever-increasing pace of NGS adoption presents enormous challenges in terms of data processing, storage, management and interpretation as well as sequencing quality control, which hinder the translation from sequence data into clinical practice. In this review, we first summarize the technical characteristics and performance of current NGS platforms. We further highlight advances in the applications of NGS technologies towards the development of clinical diagnostics and therapeutics. Common issues in NGS workflows are also discussed to guide the selection of NGS platforms and pipelines for specific research purposes.
Collapse
Affiliation(s)
- Jiekun Xuan
- School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, China; National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA
| | | | | | | | | |
Collapse
|
38
|
Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Nat Genet 2012; 44:1006-14. [PMID: 22842228 PMCID: PMC3432702 DOI: 10.1038/ng.2359] [Citation(s) in RCA: 893] [Impact Index Per Article: 74.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Accepted: 06/28/2012] [Indexed: 02/06/2023]
Abstract
We characterized the mutational landscape of melanoma, the form of skin cancer with the highest mortality rate, by sequencing the exomes of 147 melanomas. Sun-exposed melanomas had markedly more ultraviolet (UV)-like C>T somatic mutations compared to sun-shielded acral, mucosal and uveal melanomas. Among the newly identified cancer genes was PPP6C, encoding a serine/threonine phosphatase, which harbored mutations that clustered in the active site in 12% of sun-exposed melanomas, exclusively in tumors with mutations in BRAF or NRAS. Notably, we identified a recurrent UV-signature, an activating mutation in RAC1 in 9.2% of sun-exposed melanomas. This activating mutation, the third most frequent in our cohort of sun-exposed melanoma after those of BRAF and NRAS, changes Pro29 to serine (RAC1P29S) in the highly conserved switch I domain. Crystal structures, and biochemical and functional studies of RAC1P29S showed that the alteration releases the conformational restraint conferred by the conserved proline, causes an increased binding of the protein to downstream effectors, and promotes melanocyte proliferation and migration. These findings raise the possibility that pharmacological inhibition of downstream effectors of RAC1 signaling could be of therapeutic benefit.
Collapse
|
39
|
Yuan X, Yu G, Hou X, Shih IM, Clarke R, Zhang J, Hoffman EP, Wang RR, Zhang Z, Wang Y. Genome-wide identification of significant aberrations in cancer genome. BMC Genomics 2012; 13:342. [PMID: 22839576 PMCID: PMC3428679 DOI: 10.1186/1471-2164-13-342] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 07/27/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Somatic Copy Number Alterations (CNAs) in human genomes are present in almost all human cancers. Systematic efforts to characterize such structural variants must effectively distinguish significant consensus events from random background aberrations. Here we introduce Significant Aberration in Cancer (SAIC), a new method for characterizing and assessing the statistical significance of recurrent CNA units. Three main features of SAIC include: (1) exploiting the intrinsic correlation among consecutive probes to assign a score to each CNA unit instead of single probes; (2) performing permutations on CNA units that preserve correlations inherent in the copy number data; and (3) iteratively detecting Significant Copy Number Aberrations (SCAs) and estimating an unbiased null distribution by applying an SCA-exclusive permutation scheme. RESULTS We test and compare the performance of SAIC against four peer methods (GISTIC, STAC, KC-SMART, CMDS) on a large number of simulation datasets. Experimental results show that SAIC outperforms peer methods in terms of larger area under the Receiver Operating Characteristics curve and increased detection power. We then apply SAIC to analyze structural genomic aberrations acquired in four real cancer genome-wide copy number data sets (ovarian cancer, metastatic prostate cancer, lung adenocarcinoma, glioblastoma). When compared with previously reported results, SAIC successfully identifies most SCAs known to be of biological significance and associated with oncogenes (e.g., KRAS, CCNE1, and MYC) or tumor suppressor genes (e.g., CDKN2A/B). Furthermore, SAIC identifies a number of novel SCAs in these copy number data that encompass tumor related genes and may warrant further studies. CONCLUSIONS Supported by a well-grounded theoretical framework, SAIC has been developed and used to identify SCAs in various cancer copy number data sets, providing useful information to study the landscape of cancer genomes. Open-source and platform-independent SAIC software is implemented using C++, together with R scripts for data formatting and Perl scripts for user interfacing, and it is easy to install and efficient to use. The source code and documentation are freely available at http://www.cbil.ece.vt.edu/software.htm.
Collapse
Affiliation(s)
- Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, P R China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Yuan X, Zhang J, Yang L, Zhang S, Chen B, Geng Y, Wang Y. TAGCNA: a method to identify significant consensus events of copy number alterations in cancer. PLoS One 2012; 7:e41082. [PMID: 22815924 PMCID: PMC3399811 DOI: 10.1371/journal.pone.0041082] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2012] [Accepted: 06/17/2012] [Indexed: 01/20/2023] Open
Abstract
Somatic copy number alteration (CNA) is a common phenomenon in cancer genome. Distinguishing significant consensus events (SCEs) from random background CNAs in a set of subjects has been proven to be a valuable tool to study cancer. In order to identify SCEs with an acceptable type I error rate, better computational approaches should be developed based on reasonable statistics and null distributions. In this article, we propose a new approach named TAGCNA for identifying SCEs in somatic CNAs that may encompass cancer driver genes. TAGCNA employs a peel-off permutation scheme to generate a reasonable null distribution based on a prior step of selecting tag CNA markers from the genome being considered. We demonstrate the statistical power of TAGCNA on simulated ground truth data, and validate its applicability using two publicly available cancer datasets: lung and prostate adenocarcinoma. TAGCNA identifies SCEs that are known to be involved with proto-oncogenes (e.g. EGFR, CDK4) and tumor suppressor genes (e.g. CDKN2A, CDKN2B), and provides many additional SCEs with potential biological relevance in these data. TAGCNA can be used to analyze the significance of CNAs in various cancers. It is implemented in R and is freely available at http://tagcna.sourceforge.net/.
Collapse
Affiliation(s)
- Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, People’s Republic of China
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, People’s Republic of China
- * E-mail: (JZ); (YW)
| | - Liying Yang
- School of Computer Science and Technology, Xidian University, Xi'an, People’s Republic of China
| | - Shengli Zhang
- Department of Mathematics, Xidian University, Xi'an, People’s Republic of China
| | - Baodi Chen
- School of Computer Science and Technology, Xidian University, Xi'an, People’s Republic of China
| | - Yaojun Geng
- School of Computer Science and Technology, Xidian University, Xi'an, People’s Republic of China
| | - Yue Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, United States of America
- * E-mail: (JZ); (YW)
| |
Collapse
|
41
|
Yeo CWS, Ng FSL, Chai C, Tan JMM, Koh GRH, Chong YK, Koh LWH, Foong CSF, Sandanaraj E, Holbrook JD, Ang BT, Takahashi R, Tang C, Lim KL. Parkin pathway activation mitigates glioma cell proliferation and predicts patient survival. Cancer Res 2012; 72:2543-53. [PMID: 22431710 DOI: 10.1158/0008-5472.can-11-3060] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Mutations in the parkin gene, which encodes a ubiquitin ligase, are a major genetic cause of parkinsonism. Interestingly, parkin also plays a role in cancer as a putative tumor suppressor, and the gene is frequently targeted by deletion and inactivation in human malignant tumors. Here, we investigated a potential tumor suppressor role for parkin in gliomas. We found that parkin expression was dramatically reduced in glioma cells. Restoration of parkin expression promoted G(1) phase cell-cycle arrest and mitigated the proliferation rate of glioma cells in vitro and in vivo. Notably, parkin-expressing glioma cells showed a reduction in levels of cyclin D1, but not cyclin E, and a selective downregulation of Akt serine-473 phosphorylation and VEGF receptor levels. In accordance, cells derived from a parkin-null mouse model exhibited increased levels of cyclin D1, VEGF receptor, and Akt phosphorylation, and divided significantly faster when compared with wild-type cells, with suppression of these changes following parkin reintroduction. Clinically, analysis of parkin pathway activation was predictive for the survival outcome of patients with glioma. Taken together, our study provides mechanistic insight into the tumor suppressor function of parkin in brain tumors and suggests that measurement of parkin pathway activation may be used clinically as a prognostic tool in patients with brain tumor.
Collapse
Affiliation(s)
- Calvin W S Yeo
- Department of Physiology, National University of Singapore, Singapore
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012; 22:568-76. [PMID: 22300766 DOI: 10.1101/gr.129684.111] [Citation(s) in RCA: 3350] [Impact Index Per Article: 279.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Cancer is a disease driven by genetic variation and mutation. Exome sequencing can be utilized for discovering these variants and mutations across hundreds of tumors. Here we present an analysis tool, VarScan 2, for the detection of somatic mutations and copy number alterations (CNAs) in exome data from tumor-normal pairs. Unlike most current approaches, our algorithm reads data from both samples simultaneously; a heuristic and statistical algorithm detects sequence variants and classifies them by somatic status (germline, somatic, or LOH); while a comparison of normalized read depth delineates relative copy number changes. We apply these methods to the analysis of exome sequence data from 151 high-grade ovarian tumors characterized as part of the Cancer Genome Atlas (TCGA). We validated some 7790 somatic coding mutations, achieving 93% sensitivity and 85% precision for single nucleotide variant (SNV) detection. Exome-based CNA analysis identified 29 large-scale alterations and 619 focal events per tumor on average. As in our previous analysis of these data, we observed frequent amplification of oncogenes (e.g., CCNE1, MYC) and deletion of tumor suppressors (NF1, PTEN, and CDKN2A). We searched for additional recurrent focal CNAs using the correlation matrix diagonal segmentation (CMDS) algorithm, which identified 424 significant events affecting 582 genes. Taken together, our results demonstrate the robust performance of VarScan 2 for somatic mutation and CNA detection and shed new light on the landscape of genetic alterations in ovarian cancer.
Collapse
|
43
|
Wong G, Leckie C, Kowalczyk A. FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number. ACTA ACUST UNITED AC 2011; 28:151-9. [PMID: 22110244 DOI: 10.1093/bioinformatics/btr644] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
MOTIVATION Feature selection is a key concept in machine learning for microarray datasets, where features represented by probesets are typically several orders of magnitude larger than the available sample size. Computational tractability is a key challenge for feature selection algorithms in handling very high-dimensional datasets beyond a hundred thousand features, such as in datasets produced on single nucleotide polymorphism microarrays. In this article, we present a novel feature set reduction approach that enables scalable feature selection on datasets with hundreds of thousands of features and beyond. Our approach enables more efficient handling of higher resolution datasets to achieve better disease subtype classification of samples for potentially more accurate diagnosis and prognosis, which allows clinicians to make more informed decisions in regards to patient treatment options. RESULTS We applied our feature set reduction approach to several publicly available cancer single nucleotide polymorphism (SNP) array datasets and evaluated its performance in terms of its multiclass predictive classification accuracy over different cancer subtypes, its speedup in execution as well as its scalability with respect to sample size and array resolution. Feature Set Reduction (FSR) was able to reduce the dimensions of an SNP array dataset by more than two orders of magnitude while achieving at least equal, and in most cases superior predictive classification performance over that achieved on features selected by existing feature selection methods alone. An examination of the biological relevance of frequently selected features from FSR-reduced feature sets revealed strong enrichment in association with cancer. AVAILABILITY FSR was implemented in MATLAB R2010b and is available at http://ww2.cs.mu.oz.au/~gwong/FSR.
Collapse
Affiliation(s)
- Gerard Wong
- National ICT Australia, Victoria Research Laboratory, Parkville, Australia.
| | | | | |
Collapse
|
44
|
Park C, Ahn J, Yoon Y, Park S. A multi-sample based method for identifying common CNVs in normal human genomic structure using high-resolution aCGH data. PLoS One 2011; 6:e26975. [PMID: 22073121 PMCID: PMC3205051 DOI: 10.1371/journal.pone.0026975] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2011] [Accepted: 10/07/2011] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND It is difficult to identify copy number variations (CNV) in normal human genomic data due to noise and non-linear relationships between different genomic regions and signal intensity. A high-resolution array comparative genomic hybridization (aCGH) containing 42 million probes, which is very large compared to previous arrays, was recently published. Most existing CNV detection algorithms do not work well because of noise associated with the large amount of input data and because most of the current methods were not designed to analyze normal human samples. Normal human genome analysis often requires a joint approach across multiple samples. However, the majority of existing methods can only identify CNVs from a single sample. METHODOLOGY AND PRINCIPAL FINDINGS We developed a multi-sample-based genomic variations detector (MGVD) that uses segmentation to identify common breakpoints across multiple samples and a k-means-based clustering strategy. Unlike previous methods, MGVD simultaneously considers multiple samples with different genomic intensities and identifies CNVs and CNV zones (CNVZs); CNVZ is a more precise measure of the location of a genomic variant than the CNV region (CNVR). CONCLUSIONS AND SIGNIFICANCE We designed a specialized algorithm to detect common CNVs from extremely high-resolution multi-sample aCGH data. MGVD showed high sensitivity and a low false discovery rate for a simulated data set, and outperformed most current methods when real, high-resolution HapMap datasets were analyzed. MGVD also had the fastest runtime compared to the other algorithms evaluated when actual, high-resolution aCGH data were analyzed. The CNVZs identified by MGVD can be used in association studies for revealing relationships between phenotypes and genomic aberrations. Our algorithm was developed with standard C++ and is available in Linux and MS Windows format in the STL library. It is freely available at: http://embio.yonsei.ac.kr/~Park/mgvd.php.
Collapse
Affiliation(s)
- Chihyun Park
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Jaegyoon Ahn
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Youngmi Yoon
- Division of Information Engineering, Gachon University of Medicine and Science, Incheon, South Korea
| | - Sanghyun Park
- Department of Computer Science, Yonsei University, Seoul, South Korea
| |
Collapse
|
45
|
Scharpf RB, Irizarry RA, Ritchie ME, Carvalho B, Ruczinski I. Using the R Package crlmm for Genotyping and Copy Number Estimation. J Stat Softw 2011; 40:1-32. [PMID: 22523482 PMCID: PMC3329223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023] Open
Abstract
Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number and integration of the marker-level estimates with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R.
Collapse
Affiliation(s)
- Robert B. Scharpf
- Department of Oncology, Johns Hopkins University School of Medicine, 550 N. Broadway, Suite 1103, Baltimore, MD 21218, United States of America
| | - Rafael A. Irizarry
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore MD 21218, United States of America
| | - Matthew E. Ritchie
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia
| | - Benilton Carvalho
- Department of Oncology, University of Cambridge, CRUK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 ORE, United Kingdom
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore MD 21218, United States of America
| |
Collapse
|
46
|
Ritz A, Paris PL, Ittmann MM, Collins C, Raphael BJ. Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics 2011; 12:114. [PMID: 21510904 PMCID: PMC3112242 DOI: 10.1186/1471-2105-12-114] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 04/21/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variants (CNVs), including deletions, amplifications, and other rearrangements, are common in human and cancer genomes. Copy number data from array comparative genome hybridization (aCGH) and next-generation DNA sequencing is widely used to measure copy number variants. Comparison of copy number data from multiple individuals reveals recurrent variants. Typically, the interior of a recurrent CNV is examined for genes or other loci associated with a phenotype. However, in some cases, such as gene truncations and fusion genes, the target of variant lies at the boundary of the variant. RESULTS We introduce Neighborhood Breakpoint Conservation (NBC), an algorithm for identifying rearrangement breakpoints that are highly conserved at the same locus in multiple individuals. NBC detects recurrent breakpoints at varying levels of resolution, including breakpoints whose location is exactly conserved and breakpoints whose location varies within a gene. NBC also identifies pairs of recurrent breakpoints such as those that result from fusion genes. We apply NBC to aCGH data from 36 primary prostate tumors and identify 12 novel rearrangements, one of which is the well-known TMPRSS2-ERG fusion gene. We also apply NBC to 227 glioblastoma tumors and predict 93 novel rearrangements which we further classify as gene truncations, germline structural variants, and fusion genes. A number of these variants involve the protein phosphatase PTPN12 suggesting that deregulation of PTPN12, via a variety of rearrangements, is common in glioblastoma. CONCLUSIONS We demonstrate that NBC is useful for detection of recurrent breakpoints resulting from copy number variants or other structural variants, and in particular identifies recurrent breakpoints that result in gene truncations or fusion genes. Software is available at http://http.//cs.brown.edu/people/braphael/software.html.
Collapse
Affiliation(s)
- Anna Ritz
- Department of Computer Science, Brown University, Providence, RI, USA.
| | | | | | | | | |
Collapse
|
47
|
Wartman LD, Larson DE, Xiang Z, Ding L, Chen K, Lin L, Cahan P, Klco JM, Welch JS, Li C, Payton JE, Uy GL, Varghese N, Ries RE, Hoock M, Koboldt DC, McLellan MD, Schmidt H, Fulton RS, Abbott RM, Cook L, McGrath SD, Fan X, Dukes AF, Vickery T, Kalicki J, Lamprecht TL, Graubert TA, Tomasson MH, Mardis ER, Wilson RK, Ley TJ. Sequencing a mouse acute promyelocytic leukemia genome reveals genetic events relevant for disease progression. J Clin Invest 2011; 121:1445-55. [PMID: 21436584 DOI: 10.1172/jci45284] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2010] [Accepted: 01/19/2011] [Indexed: 01/12/2023] Open
Abstract
Acute promyelocytic leukemia (APL) is a subtype of acute myeloid leukemia (AML). It is characterized by the t(15;17)(q22;q11.2) chromosomal translocation that creates the promyelocytic leukemia-retinoic acid receptor α (PML-RARA) fusion oncogene. Although this fusion oncogene is known to initiate APL in mice, other cooperating mutations, as yet ill defined, are important for disease pathogenesis. To identify these, we used a mouse model of APL, whereby PML-RARA expressed in myeloid cells leads to a myeloproliferative disease that ultimately evolves into APL. Sequencing of a mouse APL genome revealed 3 somatic, nonsynonymous mutations relevant to APL pathogenesis, of which 1 (Jak1 V657F) was found to be recurrent in other affected mice. This mutation was identical to the JAK1 V658F mutation previously found in human APL and acute lymphoblastic leukemia samples. Further analysis showed that JAK1 V658F cooperated in vivo with PML-RARA, causing a rapidly fatal leukemia in mice. We also discovered a somatic 150-kb deletion involving the lysine (K)-specific demethylase 6A (Kdm6a, also known as Utx) gene, in the mouse APL genome. Similar deletions were observed in 3 out of 14 additional mouse APL samples and 1 out of 150 human AML samples. In conclusion, whole genome sequencing of mouse cancer genomes can provide an unbiased and comprehensive approach for discovering functionally relevant mutations that are also present in human leukemias.
Collapse
Affiliation(s)
- Lukas D Wartman
- Department of Internal Medicine, Division of Oncology, Stem Cell Biology Section, Washington University School of Medicine, Siteman Cancer Center, St. Louis, Missouri, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Ding L, Wendl MC, Koboldt DC, Mardis ER. Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet 2010; 19:R188-96. [PMID: 20843826 DOI: 10.1093/hmg/ddq391] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The application of next-generation sequencing technology has produced a transformation in cancer genomics, generating large data sets that can be analyzed in different ways to answer a multitude of questions about the genomic alterations associated with the disease. Analytical approaches can discover focused mutations such as substitutions and small insertion/deletions, large structural alterations and copy number events. As our capacity to produce such data for multiple cancers of the same type is improving, so are the demands to analyze multiple tumor genomes simultaneously growing. For example, pathway-based analyses that provide the full mutational impact on cellular protein networks and correlation analyses aimed at revealing causal relationships between genomic alterations and clinical presentations are both enabled. As the repertoire of data grows to include mRNA-seq, non-coding RNA-seq and methylation for multiple genomes, our challenge will be to intelligently integrate data types and genomes to produce a coherent picture of the genetic basis of cancer.
Collapse
Affiliation(s)
- Li Ding
- Department of Genetics, The Genome Center at Washington University School of Medicine, 4444 Forest Park Blvd., St Louis, MO 63108, USA
| | | | | | | |
Collapse
|
49
|
Yuan X, Zhang J, Wang Y. Probability theory-based SNP association study method for identifying susceptibility loci and genetic disease models in human case-control data. IEEE Trans Nanobioscience 2010; 9:232-41. [PMID: 20840904 DOI: 10.1109/tnb.2010.2070805] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
One of the most challenging points in studying human common complex diseases is to search for both strong and weak susceptibility single-nucleotide polymorphisms (SNPs) and identify forms of genetic disease models. Currently, a number of methods have been proposed for this purpose. Many of them have not been validated through applications into various genome datasets, so their abilities are not clear in real practice. In this paper, we present a novel SNP association study method based on probability theory, called ProbSNP. The method firstly detects SNPs by evaluating their joint probabilities in combining with disease status and selects those with the lowest joint probabilities as susceptibility ones, and then identifies some forms of genetic disease models through testing multiple-locus interactions among the selected SNPs. The joint probabilities of combined SNPs are estimated by establishing Gaussian distribution probability density functions, in which the related parameters (i.e., mean value and standard deviation) are evaluated based on allele and haplotype frequencies. Finally, we test and validate the method using various genome datasets. We find that ProbSNP has shown remarkable success in the applications to both simulated genome data and real genome-wide data.
Collapse
Affiliation(s)
- Xiguo Yuan
- School of Computer Science & Engineering, Xidian University, Xi'an 710071, China.
| | | | | |
Collapse
|
50
|
Koboldt DC, Ding L, Mardis ER, Wilson RK. Challenges of sequencing human genomes. Brief Bioinform 2010; 11:484-98. [PMID: 20519329 DOI: 10.1093/bib/bbq016] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35-250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
Collapse
Affiliation(s)
- Daniel C Koboldt
- The Genome Center at Washington University, St. Louis, Missouri 63108, USA.
| | | | | | | |
Collapse
|