201
|
Tattini L, D'Aurizio R, Magi A. Detection of Genomic Structural Variants from Next-Generation Sequencing Data. Front Bioeng Biotechnol 2015; 3:92. [PMID: 26161383 PMCID: PMC4479793 DOI: 10.3389/fbioe.2015.00092] [Citation(s) in RCA: 155] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 06/10/2015] [Indexed: 01/16/2023] Open
Abstract
Structural variants are genomic rearrangements larger than 50 bp accounting for around 1% of the variation among human genomes. They impact on phenotypic diversity and play a role in various diseases including neurological/neurocognitive disorders and cancer development and progression. Dissecting structural variants from next-generation sequencing data presents several challenges and a number of approaches have been proposed in the literature. In this mini review, we describe and summarize the latest tools – and their underlying algorithms – designed for the analysis of whole-genome sequencing, whole-exome sequencing, custom captures, and amplicon sequencing data, pointing out the major advantages/drawbacks. We also report a summary of the most recent applications of third-generation sequencing platforms. This assessment provides a guided indication – with particular emphasis on human genetics and copy number variants – for researchers involved in the investigation of these genomic events.
Collapse
Affiliation(s)
- Lorenzo Tattini
- Department of Neurosciences, Psychology, Pharmacology and Child Health, University of Florence , Florence , Italy
| | - Romina D'Aurizio
- Laboratory of Integrative Systems Medicine (LISM), Institute of Informatics and Telematics and Institute of Clinical Physiology, National Research Council , Pisa , Italy
| | - Alberto Magi
- Department of Clinical and Experimental Medicine, University of Florence , Florence , Italy
| |
Collapse
|
202
|
Abstract
The Breakage Fusion Bridge (BFB) process is a key marker for genomic instability, producing highly rearranged genomes in relatively small numbers of cell cycles. While the process itself was observed during the late 1930s, little is known about the extent of BFB in tumor genome evolution. Moreover, BFB can dramatically increase copy numbers of chromosomal segments, which in turn hardens the tasks of both reference-assisted and ab initio genome assembly. Based on available data such as Next Generation Sequencing (NGS) and Array Comparative Genomic Hybridization (aCGH) data, we show here how BFB evidence may be identified, and how to enumerate all possible evolutions of the process with respect to observed data. Specifically, we describe practical algorithms that, given a chromosomal arm segmentation and noisy segment copy number estimates, produce all segment count vectors supported by the data that can be produced by BFB, and all corresponding BFB architectures. This extends the scope of analyses described in our previous work, which produced a single count vector and architecture per instance. We apply these analyses to a comprehensive human cancer dataset, demonstrate the effectiveness and efficiency of the computation, and suggest methods for further assertions of candidate BFB samples. Source code of our tool can be found online.
Collapse
Affiliation(s)
- Shay Zakov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California
| |
Collapse
|
203
|
Masri L, Branca A, Sheppard AE, Papkou A, Laehnemann D, Guenther PS, Prahl S, Saebelfeld M, Hollensteiner J, Liesegang H, Brzuszkiewicz E, Daniel R, Michiels NK, Schulte RD, Kurtz J, Rosenstiel P, Telschow A, Bornberg-Bauer E, Schulenburg H. Host-Pathogen Coevolution: The Selective Advantage of Bacillus thuringiensis Virulence and Its Cry Toxin Genes. PLoS Biol 2015; 13:e1002169. [PMID: 26042786 PMCID: PMC4456383 DOI: 10.1371/journal.pbio.1002169] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2014] [Accepted: 05/07/2015] [Indexed: 01/11/2023] Open
Abstract
Reciprocal coevolution between host and pathogen is widely seen as a major driver of evolution and biological innovation. Yet, to date, the underlying genetic mechanisms and associated trait functions that are unique to rapid coevolutionary change are generally unknown. We here combined experimental evolution of the bacterial biocontrol agent Bacillus thuringiensis and its nematode host Caenorhabditis elegans with large-scale phenotyping, whole genome analysis, and functional genetics to demonstrate the selective benefit of pathogen virulence and the underlying toxin genes during the adaptation process. We show that: (i) high virulence was specifically favoured during pathogen-host coevolution rather than pathogen one-sided adaptation to a nonchanging host or to an environment without host; (ii) the pathogen genotype BT-679 with known nematocidal toxin genes and high virulence specifically swept to fixation in all of the independent replicate populations under coevolution but only some under one-sided adaptation; (iii) high virulence in the BT-679-dominated populations correlated with elevated copy numbers of the plasmid containing the nematocidal toxin genes; (iv) loss of virulence in a toxin-plasmid lacking BT-679 isolate was reconstituted by genetic reintroduction or external addition of the toxins. We conclude that sustained coevolution is distinct from unidirectional selection in shaping the pathogen's genome and life history characteristics. To our knowledge, this study is the first to characterize the pathogen genes involved in coevolutionary adaptation in an animal host-pathogen interaction system.
Collapse
Affiliation(s)
- Leila Masri
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
- Department of Animal Evolutionary Ecology, Institute of Evolution and Ecology, University of Tuebingen, Tuebingen, Germany
| | - Antoine Branca
- Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
| | - Anna E. Sheppard
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
| | - Andrei Papkou
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
| | - David Laehnemann
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
- Department of Animal Evolutionary Ecology, Institute of Evolution and Ecology, University of Tuebingen, Tuebingen, Germany
| | - Patrick S. Guenther
- Department of Animal Evolutionary Ecology, Institute of Evolution and Ecology, University of Tuebingen, Tuebingen, Germany
| | - Swantje Prahl
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
| | - Manja Saebelfeld
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
| | - Jacqueline Hollensteiner
- Goettingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August-University of Goettingen, Goettingen, Germany
| | - Heiko Liesegang
- Goettingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August-University of Goettingen, Goettingen, Germany
| | - Elzbieta Brzuszkiewicz
- Goettingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August-University of Goettingen, Goettingen, Germany
| | - Rolf Daniel
- Goettingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August-University of Goettingen, Goettingen, Germany
| | - Nicolaas K. Michiels
- Department of Animal Evolutionary Ecology, Institute of Evolution and Ecology, University of Tuebingen, Tuebingen, Germany
| | - Rebecca D. Schulte
- Department of Behavioural Biology, University of Osnabrueck, Osnabrueck, Germany
| | - Joachim Kurtz
- Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
| | - Philip Rosenstiel
- Institute for Clinical Molecular Biology, Christian-Albrechts-University, Kiel, Germany
| | - Arndt Telschow
- Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
| | - Hinrich Schulenburg
- Department of Evolutionary Ecology and Genetics, Zoological Institute, Christian-Albrechts-University of Kiel, Kiel, Germany
- Department of Animal Evolutionary Ecology, Institute of Evolution and Ecology, University of Tuebingen, Tuebingen, Germany
| |
Collapse
|
204
|
Zhang X, Xu Y, Liu D, Geng J, Chen S, Jiang Z, Fu Q, Sun K. A modified multiplex ligation-dependent probe amplification method for the detection of 22q11.2 copy number variations in patients with congenital heart disease. BMC Genomics 2015; 16:364. [PMID: 25952753 PMCID: PMC4424574 DOI: 10.1186/s12864-015-1590-5] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 04/27/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variations (CNVs) of chromosomal region 22q11.2 are associated with a subset of patients with congenital heart disease (CHD). Accurate and efficient detection of CNV is important for genetic analysis of CHD. The aim of the study was to introduce a novel approach named CNVplex®, a high-throughput analysis technique designed for efficient detection of chromosomal CNVs, and to explore the prevalence of sub-chromosomal imbalances in 22q11.2 loci in patients with CHD from a single institute. RESULTS We developed a novel technique, CNVplex®, for high-throughput detection of sub-chromosomal copy number aberrations. Modified from the multiplex ligation-dependent probe amplification (MLPA) method, it introduced a lengthening ligation system and four universal primer sets, which simplified the synthesis of probes and significantly improved the flexibility of the experiment. We used 110 samples, which were extensively characterized with chromosomal microarray analysis and MLPA, to validate the performance of the newly developed method. Furthermore, CNVplex® was used to screen for sub-chromosomal imbalances in 22q11.2 loci in 818 CHD patients consecutively enrolled from Shanghai Children's Medical Center. In the methodology development phase, CNVplex® detected all copy number aberrations that were previously identified with both chromosomal microarray analysis and MLPA, demonstrating 100% sensitivity and specificity. In the validation phase, 22q11.2 deletion and 22q11.2 duplication were detected in 39 and 1 of 818 patients with CHD by CNVplex®, respectively. Our data demonstrated that the frequency of 22q11.2 deletion varied among sub-groups of CHD patients. Notably, 22q11.2 deletion was more commonly observed in cases with conotruncal defect (CTD) than in cases with non-CTD (P<0.001). With higher resolution and more probes against selected chromosomal loci, CNVplex® also identified several individuals with small CNVs and alterations in other chromosomes. CONCLUSIONS CNVplex® is sensitive and specific in its detection of CNVs, and it is an alternative to MLPA for batch screening of pathogenetic CNVs in known genomic loci.
Collapse
Affiliation(s)
- Xiaoqing Zhang
- Department of Laboratory Medicine, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, Peoples Republic of China.
| | - Yuejuan Xu
- Department of Pediatric Cardiology, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200092, Peoples Republic of China.
| | - Deyuan Liu
- Genesky Diagnostics (Suzhou) Inc, Suzhou, Peoples Republic of China.
| | - Juan Geng
- Department of Laboratory Medicine, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, Peoples Republic of China.
| | - Sun Chen
- Department of Pediatric Cardiology, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200092, Peoples Republic of China.
| | - Zhengwen Jiang
- Genesky Diagnostics (Suzhou) Inc, Suzhou, Peoples Republic of China.
| | - Qihua Fu
- Department of Laboratory Medicine, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, Peoples Republic of China.
| | - Kun Sun
- Department of Pediatric Cardiology, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200092, Peoples Republic of China.
| |
Collapse
|
205
|
Identifying Human Genome-Wide CNV, LOH and UPD by Targeted Sequencing of Selected Regions. PLoS One 2015; 10:e0123081. [PMID: 25919136 PMCID: PMC4412667 DOI: 10.1371/journal.pone.0123081] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2014] [Accepted: 02/27/2015] [Indexed: 01/03/2023] Open
Abstract
Copy-number variations (CNV), loss of heterozygosity (LOH), and uniparental disomy (UPD) are large genomic aberrations leading to many common inherited diseases, cancers, and other complex diseases. An integrated tool to identify these aberrations is essential in understanding diseases and in designing clinical interventions. Previous discovery methods based on whole-genome sequencing (WGS) require very high depth of coverage on the whole genome scale, and are cost-wise inefficient. Another approach, whole exome genome sequencing (WEGS), is limited to discovering variations within exons. Thus, we are lacking efficient methods to detect genomic aberrations on the whole genome scale using next-generation sequencing technology. Here we present a method to identify genome-wide CNV, LOH and UPD for the human genome via selectively sequencing a small portion of genome termed Selected Target Regions (SeTRs). In our experiments, the SeTRs are covered by 99.73%~99.95% with sufficient depth. Our developed bioinformatics pipeline calls genome-wide CNVs with high confidence, revealing 8 credible events of LOH and 3 UPD events larger than 5M from 15 individual samples. We demonstrate that genome-wide CNV, LOH and UPD can be detected using a cost-effective SeTRs sequencing approach, and that LOH and UPD can be identified using just a sample grouping technique, without using a matched sample or familial information.
Collapse
|
206
|
Magi A, D'Aurizio R, Palombo F, Cifola I, Tattini L, Semeraro R, Pippucci T, Giusti B, Romeo G, Abbate R, Gensini GF. Characterization and identification of hidden rare variants in the human genome. BMC Genomics 2015; 16:340. [PMID: 25903059 PMCID: PMC4416239 DOI: 10.1186/s12864-015-1481-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Accepted: 03/23/2015] [Indexed: 12/11/2022] Open
Abstract
Background By examining the genotype calls generated by the 1000 Genomes Project we discovered that the human reference genome GRCh37 contains almost 20,000 loci in which the reference allele has never been observed in healthy individuals and around 70,000 loci in which it has been observed only in the heterozygous state. Results We show that a large fraction of this rare reference allele (RRA) loci belongs to coding, functional and regulatory elements of the genome and could be linked to rare Mendelian disorders as well as cancer. We also demonstrate that classical germline and somatic variant calling tools are not capable to recognize the rare allele when present in these loci. To overcome such limitations, we developed a novel tool, named RAREVATOR, that is able to identify and call the rare allele in these genomic positions. By using a small cancer dataset we compared our tool with two state-of-the-art callers and we found that RAREVATOR identified more than 1,500 germline and 22 somatic RRA variants missed by the two methods and which belong to significantly mutated pathways. Conclusions These results show that, to date, the investigation of around 100,000 loci of the human genome has been missed by re-sequencing experiments based on the GRCh37 assembly and that our tool can fill the gap left by other methods. Moreover, the investigation of the latest version of the human reference genome, GRCh38, showed that although the GRC corrected almost all insertions and a small part of SNVs and deletions, a large number of functionally relevant RRAs still remain unchanged. For this reason, also future resequencing experiments, based on GRCh38, will benefit from RAREVATOR analysis results. RAREVATOR is freely available at http://sourceforge.net/projects/rarevator. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1481-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alberto Magi
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
| | - Romina D'Aurizio
- Laboratory of Integrative Systems Medicine (LISM), Institute of Informatics and Telematics and Institute of Clinical Physiology, National Research Council, Pisa, Italy.
| | - Flavia Palombo
- Medical Genetics Unit, Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy.
| | - Ingrid Cifola
- Institute for Biomedical Technologies, National Research Council, Milan, Italy.
| | - Lorenzo Tattini
- Department of Neuroscience, Pharmacology and Child Health, University of Florence, Florence, Italy.
| | - Roberto Semeraro
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
| | - Tommaso Pippucci
- Medical Genetics Unit, Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy.
| | - Betti Giusti
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
| | - Giovanni Romeo
- Medical Genetics Unit, Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy.
| | - Rosanna Abbate
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
| | - Gian Franco Gensini
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
| |
Collapse
|
207
|
Hayes M, Li J. An integrative framework for the identification of double minute chromosomes using next generation sequencing data. BMC Genet 2015; 16 Suppl 2:S1. [PMID: 25953282 PMCID: PMC4423570 DOI: 10.1186/1471-2156-16-s2-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Double minute chromosomes are circular fragments of DNA whose presence is associated with the onset of certain cancers. Double minutes are lethal, as they are highly amplified and typically contain oncogenes. Locating double minutes can supplement the process of cancer diagnosis, and it can help to identify therapeutic targets. However, there is currently a dearth of computational methods available to identify double minutes. We propose a computational framework for the idenfication of double minute chromosomes using next-generation sequencing data. Our framework integrates predictions from algorithms that detect DNA copy number variants, and it also integrates predictions from algorithms that locate genomic structural variants. This information is used by a graph-based algorithm to predict the presence of double minute chromosomes. RESULTS Using a previously published copy number variant algorithm and two structural variation prediction algorithms, we implemented our framework and tested it on a dataset consisting of simulated double minute chromosomes. Our approach uncovered double minutes with high accuracy, demonstrating its plausibility. CONCLUSIONS Although we only tested the framework with three programs (RDXplorer, BreakDancer, Delly), it can be extended to incorporate results from programs that 1) detect amplified copy number and from programs that 2) detect genomic structural variants like deletions, translocations, inversions, and tandem repeats. The software that implements the framework can be accessed here: https://github.com/mhayes20/DMFinder
Collapse
|
208
|
Wang W, Wang W, Sun W, Crowley JJ, Szatkiewicz JP. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing. Nucleic Acids Res 2015; 43:e90. [PMID: 25883151 PMCID: PMC4538801 DOI: 10.1093/nar/gkv319] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 03/27/2015] [Indexed: 11/14/2022] Open
Abstract
Copy-number variants (CNVs) are a major form of genetic variation and a risk factor for various human diseases, so it is crucial to accurately detect and characterize them. It is conceivable that allele-specific reads from high-throughput sequencing data could be leveraged to both enhance CNV detection and produce allele-specific copy number (ASCN) calls. Although statistical methods have been developed to detect CNVs using whole-genome sequence (WGS) and/or whole-exome sequence (WES) data, information from allele-specific read counts has not yet been adequately exploited. In this paper, we develop an integrated method, called AS-GENSENG, which incorporates allele-specific read counts in CNV detection and estimates ASCN using either WGS or WES data. To evaluate the performance of AS-GENSENG, we conducted extensive simulations, generated empirical data using existing WGS and WES data sets and validated predicted CNVs using an independent methodology. We conclude that AS-GENSENG not only predicts accurate ASCN calls but also improves the accuracy of total copy number calls, owing to its unique ability to exploit information from both total and allele-specific read counts while accounting for various experimental biases in sequence data. Our novel, user-friendly and computationally efficient method and a complete analytic protocol is freely available at https://sourceforge.net/projects/asgenseng/.
Collapse
Affiliation(s)
- WeiBo Wang
- Department of Computer Science, University of North Carolina at Chapel Hill, NC 27599-3175, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Wei Sun
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7400, USA
| | - James J Crowley
- Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA
| | - Jin P Szatkiewicz
- Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA
| |
Collapse
|
209
|
Escaramís G, Docampo E, Rabionet R. A decade of structural variants: description, history and methods to detect structural variation. Brief Funct Genomics 2015; 14:305-14. [PMID: 25877305 DOI: 10.1093/bfgp/elv014] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
In the past decade, the view on genomic structural variation (SV) has been changed completely. SVs, previously considered rare events, are now recognized as the largest source of interindividual genetic variation affecting more bases than single nucleotide polymorphisms, variable number of tandem repeats and other small genetic variants. They have also been shown to play a role in phenotypic variation and in disease. In this review, the authors will provide an introduction to SV; a short historical perspective on the research of this source of genomic variation; a description of the types of structural variants, and on how they may have arisen; and an overview on methods of detecting structural variants, focusing on the analysis of high-throughput sequencing data.
Collapse
|
210
|
Pirooznia M, Goes FS, Zandi PP. Whole-genome CNV analysis: advances in computational approaches. Front Genet 2015; 6:138. [PMID: 25918519 PMCID: PMC4394692 DOI: 10.3389/fgene.2015.00138] [Citation(s) in RCA: 123] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2015] [Accepted: 03/23/2015] [Indexed: 01/04/2023] Open
Abstract
Accumulating evidence indicates that DNA copy number variation (CNV) is likely to make a significant contribution to human diversity and also play an important role in disease susceptibility. Recent advances in genome sequencing technologies have enabled the characterization of a variety of genomic features, including CNVs. This has led to the development of several bioinformatics approaches to detect CNVs from next-generation sequencing data. Here, we review recent advances in CNV detection from whole genome sequencing. We discuss the informatics approaches and current computational tools that have been developed as well as their strengths and limitations. This review will assist researchers and analysts in choosing the most suitable tools for CNV analysis as well as provide suggestions for new directions in future development.
Collapse
Affiliation(s)
- Mehdi Pirooznia
- Mood Disorders Center, Department of Psychiatry and Behavioral Sciences, School of Medicine, Johns Hopkins University Baltimore, MD, USA
| | - Fernando S Goes
- Mood Disorders Center, Department of Psychiatry and Behavioral Sciences, School of Medicine, Johns Hopkins University Baltimore, MD, USA
| | - Peter P Zandi
- Mood Disorders Center, Department of Psychiatry and Behavioral Sciences, School of Medicine, Johns Hopkins University Baltimore, MD, USA ; Department of Mental Health, Johns Hopkins Bloomberg School of Public Health Baltimore, MD, USA USA
| |
Collapse
|
211
|
Parks MM, Lawrence CE, Raphael BJ. Detecting non-allelic homologous recombination from high-throughput sequencing data. Genome Biol 2015; 16:72. [PMID: 25886137 PMCID: PMC4425883 DOI: 10.1186/s13059-015-0633-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2015] [Accepted: 03/16/2015] [Indexed: 12/27/2022] Open
Abstract
Non-allelic homologous recombination (NAHR) is a common mechanism for generating genome rearrangements and is implicated in numerous genetic disorders, but its detection in high-throughput sequencing data poses a serious challenge. We present a probabilistic model of NAHR and demonstrate its ability to find NAHR in low-coverage sequencing data from 44 individuals. We identify NAHR-mediated deletions or duplications in 109 of 324 potential NAHR loci in at least one of the individuals. These calls segregate by ancestry, are more common in closely spaced repeats, often result in duplicated genes or pseudogenes, and affect highly studied genes such as GBA and CYP2E1.
Collapse
Affiliation(s)
- Matthew M Parks
- Division of Applied Mathematics, Brown University, Providence, USA.
| | - Charles E Lawrence
- Division of Applied Mathematics, Brown University, Providence, USA. .,Center for Computational Molecular Biology, Brown University, Providence, USA.
| | - Benjamin J Raphael
- Center for Computational Molecular Biology, Brown University, Providence, USA. .,Department of Computer Science, Brown University, Providence, USA.
| |
Collapse
|
212
|
Noureen A, Fresser F, Utermann G, Schmidt K. Sequence variation within the KIV-2 copy number polymorphism of the human LPA gene in African, Asian, and European populations. PLoS One 2015; 10:e0121582. [PMID: 25822457 PMCID: PMC4378929 DOI: 10.1371/journal.pone.0121582] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2014] [Accepted: 02/13/2015] [Indexed: 11/18/2022] Open
Abstract
Amazingly little sequence variation is reported for the kringle IV 2 copy number variation (KIV 2 CNV) in the human LPA gene. Apart from whole genome sequencing projects, this region has only been analyzed in some detail in samples of European populations. We have performed a systematic resequencing study of the exonic and flanking intron regions within the KIV 2 CNV in 90 alleles from Asian, European, and four different African populations. Alleles have been separated according to their CNV length by pulsed field gel electrophoresis prior to unbiased specific PCR amplification of the target regions. These amplicons covered all KIV 2 copies of an individual allele simultaneously. In addition, cloned amplicons from genomic DNA of an African individual were sequenced. Our data suggest that sequence variation in this genomic region may be higher than previously appreciated. Detection probability of variants appeared to depend on the KIV 2 copy number of the analyzed DNA and on the proportion of copies carrying the variant. Asians had a high frequency of so-called KIV 2 type B and type C (together 70% of alleles), which differ by three or two synonymous substitutions respectively from the reference type A. This is most likely explained by the strong bottleneck suggested to have occurred when modern humans migrated to East Asia. A higher frequency of variable sites was detected in the Africans. In particular, two previously unreported splice site variants were found. One was associated with non-detectable Lp(a). The other was observed at high population frequencies (10% to 40%). Like the KIV 2 type B and C variants, this latter variant was also found in a high proportion of KIV 2 repeats in the affected alleles and in alleles differing in copy numbers. Our findings may have implications for the interpretation of SNP analyses in other repetitive loci of the human genome.
Collapse
Affiliation(s)
- Asma Noureen
- Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
- Division of Human Genetics, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
| | - Friedrich Fresser
- Division of Human Genetics, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
- Division of Translational Cell Genetics, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
| | - Gerd Utermann
- Division of Human Genetics, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
| | - Konrad Schmidt
- Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
- Division of Human Genetics, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
- Centre de Recherches Médicales de Lambaréné, Albert Schweitzer Hospital, Lambaréné, Gabon
- Department for Tropical Medicine, Eberhard-Karls-University Tübingen, Tübingen, Germany
- * E-mail:
| |
Collapse
|
213
|
Comparison of sequencing based CNV discovery methods using monozygotic twin quartets. PLoS One 2015; 10:e0122287. [PMID: 25812131 PMCID: PMC4374778 DOI: 10.1371/journal.pone.0122287] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 02/11/2015] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The advent of high throughput sequencing methods breeds an important amount of technical challenges. Among those is the one raised by the discovery of copy-number variations (CNVs) using whole-genome sequencing data. CNVs are genomic structural variations defined as a variation in the number of copies of a large genomic fragment, usually more than one kilobase. Here, we aim to compare different CNV calling methods in order to assess their ability to consistently identify CNVs by comparison of the calls in 9 quartets of identical twin pairs. The use of monozygotic twins provides a means of estimating the error rate of each algorithm by observing CNVs that are inconsistently called when considering the rules of Mendelian inheritance and the assumption of an identical genome between twins. The similarity between the calls from the different tools and the advantage of combining call sets were also considered. RESULTS ERDS and CNVnator obtained the best performance when considering the inherited CNV rate with a mean of 0.74 and 0.70, respectively. Venn diagrams were generated to show the agreement between the different algorithms, before and after filtering out familial inconsistencies. This filtering revealed a high number of false positives for CNVer and Breakdancer. A low overall agreement between the methods suggested a high complementarity of the different tools when calling CNVs. The breakpoint sensitivity analysis indicated that CNVnator and ERDS achieved better resolution of CNV borders than the other tools. The highest inherited CNV rate was achieved through the intersection of these two tools (81%). CONCLUSIONS This study showed that ERDS and CNVnator provide good performance on whole genome sequencing data with respect to CNV consistency across families, CNV breakpoint resolution and CNV call specificity. The intersection of the calls from the two tools would be valuable for CNV genotyping pipelines.
Collapse
|
214
|
Wang M, Beck CR, English AC, Meng Q, Buhay C, Han Y, Doddapaneni HV, Yu F, Boerwinkle E, Lupski JR, Muzny DM, Gibbs RA. PacBio-LITS: a large-insert targeted sequencing method for characterization of human disease-associated chromosomal structural variations. BMC Genomics 2015; 16:214. [PMID: 25887218 PMCID: PMC4376517 DOI: 10.1186/s12864-015-1370-2] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 02/20/2015] [Indexed: 11/24/2022] Open
Abstract
Background Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high. Results We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki–Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants. Conclusions The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1370-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Min Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Christine R Beck
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Qingchang Meng
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Christian Buhay
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Yi Han
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Harsha V Doddapaneni
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Fuli Yu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Human Genetics Center, University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| | - James R Lupski
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
215
|
Smith SD, Kawash JK, Grigoriev A. GROM-RD: resolving genomic biases to improve read depth detection of copy number variants. PeerJ 2015; 3:e836. [PMID: 25802807 PMCID: PMC4369336 DOI: 10.7717/peerj.836] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Accepted: 02/23/2015] [Indexed: 12/21/2022] Open
Abstract
Amplifications or deletions of genome segments, known as copy number variants (CNVs), have been associated with many diseases. Read depth analysis of next-generation sequencing (NGS) is an essential method of detecting CNVs. However, genome read coverage is frequently distorted by various biases of NGS platforms, which reduce predictive capabilities of existing approaches. Additionally, the use of read depth tools has been somewhat hindered by imprecise breakpoint identification. We developed GROM-RD, an algorithm that analyzes multiple biases in read coverage to detect CNVs in NGS data. We found non-uniform variance across distinct GC regions after using existing GC bias correction methods and developed a novel approach to normalize such variance. Although complex and repetitive genome segments complicate CNV detection, GROM-RD adjusts for repeat bias and uses a two-pipeline masking approach to detect CNVs in complex and repetitive segments while improving sensitivity in less complicated regions. To overcome a typical weakness of RD methods, GROM-RD employs a CNV search using size-varying overlapping windows to improve breakpoint resolution. We compared our method to two widely used programs based on read depth methods, CNVnator and RDXplorer, and observed improved CNV detection and breakpoint accuracy for GROM-RD. GROM-RD is available at http://grigoriev.rutgers.edu/software/.
Collapse
Affiliation(s)
- Sean D Smith
- Department of Biology, Center for Computational and Integrative Biology, Rutgers University , Camden, NJ , USA
| | - Joseph K Kawash
- Department of Biology, Center for Computational and Integrative Biology, Rutgers University , Camden, NJ , USA
| | - Andrey Grigoriev
- Department of Biology, Center for Computational and Integrative Biology, Rutgers University , Camden, NJ , USA
| |
Collapse
|
216
|
Manconi A, Manca E, Moscatelli M, Gnocchi M, Orro A, Armano G, Milanesi L. G-CNV: A GPU-Based Tool for Preparing Data to Detect CNVs with Read-Depth Methods. Front Bioeng Biotechnol 2015; 3:28. [PMID: 25806367 PMCID: PMC4354384 DOI: 10.3389/fbioe.2015.00028] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 02/19/2015] [Indexed: 11/23/2022] Open
Abstract
Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Emanuele Manca
- Department of Electrical and Electronic Engineering, University of Cagliari , Cagliari , Italy
| | - Marco Moscatelli
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Matteo Gnocchi
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Alessandro Orro
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Giuliano Armano
- Department of Electrical and Electronic Engineering, University of Cagliari , Cagliari , Italy
| | - Luciano Milanesi
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| |
Collapse
|
217
|
Glusman G, Severson A, Dhankani V, Robinson M, Farrah T, Mauldin DE, Stittrich AB, Ament SA, Roach JC, Brunkow ME, Bodian DL, Vockley JG, Shmulevich I, Niederhuber JE, Hood L. Identification of copy number variants in whole-genome data using Reference Coverage Profiles. Front Genet 2015; 6:45. [PMID: 25741365 PMCID: PMC4330915 DOI: 10.3389/fgene.2015.00045] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2014] [Accepted: 01/30/2015] [Indexed: 12/20/2022] Open
Abstract
The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Dale L Bodian
- Inova Translational Medicine Institute, Inova Health System Falls Church, VA, USA
| | - Joseph G Vockley
- Inova Translational Medicine Institute, Inova Health System Falls Church, VA, USA
| | | | - John E Niederhuber
- Inova Translational Medicine Institute, Inova Health System Falls Church, VA, USA
| | - Leroy Hood
- Institute for Systems Biology Seattle, WA, USA
| |
Collapse
|
218
|
Brynildsrud O, Snipen LG, Bohlin J. CNOGpro: detection and quantification of CNVs in prokaryotic whole-genome sequencing data. Bioinformatics 2015; 31:1708-15. [DOI: 10.1093/bioinformatics/btv070] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 01/28/2015] [Indexed: 01/22/2023] Open
|
219
|
Puranik R, Quan G, Werner J, Zhou R, Xu Z. A pipeline for completing bacterial genomes using in silico and wet lab approaches. BMC Genomics 2015; 16 Suppl 3:S7. [PMID: 25708162 PMCID: PMC4331810 DOI: 10.1186/1471-2164-16-s3-s7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Despite the large volume of genome sequencing data produced by next-generation sequencing technologies and the highly sophisticated software dedicated to handling these types of data, gaps are commonly found in draft genome assemblies. The existence of gaps compromises our ability to take full advantage of the genome data. This study aims to identify a practical approach for biologists to complete their own genome assemblies using commonly available tools and resources. Results A pipeline was developed to assemble complete genomes primarily from the next generation sequencing (NGS) data. The input of the pipeline is paired-end Illumina sequence reads, and the output is a high quality complete genome sequence. The pipeline alternates the employment of computational and biological methods in seven steps. It combines the strengths of de novo assembly, reference-based assembly, customized programming, public databases utilization, and wet lab experimentation. The application of the pipeline is demonstrated by the completion of a bacterial genome, Thermotoga sp. strain RQ7, a hydrogen-producing strain. Conclusions The developed pipeline provides an example of effective integration of computational and biological principles. It highlights the complementary roles that in silico and wet lab methodologies play in bioinformatical studies. The constituting principles and methods are applicable to similar studies on both prokaryotic and eukaryotic genomes.
Collapse
|
220
|
Reinecke F, Satya RV, DiCarlo J. Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls. BMC Bioinformatics 2015; 16:17. [PMID: 25626454 PMCID: PMC4384318 DOI: 10.1186/s12859-014-0428-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Accepted: 12/11/2014] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Next-generation sequencing (NGS) is rapidly becoming common practice in clinical diagnostics and cancer research. In addition to the detection of single nucleotide variants (SNVs), information on copy number variants (CNVs) is of great interest. Several algorithms exist to detect CNVs by analyzing whole genome sequencing data or data from samples enriched by hybridization-capture. PCR-enriched amplicon-sequencing data have special characteristics that have been taken into account by only one publicly available algorithm so far. RESULTS We describe a new algorithm named quandico to detect copy number differences based on NGS data generated following PCR-enrichment. A weighted t-test statistic was applied to calculate probabilities (p-values) of copy number changes. We assessed the performance of the method using sequencing reads generated from reference DNA with known CNVs, and we were able to detect these variants with 98.6% sensitivity and 98.5% specificity which is significantly better than another recently described method for amplicon sequencing. The source code (R-package) of quandico is licensed under the GPLv3 and it is available at https://github.com/reineckef/quandico . CONCLUSION We demonstrated that our new algorithm is suitable to call copy number changes using data from PCR-enriched samples with high sensitivity and specificity even for single copy differences.
Collapse
Affiliation(s)
- Frank Reinecke
- Bioinformatics Assay Design & Analysis, QIAGEN GmbH, Max-Volmer-Straße 4, Hilden, 40724, Germany.
| | - Ravi Vijaya Satya
- Bioinformatics Assay Design & Analysis, QIAGEN Sciences Inc., 6951 Executive Way, Frederick MD, 21703, USA.
| | - John DiCarlo
- Bioinformatics Assay Design & Analysis, QIAGEN Sciences Inc., 6951 Executive Way, Frederick MD, 21703, USA.
| |
Collapse
|
221
|
Large multiallelic copy number variations in humans. Nat Genet 2015; 47:296-303. [PMID: 25621458 PMCID: PMC4405206 DOI: 10.1038/ng.3200] [Citation(s) in RCA: 265] [Impact Index Per Article: 29.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2014] [Accepted: 12/31/2014] [Indexed: 12/14/2022]
Abstract
Thousands of genomic segments appear to be present in widely varying copy numbers in different human genomes. We developed ways to use increasingly abundant whole-genome sequence data to identify the copy numbers, alleles and haplotypes present at most large multiallelic CNVs (mCNVs). We analyzed 849 genomes sequenced by the 1000 Genomes Project to identify most large (>5-kb) mCNVs, including 3,878 duplications, of which 1,356 appear to have 3 or more segregating alleles. We find that mCNVs give rise to most human variation in gene dosage-seven times the combined contribution of deletions and biallelic duplications-and that this variation in gene dosage generates abundant variation in gene expression. We describe 'runaway duplication haplotypes' in which genes, including HPR and ORM1, have mutated to high copy number on specific haplotypes. We also describe partially successful initial strategies for analyzing mCNVs via imputation and provide an initial data resource to support such analyses.
Collapse
|
222
|
Jiang Y, Oldridge DA, Diskin SJ, Zhang NR. CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 2015; 43:e39. [PMID: 25618849 PMCID: PMC4381046 DOI: 10.1093/nar/gku1363] [Citation(s) in RCA: 100] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2014] [Accepted: 12/19/2014] [Indexed: 01/24/2023] Open
Abstract
High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but detecting and characterizing CNV from exome sequencing is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for whole exome sequencing data. The Poisson latent factor model in CODEX includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based recursive segmentation procedure that explicitly models the count-based exome sequencing data. CODEX is compared to existing methods on a population analysis of HapMap samples from the 1000 Genomes Project, and shown to be more accurate on three microarray-based validation data sets. We further evaluate performance on 222 neuroblastoma samples with matched normals and focus on a well-studied rare somatic CNV within the ATRX gene. We show that the cross-sample normalization procedure of CODEX removes more noise than normalizing the tumor against the matched normal and that the segmentation procedure performs well in detecting CNVs with nested structures.
Collapse
Affiliation(s)
- Yuchao Jiang
- Genomics and Computational Biology Graduate Program, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Derek A Oldridge
- Medical Scientist Training Program, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Division of Oncology and Center for Childhood Cancer Research, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sharon J Diskin
- Division of Oncology and Center for Childhood Cancer Research, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Abramson Family Cancer Research Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nancy R Zhang
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
223
|
Oleksiewicz U, Tomczak K, Woropaj J, Markowska M, Stępniak P, Shah PK. Computational characterisation of cancer molecular profiles derived using next generation sequencing. Contemp Oncol (Pozn) 2015; 19:A78-91. [PMID: 25691827 PMCID: PMC4322529 DOI: 10.5114/wo.2014.47137] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Our current understanding of cancer genetics is grounded on the principle that cancer arises from a clone that has accumulated the requisite somatically acquired genetic aberrations, leading to the malignant transformation. It also results in aberrent of gene and protein expression. Next generation sequencing (NGS) or deep sequencing platforms are being used to create large catalogues of changes in copy numbers, mutations, structural variations, gene fusions, gene expression, and other types of information for cancer patients. However, inferring different types of biological changes from raw reads generated using the sequencing experiments is algorithmically and computationally challenging. In this article, we outline common steps for the quality control and processing of NGS data. We highlight the importance of accurate and application-specific alignment of these reads and the methodological steps and challenges in obtaining different types of information. We comment on the importance of integrating these data and building infrastructure to analyse it. We also provide exhaustive lists of available software to obtain information and point the readers to articles comparing software for deeper insight in specialised areas. We hope that the article will guide readers in choosing the right tools for analysing oncogenomic datasets.
Collapse
Affiliation(s)
- Urszula Oleksiewicz
- Laboratory of Gene Therapy, Department of Cancer Immunology, The Greater Poland Cancer Centre, Poznan, Poland ; Department of Cancer Immunology and Diagnostics, Chair of Medical Biotechnology, Poznan University of Medical Sciences, Poznan, Poland ; These authors contributed equally to this paper
| | - Katarzyna Tomczak
- Laboratory of Gene Therapy, Department of Cancer Immunology, The Greater Poland Cancer Centre, Poznan, Poland ; Department of Cancer Immunology and Diagnostics, Chair of Medical Biotechnology, Poznan University of Medical Sciences, Poznan, Poland ; Postgraduate School of Molecular Medicine, Medical University of Warsaw, Warsaw ; These authors contributed equally to this paper
| | - Jakub Woropaj
- Poznan University of Economics, Poznań, Poland ; These authors contributed equally to this paper
| | | | | | - Parantu K Shah
- Institute for Applied Cancer Science, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
224
|
|
225
|
Lin K, Smit S, Bonnema G, Sanchez-Perez G, de Ridder D. Making the difference: integrating structural variation detection tools. Brief Bioinform 2014; 16:852-64. [PMID: 25504367 DOI: 10.1093/bib/bbu047] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Indexed: 01/01/2023] Open
Abstract
From prokaryotes to eukaryotes, phenotypic variation, adaptation and speciation has been associated with structural variation between genomes of individuals within the same species. Many computer algorithms detecting such variations (callers) have recently been developed, spurred by the advent of the next-generation sequencing technology. Such callers mainly exploit split-read mapping or paired-end read mapping. However, as different callers are geared towards different types of structural variation, there is still no single caller that can be considered a community standard; instead, increasingly the various callers are combined in integrated pipelines. In this article, we review a wide range of callers, discuss challenges in the integration step and present a survey of pipelines used in population genomics studies. Based on our findings, we provide general recommendations on how to set-up such pipelines. Finally, we present an outlook on future challenges in structural variation detection.
Collapse
|
226
|
Somatic mosaicism in the human genome. Genes (Basel) 2014; 5:1064-94. [PMID: 25513881 PMCID: PMC4276927 DOI: 10.3390/genes5041064] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Revised: 11/26/2014] [Accepted: 11/28/2014] [Indexed: 12/17/2022] Open
Abstract
Somatic mosaicism refers to the occurrence of two genetically distinct populations of cells within an individual, derived from a postzygotic mutation. In contrast to inherited mutations, somatic mosaic mutations may affect only a portion of the body and are not transmitted to progeny. These mutations affect varying genomic sizes ranging from single nucleotides to entire chromosomes and have been implicated in disease, most prominently cancer. The phenotypic consequences of somatic mosaicism are dependent upon many factors including the developmental time at which the mutation occurs, the areas of the body that are affected, and the pathophysiological effect(s) of the mutation. The advent of second-generation sequencing technologies has augmented existing array-based and cytogenetic approaches for the identification of somatic mutations. We outline the strengths and weaknesses of these techniques and highlight recent insights into the role of somatic mosaicism in causing cancer, neurodegenerative, monogenic, and complex disease.
Collapse
|
227
|
Yang JF, Ding XF, Chen L, Mat WK, Xu MZ, Chen JF, Wang JM, Xu L, Poon WS, Kwong A, Leung GKK, Tan TC, Yu CH, Ke YB, Xu XY, Ke XY, Ma RC, Chan JC, Wan WQ, Zhang LW, Kumar Y, Tsang SY, Li S, Wang HY, Xue H. Copy number variation analysis based on AluScan sequences. J Clin Bioinforma 2014; 4:15. [PMID: 25558350 PMCID: PMC4273479 DOI: 10.1186/s13336-014-0015-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 11/12/2014] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND AluScan combines inter-Alu PCR using multiple Alu-based primers with opposite orientations and next-generation sequencing to capture a huge number of Alu-proximal genomic sequences for investigation. Its requirement of only sub-microgram quantities of DNA facilitates the examination of large numbers of samples. However, the special features of AluScan data rendered difficult the calling of copy number variation (CNV) directly using the calling algorithms designed for whole genome sequencing (WGS) or exome sequencing. RESULTS In this study, an AluScanCNV package has been assembled for efficient CNV calling from AluScan sequencing data employing a Geary-Hinkley transformation (GHT) of read-depth ratios between either paired test-control samples, or between test samples and a reference template constructed from reference samples, to call the localized CNVs, followed by use of a GISTIC-like algorithm to identify recurrent CNVs and circular binary segmentation (CBS) to reveal large extended CNVs. To evaluate the utility of CNVs called from AluScan data, the AluScans from 23 non-cancer and 38 cancer genomes were analyzed in this study. The glioma samples analyzed yielded the familiar extended copy-number losses on chromosomes 1p and 9. Also, the recurrent somatic CNVs identified from liver cancer samples were similar to those reported for liver cancer WGS with respect to a striking enrichment of copy-number gains in chromosomes 1q and 8q. When localized or recurrent CNV-features capable of distinguishing between liver and non-liver cancer samples were selected by correlation-based machine learning, a highly accurate separation of the liver and non-liver cancer classes was attained. CONCLUSIONS The results obtained from non-cancer and cancerous tissues indicated that the AluScanCNV package can be employed to call localized, recurrent and extended CNVs from AluScan sequences. Moreover, both the localized and recurrent CNVs identified by this method could be subjected to machine-learning selection to yield distinguishing CNV-features that were capable of separating between liver cancers and other types of cancers. Since the method is applicable to any human DNA sample with or without the availability of a paired control, it can also be employed to analyze the constitutional CNVs of individuals.
Collapse
Affiliation(s)
- Jian-Feng Yang
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Xiao-Fan Ding
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Lei Chen
- National Center for Liver Cancer Research and Eastern Hepatobiliary Surgery Hospital, 225 Changhai Road, Shanghai, 200438 China
| | - Wai-Kin Mat
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Michelle Zhi Xu
- Department of Oncology, Nanjing First Hospital, No. 68 Changle Road, Nanjing, 210006 China
| | - Jin-Fei Chen
- Department of Oncology, Nanjing First Hospital, No. 68 Changle Road, Nanjing, 210006 China
| | - Jian-Min Wang
- Department of Hematology, Changhai Hospital, Second Military Medical University, 174 Changhai Road, Shanghai, 200433 China
| | - Lin Xu
- Department of Thoracic Surgery, Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Nanjing Medical University Affiliated Cancer Hospital, Cancer Institute of Jiangsu Province, Baiziting 42, Nanjing, 210009 China
| | - Wai-Sang Poon
- Division of Neurosurgery, Department of Surgery, Prince of Wales Hospital, Chinese University of Hong Kong, 30-32 Ngan Shing Street, Sha Tin, Hong Kong, China
| | - Ava Kwong
- Division of Neurosurgery, Department of Surgery, Li Ka Shing Faculty of Medicine, University of Hong Kong, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong, China
| | - Gilberto Ka-Kit Leung
- Division of Neurosurgery, Department of Surgery, Li Ka Shing Faculty of Medicine, University of Hong Kong, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong, China
| | - Tze-Ching Tan
- Department of Neurosurgery, Queen Elizabeth Hospital, 30 Gascoigne Road, Kowloon, Hong Kong, China
| | - Chi-Hung Yu
- Department of Neurosurgery, Queen Elizabeth Hospital, 30 Gascoigne Road, Kowloon, Hong Kong, China
| | - Yue-Bin Ke
- Shenzhen Center for Disease Control and Prevention, No 8 Longyuan Road, Nanshan district, Shenzhen City, 518055 China
| | - Xin-Yun Xu
- Shenzhen Center for Disease Control and Prevention, No 8 Longyuan Road, Nanshan district, Shenzhen City, 518055 China
| | - Xiao-Yan Ke
- Nanjing Brain Hospital and Nanjing Institute of Neuropsychiatry, Nanjing Medical University, Nanjing, 210029 China
| | - Ronald Cw Ma
- Department of Medicine and Therapeutics, 9th floor, Clinical Sciences Building, The Prince of Wales Hospital, Shatin, Hong Kong
| | - Juliana Cn Chan
- Department of Medicine and Therapeutics, 9th floor, Clinical Sciences Building, The Prince of Wales Hospital, Shatin, Hong Kong
| | - Wei-Qing Wan
- Department of Neurosurgery, Beijing Tiantan Hospital, 6 Tiantan Xili, Dongcheng District, Capital Medical University, Beijing, 100050 China
| | - Li-Wei Zhang
- Department of Neurosurgery, Beijing Tiantan Hospital, 6 Tiantan Xili, Dongcheng District, Capital Medical University, Beijing, 100050 China
| | - Yogesh Kumar
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Shui-Ying Tsang
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Shao Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Department of Automation, Tsinghua University, Beijing, 100084 China
| | - Hong-Yang Wang
- National Center for Liver Cancer Research and Eastern Hepatobiliary Surgery Hospital, 225 Changhai Road, Shanghai, 200438 China.,International Cooperation Laboratory on Signal Transduction, Eastern Hepatobiliary Surgery Hospital, 225 Changhai Road, Shanghai, 200438 China
| | - Hong Xue
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| |
Collapse
|
228
|
Yi G, Qu L, Liu J, Yan Y, Xu G, Yang N. Genome-wide patterns of copy number variation in the diversified chicken genomes using next-generation sequencing. BMC Genomics 2014; 15:962. [PMID: 25378104 PMCID: PMC4239369 DOI: 10.1186/1471-2164-15-962] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2014] [Accepted: 10/13/2014] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Copy number variation (CNV) is important and widespread in the genome, and is a major cause of disease and phenotypic diversity. Herein, we performed a genome-wide CNV analysis in 12 diversified chicken genomes based on whole genome sequencing. RESULTS A total of 8,840 CNV regions (CNVRs) covering 98.2 Mb and representing 9.4% of the chicken genome were identified, ranging in size from 1.1 to 268.8 kb with an average of 11.1 kb. Sequencing-based predictions were confirmed at a high validation rate by two independent approaches, including array comparative genomic hybridization (aCGH) and quantitative PCR (qPCR). The Pearson's correlation coefficients between sequencing and aCGH results ranged from 0.435 to 0.755, and qPCR experiments revealed a positive validation rate of 91.71% and a false negative rate of 22.43%. In total, 2,214 (25.0%) predicted CNVRs span 2,216 (36.4%) RefSeq genes associated with specific biological functions. Besides two previously reported copy number variable genes EDN3 and PRLR, we also found some promising genes with potential in phenotypic variation. Two genes, FZD6 and LIMS1, related to disease susceptibility/resistance are covered by CNVRs. The highly duplicated SOCS2 may lead to higher bone mineral density. Entire or partial duplication of some genes like POPDC3 may have great economic importance in poultry breeding. CONCLUSIONS Our results based on extensive genetic diversity provide a more refined chicken CNV map and genome-wide gene copy number estimates, and warrant future CNV association studies for important traits in chickens.
Collapse
Affiliation(s)
| | | | | | | | | | - Ning Yang
- Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China.
| |
Collapse
|
229
|
Gillet-Markowska A, Richard H, Fischer G, Lafontaine I. Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries. ACTA ACUST UNITED AC 2014; 31:801-8. [PMID: 25380961 DOI: 10.1093/bioinformatics/btu730] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The detection of structural variations (SVs) in short-range Paired-End (PE) libraries remains challenging because SV breakpoints can involve large dispersed repeated sequences, or carry inherent complexity, hardly resolvable with classical PE sequencing data. In contrast, large insert-size sequencing libraries (Mate-Pair libraries) provide higher physical coverage of the genome and give access to repeat-containing regions. They can thus theoretically overcome previous limitations as they are becoming routinely accessible. Nevertheless, broad insert size distributions and high rates of chimerical sequences are usually associated to this type of libraries, which makes the accurate annotation of SV challenging. RESULTS Here, we present Ulysses, a tool that achieves drastically higher detection accuracy than existing tools, both on simulated and real mate-pair sequencing datasets from the 1000 Human Genome project. Ulysses achieves high specificity over the complete spectrum of variants by assessing, in a principled manner, the statistical significance of each possible variant (duplications, deletions, translocations, insertions and inversions) against an explicit model for the generation of experimental noise. This statistical model proves particularly useful for the detection of low frequency variants. SV detection performed on a large insert Mate-Pair library from a breast cancer sample revealed a high level of somatic duplications in the tumor and, to a lesser extent, in the blood sample as well. Altogether, these results show that Ulysses is a valuable tool for the characterization of somatic mosaicism in human tissues and in cancer genomes.
Collapse
Affiliation(s)
- Alexandre Gillet-Markowska
- Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France
| | - Hugues Richard
- Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France
| | - Gilles Fischer
- Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France
| | - Ingrid Lafontaine
- Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France Sorbonne Universités, UPMC University Paris 06, UMR 7238, Biologie Computationnelle et Quantitative and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, F-75005 Paris, France
| |
Collapse
|
230
|
Guo Y, Zhao S, Lehmann BD, Sheng Q, Shaver TM, Stricker TP, Pietenpol JA, Shyr Y. Detection of internal exon deletion with exon Del. BMC Bioinformatics 2014; 15:332. [PMID: 25322818 PMCID: PMC4288651 DOI: 10.1186/1471-2105-15-332] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2014] [Accepted: 08/20/2014] [Indexed: 01/15/2023] Open
Abstract
Background Exome sequencing allows researchers to study the human genome in unprecedented detail. Among the many types of variants detectable through exome sequencing, one of the most over looked types of mutation is internal deletion of exons. Internal exon deletions are the absence of consecutive exons in a gene. Such deletions have potentially significant biological meaning, and they are often too short to be considered copy number variation. Therefore, to the need for efficient detection of such deletions using exome sequencing data exists. Results We present ExonDel, a tool specially designed to detect homozygous exon deletions efficiently. We tested ExonDel on exome sequencing data generated from 16 breast cancer cell lines and identified both novel and known IEDs. Subsequently, we verified our findings using RNAseq and PCR technologies. Further comparisons with multiple sequencing-based CNV tools showed that ExonDel is capable of detecting unique IEDs not found by other CNV tools. Conclusions ExonDel is an efficient way to screen for novel and known IEDs using exome sequencing data. ExonDel and its source code can be downloaded freely at https://github.com/slzhao/ExonDel. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-332) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yan Guo
- Vanderbilt Ingram Cancer Center, Center for Quantitative Sciences, 2220 Pierce Ave, 549 Preston Research Building, Nashville, TN 37232, USA.
| | | | | | | | | | | | | | | |
Collapse
|
231
|
Magi A, Tattini L, Cifola I, D'Aurizio R, Benelli M, Mangano E, Battaglia C, Bonora E, Kurg A, Seri M, Magini P, Giusti B, Romeo G, Pippucci T, De Bellis G, Abbate R, Gensini GF. EXCAVATOR: detecting copy number variants from whole-exome sequencing data. Genome Biol 2014; 14:R120. [PMID: 24172663 PMCID: PMC4053953 DOI: 10.1186/gb-2013-14-10-r120] [Citation(s) in RCA: 183] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2013] [Accepted: 10/30/2013] [Indexed: 12/11/2022] Open
Abstract
We developed a novel software tool, EXCAVATOR, for the detection of copy number variants (CNVs) from whole-exome sequencing data. EXCAVATOR combines a three-step normalization procedure with a novel heterogeneous hidden Markov model algorithm and a calling method that classifies genomic regions into five copy number states. We validate EXCAVATOR on three datasets and compare the results with three other methods. These analyses show that EXCAVATOR outperforms the other methods and is therefore a valuable tool for the investigation of CNVs in largescale projects, as well as in clinical research and diagnostics. EXCAVATOR is freely available at http://sourceforge.net/projects/excavatortool/.
Collapse
|
232
|
Liu B, Morrison CD, Johnson CS, Trump DL, Qin M, Conroy JC, Wang J, Liu S. Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges. Oncotarget 2014; 4:1868-81. [PMID: 24240121 PMCID: PMC3875755 DOI: 10.18632/oncotarget.1537] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Accurate detection of somatic copy number variations (CNVs) is an essential part of cancer genome analysis, and plays an important role in oncotarget identifications. Next generation sequencing (NGS) holds the promise to revolutionize somatic CNV detection. In this review, we provide an overview of current analytic tools used for CNV detection in NGS-based cancer studies. We summarize the NGS data types used for CNV detection, decipher the principles for data preprocessing, segmentation, and interpretation, and discuss the challenges in somatic CNV detection. This review aims to provide a guide to the analytic tools used in NGS-based cancer CNV studies, and to discuss the important factors that researchers need to consider when analyzing NGS data for somatic CNV detections.
Collapse
Affiliation(s)
- Biao Liu
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY
| | | | | | | | | | | | | | | |
Collapse
|
233
|
Ji H, Lu J, Wang J, Li H, Lin X. Combined examination of sequence and copy number variations in human deafness genes improves diagnosis for cases of genetic deafness. BMC EAR, NOSE, AND THROAT DISORDERS 2014; 14:9. [PMID: 25342930 PMCID: PMC4194081 DOI: 10.1186/1472-6815-14-9] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 08/26/2014] [Indexed: 11/11/2022]
Abstract
Background Copy number variations (CNVs) are the major type of structural variation in the human genome, and are more common than DNA sequence variations in populations. CNVs are important factors for human genetic and phenotypic diversity. Many CNVs have been associated with either resistance to diseases or identified as the cause of diseases. Currently little is known about the role of CNVs in causing deafness. CNVs are currently not analyzed by conventional genetic analysis methods to study deafness. Here we detected both DNA sequence variations and CNVs affecting 80 genes known to be required for normal hearing. Methods Coding regions of the deafness genes were captured by a hybridization-based method and processed through the standard next-generation sequencing (NGS) protocol using the Illumina platform. Samples hybridized together in the same reaction were analyzed to obtain CNVs. A read depth based method was used to measure CNVs at the resolution of a single exon. Results were validated by the quantitative PCR (qPCR) based method. Results Among 79 sporadic cases clinically diagnosed with sensorineural hearing loss, we identified previously-reported disease-causing sequence mutations in 16 cases. In addition, we identified a total of 97 CNVs (72 CNV gains and 25 CNV losses) in 27 deafness genes. The CNVs included homozygous deletions which may directly give rise to deleterious effects on protein functions known to be essential for hearing, as well as heterozygous deletions and CNV gains compounded with sequence mutations in deafness genes that could potentially harm gene functions. Conclusions We studied how CNVs in known deafness genes may result in deafness. Data provided here served as a basis to explain how CNVs disrupt normal functions of deafness genes. These results may significantly expand our understanding about how various types of genetic mutations cause deafness in humans.
Collapse
Affiliation(s)
- Haiting Ji
- Department of Otolaryngology, Eye & ENT Hospital, Fudan University, #83 Fenyang Road, Shanghai 200031, P.R China
| | - Jingqiao Lu
- Department of Otolaryngology, Emory University School of Medicine, 615 Michael Street, Atlanta, GA 30322-3030, USA
| | - Jianjun Wang
- Department of Otolaryngology, Emory University School of Medicine, 615 Michael Street, Atlanta, GA 30322-3030, USA
| | - Huawei Li
- Department of Otolaryngology, Eye & ENT Hospital, Fudan University, #83 Fenyang Road, Shanghai 200031, P.R China
| | - Xi Lin
- Department of Otolaryngology, Emory University School of Medicine, 615 Michael Street, Atlanta, GA 30322-3030, USA
| |
Collapse
|
234
|
Muñoz-Minjares J, Cabal-Aragón J, Shmaliy YS. Confidence masks for genome DNA copy number variations in applications to HR-CGH array measurements. Biomed Signal Process Control 2014. [DOI: 10.1016/j.bspc.2014.06.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
235
|
Kadalayil L, Rafiq S, Rose-Zerilli MJJ, Pengelly RJ, Parker H, Oscier D, Strefford JC, Tapper WJ, Gibson J, Ennis S, Collins A. Exome sequence read depth methods for identifying copy number changes. Brief Bioinform 2014; 16:380-92. [PMID: 25169955 DOI: 10.1093/bib/bbu027] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2014] [Accepted: 07/10/2014] [Indexed: 01/04/2023] Open
Abstract
Copy number variants (CNVs) play important roles in a number of human diseases and in pharmacogenetics. Powerful methods exist for CNV detection in whole genome sequencing (WGS) data, but such data are costly to obtain. Many disease causal CNVs span or are found in genome coding regions (exons), which makes CNV detection using whole exome sequencing (WES) data attractive. If reliably validated against WGS-based CNVs, exome-derived CNVs have potential applications in a clinical setting. Several algorithms have been developed to exploit exome data for CNV detection and comparisons made to find the most suitable methods for particular data samples. The results are not consistent across studies. Here, we review some of the exome CNV detection methods based on depth of coverage profiles and examine their performance to identify problems contributing to discrepancies in published results. We also present a streamlined strategy that uses a single metric, the likelihood ratio, to compare exome methods, and we demonstrated its utility using the VarScan 2 and eXome Hidden Markov Model (XHMM) programs using paired normal and tumour exome data from chronic lymphocytic leukaemia patients. We use array-based somatic CNV (SCNV) calls as a reference standard to compute prevalence-independent statistics, such as sensitivity, specificity and likelihood ratio, for validation of the exome-derived SCNVs. We also account for factors known to influence the performance of exome read depth methods, such as CNV size and frequency, while comparing our findings with published results.
Collapse
|
236
|
Duan J, Zhang JG, Wan M, Deng HW, Wang YP. Population clustering based on copy number variations detected from next generation sequencing data. J Bioinform Comput Biol 2014; 12:1450021. [PMID: 25152046 DOI: 10.1142/s0219720014500218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Copy number variations (CNVs) can be used as significant bio-markers and next generation sequencing (NGS) provides a high resolution detection of these CNVs. But how to extract features from CNVs and further apply them to genomic studies such as population clustering have become a big challenge. In this paper, we propose a novel method for population clustering based on CNVs from NGS. First, CNVs are extracted from each sample to form a feature matrix. Then, this feature matrix is decomposed into the source matrix and weight matrix with non-negative matrix factorization (NMF). The source matrix consists of common CNVs that are shared by all the samples from the same group, and the weight matrix indicates the corresponding level of CNVs from each sample. Therefore, using NMF of CNVs one can differentiate samples from different ethnic groups, i.e. population clustering. To validate the approach, we applied it to the analysis of both simulation data and two real data set from the 1000 Genomes Project. The results on simulation data demonstrate that the proposed method can recover the true common CNVs with high quality. The results on the first real data analysis show that the proposed method can cluster two family trio with different ancestries into two ethnic groups and the results on the second real data analysis show that the proposed method can be applied to the whole-genome with large sample size consisting of multiple groups. Both results demonstrate the potential of the proposed method for population clustering.
Collapse
Affiliation(s)
- Junbo Duan
- Department of Biomedical Engineering, Xi'an Jiaotong University, Xi'an, P. R. China
| | | | | | | | | |
Collapse
|
237
|
Gillespie RL, O'Sullivan J, Ashworth J, Bhaskar S, Williams S, Biswas S, Kehdi E, Ramsden SC, Clayton-Smith J, Black GC, Lloyd IC. Personalized diagnosis and management of congenital cataract by next-generation sequencing. Ophthalmology 2014; 121:2124-37.e1-2. [PMID: 25148791 DOI: 10.1016/j.ophtha.2014.06.006] [Citation(s) in RCA: 122] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2014] [Revised: 05/02/2014] [Accepted: 06/04/2014] [Indexed: 12/21/2022] Open
Abstract
PURPOSE To assess the utility of integrating genomic data from next-generation sequencing and phenotypic data to enhance the diagnosis of bilateral congenital cataract (CC). DESIGN Evaluation of diagnostic technology. PARTICIPANTS Thirty-six individuals diagnosed with nonsyndromic or syndromic bilateral congenital cataract were selected for investigation through a single ophthalmic genetics clinic. METHODS Participants underwent a detailed ophthalmic examination, accompanied by dysmorphology assessment where appropriate. Lenticular, ocular, and systemic phenotypes were recorded. Mutations were detected using a custom-designed target enrichment that permitted parallel analysis of 115 genes associated with CC by high-throughput, next-generation DNA sequencing (NGS). Thirty-six patients and a known positive control were tested. Suspected pathogenic variants were confirmed by bidirectional Sanger sequencing in relevant probands and other affected family members. MAIN OUTCOME MEASURES Molecular genetic results and details of clinical phenotypes were identified. RESULTS Next-generation DNA sequencing technologies are able to determine the precise genetic cause of CC in 75% of individuals, and 85% patients with nonsyndromic CC were found to have likely pathogenic mutations, all of which occurred in highly conserved domains known to be vital for normal protein function. The pick-up rate in patients with syndromic CC also was high, with 63% having potential disease-causing mutations. CONCLUSIONS This analysis demonstrates the clinical utility of this test, providing examples where it altered clinical management, directed care pathways, and enabled more accurate genetic counseling. This comprehensive screen will extend access to genetic testing and lead to improved diagnostic and management outcomes through a stratified medicine approach. Establishing more robust genotype-phenotype correlations will advance knowledge of cataract-forming mechanisms.
Collapse
Affiliation(s)
- Rachel L Gillespie
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom
| | - James O'Sullivan
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom
| | - Jane Ashworth
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom; Manchester Royal Eye Hospital, Manchester Academic Health Science Centre, The University of Manchester, Central Manchester Foundation Trust, Manchester, United Kingdom
| | - Sanjeev Bhaskar
- Manchester Centre for Genomic Medicine, Central Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom
| | - Simon Williams
- Manchester Centre for Genomic Medicine, Central Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom
| | - Susmito Biswas
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom; Manchester Royal Eye Hospital, Manchester Academic Health Science Centre, The University of Manchester, Central Manchester Foundation Trust, Manchester, United Kingdom
| | - Elias Kehdi
- Manchester Royal Eye Hospital, Manchester Academic Health Science Centre, The University of Manchester, Central Manchester Foundation Trust, Manchester, United Kingdom
| | - Simon C Ramsden
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom; Manchester Centre for Genomic Medicine, Central Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom
| | - Jill Clayton-Smith
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom; Manchester Centre for Genomic Medicine, Central Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom
| | - Graeme C Black
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom; Manchester Centre for Genomic Medicine, Central Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom.
| | - I Christopher Lloyd
- Manchester Centre for Genomic Medicine, Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester Academic Health Science Centre, Saint Mary's Hospital, Manchester, United Kingdom; Manchester Royal Eye Hospital, Manchester Academic Health Science Centre, The University of Manchester, Central Manchester Foundation Trust, Manchester, United Kingdom
| |
Collapse
|
238
|
Pryszcz LP, Németh T, Gácser A, Gabaldón T. Unexpected genomic variability in clinical and environmental strains of the pathogenic yeast Candida parapsilosis. Genome Biol Evol 2014; 5:2382-92. [PMID: 24259314 PMCID: PMC3879973 DOI: 10.1093/gbe/evt185] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Invasive candidiasis is the most commonly reported invasive fungal infection worldwide. Although Candida albicans remains the main cause, the incidence of emerging Candida species, such as C. parapsilosis is increasing. It has been postulated that C. parapsilosis clinical isolates result from a recent global expansion of a virulent clone. However, the availability of a single genome for this species has so far prevented testing this hypothesis at genomic scales. We present here the sequence of three additional strains from clinical and environmental samples. Our analyses reveal unexpected patterns of genomic variation, shared among distant strains, that argue against the clonal expansion hypothesis. All strains carry independent expansions involving an arsenite transporter homolog, pointing to the existence of directional selection in the environment, and independent origins of the two clinical isolates. Furthermore, we report the first evidence for the existence of recombination in this species. Altogether, our results shed new light onto the dynamics of genome evolution in C. parapsilosis.
Collapse
Affiliation(s)
- Leszek P Pryszcz
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), Barcelona, Spain
| | | | | | | |
Collapse
|
239
|
Lighten J, van Oosterhout C, Bentzen P. Critical review of NGS analyses for de novo genotyping multigene families. Mol Ecol 2014; 23:3957-72. [DOI: 10.1111/mec.12843] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2014] [Revised: 06/08/2014] [Accepted: 06/17/2014] [Indexed: 01/16/2023]
Affiliation(s)
- Jackie Lighten
- Department of Biology; Marine Gene Probe Laboratory; Dalhousie University; Halifax Nova Scotia Canada
| | - Cock van Oosterhout
- School of Environmental Sciences; University of East Anglia; Norwich Research Park; Norwich UK
| | - Paul Bentzen
- Department of Biology; Marine Gene Probe Laboratory; Dalhousie University; Halifax Nova Scotia Canada
| |
Collapse
|
240
|
Tessereau C, Lesecque Y, Monnet N, Buisson M, Barjhoux L, Léoné M, Feng B, Goldgar DE, Sinilnikova OM, Mousset S, Duret L, Mazoyer S. Estimation of the RNU2 macrosatellite mutation rate by BRCA1 mutation tracing. Nucleic Acids Res 2014; 42:9121-30. [PMID: 25034697 PMCID: PMC4132748 DOI: 10.1093/nar/gku639] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Large tandem repeat sequences have been poorly investigated as severe technical limitations and their frequent absence from the genome reference hinder their analysis. Extensive allelotyping of this class of variation has not been possible until now and their mutational dynamics are still poorly known. In order to estimate the mutation rate of a macrosatellite, we analysed in detail the RNU2 locus, which displays at least 50 different alleles containing 5-82 copies of a 6.1 kb repeat unit. Mining data from the 1000 Genomes Project allowed us to precisely estimate copy numbers of the RNU2 repeat unit using read depth of coverage. This further revealed significantly different mean values in various recent modern human populations, favoring a scenario of fast evolution of this locus. Its proximity to a disease gene with numerous founder mutations, BRCA1, within the same linkage disequilibrium block, offered the unique opportunity to trace RNU2 arrays over a large timescale. Analysis of the transmission of RNU2 arrays associated with one ‘private’ mutation in an extended kindred and four founder mutations in multiple kindreds gave an estimation by maximum likelihood of 5 × 10−3 mutations per generation, which is close to that of microsatellites.
Collapse
Affiliation(s)
- Chloé Tessereau
- Genetics of Breast Cancer Team, Cancer Research Centre of Lyon, CNRS UMR5286, Inserm U1052, Université Lyon 1, Centre Léon Bérard, Lyon, France Genomic Vision, Bagneux, Paris, France
| | - Yann Lesecque
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR5558, Université Lyon 1, France
| | - Nastasia Monnet
- Genetics of Breast Cancer Team, Cancer Research Centre of Lyon, CNRS UMR5286, Inserm U1052, Université Lyon 1, Centre Léon Bérard, Lyon, France
| | - Monique Buisson
- Genetics of Breast Cancer Team, Cancer Research Centre of Lyon, CNRS UMR5286, Inserm U1052, Université Lyon 1, Centre Léon Bérard, Lyon, France
| | - Laure Barjhoux
- Genetics of Breast Cancer Team, Cancer Research Centre of Lyon, CNRS UMR5286, Inserm U1052, Université Lyon 1, Centre Léon Bérard, Lyon, France
| | - Mélanie Léoné
- Unité Mixte de Génétique Constitutionnelle des Cancers Fréquents, Hospices Civils de Lyon/Centre Léon Bérard, Lyon, France
| | - Bingjian Feng
- Department of Dermatology and Huntsman Cancer Institute University of Utah School of Medicine, Salt Lake City, Utah, USA
| | - David E Goldgar
- Department of Dermatology and Huntsman Cancer Institute University of Utah School of Medicine, Salt Lake City, Utah, USA
| | - Olga M Sinilnikova
- Genetics of Breast Cancer Team, Cancer Research Centre of Lyon, CNRS UMR5286, Inserm U1052, Université Lyon 1, Centre Léon Bérard, Lyon, France Unité Mixte de Génétique Constitutionnelle des Cancers Fréquents, Hospices Civils de Lyon/Centre Léon Bérard, Lyon, France
| | - Sylvain Mousset
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR5558, Université Lyon 1, France
| | - Laurent Duret
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR5558, Université Lyon 1, France
| | - Sylvie Mazoyer
- Genetics of Breast Cancer Team, Cancer Research Centre of Lyon, CNRS UMR5286, Inserm U1052, Université Lyon 1, Centre Léon Bérard, Lyon, France
| |
Collapse
|
241
|
Trappe K, Emde AK, Ehrlich HC, Reinert K. Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone. Bioinformatics 2014; 30:3484-90. [PMID: 25028727 DOI: 10.1093/bioinformatics/btu431] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs. RESULTS We present Gustaf (Generic mUlti-SpliT Alignment Finder), a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of ≥ 30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. We show that Gustaf correctly identifies SVs, especially in the range from 30 to 100 bp, which we call the next-generation sequencing (NGS) twilight zone of SVs, as well as larger SVs >500 bp. Gustaf performs better than similar tools in our benchmark and is furthermore able to correctly identify size and location of dispersed duplications and translocations, which otherwise might be wrongly classified, for example, as large deletions.
Collapse
Affiliation(s)
- Kathrin Trappe
- Department of Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany and New York Genome Center, New York, NY 10013, USA Department of Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany and New York Genome Center, New York, NY 10013, USA
| | - Anne-Katrin Emde
- Department of Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany and New York Genome Center, New York, NY 10013, USA
| | - Hans-Christian Ehrlich
- Department of Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany and New York Genome Center, New York, NY 10013, USA
| | - Knut Reinert
- Department of Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany and New York Genome Center, New York, NY 10013, USA
| |
Collapse
|
242
|
Abstract
High-throughput DNA sequencing has revolutionized the study of cancer genomics with numerous discoveries that are relevant to cancer diagnosis and treatment. The latest sequencing and analysis methods have successfully identified somatic alterations, including single-nucleotide variants, insertions and deletions, copy-number aberrations, structural variants and gene fusions. Additional computational techniques have proved useful for defining the mutations, genes and molecular networks that drive diverse cancer phenotypes and that determine clonal architectures in tumour samples. Collectively, these tools have advanced the study of genomic, transcriptomic and epigenomic alterations in cancer, and their association to clinical properties. Here, we review cancer genomics software and the insights that have been gained from their application.
Collapse
|
243
|
Petersen BS, Spehlmann ME, Raedler A, Stade B, Thomsen I, Rabionet R, Rosenstiel P, Schreiber S, Franke A. Whole genome and exome sequencing of monozygotic twins discordant for Crohn's disease. BMC Genomics 2014; 15:564. [PMID: 24996980 PMCID: PMC4102722 DOI: 10.1186/1471-2164-15-564] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 06/27/2014] [Indexed: 12/30/2022] Open
Abstract
Background Crohn’s disease (CD) is an inflammatory bowel disease caused by genetic and environmental factors. More than 160 susceptibility loci have been identified for IBD, yet a large part of the genetic variance remains unexplained. Recent studies have demonstrated genetic differences between monozygotic twins, who were long thought to be genetically completely identical. Results We aimed to test if somatic mutations play a role in CD etiology by sequencing the genomes and exomes of directly affected tissue from the bowel and blood samples of one and the blood-derived exomes of two further monozygotic discordant twin pairs. Our goal was the identification of mutations present only in the affected twins, pointing to novel candidates for CD susceptibility loci. We present a thorough genetic characterization of the sequenced individuals but detected no consistent differences within the twin pairs. An estimate of the CD susceptibility based on known CD loci however hinted at a higher mutational load in all three twin pairs compared to 1,920 healthy individuals. Conclusion Somatic mosaicism does not seem to play a role in the discordance of monozygotic CD twins. Our study constitutes the first to perform whole genome sequencing for CD twins and therefore provides a valuable reference dataset for future studies. We present an example framework for mosaicism detection and point to the challenges in these types of analyses. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-564) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Britt-Sabina Petersen
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Schittenhelmstrasse 12, 24105 Kiel, Germany.
| | | | | | | | | | | | | | | | | |
Collapse
|
244
|
Keane TM, Wong K, Adams DJ, Flint J, Reymond A, Yalcin B. Identification of structural variation in mouse genomes. Front Genet 2014; 5:192. [PMID: 25071822 PMCID: PMC4079067 DOI: 10.3389/fgene.2014.00192] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Accepted: 06/12/2014] [Indexed: 01/25/2023] Open
Abstract
Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.
Collapse
Affiliation(s)
| | - Kim Wong
- Wellcome Trust Sanger Institute Hinxton, Cambridge, UK
| | - David J Adams
- Wellcome Trust Sanger Institute Hinxton, Cambridge, UK
| | | | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne Lausanne, Switzerland
| | - Binnaz Yalcin
- Center for Integrative Genomics, University of Lausanne Lausanne, Switzerland ; Institute of Genetics and Molecular and Cellular Biology Illkirch, France
| |
Collapse
|
245
|
Huang S, Holt J, Kao CY, McMillan L, Wang W. A novel multi-alignment pipeline for high-throughput sequencing data. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau057. [PMID: 24948510 PMCID: PMC4062837 DOI: 10.1093/database/bau057] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Mapping reads to a reference sequence is a common step when analyzing allele effects in high-throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending on the genetic distances of the target sequences from the reference. To avoid this bias, researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single-reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins. Database URL: http://csbio.unc.edu/CCstatus/index.py?run=Pseudo.
Collapse
Affiliation(s)
- Shunping Huang
- Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - James Holt
- Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Chia-Yu Kao
- Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Leonard McMillan
- Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
246
|
Abstract
The chromothripsis hypothesis suggests an extraordinary one-step catastrophic genomic event allowing a chromosome to 'shatter into many pieces' and reassemble into a functioning chromosome. Recent efforts have aimed to detect chromothripsis by looking for a genomic signature, characterized by a large number of breakpoints (50-250), but a limited number of oscillating copy number states (2-3) confined to a few chromosomes. The chromothripsis phenomenon has become widely reported in different cancers, but using inconsistent and sometimes relaxed criteria for determining rearrangements occur simultaneously rather than progressively. We revisit the original simulation approach and show that the signature is not clearly exceptional, and can be explained using only progressive rearrangements. For example, 3.9% of progressively simulated chromosomes with 50-55 breakpoints were dominated by two or three copy number states. In addition, by adjusting the parameters of the simulation, the proposed footprint appears more frequently. Lastly, we provide an algorithm to find a sequence of progressive rearrangements that explains all observed breakpoints from a proposed chromothripsis chromosome. Thus, the proposed signature cannot be considered a sufficient proof for this extraordinary hypothesis. Great caution should be exercised when labeling complex rearrangements as chromothripsis from genome hybridization and sequencing experiments.
Collapse
Affiliation(s)
- Marcus Kinsella
- Bioinformatics and Systems Biology Program, University of California, San Diego, CA, USA
| | - Anand Patel
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| |
Collapse
|
247
|
Ekblom R, Smeds L, Ellegren H. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics 2014; 15:467. [PMID: 24923674 PMCID: PMC4070552 DOI: 10.1186/1471-2164-15-467] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2014] [Accepted: 06/09/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome and transcriptome sequencing applications that rely on variation in sequence depth can be negatively affected if there are systematic biases in coverage. We have investigated patterns of local variation in sequencing coverage by utilising ultra-deep sequencing (>100,000X) of mtDNA obtained during sequencing of two vertebrate genomes, wolverine (Gulo gulo) and collared flycatcher (Ficedula albicollis). With such extreme depth, stochastic variation in coverage should be negligible, which allows us to provide a very detailed, fine-scale picture of sequence dependent coverage variation and sequencing error rates. RESULTS Sequencing coverage showed up to six-fold variation across the complete mtDNA and this variation was highly repeatable in sequencing of multiple individuals of the same species. Moreover, coverage in orthologous regions was correlated between the two species and was negatively correlated with GC content. We also found a negative correlation between the site-specific sequencing error rate and coverage, with certain sequence motifs "CCNGCC" being particularly prone to high rates of error and low coverage. CONCLUSIONS Our results demonstrate that inherent sequence characteristics govern variation in coverage and suggest that some of this variation, like GC content, should be controlled for in, for example, RNA-Seq and detection of copy number variation.
Collapse
Affiliation(s)
- Robert Ekblom
- Department of Ecology and Genetics, Uppsala University, Uppsala SE-75236, Sweden.
| | | | | |
Collapse
|
248
|
Exon expression QTL (eeQTL) analysis highlights distant genomic variations associated with splicing regulation. QUANTITATIVE BIOLOGY 2014. [DOI: 10.1007/s40484-014-0031-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
249
|
Teer JK. An improved understanding of cancer genomics through massively parallel sequencing. Transl Cancer Res 2014; 3:243-259. [PMID: 26146607 PMCID: PMC4486294 DOI: 10.3978/j.issn.2218-676x.2014.05.05] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
DNA sequencing technology advances have enabled genetic investigation of more samples in a shorter time than has previously been possible. Furthermore, the ability to analyze and understand large sequencing datasets has improved due to concurrent advances in sequence data analysis methods and software tools. Constant improvements to both technology and analytic approaches in this fast moving field are evidenced by many recent publications of computational methods, as well as biological results linking genetic events to human disease. Cancer in particular has been the subject of intense investigation, owing to the genetic underpinnings of this complex collection of diseases. New massively-parallel sequencing (MPS) technologies have enabled the investigation of thousands of samples, divided across tens of different tumor types, resulting in new driver gene identification, mutagenic pattern characterization, and other newly uncovered features of tumor biology. This review will focus both on methods and recent results: current analytical approaches to DNA and RNA sequencing will be presented followed by a review of recent pan-cancer sequencing studies. This overview of methods and results will not only highlight the recent advances in cancer genomics, but also the methods and tools used to accomplish these advancements in a constantly and rapidly improving field.
Collapse
Affiliation(s)
- Jamie K Teer
- , H. Lee Moffitt Cancer Center and Research Institute, 12902 Magnolia Dr., Tampa, FL 33612, Tel: 813-745-2650
| |
Collapse
|
250
|
Yu Z, Liu Y, Shen Y, Wang M, Li A. CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data. ACTA ACUST UNITED AC 2014; 30:2576-83. [PMID: 24845652 PMCID: PMC4155249 DOI: 10.1093/bioinformatics/btu346] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Motivation: Whole-genome sequencing of tumor samples has been demonstrated as an efficient approach for comprehensive analysis of genomic aberrations in cancer genome. Critical issues such as tumor impurity and aneuploidy, GC-content and mappability bias have been reported to complicate identification of copy number alteration and loss of heterozygosity in complex tumor samples. Therefore, efficient computational methods are required to address these issues. Results: We introduce CLImAT (CNA and LOH Assessment in Impure and Aneuploid Tumors), a bioinformatics tool for identification of genomic aberrations from tumor samples using whole-genome sequencing data. Without requiring a matched normal sample, CLImAT takes integrated analysis of read depth and allelic frequency and provides extensive data processing procedures including GC-content and mappability correction of read depth and quantile normalization of B-allele frequency. CLImAT accurately identifies copy number alteration and loss of heterozygosity even for highly impure tumor samples with aneuploidy. We evaluate CLImAT on both simulated and real DNA sequencing data to demonstrate its ability to infer tumor impurity and ploidy and identify genomic aberrations in complex tumor samples. Availability and implementation: The CLImAT software package can be freely downloaded at http://bioinformatics.ustc.edu.cn/CLImAT/. Contact: aoli@ustc.edu.cn Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhenhua Yu
- School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Yuanning Liu
- School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Yi Shen
- School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Minghui Wang
- School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Ao Li
- School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| |
Collapse
|