1
|
Zhu K, Jones MG, Luebeck J, Bu X, Yi H, Hung KL, Wong ITL, Zhang S, Mischel PS, Chang HY, Bafna V. CoRAL accurately resolves extrachromosomal DNA genome structures with long-read sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.15.580594. [PMID: 38405779 PMCID: PMC10888815 DOI: 10.1101/2024.02.15.580594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Extrachromosomal DNA (ecDNA) is a central mechanism for focal oncogene amplification in cancer, occurring in approximately 15% of early stage cancers and 30% of late-stage cancers. EcDNAs drive tumor formation, evolution, and drug resistance by dynamically modulating oncogene copy-number and rewiring gene-regulatory networks. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Paired-end short-read (Illumina) sequencing and mapping have been utilized to represent ecDNA amplifications using a breakpoint graph, where the inferred architecture of ecDNA is encoded as a cycle in the graph. Traversals of breakpoint graph have been used to successfully predict ecDNA presence in cancer samples. However, short-read technologies are intrinsically limited in the identification of breakpoints, phasing together of complex rearrangements and internal duplications, and deconvolution of cell-to-cell heterogeneity of ecDNA structures. Long-read technologies, such as from Oxford Nanopore Technologies, have the potential to improve inference as the longer reads are better at mapping structural variants and are more likely to span rearranged or duplicated regions. Here, we propose CoRAL (Complete Reconstruction of Amplifications with Long reads), for reconstructing ecDNA architectures using long-read data. CoRAL reconstructs likely cyclic architectures using quadratic programming that simultaneously optimizes parsimony of reconstruction, explained copy number, and consistency of long-read mapping. CoRAL substantially improves reconstructions in extensive simulations and 9 datasets from previously-characterized cell-lines as compared to previous short-read-based tools. As long-read usage becomes wide-spread, we anticipate that CoRAL will be a valuable tool for profiling the landscape and evolution of focal amplifications in tumors.
Collapse
Affiliation(s)
- Kaiyuan Zhu
- Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
- These authors contributed equally to this work
| | - Matthew G. Jones
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- These authors contributed equally to this work
| | - Jens Luebeck
- Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
| | - Xinxin Bu
- Bioinformatics Undergraduate Program, School of Biological Sciences, UC San Diego, La Jolla, CA, USA
| | - Hyerim Yi
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - King L. Hung
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - Ivy Tsz-Lo Wong
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Sarafan Chemistry, Engineering, and Medicine for Human Health (Sarafan ChEM-H), Stanford University, Stanford, CA, USA
| | - Shu Zhang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - Paul S. Mischel
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Sarafan Chemistry, Engineering, and Medicine for Human Health (Sarafan ChEM-H), Stanford University, Stanford, CA, USA
| | - Howard Y. Chang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
- Halicioglu Data Science Institute, UC San Diego, La Jolla, CA, USA
| |
Collapse
|
2
|
Alfayyadh MM, Maksemous N, Sutherland HG, Lea RA, Griffiths LR. Unravelling the Genetic Landscape of Hemiplegic Migraine: Exploring Innovative Strategies and Emerging Approaches. Genes (Basel) 2024; 15:443. [PMID: 38674378 PMCID: PMC11049430 DOI: 10.3390/genes15040443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Migraine is a severe, debilitating neurovascular disorder. Hemiplegic migraine (HM) is a rare and debilitating neurological condition with a strong genetic basis. Sequencing technologies have improved the diagnosis and our understanding of the molecular pathophysiology of HM. Linkage analysis and sequencing studies in HM families have identified pathogenic variants in ion channels and related genes, including CACNA1A, ATP1A2, and SCN1A, that cause HM. However, approximately 75% of HM patients are negative for these mutations, indicating there are other genes involved in disease causation. In this review, we explored our current understanding of the genetics of HM. The evidence presented herein summarises the current knowledge of the genetics of HM, which can be expanded further to explain the remaining heritability of this debilitating condition. Innovative bioinformatics and computational strategies to cover the entire genetic spectrum of HM are also discussed in this review.
Collapse
Affiliation(s)
| | | | | | | | - Lyn R. Griffiths
- Centre for Genomics and Personalised Health, Genomics Research Centre, School of Biomedical Sciences, Queensland University of Technology (QUT), Brisbane, QLD 4059, Australia; (M.M.A.); (N.M.); (H.G.S.); (R.A.L.)
| |
Collapse
|
3
|
Louw N, Carstens N, Lombard Z. Incorporating CNV analysis improves the yield of exome sequencing for rare monogenic disorders-an important consideration for resource-constrained settings. Front Genet 2023; 14:1277784. [PMID: 38155715 PMCID: PMC10753787 DOI: 10.3389/fgene.2023.1277784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 11/22/2023] [Indexed: 12/30/2023] Open
Abstract
Exome sequencing (ES) is a recommended first-tier diagnostic test for many rare monogenic diseases. It allows for the detection of both single-nucleotide variants (SNVs) and copy number variants (CNVs) in coding exonic regions of the genome in a single test, and this dual analysis is a valuable approach, especially in limited resource settings. Single-nucleotide variants are well studied; however, the incorporation of copy number variant analysis tools into variant calling pipelines has not been implemented yet as a routine diagnostic test, and chromosomal microarray is still more widely used to detect copy number variants. Research shows that combined single and copy number variant analysis can lead to a diagnostic yield of up to 58%, increasing the yield with as much as 18% from the single-nucleotide variant only pipeline. Importantly, this is achieved with the consideration of computational costs only, without incurring any additional sequencing costs. This mini review provides an overview of copy number variant analysis from exome data and what the current recommendations are for this type of analysis. We also present an overview on rare monogenic disease research standard practices in resource-limited settings. We present evidence that integrating copy number variant detection tools into a standard exome sequencing analysis pipeline improves diagnostic yield and should be considered a significantly beneficial addition, with relatively low-cost implications. Routine implementation in underrepresented populations and limited resource settings will promote generation and sharing of CNV datasets and provide momentum to build core centers for this niche within genomic medicine.
Collapse
Affiliation(s)
- Nadja Louw
- Division of Human Genetics, National Health Laboratory Service and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Nadia Carstens
- Division of Human Genetics, National Health Laboratory Service and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
- Genomics Platform, South African Medical Research Council, Cape Town, South Africa
| | - Zané Lombard
- Division of Human Genetics, National Health Laboratory Service and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | | |
Collapse
|
4
|
Choo ZN, Behr JM, Deshpande A, Hadi K, Yao X, Tian H, Takai K, Zakusilo G, Rosiene J, Da Cruz Paula A, Weigelt B, Setton J, Riaz N, Powell SN, Busam K, Shoushtari AN, Ariyan C, Reis-Filho J, de Lange T, Imieliński M. Most large structural variants in cancer genomes can be detected without long reads. Nat Genet 2023; 55:2139-2148. [PMID: 37945902 PMCID: PMC10703688 DOI: 10.1038/s41588-023-01540-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 09/19/2023] [Indexed: 11/12/2023]
Abstract
Short-read sequencing is the workhorse of cancer genomics yet is thought to miss many structural variants (SVs), particularly large chromosomal alterations. To characterize missing SVs in short-read whole genomes, we analyzed 'loose ends'-local violations of mass balance between adjacent DNA segments. In the landscape of loose ends across 1,330 high-purity cancer whole genomes, most large (>10-kb) clonal SVs were fully resolved by short reads in the 87% of the human genome where copy number could be reliably measured. Some loose ends represent neotelomeres, which we propose as a hallmark of the alternative lengthening of telomeres phenotype. These pan-cancer findings were confirmed by long-molecule profiles of 38 breast cancer and melanoma cases. Our results indicate that aberrant homologous recombination is unlikely to drive the majority of large cancer SVs. Furthermore, analysis of mass balance in short-read whole genome data provides a surprisingly complete picture of cancer chromosomal structure.
Collapse
Affiliation(s)
- Zi-Ning Choo
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Tri-institutional MD PhD Program, Weill Cornell Medicine, New York, NY, USA
- Physiology and Biophysics PhD Program, Weill Cornell Medicine, New York, NY, USA
| | - Julie M Behr
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Tri-institutional PhD Program in Computational Biology and Medicine, New York, NY, USA
| | - Aditya Deshpande
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Tri-institutional PhD Program in Computational Biology and Medicine, New York, NY, USA
| | - Kevin Hadi
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Physiology and Biophysics PhD Program, Weill Cornell Medicine, New York, NY, USA
| | - Xiaotong Yao
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Tri-institutional PhD Program in Computational Biology and Medicine, New York, NY, USA
| | - Huasong Tian
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA
| | - Kaori Takai
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA
| | - George Zakusilo
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA
| | - Joel Rosiene
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
| | | | - Britta Weigelt
- Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Jeremy Setton
- Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Nadeem Riaz
- Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Simon N Powell
- Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Klaus Busam
- Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | | | | | | | - Titia de Lange
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA
| | - Marcin Imieliński
- New York Genome Center, New York, NY, USA.
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA.
- Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA.
- Department of Pathology, NYU Grossman School of Medicine, New York, NY, USA.
| |
Collapse
|
5
|
Pajuste FD, Remm M. GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads. Sci Rep 2023; 13:17765. [PMID: 37853040 PMCID: PMC10584998 DOI: 10.1038/s41598-023-44636-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 10/10/2023] [Indexed: 10/20/2023] Open
Abstract
Genomes exhibit large regions with segmental copy number variation, many of which include entire genes and are multiallelic. We have developed a computational method GeneToCN that counts the frequencies of gene-specific k-mers in FASTQ files and uses this information to infer copy number of the gene. We validated the copy number predictions for amylase genes (AMY1, AMY2A, AMY2B) using experimental data from digital droplet PCR (ddPCR) on 39 individuals and observed a strong correlation (R = 0.99) between GeneToCN predictions and experimentally determined copy numbers. An additional validation on FCGR3 genes showed a higher concordance for FCGR3A compared to two other methods, but reduced accuracy for FCGR3B. We further tested the method on three different genomic regions (SMN, NPY4R, and LPA Kringle IV-2 domain). Predicted copy number distributions of these genes in a set of 500 individuals from the Estonian Biobank were in good agreement with the previously published studies. In addition, we investigated the possibility to use GeneToCN on sequencing data generated by different technologies by comparing copy number predictions from Illumina, PacBio, and Oxford Nanopore data of the same sample. Despite the differences in variability of k-mer frequencies, all three sequencing technologies give similar predictions with GeneToCN.
Collapse
Affiliation(s)
- Fanny-Dhelia Pajuste
- Institute of Molecular and Cell Biology, University of Tartu, 23 Riia Str., 51010, Tartu, Estonia.
| | - Maido Remm
- Institute of Molecular and Cell Biology, University of Tartu, 23 Riia Str., 51010, Tartu, Estonia
| |
Collapse
|
6
|
Mielczarek M, Frąszczak M, Zielak-Steciwko AE, Nowak B, Hofman B, Pierścińska J, Kruszyński W, Szyda J. An effect of large-scale deletions and duplications on transcript expression. Funct Integr Genomics 2022; 23:19. [PMID: 36564645 PMCID: PMC9789009 DOI: 10.1007/s10142-022-00946-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 12/14/2022] [Accepted: 12/15/2022] [Indexed: 12/25/2022]
Abstract
Since copy number variants (CNVs) have been recognized as an important source of genetic and transcriptomic variation, we aimed to characterize the impact of CNVs located within coding, intergenic, upstream, and downstream gene regions on the expression of transcripts. Regions in which deletions occurred most often were introns, while duplications in coding regions. The transcript expression was lower for deleted coding (P = 0.008) and intronic regions (P = 1.355 × 10-10), but it was not changed in the case of upstream and downstream gene regions (P = 0.085). Moreover, the expression was decreased if duplication occurred in the coding region (P = 8.318 × 10-5). Furthermore, a negative correlation (r = - 0.27) between transcript length and its expression was observed. The correlation between the percent of deleted/duplicated transcript and transcript expression level was not significant for all concerned genomic regions in five out of six animals. The exceptions were deletions in coding regions (P = 0.004) and duplications in introns (P = 0.01) in one individual. CNVs in coding (deletions, duplications) and intronic (deletions) regions are important modulators of transcripts by reducing their expression level. We hypothesize that deletions imply severe consequences by interrupting genes. The negative correlation between the size of the transcript and its expression level found in this study is consistent with the hypothesis that selection favours shorter introns and a moderate number of exons in highly expressed genes. This may explain the transcript expression reduction by duplications. We did not find the correlation between the size of deletions/duplications and transcript expression level suggesting that expression is modulated by CNVs regardless of their size.
Collapse
Affiliation(s)
- Magda Mielczarek
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland.
| | - Magdalena Frąszczak
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| | - Anna E Zielak-Steciwko
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| | - Błażej Nowak
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| | - Bartłomiej Hofman
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| | - Jagoda Pierścińska
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| | - Wojciech Kruszyński
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| | - Joanna Szyda
- Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland
| |
Collapse
|
7
|
Rahman A, Medvedev P. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Res 2022; 32:1746-1753. [PMID: 35896425 PMCID: PMC9528984 DOI: 10.1101/gr.276601.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 07/26/2022] [Indexed: 11/24/2022]
Abstract
Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from underassemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e., they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in underassembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low, then even error-free data result in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
8
|
Wang T, Sun J, Zhang X, Wang WJ, Zhou Q. CNV-P: a machine-learning framework for predicting high confident copy number variations. PeerJ 2021; 9:e12564. [PMID: 34917425 PMCID: PMC8645205 DOI: 10.7717/peerj.12564] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 11/08/2021] [Indexed: 12/27/2022] Open
Abstract
Background Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement. Methods Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier. Results The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing. Conclusions Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.
Collapse
Affiliation(s)
| | - Jinghua Sun
- BGI-Shenzhen, Shenzhen, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Xiuqing Zhang
- BGI-Shenzhen, Shenzhen, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Guangdong Enterprise Key Laboratory of Human Disease Genomics, Beishan Industrial Zone, Shenzhen, China
| | | | | |
Collapse
|
9
|
Yuan Y, Bayer PE, Batley J, Edwards D. Current status of structural variation studies in plants. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:2153-2163. [PMID: 34101329 PMCID: PMC8541774 DOI: 10.1111/pbi.13646] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Revised: 05/31/2021] [Accepted: 06/03/2021] [Indexed: 05/23/2023]
Abstract
Structural variations (SVs) including gene presence/absence variations and copy number variations are a common feature of genomes in plants and, together with single nucleotide polymorphisms and epigenetic differences, are responsible for the heritable phenotypic diversity observed within and between species. Understanding the contribution of SVs to plant phenotypic variation is important for plant breeders to assist in producing improved varieties. The low resolution of early genetic technologies and inefficient methods have previously limited our understanding of SVs in plants. However, with the rapid expansion in genomic technologies, it is possible to assess SVs with an ever-greater resolution and accuracy. Here, we review the current status of SV studies in plants, examine the roles that SVs play in phenotypic traits, compare current technologies and assess future challenges for SV studies.
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Biological Sciences and Institute of AgricultureThe University of Western AustraliaPerthWAAustralia
- School of Life Sciences and State Key Laboratory for AgrobiotechnologyThe Chinese University of Hong KongHong Kong SARChina
| | - Philipp E. Bayer
- School of Biological Sciences and Institute of AgricultureThe University of Western AustraliaPerthWAAustralia
| | - Jacqueline Batley
- School of Biological Sciences and Institute of AgricultureThe University of Western AustraliaPerthWAAustralia
| | - David Edwards
- School of Biological Sciences and Institute of AgricultureThe University of Western AustraliaPerthWAAustralia
| |
Collapse
|
10
|
Karimi E, Mahmoudian F, Reyes SOL, Bargir UA, Madkaikar M, Artac H, Sabzevari A, Lu N, Azizi G, Abolhassani H. Approach to genetic diagnosis of inborn errors of immunity through next-generation sequencing. Mol Immunol 2021; 137:57-66. [PMID: 34216999 DOI: 10.1016/j.molimm.2021.06.018] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 06/18/2021] [Accepted: 06/23/2021] [Indexed: 01/02/2023]
Abstract
Patients with inborn errors of immunity (IEI) present with a heterogeneous clinical and immunological phenotype, therefore a correct molecular diagnosis is crucial for the classification and subsequent therapeutic management. On the other hand, IEI are a group of rare congenital diseases with highly diverse features and, in most cases, an as yet unknown genetic etiology. Next generation sequencing has facilitated genetic examinations of rare inherited disorders during the recent years, thus allowing a suitable molecular diagnosis in the IEI patients. This review aimed to investigate the current findings about these techniques in the field of IEI, suggesting an efficient stepwise approach to molecular diagnosis of inborn errors of immunity.
Collapse
Affiliation(s)
- Esmat Karimi
- Department of Cellular and Molecular Medicine, College of Medicine, University of Arizona, Tucson, AZ, 85721, USA; Research Center for Immunodeficiencies, Pediatrics Center of Excellence, Children's Medical Center, Tehran University of Medical Science, Tehran, Iran
| | - Fatemeh Mahmoudian
- Department of Molecular Medicine, School of Advanced Technologies in Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - Saul O Lugo Reyes
- Immune Deficiencies Lab, National Institute of Pediatrics, Mexico City, Mexico
| | - Umair Ahmed Bargir
- Department of Pediatric Immunology and Leukocyte Biology, ICMR-National Institute of Immunohaematology, Mumbai, India
| | - Manisha Madkaikar
- Department of Pediatric Immunology and Leukocyte Biology, ICMR-National Institute of Immunohaematology, Mumbai, India
| | - Hasibe Artac
- Department of Pediatric Immunology and Allergy, Faculty of Medicine, Selcuk University, Konya, Turkey
| | - Araz Sabzevari
- CinnaGen Medical Biotechnology Research Center, Alborz University of Medical Sciences, Karaj, Iran
| | - Na Lu
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Gholamreza Azizi
- Non-communicable Diseases Research Center, Alborz University of Medical Sciences, Karaj, Iran.
| | - Hassan Abolhassani
- Research Center for Immunodeficiencies, Pediatrics Center of Excellence, Children's Medical Center, Tehran University of Medical Science, Tehran, Iran; Division of Clinical Immunology, Department of Biosciences and Nutrition, Karolinska Institute, Stockholm, Sweden; Division of Clinical Immunology, Department of Laboratory Medicine, Karolinska Institute at Karolinska University Hospital Huddinge, Stockholm, Sweden.
| |
Collapse
|
11
|
Hadi K, Yao X, Behr JM, Deshpande A, Xanthopoulakis C, Tian H, Kudman S, Rosiene J, Darmofal M, DeRose J, Mortensen R, Adney EM, Shaiber A, Gajic Z, Sigouros M, Eng K, Wala JA, Wrzeszczyński KO, Arora K, Shah M, Emde AK, Felice V, Frank MO, Darnell RB, Ghandi M, Huang F, Dewhurst S, Maciejowski J, de Lange T, Setton J, Riaz N, Reis-Filho JS, Powell S, Knowles DA, Reznik E, Mishra B, Beroukhim R, Zody MC, Robine N, Oman KM, Sanchez CA, Kuhner MK, Smith LP, Galipeau PC, Paulson TG, Reid BJ, Li X, Wilkes D, Sboner A, Mosquera JM, Elemento O, Imielinski M. Distinct Classes of Complex Structural Variation Uncovered across Thousands of Cancer Genome Graphs. Cell 2021; 183:197-210.e32. [PMID: 33007263 DOI: 10.1016/j.cell.2020.08.006] [Citation(s) in RCA: 123] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2019] [Revised: 04/08/2020] [Accepted: 08/03/2020] [Indexed: 12/12/2022]
Abstract
Cancer genomes often harbor hundreds of somatic DNA rearrangement junctions, many of which cannot be easily classified into simple (e.g., deletion) or complex (e.g., chromothripsis) structural variant classes. Applying a novel genome graph computational paradigm to analyze the topology of junction copy number (JCN) across 2,778 tumor whole-genome sequences, we uncovered three novel complex rearrangement phenomena: pyrgo, rigma, and tyfonas. Pyrgo are "towers" of low-JCN duplications associated with early-replicating regions, superenhancers, and breast or ovarian cancers. Rigma comprise "chasms" of low-JCN deletions enriched in late-replicating fragile sites and gastrointestinal carcinomas. Tyfonas are "typhoons" of high-JCN junctions and fold-back inversions associated with expressed protein-coding fusions, breakend hypermutation, and acral, but not cutaneous, melanomas. Clustering of tumors according to genome graph-derived features identified subgroups associated with DNA repair defects and poor prognosis.
Collapse
Affiliation(s)
- Kevin Hadi
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA
| | - Xiaotong Yao
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA; Tri-institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Julie M Behr
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA; Tri-institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Aditya Deshpande
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA; Tri-institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | | | - Huasong Tian
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA
| | - Sarah Kudman
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Joel Rosiene
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA
| | - Madison Darmofal
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA; Tri-institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | | | | | - Emily M Adney
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA
| | - Alon Shaiber
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Zoran Gajic
- New York Genome Center, New York, NY 10013, USA
| | - Michael Sigouros
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Kenneth Eng
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Jeremiah A Wala
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Departments of Medical Oncology and Cancer Biology, Dana-Farber Cancer Institute, Boston, MA 02215, USA; School of Medicine, University of California, San Francisco, San Francisco, CA 94143, USA
| | | | | | - Minita Shah
- New York Genome Center, New York, NY 10013, USA
| | | | | | - Mayu O Frank
- New York Genome Center, New York, NY 10013, USA; Laboratory of Molecular Neuro-Oncology and Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Robert B Darnell
- New York Genome Center, New York, NY 10013, USA; Laboratory of Molecular Neuro-Oncology and Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Mahmoud Ghandi
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
| | - Franklin Huang
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; School of Medicine, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Sally Dewhurst
- Laboratory of Cell Biology and Genetics, The Rockefeller University, New York, NY 10065, USA
| | - John Maciejowski
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Titia de Lange
- Laboratory of Cell Biology and Genetics, The Rockefeller University, New York, NY 10065, USA
| | - Jeremy Setton
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Nadeem Riaz
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA; Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA; Immunogenomics and Precision Oncology Platform, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Jorge S Reis-Filho
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA; Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Simon Powell
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - David A Knowles
- New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10027, USA
| | - Ed Reznik
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Bud Mishra
- Departments of Computer Science, Mathematics and Cell Biology, Courant Institute and NYU School of Medicine, New York University, New York, NY 10012, USA
| | - Rameen Beroukhim
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Departments of Medical Oncology and Cancer Biology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | | | | | - Kenji M Oman
- Divisions of Human Biology and Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Carissa A Sanchez
- Divisions of Human Biology and Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Mary K Kuhner
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Lucian P Smith
- Department of Bioengineering, University of Washington, Seattle, WA 98195, USA
| | - Patricia C Galipeau
- Divisions of Human Biology and Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Thomas G Paulson
- Divisions of Human Biology and Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Brian J Reid
- Divisions of Human Biology and Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA; Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Xiaohong Li
- Divisions of Human Biology and Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - David Wilkes
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Andrea Sboner
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Juan Miguel Mosquera
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Olivier Elemento
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Marcin Imielinski
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10021, USA; New York Genome Center, New York, NY 10013, USA; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA; Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, USA.
| |
Collapse
|
12
|
Dentro SC, Leshchiner I, Haase K, Tarabichi M, Wintersinger J, Deshwar AG, Yu K, Rubanova Y, Macintyre G, Demeulemeester J, Vázquez-García I, Kleinheinz K, Livitz DG, Malikic S, Donmez N, Sengupta S, Anur P, Jolly C, Cmero M, Rosebrock D, Schumacher SE, Fan Y, Fittall M, Drews RM, Yao X, Watkins TBK, Lee J, Schlesner M, Zhu H, Adams DJ, McGranahan N, Swanton C, Getz G, Boutros PC, Imielinski M, Beroukhim R, Sahinalp SC, Ji Y, Peifer M, Martincorena I, Markowetz F, Mustonen V, Yuan K, Gerstung M, Spellman PT, Wang W, Morris QD, Wedge DC, Van Loo P. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell 2021; 184:2239-2254.e39. [PMID: 33831375 PMCID: PMC8054914 DOI: 10.1016/j.cell.2021.03.009] [Citation(s) in RCA: 229] [Impact Index Per Article: 76.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Revised: 09/21/2020] [Accepted: 03/03/2021] [Indexed: 02/07/2023]
Abstract
Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1%) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones. We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data.
Collapse
Affiliation(s)
- Stefan C Dentro
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK; Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK; Big Data Institute, University of Oxford, Oxford OX3 7LF, UK
| | | | - Kerstin Haase
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Maxime Tarabichi
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK; Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK
| | - Jeff Wintersinger
- University of Toronto, Toronto, ON M5S 3E1, Canada; Vector Institute, Toronto, ON M5G 1L7, Canada
| | - Amit G Deshwar
- University of Toronto, Toronto, ON M5S 3E1, Canada; Vector Institute, Toronto, ON M5G 1L7, Canada
| | - Kaixian Yu
- The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Yulia Rubanova
- University of Toronto, Toronto, ON M5S 3E1, Canada; Vector Institute, Toronto, ON M5G 1L7, Canada
| | - Geoff Macintyre
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge CB2 0RE, UK
| | - Jonas Demeulemeester
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK; Department of Human Genetics, University of Leuven, 3000 Leuven, Belgium
| | - Ignacio Vázquez-García
- Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK; University of Cambridge, Cambridge CB2 0QQ, UK; Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA; Irving Institute for Cancer Dynamics, Columbia University, New York, NY 10027, USA
| | - Kortine Kleinheinz
- German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; Heidelberg University, 69120 Heidelberg, Germany
| | | | - Salem Malikic
- Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, MD 20892, USA
| | - Nilgun Donmez
- Simon Fraser University, Burnaby, BC V5A 1S6, Canada; Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | | | - Pavana Anur
- Molecular and Medical Genetics, Oregon Health & Science University, Portland, OR 97231, USA
| | - Clemency Jolly
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Marek Cmero
- University of Melbourne, Melbourne, VIC 3010, Australia; Walter + Eliza Hall Institute, Melbourne, VIC 3000, Australia
| | | | | | - Yu Fan
- The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Matthew Fittall
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Ruben M Drews
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge CB2 0RE, UK
| | - Xiaotong Yao
- Weill Cornell Medicine, New York, NY 10065, USA; New York Genome Center, New York, NY 10013, USA
| | - Thomas B K Watkins
- Cancer Evolution and Genome Instability Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Juhee Lee
- University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | | | - Hongtu Zhu
- The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - David J Adams
- Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK
| | - Nicholas McGranahan
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London WC1E 6BT, UK; Cancer Genome Evolution Research Group, University College London Cancer Institute, London WC1E 6DD, UK
| | - Charles Swanton
- Cancer Evolution and Genome Instability Laboratory, The Francis Crick Institute, London NW1 1AT, UK; Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London WC1E 6BT, UK; Department of Medical Oncology, University College London Hospitals, London NW1 2BU, UK
| | - Gad Getz
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Massachusetts General Hospital Center for Cancer Research, Charlestown, MA 02129, USA; Massachusetts General Hospital, Department of Pathology, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02215, USA
| | - Paul C Boutros
- University of Toronto, Toronto, ON M5S 3E1, Canada; Ontario Institute for Cancer Research, Toronto, ON M5G 0A3, Canada; University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Marcin Imielinski
- Weill Cornell Medicine, New York, NY 10065, USA; New York Genome Center, New York, NY 10013, USA
| | - Rameen Beroukhim
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, MD 20892, USA
| | - Yuan Ji
- NorthShore University HealthSystem, Evanston, IL 60201, USA; The University of Chicago, Chicago, IL 60637, USA
| | - Martin Peifer
- Department of Translational Genomics, Center for Integrated Oncology Cologne-Bonn, Medical Faculty, University of Cologne, 50931 Cologne, Germany
| | | | - Florian Markowetz
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge CB2 0RE, UK
| | - Ville Mustonen
- Organismal and Evolutionary Biology Research Programme, Department of Computer Science, Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Ke Yuan
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge CB2 0RE, UK; School of Computing Science, University of Glasgow, Glasgow G12 8RZ, UK
| | - Moritz Gerstung
- Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge CB10 1SD, UK; European Molecular Biology Laboratory, Genome Biology Unit, 69117 Heidelberg, Germany
| | - Paul T Spellman
- Molecular and Medical Genetics, Oregon Health & Science University, Portland, OR 97231, USA
| | - Wenyi Wang
- The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Quaid D Morris
- University of Toronto, Toronto, ON M5S 3E1, Canada; Vector Institute, Toronto, ON M5G 1L7, Canada; Ontario Institute for Cancer Research, Toronto, ON M5G 0A3, Canada; Computational and Systems Biology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - David C Wedge
- Big Data Institute, University of Oxford, Oxford OX3 7LF, UK; Oxford NIHR Biomedical Research Centre, Oxford OX4 2PG, UK; Manchester Cancer Research Centre, University of Manchester, Manchester M20 4GJ, UK
| | - Peter Van Loo
- Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, UK.
| |
Collapse
|
13
|
Yuan X, Yu J, Xi J, Yang L, Shang J, Li Z, Duan J. CNV_IFTV: An Isolation Forest and Total Variation-Based Detection of CNVs from Short-Read Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:539-549. [PMID: 31180897 DOI: 10.1109/tcbb.2019.2920889] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurate detection of copy number variations (CNVs) from short-read sequencing data is challenging due to the uneven distribution of reads and the unbalanced amplitudes of gains and losses. The direct use of read depths to measure CNVs tends to limit performance. Thus, robust computational approaches equipped with appropriate statistics are required to detect CNV regions and boundaries. This study proposes a new method called CNV_IFTV to address this need. CNV_IFTV assigns an anomaly score to each genome bin through a collection of isolation trees. The trees are trained based on isolation forest algorithm through conducting subsampling from measured read depths. With the anomaly scores, CNV_IFTV uses a total variation model to smooth adjacent bins, leading to a denoised score profile. Finally, a statistical model is established to test the denoised scores for calling CNVs. CNV_IFTV is tested on both simulated and real data in comparison to several peer methods. The results indicate that the proposed method outperforms the peer methods. CNV_IFTV is a reliable tool for detecting CNVs from short-read sequencing data even for low-level coverage and tumor purity. The detection results on tumor samples can aid to evaluate known cancer genes and to predict target drugs for disease diagnosis.
Collapse
|
14
|
Aganezov S, Raphael BJ. Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples. Genome Res 2020; 30:1274-1290. [PMID: 32887685 PMCID: PMC7545144 DOI: 10.1101/gr.256701.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Accepted: 08/07/2020] [Indexed: 12/25/2022]
Abstract
Many cancer genomes are extensively rearranged with aberrant chromosomal karyotypes. Deriving these karyotypes from high-throughput DNA sequencing of bulk tumor samples is complicated because most tumors are a heterogeneous mixture of normal cells and subpopulations of cancer cells, or clones, that harbor distinct somatic mutations. We introduce a new algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes from DNA sequencing data from a bulk tumor sample. RCK leverages evolutionary constraints on the somatic mutational process in cancer to reduce ambiguity in the deconvolution of admixed sequencing data into multiple haplotype-specific cancer karyotypes. RCK models mixtures containing an arbitrary number of derived genomes and allows the incorporation of information both from short-read and long-read DNA sequencing technologies. We compare RCK to existing approaches on 17 primary and metastatic prostate cancer samples. We find that RCK infers cancer karyotypes that better explain the DNA sequencing data and conform to a reasonable evolutionary model. RCK's reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is freely available as open source software.
Collapse
Affiliation(s)
- Sergey Aganezov
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| |
Collapse
|
15
|
Jia H, Wei H, Zhu D, Ma J, Yang H, Wang R, Feng X. PASA: Identifying More Credible Structural Variants of Hedou12. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1493-1503. [PMID: 31425044 DOI: 10.1109/tcbb.2019.2934463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Although plenty of structural variant detecting approaches for human genomes can be looked up in the literatures, little has been acknowledged on the effectiveness of those structural variant softwares for plant genomes. Moreover, it has been demonstrated frequent occurrences for those structural variant detecting softwares to find too many false structural variants. In this paper, we devote to detect deletions, insertions, and inversions, in total of three kinds of structural variants occurring in Hedou12 genome in contrast to Williams82 genome. To find more potential structural variants, we try to develop new principles to detect discordant and split read map sets supporting structural variants. Aiming to enhance the precision of structural variant detections, we propose two new sequencing characteristic based probability models, which use the sequencing parameters of Hedou12 genome as well as the parameters for Hedou12 paired-end reads to be aligned onto Williams82, to evaluate the probability for a potential structural variant to occur in. To remove the false members from those potential structural variants, we propose a set cover problem model to describe formally on which potential structural variants it should accept to achieve as high as possible a probability summation. This will achieve a solution with more credible structural variants, which can be verified by comparing with DELLY version 0.5.8 and LUMPY version 0.2.2.3. Our algorithm has been verified to be able to find deletions, insertions, and inversions in Hedou12 in contrast to Williams82 DELLY as well as LUMPY fails to find.
Collapse
|
16
|
Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data. PLoS Comput Biol 2020; 16:e1008012. [PMID: 32658894 PMCID: PMC7377518 DOI: 10.1371/journal.pcbi.1008012] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 07/23/2020] [Accepted: 06/03/2020] [Indexed: 12/22/2022] Open
Abstract
Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient. Copy number aberrations, or CNAs, refer to evolutionary events that act on cancer genomes by deleting segments of the genomes or introducing new copies of existing segments. These events have been implicated in various types of cancer; consequently, their accurate detection could shed light on the initiation and progression of tumor, as well as on the development of potential targeted therapeutics. Single-cell DNA sequencing technologies are now producing the type of data that would allow such detection at the resolution of individual cells. However, to achieve this detection task, methods have to implement several steps of “data wrangling” and dealing with technical artifacts. In this work, we benchmarked three widely used methods for CNA detection from single-cell DNA data, namely Ginkgo, HMMcopy, and CopyNumber. To accomplish this study, we developed a novel simulator and devised a phylogeny-based measure of potentially erroneous CNA calls. We find that none of these methods has high accuracy, and all of them can be computationally very demanding. These findings call for the development of more accurate and more efficient methods for CNA detection from single-cell DNA data.
Collapse
|
17
|
Wei YC, Huang GH. CONY: A Bayesian procedure for detecting copy number variations from sequencing read depths. Sci Rep 2020; 10:10493. [PMID: 32591545 PMCID: PMC7319969 DOI: 10.1038/s41598-020-64353-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 04/15/2020] [Indexed: 12/26/2022] Open
Abstract
Copy number variations (CNVs) are genomic structural mutations consisting of abnormal numbers of fragment copies. Next-generation sequencing of read-depth signals mirrors these variants. Some tools used to predict CNVs by depth have been published, but most of these tools can be applied to only a specific data type due to modeling limitations. We develop a tool for copy number variation detection by a Bayesian procedure, i.e., CONY, that adopts a Bayesian hierarchical model and an efficient reversible-jump Markov chain Monte Carlo inference algorithm for whole genome sequencing of read-depth data. CONY can be applied not only to individual samples for estimating the absolute number of copies but also to case-control pairs for detecting patient-specific variations. We evaluate the performance of CONY and compare CONY with competing approaches through simulations and by using experimental data from the 1000 Genomes Project. CONY outperforms the other methods in terms of accuracy in both single-sample and paired-samples analyses. In addition, CONY performs well regardless of whether the data coverage is high or low. CONY is useful for detecting both absolute and relative CNVs from read-depth data sequences. The package is available at https://github.com/weiyuchung/CONY.
Collapse
Affiliation(s)
- Yu-Chung Wei
- Graduate Institute of Statistics and Information Science, National Changhua University of Education, No.1 Jinde Road, Changhua City, Changhua County, 50007, Taiwan
| | - Guan-Hua Huang
- Institute of Statistics, National Chiao Tung University, 1001 University Road, Hsinchu, 30010, Taiwan.
| |
Collapse
|
18
|
Friedrich S, Barbulescu R, Helleday T, Sonnhammer ELL. MetaCNV - a consensus approach to infer accurate copy numbers from low coverage data. BMC Med Genomics 2020; 13:76. [PMID: 32487140 PMCID: PMC7268502 DOI: 10.1186/s12920-020-00731-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 05/20/2020] [Indexed: 12/23/2022] Open
Abstract
Background The majority of copy number callers requires high read coverage data that is often achieved with elevated material input, which increases the heterogeneity of tissue samples. However, to gain insights into smaller areas within a tissue sample, e.g. a cancerous area in a heterogeneous tissue sample, less material is used for sequencing, which results in lower read coverage. Therefore, more focus needs to be put on copy number calling that is sensitive enough for low coverage data. Results We present MetaCNV, a copy number caller that infers reliable copy numbers for human genomes with a consensus approach. MetaCNV specializes in low coverage data, but also performs well on normal and high coverage data. MetaCNV integrates the results of multiple copy number callers and infers absolute and unbiased copy numbers for the entire genome. MetaCNV is based on a meta-model that bypasses the weaknesses of current calling models while combining the strengths of existing approaches. Here we apply MetaCNV based on ReadDepth, SVDetect, and CNVnator to real and simulated datasets in order to demonstrate how the approach improves copy number calling. Conclusions MetaCNV, available at https://bitbucket.org/sonnhammergroup/metacnv, provides accurate copy number prediction on low coverage data and performs well on high coverage data.
Collapse
Affiliation(s)
- Stefanie Friedrich
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden.
| | - Remus Barbulescu
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| | - Thomas Helleday
- Department of Oncology-Pathology, Science for Life Laboratory, Karolinska Institutet, Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| |
Collapse
|
19
|
Berlow NE, Grasso CS, Quist MJ, Cheng M, Gandour-Edwards R, Hernandez BS, Michalek JE, Ryan C, Spellman P, Pal R, Million LS, Renneker M, Keller C. Deep Functional and Molecular Characterization of a High-Risk Undifferentiated Pleomorphic Sarcoma. Sarcoma 2020; 2020:6312480. [PMID: 32565715 PMCID: PMC7285280 DOI: 10.1155/2020/6312480] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Revised: 02/07/2020] [Accepted: 02/10/2020] [Indexed: 11/29/2022] Open
Abstract
Nonrhabdomyosarcoma soft-tissue sarcomas (STSs) are a class of 50+ cancers arising in muscle and soft tissues of children, adolescents, and adults. Rarity of each subtype often precludes subtype-specific preclinical research, leaving many STS patients with limited treatment options should frontline therapy be insufficient. When clinical options are exhausted, personalized therapy assignment approaches may help direct patient care. Here, we report the results of an adult female STS patient with relapsed undifferentiated pleomorphic sarcoma (UPS) who self-drove exploration of a wide array of personalized Clinical Laboratory Improvement Amendments (CLIAs) level and research-level diagnostics, including state of the art genomic, proteomic, ex vivo live cell chemosensitivity testing, a patient-derived xenograft model, and immunoscoring. Her therapeutic choices were also diverse, including neoadjuvant chemotherapy, radiation therapy, and surgeries. Adjuvant and recurrence strategies included off-label and natural medicines, several immunotherapies, and N-of-1 approaches. Identified treatment options, especially those validated during the in vivo study, were not introduced into the course of clinical treatment but did provide plausible treatment regimens based on FDA-approved clinical agents.
Collapse
Affiliation(s)
- Noah E. Berlow
- Children's Cancer Therapy Development Institute, Beaverton, OR 97005, USA
- Electrical and Computer Engineering, Texas Tech University, Lubbock, TX 79409, USA
- Division of Hematology-Oncology, University of California, Los Angeles, CA 90095, USA
| | - Catherine S. Grasso
- Division of Hematology-Oncology, University of California, Los Angeles, CA 90095, USA
| | - Michael J. Quist
- Division of Hematology-Oncology, University of California, Los Angeles, CA 90095, USA
| | | | - Regina Gandour-Edwards
- Department of Pathology & Laboratory Medicine, UC Davis Health System, Sacramento, CA 95817, USA
| | - Brian S. Hernandez
- Department of Epidemiology and Biostatistics, University of Texas Health Science Center San Antonio, San Antonio, TX 78229, USA
| | - Joel E. Michalek
- Department of Epidemiology and Biostatistics, University of Texas Health Science Center San Antonio, San Antonio, TX 78229, USA
| | - Christopher Ryan
- School of Medicine, Oregon Health and Science University, Portland, OR 97239, USA
| | - Paul Spellman
- School of Medicine, Oregon Health and Science University, Portland, OR 97239, USA
| | - Ranadip Pal
- Electrical and Computer Engineering, Texas Tech University, Lubbock, TX 79409, USA
| | - Lynn S. Million
- Department of Radiation Oncology, Stanford University, Stanford, CA 94305, USA
| | - Mark Renneker
- Patient-Directed Consultations, San Francisco, CA 94116, USA
- Department of Family Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Charles Keller
- Children's Cancer Therapy Development Institute, Beaverton, OR 97005, USA
- Electrical and Computer Engineering, Texas Tech University, Lubbock, TX 79409, USA
| |
Collapse
|
20
|
Lange S, Engleitner T, Mueller S, Maresch R, Zwiebel M, González-Silva L, Schneider G, Banerjee R, Yang F, Vassiliou GS, Friedrich MJ, Saur D, Varela I, Rad R. Analysis pipelines for cancer genome sequencing in mice. Nat Protoc 2020; 15:266-315. [PMID: 31907453 DOI: 10.1038/s41596-019-0234-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 08/27/2019] [Indexed: 12/14/2022]
Abstract
Mouse models of human cancer have transformed our ability to link genetics, molecular mechanisms and phenotypes. Both reverse and forward genetics in mice are currently gaining momentum through advances in next-generation sequencing (NGS). Methodologies to analyze sequencing data were, however, developed for humans and hence do not account for species-specific differences in genome structures and experimental setups. Here, we describe standardized computational pipelines specifically tailored to the analysis of mouse genomic data. We present novel tools and workflows for the detection of different alteration types, including single-nucleotide variants (SNVs), small insertions and deletions (indels), copy-number variations (CNVs), loss of heterozygosity (LOH) and complex rearrangements, such as in chromothripsis. Workflows have been extensively validated and cross-compared using multiple methodologies. We also give step-by-step guidance on the execution of individual analysis types, provide advice on data interpretation and make the complete code available online. The protocol takes 2-7 d, depending on the desired analyses.
Collapse
Affiliation(s)
- Sebastian Lange
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany
- Department of Medicine II, Klinikum rechts der Isar, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
| | - Thomas Engleitner
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
| | - Sebastian Mueller
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
| | - Roman Maresch
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
| | - Maximilian Zwiebel
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
| | - Laura González-Silva
- Instituto de Biomedicina y Biotecnología de Cantabria, Universidad de Cantabria-CSIC, Santander, Spain
| | - Günter Schneider
- Department of Medicine II, Klinikum rechts der Isar, School of Medicine, Technische Universität München, Munich, Germany
| | | | | | - George S Vassiliou
- The Wellcome Trust Sanger Institute, Cambridge, UK
- Wellcome Trust-MRC Stem Cell Institute, Biomedical Campus, University of Cambridge, Cambridge, UK
- Department of Haematology, Cambridge University Hospitals NHS Trust, Cam bridge, UK
| | - Mathias J Friedrich
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany
- Department of Medicine II, Klinikum rechts der Isar, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
| | - Dieter Saur
- Department of Medicine II, Klinikum rechts der Isar, School of Medicine, Technische Universität München, Munich, Germany
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany
- German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany
- Institute for Experimental Cancer Therapy, School of Medicine, Technische Universität München, Munich, Germany
| | - Ignacio Varela
- Instituto de Biomedicina y Biotecnología de Cantabria, Universidad de Cantabria-CSIC, Santander, Spain
| | - Roland Rad
- Institute of Molecular Oncology and Functional Genomics, School of Medicine, Technische Universität München, Munich, Germany.
- Department of Medicine II, Klinikum rechts der Isar, School of Medicine, Technische Universität München, Munich, Germany.
- Center for Translational Cancer Research (TranslaTUM), School of Medicine, Technische Universität München, Munich, Germany.
- German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany.
| |
Collapse
|
21
|
Alzaid E, Allali AE. PostSV: A Post-Processing Approach for Filtering Structural Variations. Bioinform Biol Insights 2020; 14:1177932219892957. [PMID: 32009779 PMCID: PMC6974750 DOI: 10.1177/1177932219892957] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 11/09/2019] [Indexed: 11/25/2022] Open
Abstract
Genomic structural variations are significant causes of genome diversity and
complex diseases. With advances in sequencing technologies, many algorithms have
been designed to identify structural differences using next-generation
sequencing (NGS) data. Due to repetitions in the human genome and the short
reads produced by NGS, the discovery of structural variants (SVs) by
state-of-the-art SV callers is not always accurate. To improve performance,
multiple SV callers are often used to detect variants. However, most SV callers
suffer from high false-positive rates, which diminishes the overall performance,
especially in low-coverage genomes. In this article, we propose a
post-processing classification–based algorithm that can be used to filter
structural variation predictions produced by SV callers. Novel features are
defined from putative SV predictions using reads at the local regions around the
breakpoints. Several classifiers are employed to classify the candidate
predictions and remove false positives. We test our classifier models on
simulated and real genomes and show that the proposed approach improves the
performance of state-of-the-art algorithms.
Collapse
Affiliation(s)
- Eman Alzaid
- Computer Science Department, King Saud University, Riyadh, Saudi Arabia.,Department of Computer Science, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia
| | - Achraf El Allali
- Computer Science Department, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
22
|
Jiang Y, Jiang Y, Wang S, Zhang Q, Ding X. Optimal sequencing depth design for whole genome re-sequencing in pigs. BMC Bioinformatics 2019; 20:556. [PMID: 31703550 PMCID: PMC6839175 DOI: 10.1186/s12859-019-3164-z] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 10/16/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND As whole-genome sequencing is becoming a routine technique, it is important to identify a cost-effective depth of sequencing for such studies. However, the relationship between sequencing depth and biological results from the aspects of whole-genome coverage, variant discovery power and the quality of variants is unclear, especially in pigs. We sequenced the genomes of three Yorkshire boars at an approximately 20X depth on the Illumina HiSeq X Ten platform and downloaded whole-genome sequencing data for three Duroc and three Landrace pigs with an approximately 20X depth for each individual. Then, we downsampled the deep genome data by extracting twelve different proportions of 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 paired reads from the original bam files to mimic the sequence data of the same individuals at sequencing depths of 1.09X, 2.18X, 3.26X, 4.35X, 6.53X, 8.70X, 10.88X, 13.05X, 15.22X, 17.40X, 19.57X and 21.75X to evaluate the influence of genome coverage, the variant discovery rate and genotyping accuracy as a function of sequencing depth. In addition, SNP chip data for Yorkshire pigs were used as a validation for the comparison of single-sample calling and multisample calling algorithms. RESULTS Our results indicated that 10X is an ideal practical depth for achieving plateau coverage and discovering accurate variants, which achieved greater than 99% genome coverage. The number of false-positive variants was increased dramatically at a depth of less than 4X, which covered 95% of the whole genome. In addition, the comparison of multi- and single-sample calling showed that multisample calling was more sensitive than single-sample calling, especially at lower depths. The number of variants discovered under multisample calling was 13-fold and 2-fold higher than that under single-sample calling at 1X and 22X, respectively. A large difference was observed when the depth was less than 4.38X. However, more false-positive variants were detected under multisample calling. CONCLUSIONS Our research will inform important study design decisions regarding whole-genome sequencing depth. Our results will be helpful for choosing the appropriate depth to achieve the same power for studies performed under limited budgets.
Collapse
Affiliation(s)
- Yifan Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| | - Yao Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| | - Sheng Wang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| | - Qin Zhang
- Shandong Provincial Key Laboratory of Animal Biotechnology and Disease Control and Prevention, College of Animal Science and Technology, Shandong Agricultural University, Taian, 271001 China
| | - Xiangdong Ding
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| |
Collapse
|
23
|
Standage DS, Brown CT, Hormozdiari F. Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants. iScience 2019; 18:28-36. [PMID: 31377530 PMCID: PMC6682328 DOI: 10.1016/j.isci.2019.07.032] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 06/24/2019] [Accepted: 07/19/2019] [Indexed: 01/05/2023] Open
Abstract
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy.
Collapse
Affiliation(s)
- Daniel S Standage
- Population Health and Reproduction, University of California, Davis, USA.
| | - C Titus Brown
- Population Health and Reproduction, University of California, Davis, USA; Genome Center, University of California, Davis, USA.
| | - Fereydoun Hormozdiari
- Genome Center, University of California, Davis, USA; MIND Institute, University of California, Davis, USA; Biochemistry and Molecular Medicine, University of California, Davis, 1 Shields Avenue, Davis, CA 95616, USA.
| |
Collapse
|
24
|
Baaijens JA, Van der Roest B, Köster J, Stougie L, Schönhuth A. Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics 2019; 35:5086-5094. [DOI: 10.1093/bioinformatics/btz443] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2018] [Revised: 04/17/2019] [Accepted: 05/27/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly is the reconstruction of strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains is an important step for various treatment-related reasons. Reference genome independent (‘de novo’) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. The remaining challenge is to reconstruct full-length haplotypes together with their abundances from such contigs.
Results
We present Virus-VG as a de novo approach to viral haplotype reconstruction from preassembled contigs. Our method constructs a variation graph from the short input contigs without making use of a reference genome. Then, to obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is, optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances. Benchmarking experiments on challenging simulated and real datasets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates compared to the state-of-the-art viral quasispecies assemblers.
Availability and implementation
Virus-VG is freely available at https://bitbucket.org/jbaaijens/virus-vg.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
| | | | - Johannes Köster
- Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Leen Stougie
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- Department of Econometrics and Operations Research, Vrije Universiteit, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
| | - Alexander Schönhuth
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| |
Collapse
|
25
|
Jin Y, Chen G, Xiao W, Hong H, Xu J, Guo Y, Xiao W, Shi T, Shi L, Tong W, Ning B. Sequencing XMET genes to promote genotype-guided risk assessment and precision medicine. SCIENCE CHINA-LIFE SCIENCES 2019; 62:895-904. [PMID: 31114935 DOI: 10.1007/s11427-018-9479-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 12/06/2018] [Indexed: 12/26/2022]
Abstract
High-throughput next generation sequencing (NGS) is a shotgun approach applied in a parallel fashion by which the genome is fragmented and sequenced through small pieces and then analyzed either by aligning to a known reference genome or by de novo assembly without reference genome. This technology has led researchers to conduct an explosion of sequencing related projects in multidisciplinary fields of science. However, due to the limitations of sequencing-based chemistry, length of sequencing reads and the complexity of genes, it is difficult to determine the sequences of some portions of the human genome, leaving gaps in genomic data that frustrate further analysis. Particularly, some complex genes are difficult to be accurately sequenced or mapped because they contain high GC-content and/or low complexity regions, and complicated pseudogenes, such as the genes encoding xenobiotic metabolizing enzymes and transporters (XMETs). The genetic variants in XMET genes are critical to predicate inter-individual variability in drug efficacy, drug safety and susceptibility to environmental toxicity. We summarized and discussed challenges, wet-lab methods, and bioinformatics algorithms in sequencing "complex" XMET genes, which may provide insightful information in the application of NGS technology for implementation in toxicogenomics and pharmacogenomics.
Collapse
Affiliation(s)
- Yaqiong Jin
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, 100045, China
| | - Geng Chen
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Wenming Xiao
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Huixiao Hong
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Joshua Xu
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Yongli Guo
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, 100045, China
| | - Wenzhong Xiao
- Department of Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, 02114, USA
| | - Tieliu Shi
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Cancer Center; Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, 200433, China
| | - Weida Tong
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Baitang Ning
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA.
| |
Collapse
|
26
|
Deshpande V, Luebeck J, Nguyen NPD, Bakhtiari M, Turner KM, Schwab R, Carter H, Mischel PS, Bafna V. Exploring the landscape of focal amplifications in cancer using AmpliconArchitect. Nat Commun 2019; 10:392. [PMID: 30674876 PMCID: PMC6344493 DOI: 10.1038/s41467-018-08200-y] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Accepted: 12/13/2018] [Indexed: 01/17/2023] Open
Abstract
Focal oncogene amplification and rearrangements drive tumor growth and evolution in multiple cancer types. We present AmpliconArchitect (AA), a tool to reconstruct the fine structure of focally amplified regions using whole genome sequencing (WGS) and validate it extensively on multiple simulated and real datasets, across a wide range of coverage and copy numbers. Analysis of AA-reconstructed amplicons in a pan-cancer dataset reveals many novel properties of copy number amplifications in cancer. These findings support a model in which focal amplifications arise due to the formation and replication of extrachromosomal DNA. Applying AA to 68 viral-mediated cancer samples, we identify a large fraction of amplicons with specific structural signatures suggestive of hybrid, human-viral extrachromosomal DNA. AA reconstruction, integrated with metaphase fluorescence in situ hybridization (FISH) and PacBio sequencing on the cell-line UPCI:SCC090 confirm the extrachromosomal origin and fine structure of a Forkhead box E1 (FOXE1)-containing hybrid amplicon.
Collapse
Affiliation(s)
- Viraj Deshpande
- Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
| | - Jens Luebeck
- Bioinformatics and Systems Biology Program, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Nam-Phuong D Nguyen
- Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Kristen M Turner
- Ludwig Institute for Cancer Research, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Richard Schwab
- Department of Medicine, Division of Hematology-Oncology, School of Medicine, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Hannah Carter
- Department of Medicine, Division of Medical Genetics, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
- Moores Cancer Center, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Paul S Mischel
- Ludwig Institute for Cancer Research, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
- Moores Cancer Center, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
- Department of Pathology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
| |
Collapse
|
27
|
Roca I, González-Castro L, Fernández H, Couce ML, Fernández-Marmiesse A. Free-access copy-number variant detection tools for targeted next-generation sequencing data. MUTATION RESEARCH-REVIEWS IN MUTATION RESEARCH 2019; 779:114-125. [DOI: 10.1016/j.mrrev.2019.02.005] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Revised: 12/25/2018] [Accepted: 02/22/2019] [Indexed: 01/23/2023]
|
28
|
Sohn JI, Nam JW. The present and future of de novo whole-genome assembly. Brief Bioinform 2018; 19:23-40. [PMID: 27742661 DOI: 10.1093/bib/bbw096] [Citation(s) in RCA: 75] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Indexed: 12/15/2022] Open
Abstract
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
Collapse
|
29
|
Smadbeck JB, Johnson SH, Smoley SA, Gaitatzes A, Drucker TM, Zenka RM, Kosari F, Murphy SJ, Hoppman N, Aypar U, Sukov WR, Jenkins RB, Kearney HM, Feldman AL, Vasmatzis G. Copy number variant analysis using genome-wide mate-pair sequencing. Genes Chromosomes Cancer 2018; 57:459-470. [PMID: 29726617 DOI: 10.1002/gcc.5] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Revised: 04/23/2018] [Accepted: 04/29/2018] [Indexed: 02/06/2023] Open
Abstract
Copy number variation (CNV) is a common form of structural variation detected in human genomes, occurring as both constitutional and somatic events. Cytogenetic techniques like chromosomal microarray (CMA) are widely used in analyzing CNVs. However, CMA techniques cannot resolve the full nature of these structural variations (i.e. the orientation and location of associated breakpoint junctions) and must be combined with other cytogenetic techniques, such as karyotyping or FISH, to do so. This makes the development of a next-generation sequencing (NGS) approach capable of resolving both CNVs and breakpoint junctions desirable. Mate-pair sequencing (MPseq) is a NGS technology designed to find large structural rearrangements across the entire genome. Here we present an algorithm capable of performing copy number analysis from mate-pair sequencing data. The algorithm uses a step-wise procedure involving normalization, segmentation, and classification of the sequencing data. The segmentation technique combines both read depth and discordant mate-pair reads to increase the sensitivity and resolution of CNV calls. The method is particularly suited to MPseq, which is designed to detect breakpoint junctions at high resolution. This allows for the classification step to accurately calculate copy number levels at the relatively low read depth of MPseq. Here we compare results for a series of hematological cancer samples that were tested with CMA and MPseq. We demonstrate comparable sensitivity to the state-of-the-art CMA technology, with the benefit of improved breakpoint resolution. The algorithm provides a powerful analytical tool for the analysis of MPseq results in cancer.
Collapse
Affiliation(s)
- James B Smadbeck
- Center for Individualized Medicine - Biomarker Discovery, Mayo Clinic, Rochester, Minnesota
| | - Sarah H Johnson
- Center for Individualized Medicine - Biomarker Discovery, Mayo Clinic, Rochester, Minnesota
| | - Stephanie A Smoley
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | | | | | - Roman M Zenka
- Bioinformatics Systems, Mayo Clinic, Rochester, Minnesota
| | - Farhad Kosari
- Center for Individualized Medicine - Biomarker Discovery, Mayo Clinic, Rochester, Minnesota
| | - Stephen J Murphy
- Center for Individualized Medicine - Biomarker Discovery, Mayo Clinic, Rochester, Minnesota
| | - Nicole Hoppman
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Umut Aypar
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - William R Sukov
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Robert B Jenkins
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Hutton M Kearney
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Andrew L Feldman
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - George Vasmatzis
- Center for Individualized Medicine - Biomarker Discovery, Mayo Clinic, Rochester, Minnesota.,Department of Molecular Medicine, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
30
|
Abstract
Integrated analysis of structural variants (SVs) and copy number alterations in aneuploid cancer genomes is key to understanding tumor genome complexity. A recently developed algorithm, Weaver, can estimate, for the first time, allele-specific copy number of SVs and their interconnectivity in aneuploid cancer genomes. However, one major limitation is that not all SVs identified by Weaver are phased. In this article, we develop a general convex programming framework that predicts the interconnectivity of unphased SVs with possibly noisy allele-specific copy number estimations as input. We demonstrated through applications to both simulated data and HeLa whole-genome sequencing data that our method is robust to the noise in the input copy numbers and can predict SV phasings with high specificity. We found that our method can make consistent predictions with Weaver even if a large proportion of the input variants are unphased. We also applied our method to The Cancer Genome Atlas (TCGA) ovarian cancer whole-genome sequencing samples to phase SVs left unphased by Weaver. Our work provides an important new algorithmic framework for recovering more complete allele-specific cancer genome graphs.
Collapse
Affiliation(s)
- Ashok Rajaraman
- Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania
| | - Jian Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania
| |
Collapse
|
31
|
Dharanipragada P, Vogeti S, Parekh N. iCopyDAV: Integrated platform for copy number variations-Detection, annotation and visualization. PLoS One 2018; 13:e0195334. [PMID: 29621297 PMCID: PMC5886540 DOI: 10.1371/journal.pone.0195334] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2017] [Accepted: 03/20/2018] [Indexed: 12/14/2022] Open
Abstract
Discovery of copy number variations (CNVs), a major category of structural variations, have dramatically changed our understanding of differences between individuals and provide an alternate paradigm for the genetic basis of human diseases. CNVs include both copy gain and copy loss events and their detection genome-wide is now possible using high-throughput, low-cost next generation sequencing (NGS) methods. However, accurate detection of CNVs from NGS data is not straightforward due to non-uniform coverage of reads resulting from various systemic biases. We have developed an integrated platform, iCopyDAV, to handle some of these issues in CNV detection in whole genome NGS data. It has a modular framework comprising five major modules: data pre-treatment, segmentation, variant calling, annotation and visualization. An important feature of iCopyDAV is the functional annotation module that enables the user to identify and prioritize CNVs encompassing various functional elements, genomic features and disease-associations. Parallelization of the segmentation algorithms makes the iCopyDAV platform even accessible on a desktop. Here we show the effect of sequencing coverage, read length, bin size, data pre-treatment and segmentation approaches on accurate detection of the complete spectrum of CNVs. Performance of iCopyDAV is evaluated on both simulated data and real data for different sequencing depths. It is an open-source integrated pipeline available at https://github.com/vogetihrsh/icopydav and as Docker’s image at http://bioinf.iiit.ac.in/icopydav/.
Collapse
Affiliation(s)
- Prashanthi Dharanipragada
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| | - Sriharsha Vogeti
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| | - Nita Parekh
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
- * E-mail:
| |
Collapse
|
32
|
Wang W, Sun W, Wang W, Szatkiewicz J. A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection. BMC Bioinformatics 2018; 19:74. [PMID: 29490610 PMCID: PMC5831535 DOI: 10.1186/s12859-018-2077-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2017] [Accepted: 02/20/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. RESULTS We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection. CONCLUSIONS Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
Collapse
Affiliation(s)
- WeiBo Wang
- Department of Computer Science, University of North Carolina at Chapel Hill, 201 S. Columbia St., Chapel Hill, 27599-3175 USA
| | - Wei Sun
- Biostatistics Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 19024 USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, 580 Portola Plaza, Los Angeles, 90095-1596 USA
| | - Jin Szatkiewicz
- Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, Chapel Hill, 27599-7264 USA
| |
Collapse
|
33
|
Abstract
Differences between genomes can be due to single nucleotide variants (SNPs), translocations, inversions and copy number variants (CNVs, gain or loss of DNA). The latter can range from sub-microscopic events to complete chromosomal aneuploidies. Small CNVs are often benign but those larger than 250 kb are strongly associated with morbid consequences such as developmental disorders and cancer. Detecting CNVs within and between populations is essential to better understand the plasticity of our genome and to elucidate its possible contribution to disease or phenotypic traits.While the link between SNPs and disease susceptibility has been well studied, to date there are still very few published CNV genome-wide association studies; probably owing to the fact that CNV analysis remains a slightly more complex task than SNP analysis (both in term of bioinformatics workflow and uncertainty in the CNV calling leading to high false positive rates and unknown false negative rates). This chapter aims at explaining computational methods for the analysis of CNVs, ranging from study design, data processing and quality control, up to genome-wide association study with clinical traits.
Collapse
Affiliation(s)
- Aurélien Macé
- Institute of Social and Preventive Medicine, University Hospital of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Zoltán Kutalik
- Institute of Social and Preventive Medicine, University Hospital of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | |
Collapse
|
34
|
do Nascimento F, Guimaraes KS. Copy Number Variations Detection: Unravelling the Problem in Tangible Aspects. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1237-1250. [PMID: 27295681 DOI: 10.1109/tcbb.2016.2576441] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In the midst of the important genomic variants associated to the susceptibility and resistance to complex diseases, Copy Number Variations (CNV) has emerged as a prevalent class of structural variation. Following the flood of next-generation sequencing data, numerous tools publicly available have been developed to provide computational strategies to identify CNV at improved accuracy. This review goes beyond scrutinizing the main approaches widely used for structural variants detection in general, including Split-Read, Paired-End Mapping, Read-Depth, and Assembly-based. In this paper, (1) we characterize the relevant technical details around the detection of CNV, which can affect the estimation of breakpoints and number of copies, (2) we pinpoint the most important insights related to GC-content and mappability biases, and (3) we discuss the paramount caveats in the tools evaluation process. The points brought out in this study emphasize common assumptions, a variety of possible limitations, valuable insights, and directions for desirable contributions to the state-of-the-art in CNV detection tools.
Collapse
|
35
|
McPherson AW, Roth A, Ha G, Chauve C, Steif A, de Souza CPE, Eirew P, Bouchard-Côté A, Aparicio S, Sahinalp SC, Shah SP. ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 2017; 18:140. [PMID: 28750660 PMCID: PMC5530528 DOI: 10.1186/s13059-017-1267-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 07/03/2017] [Indexed: 11/10/2022] Open
Abstract
Somatic evolution of malignant cells produces tumors composed of multiple clonal populations, distinguished in part by rearrangements and copy number changes affecting chromosomal segments. Whole genome sequencing mixes the signals of sampled populations, diluting the signals of clone-specific aberrations, and complicating estimation of clone-specific genotypes. We introduce ReMixT, a method to unmix tumor and contaminating normal signals and jointly predict mixture proportions, clone-specific segment copy number, and clone specificity of breakpoints. ReMixT is free, open-source software and is available at http://bitbucket.org/dranew/remixt .
Collapse
Affiliation(s)
- Andrew W. McPherson
- Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, 2329 West Mall, Vancouver, BC Canada
| | - Andrew Roth
- Department of Statistics, Oxford University, 24-29 St Giles, Oxford, United Kingdom
- Ludwig Institute for Cancer Research, Oxford University, Old Road Campus Research Building, Headington, Oxford, United Kingdom
| | - Gavin Ha
- Dana-Farber Cancer Institute, 450 Brookline Ave, Oxford, Boston USA
- Eli and Edythe L. Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA USA
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, BC Canada
| | - Adi Steif
- Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC Canada
| | - Camila P. E. de Souza
- Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, 2329 West Mall, Vancouver, BC Canada
| | - Peter Eirew
- Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC Canada
| | - Alexandre Bouchard-Côté
- Department of Statistics, University of British Columbia, 2329 West Mall, Vancouver, BC Canada
| | - Sam Aparicio
- Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, 2329 West Mall, Vancouver, BC Canada
| | - S. Cenk Sahinalp
- Vancouver Prostate Centre, 2660 Oak Street, Vancouver, Canada
- Department of Computer Science, Indiana University Bloomington, 107 S. Indiana Avenue, Bloomington, IN USA
| | - Sohrab P. Shah
- Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, 2329 West Mall, Vancouver, BC Canada
| |
Collapse
|
36
|
Schröder J, Wirawan A, Schmidt B, Papenfuss AT. CLOVE: classification of genomic fusions into structural variation events. BMC Bioinformatics 2017; 18:346. [PMID: 28728542 PMCID: PMC5520322 DOI: 10.1186/s12859-017-1760-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 07/13/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A precise understanding of structural variants (SVs) in DNA is important in the study of cancer and population diversity. Many methods have been designed to identify SVs from DNA sequencing data. However, the problem remains challenging because existing approaches suffer from low sensitivity, precision, and positional accuracy. Furthermore, many existing tools only identify breakpoints, and so not collect related breakpoints and classify them as a particular type of SV. Due to the rapidly increasing usage of high throughput sequencing technologies in this area, there is an urgent need for algorithms that can accurately classify complex genomic rearrangements (involving more than one breakpoint or fusion). RESULTS We present CLOVE, an algorithm for integrating the results of multiple breakpoint or SV callers and classifying the results as a particular SV. CLOVE is based on a graph data structure that is created from the breakpoint information. The algorithm looks for patterns in the graph that are characteristic of more complex rearrangement types. CLOVE is able to integrate the results of multiple callers, producing a consensus call. CONCLUSIONS We demonstrate using simulated and real data that re-classified SV calls produced by CLOVE improve on the raw call set of existing SV algorithms, particularly in terms of accuracy. CLOVE is freely available from http://www.github.com/PapenfussLab .
Collapse
Affiliation(s)
- Jan Schröder
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC, 3052, Australia. .,Department of Computing and Information Systems, University of Melbourne, Melbourne, VIC, Australia. .,Bioinformatics and Cancer Genomics, Peter MacCallum Cancer Centre, East Melbourne, VIC, 3000, Australia.
| | - Adrianto Wirawan
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany
| | - Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany
| | - Anthony T Papenfuss
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC, 3052, Australia. .,Bioinformatics and Cancer Genomics, Peter MacCallum Cancer Centre, East Melbourne, VIC, 3000, Australia. .,Department of Medical Biology, University of Melbourne, Melbourne, VIC, 3010, Australia. .,Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, VIC, 3010, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, 3010, Australia.
| |
Collapse
|
37
|
Chen Y, Zhao L, Wang Y, Cao M, Gelowani V, Xu M, Agrawal SA, Li Y, Daiger SP, Gibbs R, Wang F, Chen R. SeqCNV: a novel method for identification of copy number variations in targeted next-generation sequencing data. BMC Bioinformatics 2017; 18:147. [PMID: 28253855 PMCID: PMC5335817 DOI: 10.1186/s12859-017-1566-3] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2016] [Accepted: 02/24/2017] [Indexed: 12/15/2022] Open
Abstract
Background Targeted next-generation sequencing (NGS) has been widely used as a cost-effective way to identify the genetic basis of human disorders. Copy number variations (CNVs) contribute significantly to human genomic variability, some of which can lead to disease. However, effective detection of CNVs from targeted capture sequencing data remains challenging. Results Here we present SeqCNV, a novel CNV calling method designed to use capture NGS data. SeqCNV extracts the read depth information and utilizes the maximum penalized likelihood estimation (MPLE) model to identify the copy number ratio and CNV boundary. We applied SeqCNV to both bacterial artificial clone (BAC) and human patient NGS data to identify CNVs. These CNVs were validated by array comparative genomic hybridization (aCGH). Conclusions SeqCNV is able to robustly identify CNVs of different size using capture NGS data. Compared with other CNV-calling methods, SeqCNV shows a significant improvement in both sensitivity and specificity. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1566-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yong Chen
- Shanghai Key Lab of Intelligent Information Processing, Shanghai, China.,School of Computer Science and Technology, Fudan University, Shanghai, China
| | - Li Zhao
- Structural and Computational Biology & Molecular Biophysics Graduate Program, Baylor College of Medicine, Houston, TX, USA.,Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Yi Wang
- School of Life Sciences, Fudan University, Shanghai, China
| | - Ming Cao
- University of Texas Health Science Center, Houston, TX, USA
| | - Violet Gelowani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Mingchu Xu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Smriti A Agrawal
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Yumei Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Stephen P Daiger
- Department of Ophthalmology and Visual Sciences, University of Texas Health Science Center, Houston, TX, USA
| | - Richard Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Fei Wang
- Shanghai Key Lab of Intelligent Information Processing, Shanghai, China. .,School of Computer Science and Technology, Fudan University, Shanghai, China.
| | - Rui Chen
- Structural and Computational Biology & Molecular Biophysics Graduate Program, Baylor College of Medicine, Houston, TX, USA. .,Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
38
|
Dzamba M, Ramani AK, Buczkowicz P, Jiang Y, Yu M, Hawkins C, Brudno M. Identification of complex genomic rearrangements in cancers using CouGaR. Genome Res 2016; 27:107-117. [PMID: 27986820 PMCID: PMC5204335 DOI: 10.1101/gr.211201.116] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 11/10/2016] [Indexed: 12/17/2022]
Abstract
The genomic alterations associated with cancers are numerous and varied, involving both isolated and large-scale complex genomic rearrangements (CGRs). Although the underlying mechanisms are not well understood, CGRs have been implicated in tumorigenesis. Here, we introduce CouGaR, a novel method for characterizing the genomic structure of amplified CGRs, leveraging both depth of coverage (DOC) and discordant pair-end mapping techniques. We applied our method to whole-genome sequencing (WGS) samples from The Cancer Genome Atlas and identify amplified CGRs in at least 5.2% (10+ copies) to 17.8% (6+ copies) of the samples. Furthermore, ∼95% of these amplified CGRs contain genes previously implicated in tumorigenesis, indicating the importance and widespread occurrence of CGRs in cancers. Additionally, CouGaR identified the occurrence of 'chromoplexy' in nearly 63% of all prostate cancer samples and 30% of all bladder cancer samples. To further validate the accuracy of our method, we experimentally tested 17 predicted fusions in two pediatric glioma samples and validated 15 of these (88%) with precise resolution of the breakpoints via qPCR experiments and Sanger sequencing, with nearly perfect copy count concordance. Additionally, to further help display and understand the structure of CGRs, we have implemented CouGaR-viz, a generic stand-alone tool for visualization of the copy count of regions, breakpoints, and relevant genes.
Collapse
Affiliation(s)
- Misko Dzamba
- Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada
| | - Arun K Ramani
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, M5G 0A4, Canada
| | - Pawel Buczkowicz
- Division of Pathology, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, M5G 1E8, Canada.,Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children, Toronto, Ontario, M5G 0A4, Canada.,Department of Laboratory Medicine and Pathobiology, Faculty of Medicine, University of Toronto, Toronto, Ontario, M5G 1E8, Canada
| | - Yue Jiang
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, M5G 0A4, Canada
| | - Man Yu
- Division of Pathology, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, M5G 1E8, Canada.,Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children, Toronto, Ontario, M5G 0A4, Canada.,Department of Laboratory Medicine and Pathobiology, Faculty of Medicine, University of Toronto, Toronto, Ontario, M5G 1E8, Canada
| | - Cynthia Hawkins
- Division of Pathology, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, M5G 1E8, Canada.,Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children, Toronto, Ontario, M5G 0A4, Canada.,Department of Laboratory Medicine and Pathobiology, Faculty of Medicine, University of Toronto, Toronto, Ontario, M5G 1E8, Canada
| | - Michael Brudno
- Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, M5G 0A4, Canada
| |
Collapse
|
39
|
Malekpour SA, Pezeshk H, Sadeghi M. PSE-HMM: genome-wide CNV detection from NGS data using an HMM with Position-Specific Emission probabilities. BMC Bioinformatics 2016; 18:30. [PMID: 27809781 PMCID: PMC5445519 DOI: 10.1186/s12859-016-1296-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2016] [Accepted: 10/20/2016] [Indexed: 11/23/2022] Open
Abstract
Background Copy Number Variation (CNV) is envisaged to be a major source of large structural variations in the human genome. In recent years, many studies apply Next Generation Sequencing (NGS) data for the CNV detection. However, still there is a necessity to invent more accurate computational tools. Results In this study, mate pair NGS data are used for the CNV detection in a Hidden Markov Model (HMM). The proposed HMM has position specific emission probabilities, i.e. a Gaussian mixture distribution. Each component in the Gaussian mixture distribution captures a different type of aberration that is observed in the mate pairs, after being mapped to the reference genome. These aberrations may include any increase (decrease) in the insertion size or change in the direction of mate pairs that are mapped to the reference genome. This HMM with Position-Specific Emission probabilities (PSE-HMM) is utilized for the genome-wide detection of deletions and tandem duplications. The performance of PSE-HMM is evaluated on a simulated dataset and also on a real data of a Yoruban HapMap individual, NA18507. Conclusions PSE-HMM is effective in taking observation dependencies into account and reaches a high accuracy in detecting genome-wide CNVs. MATLAB programs are available at http://bs.ipm.ir/softwares/PSE-HMM/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1296-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, 14155-6455, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, 14155-6455, Iran. .,School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
40
|
Paternal Age Explains a Major Portion of De Novo Germline Mutation Rate Variability in Healthy Individuals. PLoS One 2016; 11:e0164212. [PMID: 27723766 PMCID: PMC5056704 DOI: 10.1371/journal.pone.0164212] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Accepted: 09/21/2016] [Indexed: 11/19/2022] Open
Abstract
De novo mutations (DNM) are an important source of rare variants and are increasingly being linked to the development of many diseases. Recently, the paternal age effect has been the focus of a number of studies that attempt to explain the observation that increasing paternal age increases the risk for a number of diseases. Using disease-free familial quartets we show that there is a strong positive correlation between paternal age and germline DNM in healthy subjects. We also observed that germline CNVs do not follow the same trend, suggesting a different mechanism. Finally, we observed that DNM were not evenly distributed across the genome, which adds support to the existence of DNM hotspots.
Collapse
|
41
|
Liu L, Huang J, Wang K, Li L, Li Y, Yuan J, Wei S. Identification of hallmarks of lung adenocarcinoma prognosis using whole genome sequencing. Oncotarget 2016; 6:38016-28. [PMID: 26497366 PMCID: PMC4741981 DOI: 10.18632/oncotarget.5697] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 09/30/2015] [Indexed: 11/25/2022] Open
Abstract
In conjunction with clinical characteristics, prognostic biomarkers are essential for choosing optimal therapies to lower the mortality of lung adenocarcinoma. Whole genome sequencing (WGS) of 7 cancerous-noncancerous tissue pairs was performed to explore the comparative copy number variations (CNVs) associated with lung adenocarcinoma. The frequencies of top ranked CNVs were verified in an independent set of 114 patients and then the roles of target CNVs in disease prognosis were assessed in 313 patients. The WGS yielded 2604 CNVs. After frequency validation and biological function screening of top 10 CNVs, 9 mutant driver genes from 7 CNVs were further analyzed for an association with survival. Compared with the PBXIP1 amplified copy number, unamplified carriers had a 0.62-fold (95%CI = 0.43–0.91) decreased risk of death. Compared with an amplified TERT, those with an unamplified TERT had a 35% reduction (95% CI = 3%–56%) in risk of lung adenocarcinoma progression. Cases with both unamplified PBXIP1 and TERT had a median 34.32-month extension of overall survival and 34.55-month delay in disease progression when compared with both amplified CNVs. This study demonstrates that CNVs of TERT and PBXIP1 have the potential to translate into the clinic and be used to improve outcomes for patients with this fatal disease.
Collapse
Affiliation(s)
- Li Liu
- Department of Epidemiology and Biostatistics, and the Ministry of Education Key Lab of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, PR China
| | - Jiao Huang
- Department of Epidemiology and Biostatistics, and the Ministry of Education Key Lab of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, PR China
| | - Ke Wang
- Department of Epidemiology and Biostatistics, and the Ministry of Education Key Lab of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, PR China
| | - Li Li
- Department of Epidemiology and Biostatistics, and the Ministry of Education Key Lab of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, PR China
| | - Yangkai Li
- Department of Thoracic Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, PR China
| | - Jingsong Yuan
- Department of Radiation Oncology, Center for Radiological Research, Columbia University Medical Center, New York, NY, USA
| | - Sheng Wei
- Department of Epidemiology and Biostatistics, and the Ministry of Education Key Lab of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, PR China
| |
Collapse
|
42
|
Li Y, Zhou S, Schwartz DC, Ma J. Allele-Specific Quantification of Structural Variations in Cancer Genomes. Cell Syst 2016; 3:21-34. [PMID: 27453446 PMCID: PMC4965314 DOI: 10.1016/j.cels.2016.05.007] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Revised: 05/13/2016] [Accepted: 05/24/2016] [Indexed: 12/21/2022]
Abstract
Aneuploidy and structural variations (SVs) generate cancer genomes containing a mixture of rearranged genomic segments with extensive somatic copy number alterations. However, existing methods can identify either SVs or allele-specific copy number alterations, but not both simultaneously, which provides a limited view of cancer genome structure. Here we introduce Weaver, an algorithm for the quantification and analysis of allele-specific copy numbers of SVs. Weaver uses a Markov Random Field to estimate joint probabilities of allele-specific copy number of SVs and their inter-connectivity based on paired-end whole-genome sequencing data. Weaver also predicts the timing of SVs relative to chromosome amplifications. We demonstrate the accuracy of Weaver using simulations and findings from whole-genome Optical Mapping. We apply Weaver to generate allele-specific copy numbers of SVs for MCF-7 and HeLa cell lines, and identify recurrent SV patterns in 44 TCGA ovarian cancer whole-genome sequencing datasets. Our approach provides a more complete assessment of the complex genomic architectures inherent to many cancer genomes.
Collapse
Affiliation(s)
- Yang Li
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Shiguo Zhou
- Laboratory for Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - David C Schwartz
- Laboratory for Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Jian Ma
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| |
Collapse
|
43
|
Xia LC, Sakshuwong S, Hopmans ES, Bell JM, Grimes SM, Siegmund DO, Ji HP, Zhang NR. A genome-wide approach for detecting novel insertion-deletion variants of mid-range size. Nucleic Acids Res 2016; 44:e126. [PMID: 27325742 PMCID: PMC5009736 DOI: 10.1093/nar/gkw481] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2015] [Accepted: 05/15/2016] [Indexed: 11/14/2022] Open
Abstract
We present SWAN, a statistical framework for robust detection of genomic structural variants in next-generation sequencing data and an analysis of mid-range size insertion and deletions (<10 Kb) for whole genome analysis and DNA mixtures. To identify these mid-range size events, SWAN collectively uses information from read-pair, read-depth and one end mapped reads through statistical likelihoods based on Poisson field models. SWAN also uses soft-clip/split read remapping to supplement the likelihood analysis and determine variant boundaries. The accuracy of SWAN is demonstrated by in silico spike-ins and by identification of known variants in the NA12878 genome. We used SWAN to identify a series of novel set of mid-range insertion/deletion detection that were confirmed by targeted deep re-sequencing. An R package implementation of SWAN is open source and freely available.
Collapse
Affiliation(s)
- Li C Xia
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 18014, USA
| | - Sukolsak Sakshuwong
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Erik S Hopmans
- Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - John M Bell
- Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - Susan M Grimes
- Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - David O Siegmund
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - Nancy R Zhang
- Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 18014, USA
| |
Collapse
|
44
|
Talevich E, Shain AH, Botton T, Bastian BC. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput Biol 2016; 12:e1004873. [PMID: 27100738 PMCID: PMC4839673 DOI: 10.1371/journal.pcbi.1004873] [Citation(s) in RCA: 1112] [Impact Index Per Article: 139.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2015] [Accepted: 03/16/2016] [Indexed: 01/19/2023] Open
Abstract
Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massively parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for integration into existing analysis pipelines. CNVkit is freely available from https://github.com/etal/cnvkit.
Collapse
Affiliation(s)
- Eric Talevich
- Department of Dermatology, University of California, San Francisco, San Francisco, California, United States of America
- Department of Pathology, University of California, San Francisco, San Francisco, California, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, United States of America
| | - A. Hunter Shain
- Department of Dermatology, University of California, San Francisco, San Francisco, California, United States of America
- Department of Pathology, University of California, San Francisco, San Francisco, California, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, United States of America
| | - Thomas Botton
- Department of Dermatology, University of California, San Francisco, San Francisco, California, United States of America
- Department of Pathology, University of California, San Francisco, San Francisco, California, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, United States of America
| | - Boris C. Bastian
- Department of Dermatology, University of California, San Francisco, San Francisco, California, United States of America
- Department of Pathology, University of California, San Francisco, San Francisco, California, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
45
|
Guan P, Sung WK. Structural variation detection using next-generation sequencing data: A comparative technical review. Methods 2016; 102:36-49. [PMID: 26845461 DOI: 10.1016/j.ymeth.2016.01.020] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2015] [Revised: 01/09/2016] [Accepted: 01/31/2016] [Indexed: 12/11/2022] Open
Abstract
Structural variations (SVs) are mutations in the genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and library-property-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies.
Collapse
Affiliation(s)
- Peiyong Guan
- School of Computing, National University of Singapore, 117543, Singapore
| | - Wing-Kin Sung
- School of Computing, National University of Singapore, 117543, Singapore; Computational & Mathematical Biology Group, Genome Institute of Singapore, 138672, Singapore.
| |
Collapse
|
46
|
Tattini L, D'Aurizio R, Magi A. Detection of Genomic Structural Variants from Next-Generation Sequencing Data. Front Bioeng Biotechnol 2015; 3:92. [PMID: 26161383 PMCID: PMC4479793 DOI: 10.3389/fbioe.2015.00092] [Citation(s) in RCA: 155] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 06/10/2015] [Indexed: 01/16/2023] Open
Abstract
Structural variants are genomic rearrangements larger than 50 bp accounting for around 1% of the variation among human genomes. They impact on phenotypic diversity and play a role in various diseases including neurological/neurocognitive disorders and cancer development and progression. Dissecting structural variants from next-generation sequencing data presents several challenges and a number of approaches have been proposed in the literature. In this mini review, we describe and summarize the latest tools – and their underlying algorithms – designed for the analysis of whole-genome sequencing, whole-exome sequencing, custom captures, and amplicon sequencing data, pointing out the major advantages/drawbacks. We also report a summary of the most recent applications of third-generation sequencing platforms. This assessment provides a guided indication – with particular emphasis on human genetics and copy number variants – for researchers involved in the investigation of these genomic events.
Collapse
Affiliation(s)
- Lorenzo Tattini
- Department of Neurosciences, Psychology, Pharmacology and Child Health, University of Florence , Florence , Italy
| | - Romina D'Aurizio
- Laboratory of Integrative Systems Medicine (LISM), Institute of Informatics and Telematics and Institute of Clinical Physiology, National Research Council , Pisa , Italy
| | - Alberto Magi
- Department of Clinical and Experimental Medicine, University of Florence , Florence , Italy
| |
Collapse
|
47
|
Lelieveld SH, Spielmann M, Mundlos S, Veltman JA, Gilissen C. Comparison of Exome and Genome Sequencing Technologies for the Complete Capture of Protein-Coding Regions. Hum Mutat 2015; 36:815-22. [PMID: 25973577 PMCID: PMC4755152 DOI: 10.1002/humu.22813] [Citation(s) in RCA: 128] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 04/30/2015] [Indexed: 01/20/2023]
Abstract
For next‐generation sequencing technologies, sufficient base‐pair coverage is the foremost requirement for the reliable detection of genomic variants. We investigated whether whole‐genome sequencing (WGS) platforms offer improved coverage of coding regions compared with whole‐exome sequencing (WES) platforms, and compared single‐base coverage for a large set of exome and genome samples. We find that WES platforms have improved considerably in the last years, but at comparable sequencing depth, WGS outperforms WES in terms of covered coding regions. At higher sequencing depth (95x–160x), WES successfully captures 95% of the coding regions with a minimal coverage of 20x, compared with 98% for WGS at 87‐fold coverage. Three different assessments of sequence coverage bias showed consistent biases for WES but not for WGS. We found no clear differences for the technologies concerning their ability to achieve complete coverage of 2,759 clinically relevant genes. We show that WES performs comparable to WGS in terms of covered bases if sequenced at two to three times higher coverage. This does, however, go at the cost of substantially more sequencing biases in WES approaches. Our findings will guide laboratories to make an informed decision on which sequencing platform and coverage to choose.
Collapse
Affiliation(s)
- Stefan H Lelieveld
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, 6525 GA, The Netherlands
| | - Malte Spielmann
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Berlin, Germany.,Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Stefan Mundlos
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Berlin, Germany.,Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Joris A Veltman
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, 6525 GA, The Netherlands.,Department of Clinical Genetics, Maastricht University Medical Centre, Maastricht, The Netherlands
| | - Christian Gilissen
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, 6525 GA, The Netherlands
| |
Collapse
|
48
|
Wang W, Wang W, Sun W, Crowley JJ, Szatkiewicz JP. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing. Nucleic Acids Res 2015; 43:e90. [PMID: 25883151 PMCID: PMC4538801 DOI: 10.1093/nar/gkv319] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 03/27/2015] [Indexed: 11/14/2022] Open
Abstract
Copy-number variants (CNVs) are a major form of genetic variation and a risk factor for various human diseases, so it is crucial to accurately detect and characterize them. It is conceivable that allele-specific reads from high-throughput sequencing data could be leveraged to both enhance CNV detection and produce allele-specific copy number (ASCN) calls. Although statistical methods have been developed to detect CNVs using whole-genome sequence (WGS) and/or whole-exome sequence (WES) data, information from allele-specific read counts has not yet been adequately exploited. In this paper, we develop an integrated method, called AS-GENSENG, which incorporates allele-specific read counts in CNV detection and estimates ASCN using either WGS or WES data. To evaluate the performance of AS-GENSENG, we conducted extensive simulations, generated empirical data using existing WGS and WES data sets and validated predicted CNVs using an independent methodology. We conclude that AS-GENSENG not only predicts accurate ASCN calls but also improves the accuracy of total copy number calls, owing to its unique ability to exploit information from both total and allele-specific read counts while accounting for various experimental biases in sequence data. Our novel, user-friendly and computationally efficient method and a complete analytic protocol is freely available at https://sourceforge.net/projects/asgenseng/.
Collapse
Affiliation(s)
- WeiBo Wang
- Department of Computer Science, University of North Carolina at Chapel Hill, NC 27599-3175, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Wei Sun
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7400, USA
| | - James J Crowley
- Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA
| | - Jin P Szatkiewicz
- Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA
| |
Collapse
|
49
|
Escaramís G, Docampo E, Rabionet R. A decade of structural variants: description, history and methods to detect structural variation. Brief Funct Genomics 2015; 14:305-14. [PMID: 25877305 DOI: 10.1093/bfgp/elv014] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
In the past decade, the view on genomic structural variation (SV) has been changed completely. SVs, previously considered rare events, are now recognized as the largest source of interindividual genetic variation affecting more bases than single nucleotide polymorphisms, variable number of tandem repeats and other small genetic variants. They have also been shown to play a role in phenotypic variation and in disease. In this review, the authors will provide an introduction to SV; a short historical perspective on the research of this source of genomic variation; a description of the types of structural variants, and on how they may have arisen; and an overview on methods of detecting structural variants, focusing on the analysis of high-throughput sequencing data.
Collapse
|
50
|
Pirooznia M, Goes FS, Zandi PP. Whole-genome CNV analysis: advances in computational approaches. Front Genet 2015; 6:138. [PMID: 25918519 PMCID: PMC4394692 DOI: 10.3389/fgene.2015.00138] [Citation(s) in RCA: 119] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2015] [Accepted: 03/23/2015] [Indexed: 01/04/2023] Open
Abstract
Accumulating evidence indicates that DNA copy number variation (CNV) is likely to make a significant contribution to human diversity and also play an important role in disease susceptibility. Recent advances in genome sequencing technologies have enabled the characterization of a variety of genomic features, including CNVs. This has led to the development of several bioinformatics approaches to detect CNVs from next-generation sequencing data. Here, we review recent advances in CNV detection from whole genome sequencing. We discuss the informatics approaches and current computational tools that have been developed as well as their strengths and limitations. This review will assist researchers and analysts in choosing the most suitable tools for CNV analysis as well as provide suggestions for new directions in future development.
Collapse
Affiliation(s)
- Mehdi Pirooznia
- Mood Disorders Center, Department of Psychiatry and Behavioral Sciences, School of Medicine, Johns Hopkins University Baltimore, MD, USA
| | - Fernando S Goes
- Mood Disorders Center, Department of Psychiatry and Behavioral Sciences, School of Medicine, Johns Hopkins University Baltimore, MD, USA
| | - Peter P Zandi
- Mood Disorders Center, Department of Psychiatry and Behavioral Sciences, School of Medicine, Johns Hopkins University Baltimore, MD, USA ; Department of Mental Health, Johns Hopkins Bloomberg School of Public Health Baltimore, MD, USA USA
| |
Collapse
|