1
|
Höps W, Rausch T, Jendrusch M, Korbel JO, Sedlazeck FJ. Impact and characterization of serial structural variations across humans and great apes. Nat Commun 2024; 15:8007. [PMID: 39266513 PMCID: PMC11393467 DOI: 10.1038/s41467-024-52027-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Accepted: 08/23/2024] [Indexed: 09/14/2024] Open
Abstract
Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals ( https://github.com/WHops/NAHRwhals ), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.
Collapse
Affiliation(s)
- Wolfram Höps
- European Molecular Biology Laboratory, Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany
| | - Tobias Rausch
- European Molecular Biology Laboratory, Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany
- Molecular Medicine Partnership Unit, European Molecular Biology Laboratory, University of Heidelberg, Heidelberg, Germany
| | - Michael Jendrusch
- European Molecular Biology Laboratory, Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany
| | - Jan O Korbel
- European Molecular Biology Laboratory, Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany.
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| |
Collapse
|
2
|
Foltz SM, Li Y, Yao L, Terekhanova NV, Weerasinghe A, Gao Q, Dong G, Schindler M, Cao S, Sun H, Jayasinghe RG, Fulton RS, Fronick CC, King J, Kohnen DR, Fiala MA, Chen K, DiPersio JF, Vij R, Ding L. Somatic mutation phasing and haplotype extension using linked-reads in multiple myeloma. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.09.607342. [PMID: 39149342 PMCID: PMC11326269 DOI: 10.1101/2024.08.09.607342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Somatic mutation phasing informs our understanding of cancer-related events, like driver mutations. We generated linked-read whole genome sequencing data for 23 samples across disease stages from 14 multiple myeloma (MM) patients and systematically assigned somatic mutations to haplotypes using linked-reads. Here, we report the reconstructed cancer haplotypes and phase blocks from several MM samples and show how phase block length can be extended by integrating samples from the same individual. We also uncover phasing information in genes frequently mutated in MM, including DIS3, HIST1H1E, KRAS, NRAS, and TP53, phasing 79.4% of 20,705 high-confidence somatic mutations. In some cases, this enabled us to interpret clonal evolution models at higher resolution using pairs of phased somatic mutations. For example, our analysis of one patient suggested that two NRAS hotspot mutations occurred on the same haplotype but were independent events in different subclones. Given sufficient tumor purity and data quality, our framework illustrates how haplotype-aware analysis of somatic mutations in cancer can be beneficial for some cancer cases.
Collapse
Affiliation(s)
- Steven M. Foltz
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Yize Li
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Lijun Yao
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Nadezhda V. Terekhanova
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Amila Weerasinghe
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Qingsong Gao
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Guanlan Dong
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Moses Schindler
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Song Cao
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Hua Sun
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Reyka G. Jayasinghe
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Robert S. Fulton
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Catrina C. Fronick
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
| | - Justin King
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
| | - Daniel R. Kohnen
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
| | - Mark A. Fiala
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - John F. DiPersio
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO, 63110, USA
| | - Ravi Vij
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO, 63110, USA
| | - Li Ding
- Department of Medicine, Washington University in St. Louis, St. Louis, MO, 63110, USA
- McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO, 63108, USA
- Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO, 63110, USA
- Department of Genetics, Washington University in St. Louis, St. Louis, MO, 63110, USA
| |
Collapse
|
3
|
Tan KT, Slevin MK, Leibowitz ML, Garrity-Janger M, Shan J, Li H, Meyerson M. Neotelomeres and telomere-spanning chromosomal arm fusions in cancer genomes revealed by long-read sequencing. CELL GENOMICS 2024; 4:100588. [PMID: 38917803 PMCID: PMC11293586 DOI: 10.1016/j.xgen.2024.100588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 11/09/2023] [Accepted: 05/30/2024] [Indexed: 06/27/2024]
Abstract
Alterations in the structure and location of telomeres are pivotal in cancer genome evolution. Here, we applied both long-read and short-read genome sequencing to assess telomere repeat-containing structures in cancers and cancer cell lines. Using long-read genome sequences that span telomeric repeats, we defined four types of telomere repeat variations in cancer cells: neotelomeres where telomere addition heals chromosome breaks, chromosomal arm fusions spanning telomere repeats, fusions of neotelomeres, and peri-centromeric fusions with adjoined telomere and centromere repeats. These results provide a framework for the systematic study of telomeric repeats in cancer genomes, which could serve as a model for understanding the somatic evolution of other repetitive genomic elements.
Collapse
Affiliation(s)
- Kar-Tong Tan
- Dana-Farber Cancer Institute, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02215, USA
| | | | - Mitchell L Leibowitz
- Dana-Farber Cancer Institute, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02215, USA
| | - Max Garrity-Janger
- Dana-Farber Cancer Institute, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02215, USA
| | - Jidong Shan
- Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Heng Li
- Dana-Farber Cancer Institute, Boston, MA 02215, USA; Harvard Medical School, Boston, MA 02215, USA.
| | - Matthew Meyerson
- Dana-Farber Cancer Institute, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02215, USA.
| |
Collapse
|
4
|
Bai X, Duren Z, Wan L, Xia LC. Joint inference of clonal structure using single-cell genome and transcriptome sequencing data. NAR Genom Bioinform 2024; 6:lqae017. [PMID: 38486887 PMCID: PMC10939367 DOI: 10.1093/nargab/lqae017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 11/19/2023] [Accepted: 01/29/2024] [Indexed: 03/17/2024] Open
Abstract
Latest advancements in the high-throughput single-cell genome (scDNA) and transcriptome (scRNA) sequencing technologies enabled cell-resolved investigation of tissue clones. However, it remains challenging to cluster and couple single cells for heterogeneous scRNA and scDNA data generated from the same specimen. In this study, we present a computational framework called CCNMF, which employs a novel Coupled-Clone Non-negative Matrix Factorization technique to jointly infer clonal structure for matched scDNA and scRNA data. CCNMF couples multi-omics single cells by linking copy number and gene expression profiles through their general concordance. It successfully resolved the underlying coexisting clones with high correlations between the clonal genome and transcriptome from the same specimen. We validated that CCNMF can achieve high accuracy and robustness using both simulated benchmarks and real-world applications, including an ovarian cancer cell lines mixture, a gastric cancer cell line, and a primary gastric cancer. In summary, CCNMF provides a powerful tool for integrating multi-omics single-cell data, enabling simultaneous resolution of genomic and transcriptomic clonal architecture. This computational framework facilitates the understanding of how cellular gene expression changes in conjunction with clonal genome alternations, shedding light on the cellular genomic difference of subclones that contributes to tumor evolution.
Collapse
Affiliation(s)
- Xiangqi Bai
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Zhana Duren
- Center for Human Genetics and Department of Genetics and Biochemistry, Clemson University, Greenwood, SC 29646, USA
| | - Lin Wan
- NCMIS, LSC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Li C Xia
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
5
|
Yang C, Zhang Z, Huang Y, Xie X, Liao H, Xiao J, Veldsman WP, Yin K, Fang X, Zhang L. LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome. Gigascience 2024; 13:giae028. [PMID: 38869148 PMCID: PMC11170215 DOI: 10.1093/gigascience/giae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 03/15/2024] [Accepted: 05/09/2024] [Indexed: 06/14/2024] Open
Abstract
BACKGROUND Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform. FINDINGS To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK's ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots. CONCLUSIONS LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Yufen Huang
- BGI Research, Shenzhen 518083, China
- BGI Genomics, Shenzhen 518083, China
| | | | - Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR 999077, Hong Kong
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Werner Pieter Veldsman
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Xiaodong Fang
- BGI Genomics, Shenzhen 518083, China
- BGI Research, Sanya 572025, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
- Institute for Research and Continuing Education, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| |
Collapse
|
6
|
Tan KT, Slevin MK, Leibowitz ML, Garrity-Janger M, Li H, Meyerson M. Neotelomeres and Telomere-Spanning Chromosomal Arm Fusions in Cancer Genomes Revealed by Long-Read Sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.30.569101. [PMID: 38077026 PMCID: PMC10705422 DOI: 10.1101/2023.11.30.569101] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Alterations in the structure and location of telomeres are key events in cancer genome evolution. However, previous genomic approaches, unable to span long telomeric repeat arrays, could not characterize the nature of these alterations. Here, we applied both long-read and short-read genome sequencing to assess telomere repeat-containing structures in cancers and cancer cell lines. Using long-read genome sequences that span telomeric repeat arrays, we defined four types of telomere repeat variations in cancer cells: neotelomeres where telomere addition heals chromosome breaks, chromosomal arm fusions spanning telomere repeats, fusions of neotelomeres, and peri-centromeric fusions with adjoined telomere and centromere repeats. Analysis of lung adenocarcinoma genome sequences identified somatic neotelomere and telomere-spanning fusion alterations. These results provide a framework for systematic study of telomeric repeat arrays in cancer genomes, that could serve as a model for understanding the somatic evolution of other repetitive genomic elements.
Collapse
Affiliation(s)
- Kar-Tong Tan
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
| | - Michael K. Slevin
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Center for Cancer Genomics, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Mitchell L. Leibowitz
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
| | - Max Garrity-Janger
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, USA
| | - Matthew Meyerson
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Harvard Medical School, Boston, MA 02215, USA
- Center for Cancer Genomics, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Lead contact
| |
Collapse
|
7
|
Qu J, Li S, Yu D. Detection of complex chromosome rearrangements using optical genome mapping. Gene 2023; 884:147688. [PMID: 37543218 DOI: 10.1016/j.gene.2023.147688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 07/15/2023] [Accepted: 08/02/2023] [Indexed: 08/07/2023]
Abstract
Chromosomal structural variations (SVs) are a main cause of human genetic disease. Currently, karyotype, chromosomal microarray analysis (CMA), and fluorescent in situ hybridization (FISH) form the backbone of current routine diagnostics (CRD). These methods have their own limitations. CRD cannot identify cryptic balanced SVs and complex SVs even if these techniques were performed either simultaneously or in a sequential manner. Optical genome mapping (OGM) is a novel technology that can identify several classes of SVs with higher resolution, but studies on the applicability of OGM and its comparison with CRD are inadequate for difficult and complicated chromosomal SVs are lacking. Herein, seven patients with definite complicated SVs involving at least two breakpoints (BPs) were recruited for this study. The results of BPs and SVs from OGM were compared with those from CRD. The results showed that all BPs of five samples and partial BPs of two samples were detected by OGM. The undetected BPs were all close to the repeat-rich gap region. Besides, OGM also detected additional SVs including a cryptic balanced translocation, two additional complex chromosomal rearrangement (CCR). OGM yielded the additional information, such as the orientation of acentric fragments, BP positions, and genes mapped in the BP region for all the cases. The accuracy of additional SVs and BPs detected by OGM was verified by FISH panel and next-generation sequencing and Sanger sequencing. Taken together, OGM exhibit a better performance in detecting chromosomal SVs compared to the CRD. We suggested that OGM method should be utilized in the clinical examination to improve the efficiency and accuracy of genetic disease diagnosis, supplemented by FISH or karyotyping to compensate for the SVs in the repeat-rich gap region if necessary.
Collapse
Affiliation(s)
- Jiangbo Qu
- Center for Medical Genetics and Prenatal Diagnosis, Key Laboratory of Birth Defect Prevention and Genetic Medicine of Shandong Health Commission, Key Laboratory of Birth Regulation and Control Technology of National Health Commission of China, Shandong Provincial Maternal and Child Health Care Hospital Affiliated to Qingdao University, Jinan 250014, Shandong, China.
| | - Shuo Li
- Genetic Testing Center, Qingdao Women and Children's Hospital, Qingdao 266034, Shandong, China.
| | - Dongyi Yu
- Center for Medical Genetics and Prenatal Diagnosis, Key Laboratory of Birth Defect Prevention and Genetic Medicine of Shandong Health Commission, Key Laboratory of Birth Regulation and Control Technology of National Health Commission of China, Shandong Provincial Maternal and Child Health Care Hospital Affiliated to Qingdao University, Jinan 250014, Shandong, China.
| |
Collapse
|
8
|
Laufer VA, Glover TW, Wilson TE. Applications of advanced technologies for detecting genomic structural variation. MUTATION RESEARCH. REVIEWS IN MUTATION RESEARCH 2023; 792:108475. [PMID: 37931775 PMCID: PMC10792551 DOI: 10.1016/j.mrrev.2023.108475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/07/2023] [Accepted: 11/02/2023] [Indexed: 11/08/2023]
Abstract
Chromosomal structural variation (SV) encompasses a heterogenous class of genetic variants that exerts strong influences on human health and disease. Despite their importance, many structural variants (SVs) have remained poorly characterized at even a basic level, a discrepancy predicated upon the technical limitations of prior genomic assays. However, recent advances in genomic technology can identify and localize SVs accurately, opening new questions regarding SV risk factors and their impacts in humans. Here, we first define and classify human SVs and their generative mechanisms, highlighting characteristics leveraged by various SV assays. We next examine the first-ever gapless assembly of the human genome and the technical process of assembling it, which required third-generation sequencing technologies to resolve structurally complex loci. The new portions of that "telomere-to-telomere" and subsequent pangenome assemblies highlight aspects of SV biology likely to develop in the near-term. We consider the strengths and limitations of the most promising new SV technologies and when they or longstanding approaches are best suited to meeting salient goals in the study of human SV in population-scale genomics research, clinical, and public health contexts. It is a watershed time in our understanding of human SV when new approaches are expected to fundamentally change genomic applications.
Collapse
Affiliation(s)
- Vincent A Laufer
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| | - Thomas W Glover
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| | - Thomas E Wilson
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| |
Collapse
|
9
|
Weisweiler M, Stich B. Benchmarking of structural variant detection in the tetraploid potato genome using linked-read sequencing. Genomics 2023; 115:110568. [PMID: 36702293 DOI: 10.1016/j.ygeno.2023.110568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 01/12/2023] [Accepted: 01/18/2023] [Indexed: 01/25/2023]
Abstract
It has recently been shown that structural variants (SV) can have a higher impact on gene expression variation compared to single nucleotide variants (SNV) in different plant species. Additionally, SV were associated with phenotypic variation in several crops. However, compared to the established SV detection based on short-read sequencing, less approaches were described for linked-read based SV calling. We therefore evaluated the performance of six linked-read SV callers compared to an established short-read SV caller based on simulated linked-reads in tetraploid potato. The objectives of our study were to i) compare the performance of SV callers based on linked-read sequencing to short-read sequencing, ii) examine the influence of SV type, SV length, haplotype incidence (HI), as well as sequencing coverage on the SV calling performance in the tetraploid potato genome, and iii) evaluate the accuracy of detecting insertions by linked-read compared to short-read sequencing. We observed high break point resolutions (BPR) detecting short SV and slightly lower BPR for large SV. Our observations highlighted the importance of short-read signals provided by Manta and LinkedSV to detect short SV. Manta and NAIBR performed well for detecting larger deletions, inversions, and duplications. Detected large SV were weakly influenced by the HI. Furthermore, we illustrated that large insertions can be assembled by Novel-X. Our results suggest the usage of the short-read and linked-read SV callers Manta, NAIBR, LinkedSV, and Novel-X based on at least 90x linked-read sequencing coverage to ensure the detection of a broad range of SV in the tetraploid potato genome.
Collapse
Affiliation(s)
- Marius Weisweiler
- Institute for Quantitative Genetics and Genomics of Plants, Universitätsstraße 1, 40225 Düsseldorf, Germany
| | - Benjamin Stich
- Institute for Quantitative Genetics and Genomics of Plants, Universitätsstraße 1, 40225 Düsseldorf, Germany; Cluster of Excellence on Plant Sciences, From Complex Traits towards Synthetic Modules, Universitätsstraße 1, 40225 Düsseldorf, Germany; Max Planck Institute for Plant Breeding Research, Carl-von-Linne-Weg 10, 50829 Köln, Germany.
| |
Collapse
|
10
|
Muñoz-Barrera A, Rubio-Rodríguez LA, Díaz-de Usera A, Jáspez D, Lorenzo-Salazar JM, González-Montelongo R, García-Olivares V, Flores C. From Samples to Germline and Somatic Sequence Variation: A Focus on Next-Generation Sequencing in Melanoma Research. Life (Basel) 2022; 12:1939. [PMID: 36431075 PMCID: PMC9695713 DOI: 10.3390/life12111939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 11/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Next-generation sequencing (NGS) applications have flourished in the last decade, permitting the identification of cancer driver genes and profoundly expanding the possibilities of genomic studies of cancer, including melanoma. Here we aimed to present a technical review across many of the methodological approaches brought by the use of NGS applications with a focus on assessing germline and somatic sequence variation. We provide cautionary notes and discuss key technical details involved in library preparation, the most common problems with the samples, and guidance to circumvent them. We also provide an overview of the sequence-based methods for cancer genomics, exposing the pros and cons of targeted sequencing vs. exome or whole-genome sequencing (WGS), the fundamentals of the most common commercial platforms, and a comparison of throughputs and key applications. Details of the steps and the main software involved in the bioinformatics processing of the sequencing results, from preprocessing to variant prioritization and filtering, are also provided in the context of the full spectrum of genetic variation (SNVs, indels, CNVs, structural variation, and gene fusions). Finally, we put the emphasis on selected bioinformatic pipelines behind (a) short-read WGS identification of small germline and somatic variants, (b) detection of gene fusions from transcriptomes, and (c) de novo assembly of genomes from long-read WGS data. Overall, we provide comprehensive guidance across the main methodological procedures involved in obtaining sequencing results for the most common short- and long-read NGS platforms, highlighting key applications in melanoma research.
Collapse
Affiliation(s)
- Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Luis A. Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Ana Díaz-de Usera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - José M. Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Rafaela González-Montelongo
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Víctor García-Olivares
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, 28029 Madrid, Spain
- Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, 35450 Las Palmas de Gran Canaria, Spain
| |
Collapse
|
11
|
Linked-read whole-genome sequencing resolves common and private structural variants in multiple myeloma. Blood Adv 2022; 6:5009-5023. [PMID: 35675515 PMCID: PMC9631623 DOI: 10.1182/bloodadvances.2021006720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 05/31/2022] [Indexed: 01/18/2023] Open
Abstract
Linked-read WGS can be performed without DNA purification and allows for resolution of the diverse structural variants found in MM. Linked-read WGS can, as a standalone assay, provide comprehensive genetics in myeloma and other diseases with complex genomes.
Multiple myeloma (MM) is an incurable and aggressive plasma cell malignancy characterized by a complex karyotype with multiple structural variants (SVs) and copy-number variations (CNVs). Linked-read whole-genome sequencing (lrWGS) allows for refined detection and reconstruction of SVs by providing long-range genetic information from standard short-read sequencing. This makes lrWGS an attractive solution for capturing the full genomic complexity of MM. Here we show that high-quality lrWGS data can be generated from low numbers of cells subjected to fluorescence-activated cell sorting (FACS) without DNA purification. Using this protocol, we analyzed MM cells after FACS from 37 patients with MM using lrWGS. We found high concordance between lrWGS and fluorescence in situ hybridization (FISH) for the detection of recurrent translocations and CNVs. Outside of the regions investigated by FISH, we identified >150 additional SVs and CNVs across the cohort. Analysis of the lrWGS data allowed for resolution of the structure of diverse SVs affecting the MYC and t(11;14) loci, causing the duplication of genes and gene regulatory elements. In addition, we identified private SVs causing the dysregulation of genes recurrently involved in translocations with the IGH locus and show that these can alter the molecular classification of MM. Overall, we conclude that lrWGS allows for the detection of aberrations critical for MM prognostics and provides a feasible route for providing comprehensive genetics. Implementing lrWGS could provide more accurate clinical prognostics, facilitate genomic medicine initiatives, and greatly improve the stratification of patients included in clinical trials.
Collapse
|
12
|
Guo J, Shi C, Chen X, Wang O, Liu P, Yang H, Xu X, Zhang W, Zhu H. stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads. Front Genet 2021; 12:636239. [PMID: 33815469 PMCID: PMC8012683 DOI: 10.3389/fgene.2021.636239] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 02/04/2021] [Indexed: 11/13/2022] Open
Abstract
Co-barcoded reads originating from long DNA fragments (mean length >30 kbp) maintain both single base level accuracy and long-range genomic information. We propose a pipeline, stLFRsv, to detect structural variation using co-barcoded reads. stLFRsv identifies abnormal large gaps between co-barcoded reads to detect potential breakpoints and reconstruct complex structural variants (SVs). Haplotype phasing by co-barcoded reads increases the signal to noise ratio, and barcode sharing profiles are used to filter out false positives. We integrate the short read SV caller smoove for smaller variants with stLFRsv. The integrated pipeline was evaluated on the well-characterized genome HG002/NA24385, and 74.5% precision and a 22.4% recall rate were obtained for deletions. stLFRsv revealed some large variants not included in the benchmark set that were verified by long reads or assembly. For the HG001/NA12878 genome, stLFRsv also achieved the best performance for both resource usage and the detection of large variants. Our work indicates that co-barcoded read technology has the potential to improve genome completeness.
Collapse
Affiliation(s)
- Junfu Guo
- BGI-Tianjin, BGI-Shenzhen, Tianjin, China
| | - Chang Shi
- BGI-Tianjin, BGI-Shenzhen, Tianjin, China
| | - Xi Chen
- BGI-Tianjin, BGI-Shenzhen, Tianjin, China
| | - Ou Wang
- BGI-Shenzhen, Shenzhen, China
| | - Ping Liu
- MGI, BGI-Shenzhen, Shenzhen, China
| | - Huanming Yang
- Guangdong Provincial Academician Workstation of BGI Synthetic Genomics, BGI-Shenzhen, Shenzhen, China
| | - Xun Xu
- Guangdong Provincial Key Laboratory of Genome Read and Write, BGI-Shenzhen, Shenzhen, China
| | | | | |
Collapse
|
13
|
Noninvasive prenatal test of single-gene disorders by linked-read direct haplotyping: application in various diseases. Eur J Hum Genet 2020; 29:463-470. [PMID: 33235377 DOI: 10.1038/s41431-020-00759-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 08/26/2020] [Accepted: 10/20/2020] [Indexed: 11/08/2022] Open
Abstract
Direct haplotyping enables noninvasive prenatal testing (NIPT) without analyzing proband, which is a promising strategy for pregnancies at risk of an inherited single-gene disorder. Here, we aimed to expand the scope of single-gene disorders that NIPT using linked-read direct haplotyping would be applicable to. Three families at risk of myotonic dystrophy type 1, lipoid congenital adrenal hyperplasia, and Fukuyama congenital muscular dystrophy were recruited. All cases exhibited distinct characteristics that are often encountered as hurdles (i.e., repeat expansion, identical variants in both parents, and novel variants with retrotransposon insertion) in the universal clinical application of NIPT. Direct haplotyping of parental genomes was performed by linked-read sequencing, combined with allele-specific PCR, if necessary. Target DMPK, STAR, and FKTN genes in the maternal plasma DNA were sequenced. Posterior risk calculations and an Anderson-Darling test were performed to deduce the maternal and paternal inheritance, respectively. In all cases, we could predict the inheritance of maternal mutant allele with > 99.9% confidence, while paternal mutant alleles were not predicted to be inherited. Our study indicates that direct haplotyping and posterior risk calculation can be applied with subtle modifications to NIPT for the detection of an expanded range of diseases.
Collapse
|
14
|
Integrative analysis of structural variations using short-reads and linked-reads yields highly specific and sensitive predictions. PLoS Comput Biol 2020; 16:e1008397. [PMID: 33226985 PMCID: PMC7721175 DOI: 10.1371/journal.pcbi.1008397] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 12/07/2020] [Accepted: 09/24/2020] [Indexed: 11/19/2022] Open
Abstract
Genetic diseases are driven by aberrations of the human genome. Identification of such aberrations including structural variations (SVs) is key to our understanding. Conventional short-reads whole genome sequencing (cWGS) can identify SVs to base-pair resolution, but utilizes only short-range information and suffers from high false discovery rate (FDR). Linked-reads sequencing (10XWGS) utilizes long-range information by linkage of short-reads originating from the same large DNA molecule. This can mitigate alignment-based artefacts especially in repetitive regions and should enable better prediction of SVs. However, an unbiased evaluation of this technology is not available. In this study, we performed a comprehensive analysis of different types and sizes of SVs predicted by both the technologies and validated with an independent PCR based approach. The SVs commonly identified by both the technologies were highly specific, while validation rate dropped for uncommon events. A particularly high FDR was observed for SVs only found by 10XWGS. To improve FDR and sensitivity, statistical models for both the technologies were trained. Using our approach, we characterized SVs from the MCF7 cell line and a primary breast cancer tumor with high precision. This approach improves SV prediction and can therefore help in understanding the underlying genetics in various diseases. Cancer and many other diseases are often driven by structural rearrangements in the patients. Their precise identification is necessary to understand evolution and cure for the disease. In this study, we have compared two sequencing technologies for the identification of structural variations i.e. Illumina’s short-reads and 10X Genomics linked-reads sequencing. Short-reads sequencing is already known to have high false discovery rate for structural variations, while, an unbiased performance evaluation of linked-reads sequencing is missing. Hence, we evaluate the performance of these two technologies using computational and PCR based methodologies. Moreover, we also present a statistical approach to increase their performance, supporting better detection of structural variations and thus further research into disease biology.
Collapse
|
15
|
Gallant J, Mouton J, Ummels R, Ten Hagen-Jongman C, Kriel N, Pain A, Warren RM, Bitter W, Heunis T, Sampson SL. Identification of gene fusion events in Mycobacterium tuberculosis that encode chimeric proteins. NAR Genom Bioinform 2020; 2:lqaa033. [PMID: 33575588 PMCID: PMC7671302 DOI: 10.1093/nargab/lqaa033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Revised: 04/16/2020] [Accepted: 05/05/2020] [Indexed: 02/07/2023] Open
Abstract
Mycobacterium tuberculosis is a facultative intracellular pathogen responsible for causing tuberculosis. The harsh environment in which M. tuberculosis survives requires this pathogen to continuously adapt in order to maintain an evolutionary advantage. However, the apparent absence of horizontal gene transfer in M. tuberculosis imposes restrictions in the ways by which evolution can occur. Large-scale changes in the genome can be introduced through genome reduction, recombination events and structural variation. Here, we identify a functional chimeric protein in the ppe38-71 locus, the absence of which is known to have an impact on protein secretion and virulence. To examine whether this approach was used more often by this pathogen, we further develop software that detects potential gene fusion events from multigene deletions using whole genome sequencing data. With this software we could identify a number of other putative gene fusion events within the genomes of M. tuberculosis isolates. We were able to demonstrate the expression of one of these gene fusions at the protein level using mass spectrometry. Therefore, gene fusions may provide an additional means of evolution for M. tuberculosis in its natural environment whereby novel chimeric proteins and functions can arise.
Collapse
Affiliation(s)
- James Gallant
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Department of Biomedical Science, Faculty of Medicine and Health Science, Stellenbosch University, Tygerberg, Cape Town 7505, South Africa.,Section of Molecular Microbiology, Amsterdam Institute for Molecules, Medicines and Systems, Vrije Universiteit Amsterdam, 1081 HZ Amsterdam, The Netherlands
| | - Jomien Mouton
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Department of Biomedical Science, Faculty of Medicine and Health Science, Stellenbosch University, Tygerberg, Cape Town 7505, South Africa
| | - Roy Ummels
- Medical Microbiology and Infection Control, Vrije Universiteit Amsterdam, Amsterdam UMC, 1081 HZ Amsterdam, The Netherlands
| | - Corinne Ten Hagen-Jongman
- Section of Molecular Microbiology, Amsterdam Institute for Molecules, Medicines and Systems, Vrije Universiteit Amsterdam, 1081 HZ Amsterdam, The Netherlands
| | - Nastassja Kriel
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Department of Biomedical Science, Faculty of Medicine and Health Science, Stellenbosch University, Tygerberg, Cape Town 7505, South Africa
| | - Arnab Pain
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia.,Global Station for Zoonosis Control, GI-CoRE, Hokkaido University, 001-0020, N20 W10 Kita-ku, Sapporo, Japan
| | - Robin M Warren
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Department of Biomedical Science, Faculty of Medicine and Health Science, Stellenbosch University, Tygerberg, Cape Town 7505, South Africa
| | - Wilbert Bitter
- Section of Molecular Microbiology, Amsterdam Institute for Molecules, Medicines and Systems, Vrije Universiteit Amsterdam, 1081 HZ Amsterdam, The Netherlands.,Medical Microbiology and Infection Control, Vrije Universiteit Amsterdam, Amsterdam UMC, 1081 HZ Amsterdam, The Netherlands
| | - Tiaan Heunis
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Department of Biomedical Science, Faculty of Medicine and Health Science, Stellenbosch University, Tygerberg, Cape Town 7505, South Africa.,Biosciences Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne NE2 4HH, UK
| | - Samantha L Sampson
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Department of Biomedical Science, Faculty of Medicine and Health Science, Stellenbosch University, Tygerberg, Cape Town 7505, South Africa
| |
Collapse
|
16
|
Karaoğlanoğlu F, Ricketts C, Ebren E, Rasekh ME, Hajirasouliha I, Alkan C. VALOR2: characterization of large-scale structural variants using linked-reads. Genome Biol 2020; 21:72. [PMID: 32192518 PMCID: PMC7083023 DOI: 10.1186/s13059-020-01975-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 02/24/2020] [Indexed: 12/31/2022] Open
Abstract
Most existing methods for structural variant detection focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced structural variants with no gain or loss of genomic segments, for example, inversions and translocations, is a particularly challenging task. Furthermore, there are very few algorithms to predict the insertion locus of large interspersed segmental duplications and characterize translocations. Here, we propose novel algorithms to characterize large interspersed segmental duplications, inversions, deletions, and translocations using linked-read sequencing data. We redesign our earlier algorithm, VALOR, and implement our new algorithms in a new software package, called VALOR2.
Collapse
Affiliation(s)
- Fatih Karaoğlanoğlu
- Department of Computer Engineering, Bilkent University, Ankara, 06800 Turkey
| | - Camir Ricketts
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, 1300 York Ave, New York, 10065 NY USA
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065 NY USA
| | - Ezgi Ebren
- Department of Computer Engineering, Bilkent University, Ankara, 06800 Turkey
| | - Marzieh Eslami Rasekh
- Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, 02215 MA USA
| | - Iman Hajirasouliha
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065 NY USA
- Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, 1300 York Ave, New York, 10065 NY USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, 06800 Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Bilkent University, Ankara, 06800 Turkey
| |
Collapse
|
17
|
Abstract
Identifying structural variation (SV) is essential for genome interpretation but has been historically difficult due to limitations inherent to available genome technologies. Detection methods that use ensemble algorithms and emerging sequencing technologies have enabled the discovery of thousands of SVs, uncovering information about their ubiquity, relationship to disease and possible effects on biological mechanisms. Given the variability in SV type and size, along with unique detection biases of emerging genomic platforms, multiplatform discovery is necessary to resolve the full spectrum of variation. Here, we review modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome.
Collapse
Affiliation(s)
- Steve S Ho
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Ryan E Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
18
|
Zhang Y, Kang Z, Lv D, Zhang X, Liao Y, Li Y, Liu R, Li P, Tong M, Tian J, Shao Y, Huang C, Ge D, Zhang J, Bai W, Wang Y, Liu Q, Li Z, Yan J. Longitudinal whole-genome sequencing reveals the evolution of MPAL. Cancer Genet 2020; 240:59-65. [DOI: 10.1016/j.cancergen.2019.11.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 10/21/2019] [Accepted: 11/21/2019] [Indexed: 12/30/2022]
|
19
|
Shin G, Greer SU, Xia LC, Lee H, Zhou J, Boles TC, Ji HP. Targeted short read sequencing and assembly of re-arrangements and candidate gene loci provide megabase diplotypes. Nucleic Acids Res 2019; 47:e115. [PMID: 31350896 PMCID: PMC6821272 DOI: 10.1093/nar/gkz661] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Revised: 07/02/2019] [Accepted: 07/18/2019] [Indexed: 11/12/2022] Open
Abstract
The human genome is composed of two haplotypes, otherwise called diplotypes, which denote phased polymorphisms and structural variations (SVs) that are derived from both parents. Diplotypes place genetic variants in the context of cis-related variants from a diploid genome. As a result, they provide valuable information about hereditary transmission, context of SV, regulation of gene expression and other features which are informative for understanding human genetics. Successful diplotyping with short read whole genome sequencing generally requires either a large population or parent-child trio samples. To overcome these limitations, we developed a targeted sequencing method for generating megabase (Mb)-scale haplotypes with short reads. One selects specific 0.1-0.2 Mb high molecular weight DNA targets with custom-designed Cas9-guide RNA complexes followed by sequencing with barcoded linked reads. To test this approach, we designed three assays, targeting the BRCA1 gene, the entire 4-Mb major histocompatibility complex locus and 18 well-characterized SVs, respectively. Using an integrated alignment- and assembly-based approach, we generated comprehensive variant diplotypes spanning the entirety of the targeted loci and characterized SVs with exact breakpoints. Our results were comparable in quality to long read sequencing.
Collapse
Affiliation(s)
- GiWon Shin
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stephanie U Greer
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Li C Xia
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Jun Zhou
- Sage Science, Inc., Beverly, MA 01915, USA
| | | | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA.,Stanford Genome Technology Center, Stanford University, Palo Alto, CA 94304, USA
| |
Collapse
|
20
|
Wellenreuther M, Mérot C, Berdan E, Bernatchez L. Going beyond SNPs: The role of structural genomic variants in adaptive evolution and species diversification. Mol Ecol 2019; 28:1203-1209. [PMID: 30834648 DOI: 10.1111/mec.15066] [Citation(s) in RCA: 120] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Accepted: 02/28/2019] [Indexed: 12/17/2022]
Affiliation(s)
- Maren Wellenreuther
- The New Zealand Institute for Plant & Food Research Ltd, Nelson, New Zealand.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Claire Mérot
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, Quebec, Canada
| | - Emma Berdan
- Department of Marine Sciences, University of Gothenburg, Gothenburg, Sweden
| | - Louis Bernatchez
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, Quebec, Canada
| |
Collapse
|
21
|
Nicolussi A, Belardinilli F, Silvestri V, Mahdavian Y, Valentini V, D'Inzeo S, Petroni M, Zani M, Ferraro S, Di Giulio S, Fabretti F, Fratini B, Gradilone A, Ottini L, Giannini G, Coppa A, Capalbo C. Identification of novel BRCA1 large genomic rearrangements by a computational algorithm of amplicon-based Next-Generation Sequencing data. PeerJ 2019; 7:e7972. [PMID: 31741787 PMCID: PMC6859874 DOI: 10.7717/peerj.7972] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Accepted: 10/01/2019] [Indexed: 12/30/2022] Open
Abstract
Background Genetic testing for BRCA1/2 germline mutations in hereditary breast/ovarian cancer patients requires screening for single nucleotide variants, small insertions/deletions and large genomic rearrangements (LGRs). These studies have long been run by Sanger sequencing and multiplex ligation-dependent probe amplification (MLPA). The recent introduction of next-generation sequencing (NGS) platforms dramatically improved the speed and the efficiency of DNA testing for nucleotide variants, while the possibility to correctly detect LGRs by this mean is still debated. The purpose of this study was to establish whether and to which extent the development of an analytical algorithm could help us translating NGS sequencing via an Ion Torrent PGM platform into a tool suitable to identify LGRs in hereditary breast-ovarian cancer patients. Methods We first used NGS data of a group of three patients (training set), previously screened in our laboratory by conventional methods, to develop an algorithm for the calculation of the dosage quotient (DQ) to be compared with the Ion Reporter (IR) analysis. Then, we tested the optimized pipeline with a consecutive cohort of 85 uncharacterized probands (validation set) also subjected to MLPA analysis. Characterization of the breakpoints of three novel BRCA1 LGRs was obtained via long-range PCR and direct sequencing of the DNA products. Results In our cohort, the newly defined DQ-based algorithm detected 3/3 BRCA1 LGRs, demonstrating 100% sensitivity and 100% negative predictive value (NPV) (95% CI [87.6–99.9]) compared to 2/3 cases detected by IR (66.7% sensitivity and 98.2% NPV (95% CI [85.6–99.9])). Interestingly, DQ and IR shared 12 positive results, but exons deletion calls matched only in five cases, two of which confirmed by MLPA. The breakpoints of the 3 novel BRCA1 deletions, involving exons 16–17, 21–22 and 20, have been characterized. Conclusions Our study defined a DQ-based algorithm to identify BRCA1 LGRs using NGS data. Whether confirmed on larger data sets, this tool could guide the selection of samples to be subjected to MLPA analysis, leading to significant savings in time and money.
Collapse
Affiliation(s)
- Arianna Nicolussi
- Department of Experimental Medicine, University of Roma "La Sapienza", Roma, Italy
| | | | - Valentina Silvestri
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Yasaman Mahdavian
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Virginia Valentini
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Sonia D'Inzeo
- U.O.C. Microbiology and Virology Laboratory, A.O. San Camillo Forlanini, Roma, Italy
| | - Marialaura Petroni
- Istituto Italiano di Tecnologia, Center for Life Nano Science @ Sapienza, Roma, Italy
| | - Massimo Zani
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Sergio Ferraro
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Stefano Di Giulio
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Francesca Fabretti
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Beatrice Fratini
- Department of Experimental Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Angela Gradilone
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Laura Ottini
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Giuseppe Giannini
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy.,Istituto Pasteur-Fondazione Cenci Bolognetti, Roma, Italy
| | - Anna Coppa
- Department of Experimental Medicine, University of Roma "La Sapienza", Roma, Italy
| | - Carlo Capalbo
- Department of Molecular Medicine, University of Roma "La Sapienza", Roma, Italy
| |
Collapse
|
22
|
Iwata S, Nakadai H, Fukushi D, Jose M, Nagahara M, Iwamoto T. Simple and large-scale chromosomal engineering of mouse zygotes via in vitro and in vivo electroporation. Sci Rep 2019; 9:14713. [PMID: 31604975 PMCID: PMC6789149 DOI: 10.1038/s41598-019-50900-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Accepted: 09/19/2019] [Indexed: 01/25/2023] Open
Abstract
The clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 system has facilitated dramatic progress in the field of genome engineering. Whilst microinjection of the Cas9 protein and a single guide RNA (sgRNA) into mouse zygotes is a widespread method for producing genetically engineered mice, in vitro and in vivo electroporation (which are much more convenient strategies) have recently been developed. However, it remains unknown whether these electroporation methods are able to manipulate genomes at the chromosome level. In the present study, we used these techniques to introduce chromosomal inversions of several megabases (Mb) in length in mouse zygotes. Using in vitro electroporation, we successfully introduced a 7.67 Mb inversion, which is longer than any previously reported inversion produced using microinjection-based methods. Additionally, using in vivo electroporation, we also introduced a long chromosomal inversion by targeting an allele in F1 hybrid mice. To our knowledge, the present study is the first report of target-specific chromosomal inversions in mammalian zygotes using electroporation.
Collapse
Affiliation(s)
- Satoru Iwata
- Center for Education in Laboratory Animal Research, Chubu University, Kasugai, Japan.
- Department of Biomedical Sciences, College of Life and Health Sciences, Chubu University, Kasugai, Japan.
- College of Bioscience and Biotechnology, Chubu University, Kasugai, Japan.
| | - Hitomi Nakadai
- Department of Biomedical Sciences, College of Life and Health Sciences, Chubu University, Kasugai, Japan
| | - Daisuke Fukushi
- Department of Genetics, Institute for Developmental Research, Aichi Developmental Disability Center, Kasugai, Japan
| | - Mami Jose
- Department of Biomedical Sciences, College of Life and Health Sciences, Chubu University, Kasugai, Japan
| | - Miki Nagahara
- Center for Education in Laboratory Animal Research, Chubu University, Kasugai, Japan
| | - Takashi Iwamoto
- Center for Education in Laboratory Animal Research, Chubu University, Kasugai, Japan
- Department of Biomedical Sciences, College of Life and Health Sciences, Chubu University, Kasugai, Japan
| |
Collapse
|
23
|
Darby CA, Fitch JR, Brennan PJ, Kelly BJ, Bir N, Magrini V, Leonard J, Cottrell CE, Gastier-Foster JM, Wilson RK, Mardis ER, White P, Langmead B, Schatz MC. Samovar: Single-Sample Mosaic Single-Nucleotide Variant Calling with Linked Reads. iScience 2019; 18:1-10. [PMID: 31271967 PMCID: PMC6609817 DOI: 10.1016/j.isci.2019.05.037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 05/06/2019] [Accepted: 05/24/2019] [Indexed: 12/25/2022] Open
Abstract
Linked-read sequencing enables greatly improves haplotype assembly over standard paired-end analysis. The detection of mosaic single-nucleotide variants benefits from haplotype assembly when the model is informed by the mapping between constituent reads and linked reads. Samovar evaluates haplotype-discordant reads identified through linked-read sequencing, thus enabling phasing and mosaic variant detection across the entire genome. Samovar trains a random forest model to score candidate sites using a dataset that considers read quality, phasing, and linked-read characteristics. Samovar calls mosaic single-nucleotide variants (SNVs) within a single sample with accuracy comparable with what previously required trios or matched tumor/normal pairs and outperforms single-sample mosaic variant callers at minor allele frequency 5%-50% with at least 30X coverage. Samovar finds somatic variants in both tumor and normal whole-genome sequencing from 13 pediatric cancer cases that can be corroborated with high recall with whole exome sequencing. Samovar is available open-source at https://github.com/cdarby/samovar under the MIT license.
Collapse
Affiliation(s)
- Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - James R Fitch
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Patrick J Brennan
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Benjamin J Kelly
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Natalie Bir
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Vincent Magrini
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Jeffrey Leonard
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA; Department of Neurosurgery, Nationwide Children's Hospital, Columbus, OH, USA
| | - Catherine E Cottrell
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Julie M Gastier-Foster
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Richard K Wilson
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Elaine R Mardis
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Peter White
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Biology, Johns Hopkins University, Baltimore, MD, USA; Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
24
|
Abstract
More than a decade ago, the term "next-generation" sequencing was coined to describe what was, at the time, revolutionary new methods to sequence RNA and DNA at a faster pace and cheaper cost than could be performed by standard bench-top protocols. Since then, the field of DNA sequencing has evolved at a rapid pace, with new breakthroughs allowing capacity to exponentially increase and cost to dramatically decrease. As genome-scale sequencing has become routine, a paradigm shift is occurring in genomics, which uses the power of high-throughput, rapid sequencing power with large-scale studies. These new approaches to genetic discovery will provide direct impact to fields such as personalized medicine, evolution, and biodiversity. This work reviews recent technology advances and methods in next-generation sequencing and highlights current large-scale sequencing efforts driving the evolution of the genomics space.
Collapse
Affiliation(s)
- Shawn E Levy
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806
| | - Braden E Boone
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806
| |
Collapse
|
25
|
Marks P, Garcia S, Barrio AM, Belhocine K, Bernate J, Bharadwaj R, Bjornson K, Catalanotti C, Delaney J, Fehr A, Fiddes IT, Galvin B, Heaton H, Herschleb J, Hindson C, Holt E, Jabara CB, Jett S, Keivanfar N, Kyriazopoulou-Panagiotopoulou S, Lek M, Lin B, Lowe A, Mahamdallie S, Maheshwari S, Makarewicz T, Marshall J, Meschi F, O'Keefe CJ, Ordonez H, Patel P, Price A, Royall A, Ruark E, Seal S, Schnall-Levin M, Shah P, Stafford D, Williams S, Wu I, Xu AW, Rahman N, MacArthur D, Church DM. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res 2019; 29:635-645. [PMID: 30894395 PMCID: PMC6442396 DOI: 10.1101/gr.234443.118] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Accepted: 02/21/2019] [Indexed: 02/07/2023]
Abstract
Large-scale population analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short-read whole-genome sequencing. However, these short-read approaches fail to give a complete picture of a genome. They struggle to identify structural events, cannot access repetitive regions, and fail to resolve the human genome into haplotypes. Here, we describe an approach that retains long range information while maintaining the advantages of short reads. Starting from ∼1 ng of high molecular weight DNA, we produce barcoded short-read libraries. Novel informatic approaches allow for the barcoded short reads to be associated with their original long molecules producing a novel data type known as "Linked-Reads". This approach allows for simultaneous detection of small and large variants from a single library. In this manuscript, we show the advantages of Linked-Reads over standard short-read approaches for reference-based analysis. Linked-Reads allow mapping to 38 Mb of sequence not accessible to short reads, adding sequence in 423 difficult-to-sequence genes including disease-relevant genes STRC, SMN1, and SMN2 Both Linked-Read whole-genome and whole-exome sequencing identify complex structural variations, including balanced events and single exon deletions and duplications. Further, Linked-Reads extend the region of high-confidence calls by 68.9 Mb. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - Adrian Fehr
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | | | | | | | | - Esty Holt
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | | | | | | | - Monkol Lek
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Bill Lin
- 10x Genomics, Pleasanton, California 94566, USA
| | - Adam Lowe
- 10x Genomics, Pleasanton, California 94566, USA
| | - Shazia Mahamdallie
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | | | - Jamie Marshall
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | - Elise Ruark
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | - Sheila Seal
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | - Preyas Shah
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | | - Indira Wu
- 10x Genomics, Pleasanton, California 94566, USA
| | | | - Nazneen Rahman
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | - Daniel MacArthur
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | |
Collapse
|
26
|
Ma ZS, Li L, Ye C, Peng M, Zhang YP. Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome. Genomics 2018; 111:1896-1901. [PMID: 30594583 DOI: 10.1016/j.ygeno.2018.12.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Revised: 11/17/2018] [Accepted: 12/24/2018] [Indexed: 10/27/2022]
Abstract
The 3rd generation of sequencing (3GS) technologies generate ultra-long reads (up to 1 Mb), which makes it possible to eliminate gaps and effectively resolve repeats in genome assembly. However, the 3GS technologies suffer from the high base-level error rates (15%-40%) and high sequencing costs. To address these issues, the hybrid assembly strategy, which utilizes both 3GS reads and inexpensive NGS (next generation sequencing) short reads, was invented. Here, we use 10×-Genomics® technology, which integrates a novel bar-coding strategy with Illumina® NGS with an advantage of revealing long-range sequence information, to replace common NGS short reads for hybrid assembly of long erroneous 3GS reads. We demonstrate the feasibility of integrating the 3GS with 10×-Genomics technologies for a new strategy of hybrid de novo genome assembly by utilizing DBG2OLC and Sparc software packages, previously developed by the authors for regular hybrid assembly. Using a human genome as an example, we show that with only 7× coverage of ultra-long Nanopore® reads, augmented with 10× reads, our approach achieved nearly the same level of quality, compared with non-hybrid assembly with 35× coverage of Nanopore reads. Compared with the assembly with 10×-Genomics reads alone, our assembly is gapless with slightly high cost. These results suggest that our new hybrid assembly with ultra-long 3GS reads augmented with 10×-Genomics reads offers a low-cost (less than ¼ the cost of the non-hybrid assembly) and computationally light-weighted (only took 109 calendar hours with peak memory-usage = 61GB on a dual-CPU office workstation) solution for extending the wide applications of the 3GS technologies.
Collapse
Affiliation(s)
- Zhanshan Sam Ma
- Computational Biology and Medical Ecology Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China; Kunming College of Life Science, Chinese Academy of Sciences, Kunming, 650223, China.
| | - Lianwei Li
- Computational Biology and Medical Ecology Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Kunming College of Life Science, Chinese Academy of Sciences, Kunming, 650223, China
| | - Chengxi Ye
- Computational Biology and Medical Ecology Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Minsheng Peng
- Molecular Evolution and Genome Diversity Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Kunming College of Life Science, Chinese Academy of Sciences, Kunming, 650223, China; KIZ/CUHK Joint Laboratory of Bio-resources and Molecular Research in Common Diseases, Kunming 650223, China
| | - Ya-Ping Zhang
- Molecular Evolution and Genome Diversity Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China; Kunming College of Life Science, Chinese Academy of Sciences, Kunming, 650223, China; KIZ/CUHK Joint Laboratory of Bio-resources and Molecular Research in Common Diseases, Kunming 650223, China.
| |
Collapse
|
27
|
Xia LC, Ai D, Lee H, Andor N, Li C, Zhang NR, Ji HP. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 2018; 7:5049476. [PMID: 29982625 PMCID: PMC6057526 DOI: 10.1093/gigascience/giy081] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Revised: 05/22/2018] [Accepted: 06/26/2018] [Indexed: 11/29/2022] Open
Abstract
Background Simulating genome sequence data with variant features facilitates the development and benchmarking of structural variant analysis programs. However, there are only a few data simulators that provide structural variants in silico and even fewer that provide variants with different allelic fraction and haplotypes. Findings We developed SVEngine, an open-source tool to address this need. SVEngine simulates next-generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file, and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs), and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine's flexible design process enables one to specify size, position, and allelic fraction for deletions, insertions, duplications, inversions, and translocations. Finally, SVEngine simulates sequence data that replicate the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time. Conclusions We demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine's features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated SVEngine's accuracy by simulating genome-wide structural variants of NA12878 and a heterogeneous cancer genome. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift, and neighboring hanging read pairs for representative variant types. Structural variant callers Lumpy and Manta and tumor heterogeneity estimator THetA2 were able to perform realistically on the simulated data. SVEngine is implemented as a standard Python package and is freely available for academic use .
Collapse
Affiliation(s)
- Li Charlie Xia
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
- Department of Statistics, the Wharton School, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA 18014
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, 30 Xueyuan Road, Haidian District, Beijing 100083 P. R. China
| | - Hojoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
| | - Noemi Andor
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
| | - Chao Li
- School of Mathematics and Physics, University of Science and Technology Beijing, 30 Xueyuan Road, Haidian District, Beijing 100083 P. R. China
| | - Nancy R Zhang
- Department of Statistics, the Wharton School, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA 18014
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
- Stanford Genome Technology Center, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304
| |
Collapse
|