1
|
Lai J, Yang Y, Liu Y, Scharpf RB, Karchin R. Assessing the merits: an opinion on the effectiveness of simulation techniques in tumor subclonal reconstruction. BIOINFORMATICS ADVANCES 2024; 4:vbae094. [PMID: 38948008 PMCID: PMC11213631 DOI: 10.1093/bioadv/vbae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 05/28/2024] [Accepted: 06/15/2024] [Indexed: 07/02/2024]
Abstract
Summary Neoplastic tumors originate from a single cell, and their evolution can be traced through lineages characterized by mutations, copy number alterations, and structural variants. These lineages are reconstructed and mapped onto evolutionary trees with algorithmic approaches. However, without ground truth benchmark sets, the validity of an algorithm remains uncertain, limiting potential clinical applicability. With a growing number of algorithms available, there is urgent need for standardized benchmark sets to evaluate their merits. Benchmark sets rely on in silico simulations of tumor sequence, but there are no accepted standards for simulation tools, presenting a major obstacle to progress in this field. Availability and implementation All analysis done in the paper was based on publicly available data from the publication of each accessed tool.
Collapse
Affiliation(s)
- Jiaying Lai
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Yi Yang
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Yunzhou Liu
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Robert B Scharpf
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21231, United States
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD 21231, United States
| | - Rachel Karchin
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21231, United States
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD 21231, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| |
Collapse
|
2
|
Lai J, Liu Y, Scharpf RB, Karchin R. Evaluation of simulation methods for tumor subclonal reconstruction. ARXIV 2024:arXiv:2402.09599v1. [PMID: 38410652 PMCID: PMC10896360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
Most neoplastic tumors originate from a single cell, and their evolution can be genetically traced through lineages characterized by common alterations such as small somatic mutations (SSMs), copy number alterations (CNAs), structural variants (SVs), and aneuploidies. Due to the complexity of these alterations in most tumors and the errors introduced by sequencing protocols and calling algorithms, tumor subclonal reconstruction algorithms are necessary to recapitulate the DNA sequence composition and tumor evolution in silico. With a growing number of these algorithms available, there is a pressing need for consistent and comprehensive benchmarking, which relies on realistic tumor sequencing generated by simulation tools. Here, we examine the current simulation methods, identifying their strengths and weaknesses, and provide recommendations for their improvement. Our review also explores potential new directions for research in this area. This work aims to serve as a resource for understanding and enhancing tumor genomic simulations, contributing to the advancement of the field.
Collapse
Affiliation(s)
- Jiaying Lai
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD
| | - Yunzhou Liu
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD
| | - Robert B. Scharpf
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD
| | - Rachel Karchin
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
3
|
Yang X, Xu X, Breuss MW, Antaki D, Ball LL, Chung C, Shen J, Li C, George RD, Wang Y, Bae T, Cheng Y, Abyzov A, Wei L, Alexandrov LB, Sebat JL, Gleeson JG. Control-independent mosaic single nucleotide variant detection with DeepMosaic. Nat Biotechnol 2023; 41:870-877. [PMID: 36593400 PMCID: PMC10314968 DOI: 10.1038/s41587-022-01559-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 10/10/2022] [Indexed: 01/04/2023]
Abstract
Mosaic variants (MVs) reflect mutagenic processes during embryonic development and environmental exposure, accumulate with aging and underlie diseases such as cancer and autism. The detection of noncancer MVs has been computationally challenging due to the sparse representation of nonclonally expanded MVs. Here we present DeepMosaic, combining an image-based visualization module for single nucleotide MVs and a convolutional neural network-based classification module for control-independent MV detection. DeepMosaic was trained on 180,000 simulated or experimentally assessed MVs, and was benchmarked on 619,740 simulated MVs and 530 independent biologically tested MVs from 16 genomes and 181 exomes. DeepMosaic achieved higher accuracy compared with existing methods on biological data, with a sensitivity of 0.78, specificity of 0.83 and positive predictive value of 0.96 on noncancer whole-genome sequencing data, as well as doubling the validation rate over previous best-practice methods on noncancer whole-exome sequencing data (0.43 versus 0.18). DeepMosaic represents an accurate MV classifier for noncancer samples that can be implemented as an alternative or complement to existing methods.
Collapse
Affiliation(s)
- Xiaoxu Yang
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA.
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA.
| | - Xin Xu
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Martin W Breuss
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
- Department of Pediatrics, Section of Genetics and Metabolism, University of Colorado School of Medicine, Aurora, CO, USA
| | - Danny Antaki
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Laurel L Ball
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Changuk Chung
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Jiawei Shen
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Chen Li
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Renee D George
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Yifan Wang
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Taejeong Bae
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Yuhe Cheng
- Department of Cellular and Molecular Medicine, UC San Diego, La Jolla, CA, USA
- Department of Bioengineering, UC San Diego, La Jolla, CA, USA
- Moores Cancer Center, UC San Diego, La Jolla, CA, USA
| | - Alexej Abyzov
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Liping Wei
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, China
| | - Ludmil B Alexandrov
- Department of Cellular and Molecular Medicine, UC San Diego, La Jolla, CA, USA
- Department of Bioengineering, UC San Diego, La Jolla, CA, USA
- Moores Cancer Center, UC San Diego, La Jolla, CA, USA
| | - Jonathan L Sebat
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
| | - Joseph G Gleeson
- Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA.
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA.
| |
Collapse
|
4
|
Duncavage EJ, Coleman JF, de Baca ME, Kadri S, Leon A, Routbort M, Roy S, Suarez CJ, Vanderbilt C, Zook JM. Recommendations for the Use of in Silico Approaches for Next-Generation Sequencing Bioinformatic Pipeline Validation: A Joint Report of the Association for Molecular Pathology, Association for Pathology Informatics, and College of American Pathologists. J Mol Diagn 2023; 25:3-16. [PMID: 36244574 DOI: 10.1016/j.jmoldx.2022.09.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 09/14/2022] [Accepted: 09/28/2022] [Indexed: 11/21/2022] Open
Abstract
In silico approaches for next-generation sequencing (NGS) data modeling have utility in the clinical laboratory as a tool for clinical assay validation. In silico NGS data can take a variety of forms, including pure simulated data or manipulated data files in which variants are inserted into existing data files. In silico data enable simulation of a range of variants that may be difficult to obtain from a single physical sample. Such data allow laboratories to more accurately test the performance of clinical bioinformatics pipelines without sequencing additional cases. For example, clinical laboratories may use in silico data to simulate low variant allele fraction variants to test the analytical sensitivity of variant calling software or simulate a range of insertion/deletion sizes to determine the performance of insertion/deletion calling software. In this article, the Working Group reviews the different types of in silico data with their strengths and limitations, methods to generate in silico data, and how data can be used in the clinical molecular diagnostic laboratory. Survey data indicate how in silico NGS data are currently being used. Finally, potential applications for which in silico data may become useful in the future are presented.
Collapse
Affiliation(s)
- Eric J Duncavage
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, Missouri.
| | - Joshua F Coleman
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, University of Utah, Salt Lake City, Utah
| | - Monica E de Baca
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Pacific Pathology Partners, Seattle, Washington
| | - Sabah Kadri
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Anne and Robert H Lurie Children's Hospital of Chicago, Chicago, Illinois
| | - Annette Leon
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Color Health, Burlingame, California
| | - Mark Routbort
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Hematopathology, MD Anderson Cancer Center, Houston, Texas
| | - Somak Roy
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology and Laboratory Medicine, Cincinnati Children's Hospital, Cincinnati, Ohio
| | - Carlos J Suarez
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Stanford University, Palo Alto, California
| | - Chad Vanderbilt
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Justin M Zook
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Biomarker and Genomic Sciences Group, National Institute of Standards and Technology, Gaithersburg, Maryland
| |
Collapse
|
5
|
Lei Y, Meng Y, Guo X, Ning K, Bian Y, Li L, Hu Z, Anashkina AA, Jiang Q, Dong Y, Zhu X. Overview of structural variation calling: Simulation, identification, and visualization. Comput Biol Med 2022; 145:105534. [DOI: 10.1016/j.compbiomed.2022.105534] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 12/11/2022]
|
6
|
Identification of Copy Number Alterations from Next-Generation Sequencing Data. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2022; 1361:55-74. [DOI: 10.1007/978-3-030-91836-1_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
7
|
Yang X, Breuss MW, Xu X, Antaki D, James KN, Stanley V, Ball LL, George RD, Wirth SA, Cao B, Nguyen A, McEvoy-Venneri J, Chai G, Nahas S, Van Der Kraan L, Ding Y, Sebat J, Gleeson JG. Developmental and temporal characteristics of clonal sperm mosaicism. Cell 2021; 184:4772-4783.e15. [PMID: 34388390 PMCID: PMC8496133 DOI: 10.1016/j.cell.2021.07.024] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 05/12/2021] [Accepted: 07/14/2021] [Indexed: 01/07/2023]
Abstract
Throughout development and aging, human cells accumulate mutations resulting in genomic mosaicism and genetic diversity at the cellular level. Mosaic mutations present in the gonads can affect both the individual and the offspring and subsequent generations. Here, we explore patterns and temporal stability of clonal mosaic mutations in male gonads by sequencing ejaculated sperm. Through 300× whole-genome sequencing of blood and sperm from healthy men, we find each ejaculate carries on average 33.3 ± 12.1 (mean ± SD) clonal mosaic variants, nearly all of which are detected in serial sampling, with the majority absent from sampled somal tissues. Their temporal stability and mutational signature suggest origins during embryonic development from a largely immutable stem cell niche. Clonal mosaicism likely contributes a transmissible, predicted pathogenic exonic variant for 1 in 15 men, representing a life-long threat of transmission for these individuals and a significant burden on human population health.
Collapse
Affiliation(s)
- Xiaoxu Yang
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Martin W Breuss
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Xin Xu
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Danny Antaki
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Kiely N James
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Valentina Stanley
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Laurel L Ball
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Renee D George
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Sara A Wirth
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Beibei Cao
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - An Nguyen
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Jennifer McEvoy-Venneri
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Guoliang Chai
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Shareef Nahas
- Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | | | - Yan Ding
- Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA
| | - Jonathan Sebat
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA 92093, USA; Department of Psychiatry, University of California, San Diego, La Jolla, CA 92093, USA; Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093, USA; Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA
| | - Joseph G Gleeson
- Department of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA; Rady Children's Institute for Genomic Medicine, San Diego, CA 92123, USA.
| |
Collapse
|
8
|
Alosaimi S, Bandiang A, van Biljon N, Awany D, Thami PK, Tchamga MSS, Kiran A, Messaoud O, Hassan RIM, Mugo J, Ahmed A, Bope CD, Allali I, Mazandu GK, Mulder NJ, Chimusa ER. A broad survey of DNA sequence data simulation tools. Brief Funct Genomics 2020; 19:49-59. [PMID: 31867604 DOI: 10.1093/bfgp/elz033] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 10/27/2019] [Accepted: 11/04/2019] [Indexed: 11/12/2022] Open
Abstract
In silico DNA sequence generation is a powerful technology to evaluate and validate bioinformatics tools, and accordingly more than 35 DNA sequence simulation tools have been developed. With such a diverse array of tools to choose from, an important question is: Which tool should be used for a desired outcome? This question is largely unanswered as documentation for many of these DNA simulation tools is sparse. To address this, we performed a review of DNA sequence simulation tools developed to date and evaluated 20 state-of-art DNA sequence simulation tools on their ability to produce accurate reads based on their implemented sequence error model. We provide a succinct description of each tool and suggest which tool is most appropriate for the given different scenarios. Given the multitude of similar yet non-identical tools, researchers can use this review as a guide to inform their choice of DNA sequence simulation tool. This paves the way towards assessing existing tools in a unified framework, as well as enabling different simulation scenario analysis within the same framework.
Collapse
Affiliation(s)
- Shatha Alosaimi
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Armand Bandiang
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Noelle van Biljon
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Denis Awany
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Prisca K Thami
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
| | - Milaine S S Tchamga
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Anmol Kiran
- Malawi-Liverpool-Wellcome Trust Clinical Research Programme, Blantyre, Malawi.,Edinburgh University, Edinburgh, UK
| | - Olfa Messaoud
- Université de Tunis El Manar, Institut Pasteur de Tunis, LR16IPT05 Génomique Biomédicale et Oncogénétique, Tunis, 1002, Tunisia
| | - Radia Ismaeel Mohammed Hassan
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Jacquiline Mugo
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Azza Ahmed
- Centre for Bioinformatics and Systems Biology, Faculty of Science, University of Khartoum, Sudan
| | - Christian D Bope
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Imane Allali
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Gaston K Mazandu
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,African Institute for Mathematical Sciences (AIMS), Cape Town, South Africa
| | - Nicola J Mulder
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Emile R Chimusa
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
9
|
Yu Z, Du F, Ban R, Zhang Y. SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles. BMC Bioinformatics 2020; 21:331. [PMID: 32703148 PMCID: PMC7379788 DOI: 10.1186/s12859-020-03665-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2018] [Accepted: 07/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required. RESULTS Here, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools. CONCLUSIONS SimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data.
Collapse
Affiliation(s)
- Zhenhua Yu
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China.
| | - Fang Du
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
| | - Rongjun Ban
- Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China
| | - Yuanwei Zhang
- Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
10
|
Jang H, Lee H. Multiresolution correction of GC bias and application to identification of copy number alterations. Bioinformatics 2020; 35:3890-3897. [PMID: 30865265 DOI: 10.1093/bioinformatics/btz174] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 03/03/2019] [Accepted: 03/12/2019] [Indexed: 01/03/2023] Open
Abstract
MOTIVATION Whole-genome sequencing (WGS) data are affected by various sequencing biases such as GC bias and mappability bias. These biases degrade performance on detection of genetic variations such as copy number alterations. The existing methods use a relation between the GC proportion and depth of coverage (DOC) of markers by means of regression models. Nonetheless, severity of the GC bias varies from sample to sample. We developed a new method for correction of GC bias on the basis of multiresolution analysis. We used a translation-invariant wavelet transform to decompose biased raw signals into high- and low-frequency coefficients. Then, we modeled the relation between GC proportion and DOC of the genomic regions and constructed new control DOC signals that reflect the GC bias. The control DOC signals are used for normalizing genomic sequences by correcting the GC bias. RESULTS When we applied our method to simulated sequencing data with various degrees of GC bias, our method showed more robust performance on correcting the GC bias than the other methods did. We also applied our method to real-world cancer sequencing datasets and successfully identified cancer-related focal alterations even when cancer genomes were not normalized to normal control samples. In conclusion, our method can be employed for WGS data with different degrees of GC bias. AVAILABILITY AND IMPLEMENTATION The code is available at http://gcancer.org/wabico. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ho Jang
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea
| |
Collapse
|
11
|
Yuan X, Gao M, Bai J, Duan J. SVSR: A Program to Simulate Structural Variations and Generate Sequencing Reads for Multiple Platforms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1082-1091. [PMID: 30334804 DOI: 10.1109/tcbb.2018.2876527] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Structural variation accounts for a major fraction of mutations in the human genome and confers susceptibility to complex diseases. Next generation sequencing along with the rapid development of computational methods provides a cost-effective procedure to detect such variations. Simulation of structural variations and sequencing reads with real characteristics is essential for benchmarking the computational methods. Here, we develop a new program, SVSR, to simulate five types of structural variations (indels, tandem duplication, CNVs, inversions, and translocations) and SNPs for the human genome and to generate sequencing reads with features from popular platforms (Illumina, SOLiD, 454, and Ion Torrent). We adopt a selection model trained from real data to predict copy number states, starting from the first site of a particular genome to the end. Furthermore, we utilize references of microbial genomes to produce insertion fragments and design probabilistic models to imitate inversions and translocations. Moreover, we create platform-specific errors and base quality profiles to generate normal, tumor, or normal-tumor mixture reads. Experimental results show that SVSR could capture more features that are realistic and generate datasets with satisfactory quality scores. SVSR is able to evaluate the performance of structural variation detection methods and guide the development of new computational methods.
Collapse
|
12
|
Xing Y, Dabney AR, Li X, Wang G, Gill CA, Casola C. SECNVs: A Simulator of Copy Number Variants and Whole-Exome Sequences From Reference Genomes. Front Genet 2020; 11:82. [PMID: 32153642 PMCID: PMC7046838 DOI: 10.3389/fgene.2020.00082] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Accepted: 01/24/2020] [Indexed: 01/26/2023] Open
Abstract
Copy number variants are duplications and deletions of the genome that play an important role in phenotypic changes and human disease. Many software applications have been developed to detect copy number variants using either whole-genome sequencing or whole-exome sequencing data. However, there is poor agreement in the results from these applications. Simulated datasets containing copy number variants allow comprehensive comparisons of the operating characteristics of existing and novel copy number variant detection methods. Several software applications have been developed to simulate copy number variants and other structural variants in whole-genome sequencing data. However, none of the applications reliably simulate copy number variants in whole-exome sequencing data. We have developed and tested Simulator of Exome Copy Number Variants (SECNVs), a fast, robust and customizable software application for simulating copy number variants and whole-exome sequences from a reference genome. SECNVs is easy to install, implements a wide range of commands to customize simulations, can output multiple samples at once, and incorporates a pipeline to output rearranged genomes, short reads and BAM files in a single command. Variants generated by SECNVs are detected with high sensitivity and precision by tools commonly used to detect copy number variants. SECNVs is publicly available at https://github.com/YJulyXing/SECNVs.
Collapse
Affiliation(s)
- Yue Xing
- Interdisciplinary Program in Genetics, Texas A&M University, College Station, TX, United States
- Department of Statistics, Texas A&M University, College Station, TX, United States
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, United States
| | - Alan R. Dabney
- Department of Statistics, Texas A&M University, College Station, TX, United States
| | - Xiao Li
- Department of Molecular and Cellular Medicine, Texas A&M University, College Station, TX, United States
| | - Guosong Wang
- Department of Animal Science, Texas A&M University, College Station, TX, United States
| | - Clare A. Gill
- Department of Animal Science, Texas A&M University, College Station, TX, United States
| | - Claudio Casola
- Department of Ecosystem Science and Management, Texas A&M University, College Station, TX, United States
| |
Collapse
|
13
|
Breuss MW, Antaki D, George RD, Kleiber M, James KN, Ball LL, Hong O, Mitra I, Yang X, Wirth SA, Gu J, Garcia CAB, Gujral M, Brandler WM, Musaev D, Nguyen A, McEvoy-Venneri J, Knox R, Sticca E, Botello MCC, Uribe Fenner J, Pérez MC, Arranz M, Moffitt AB, Wang Z, Hervás A, Devinsky O, Gymrek M, Sebat J, Gleeson JG. Autism risk in offspring can be assessed through quantification of male sperm mosaicism. Nat Med 2020; 26:143-150. [PMID: 31873310 PMCID: PMC7032648 DOI: 10.1038/s41591-019-0711-0] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Accepted: 11/21/2019] [Indexed: 01/28/2023]
Abstract
De novo mutations arising on the paternal chromosome make the largest known contribution to autism risk, and correlate with paternal age at the time of conception. The recurrence risk for autism spectrum disorders is substantial, leading many families to decline future pregnancies, but the potential impact of assessing parental gonadal mosaicism has not been considered. We measured sperm mosaicism using deep-whole-genome sequencing, for variants both present in an offspring and evident only in father's sperm, and identified single-nucleotide, structural and short tandem-repeat variants. We found that mosaicism quantification can stratify autism spectrum disorders recurrence risk due to de novo mutations into a vast majority with near 0% recurrence and a small fraction with a substantially higher and quantifiable risk, and we identify novel mosaic variants at risk for transmission to a future offspring. This suggests, therefore, that genetic counseling would benefit from the addition of sperm mosaicism assessment.
Collapse
Affiliation(s)
- Martin W Breuss
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Danny Antaki
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
| | - Renee D George
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Morgan Kleiber
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Kiely N James
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Laurel L Ball
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Oanh Hong
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
| | - Ileena Mitra
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Xiaoxu Yang
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Sara A Wirth
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Jing Gu
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Camila A B Garcia
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Madhusudan Gujral
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
| | - William M Brandler
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
| | - Damir Musaev
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - An Nguyen
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Jennifer McEvoy-Venneri
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | - Renatta Knox
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
- Department of Child Neurology, Weill Cornell Medical College, New York, NY, USA
| | - Evan Sticca
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA
| | | | - Javiera Uribe Fenner
- Child and Adolescent Mental Health Unit, Hospital Universitari Mútua de Terrassa, Barcelona, Spain
| | | | - Maria Arranz
- Fundació Docència i Recerca Mútua Terrassa, Barcelona, Spain
| | - Andrea B Moffitt
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, NY, USA
| | - Zihua Wang
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, NY, USA
| | - Amaia Hervás
- Research Laboratory Unit, Fundació Docencia i Recerca Mútua Terrassa, Barcelona, Spain
| | - Orrin Devinsky
- Department of Neurology, Epilepsy Division, New York University School of Medicine, New York, NY, USA
| | - Melissa Gymrek
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Jonathan Sebat
- Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA.
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA.
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA.
| | - Joseph G Gleeson
- Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA.
- Rady Children's Institute for Genomic Medicine, San Diego, CA, USA.
| |
Collapse
|