1
|
Pipes L, Nielsen R. A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets. eLife 2024; 13:e85794. [PMID: 39145536 PMCID: PMC11377034 DOI: 10.7554/elife.85794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Accepted: 08/14/2024] [Indexed: 08/16/2024] Open
Abstract
Environmental DNA (eDNA) is becoming an increasingly important tool in diverse scientific fields from ecological biomonitoring to wastewater surveillance of viruses. The fundamental challenge in eDNA analyses has been the bioinformatical assignment of reads to taxonomic groups. It has long been known that full probabilistic methods for phylogenetic assignment are preferable, but unfortunately, such methods are computationally intensive and are typically inapplicable to modern next-generation sequencing data. We present a fast approximate likelihood method for phylogenetic assignment of DNA sequences. Applying the new method to several mock communities and simulated datasets, we show that it identifies more reads at both high and low taxonomic levels more accurately than other leading methods. The advantage of the method is particularly apparent in the presence of polymorphisms and/or sequencing errors and when the true species is not represented in the reference database.
Collapse
Affiliation(s)
- Lenore Pipes
- Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
- GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
2
|
Artymiuk CJ, Basu S, Koganti T, Tandale P, Balan J, Dina MA, Barr Fritcher EG, Wu X, Ashworth T, He R, Viswanatha DS. Clinical Validation of a Targeted Next-Generation Sequencing Panel for Lymphoid Malignancies. J Mol Diagn 2024; 26:583-598. [PMID: 38582399 DOI: 10.1016/j.jmoldx.2024.03.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 02/16/2024] [Accepted: 03/22/2024] [Indexed: 04/08/2024] Open
Abstract
Lymphoid malignancies are a heterogeneous group of hematological disorders characterized by a diverse range of morphologic, immunophenotypic, and clinical features. Next-generation sequencing (NGS) is increasingly being applied to delineate the complex nature of these malignancies and identify high-value biomarkers with diagnostic, prognostic, or therapeutic benefit. However, there are various challenges in using NGS routinely to characterize lymphoid malignancies, including pre-analytic issues, such as sequencing DNA from formalin-fixed, paraffin-embedded tissue, and optimizing the bioinformatic workflow for accurate variant calling and filtering. This study reports the clinical validation of a custom capture-based NGS panel to test for molecular markers in a range of lymphoproliferative diseases and histiocytic neoplasms. The fully validated clinical assay represents an accurate and sensitive tool for detection of single-nucleotide variants and small insertion/deletion events to facilitate the characterization and management of patients with hematologic cancers specifically of lymphoid origin.
Collapse
Affiliation(s)
- Cody J Artymiuk
- Molecular Hematopathology Laboratory, Mayo Clinic, Rochester, Minnesota.
| | - Shubham Basu
- Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Tejaswi Koganti
- Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | | | | | - Michelle A Dina
- Molecular Hematopathology Laboratory, Mayo Clinic, Rochester, Minnesota
| | | | - Xianglin Wu
- Clinical Genome Sequencing Laboratory, Mayo Clinic, Rochester, Minnesota
| | - Taylor Ashworth
- Clinical Genome Sequencing Laboratory, Mayo Clinic, Rochester, Minnesota
| | - Rong He
- Hematopathology Division, Mayo Clinic, Rochester, Minnesota
| | | |
Collapse
|
3
|
Sergi A, Beltrame L, Marchini S, Masseroli M. Integrated approach to generate artificial samples with low tumor fraction for somatic variant calling benchmarking. BMC Bioinformatics 2024; 25:180. [PMID: 38720249 PMCID: PMC11077792 DOI: 10.1186/s12859-024-05793-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 04/19/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. RESULTS Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. CONCLUSIONS Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.
Collapse
Affiliation(s)
- Aldo Sergi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy.
- IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Milan, Rozzano, Italy.
| | - Luca Beltrame
- IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Milan, Rozzano, Italy
| | - Sergio Marchini
- IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Milan, Rozzano, Italy
| | - Marco Masseroli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy
| |
Collapse
|
4
|
Kalleberg J, Rissman J, Schnabel RD. Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589602. [PMID: 38659907 PMCID: PMC11042298 DOI: 10.1101/2024.04.15.589602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Variant calling across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human genomes. Here, we use bovine genomes to assess the limits of human-genome-trained models in other species. We introduce the first multi-species DV model that achieves a lower Mendelian Inheritance Error (MIE) rate during single-sample genotyping. Our novel approach, TrioTrain, automates extending DV for species without Genome In A Bottle (GIAB) resources and uses region shuffling to mitigate barriers for SLURM-based clusters. To offset imperfect truth labels for animal genomes, we remove Mendelian discordant variants before training, where models are tuned to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to build 30 model iterations across five phases. We observe remarkable performance across phases when testing the GIAB human trios with a mean SNP F1 score >0.990. In HG002, our phase 4 bovine model identifies more variants at a lower MIE rate than DeepTrio. In bovine F1-hybrid genomes, our model substantially reduces inheritance errors with a mean MIE rate of 0.03 percent. Although constrained by imperfect labels, we find that multi-species, trio-based training produces a robust variant calling model. Our research demonstrates that exclusively training with human genomes restricts the application of deep-learning approaches for comparative genomics.
Collapse
Affiliation(s)
- Jenna Kalleberg
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Jacob Rissman
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Robert D Schnabel
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
- University of Missouri, Genetics Area Program, Columbia, MO, 65201 USA
| |
Collapse
|
5
|
Roth C, Venu V, Job V, Lubbers N, Sanbonmatsu KY, Steadman CR, Starkenburg SR. Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information. BMC Bioinformatics 2023; 24:441. [PMID: 37990143 PMCID: PMC10664258 DOI: 10.1186/s12859-023-05553-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 10/30/2023] [Indexed: 11/23/2023] Open
Abstract
BACKGROUND Correlation metrics are widely utilized in genomics analysis and often implemented with little regard to assumptions of normality, homoscedasticity, and independence of values. This is especially true when comparing values between replicated sequencing experiments that probe chromatin accessibility, such as assays for transposase-accessible chromatin via sequencing (ATAC-seq). Such data can possess several regions across the human genome with little to no sequencing depth and are thus non-normal with a large portion of zero values. Despite distributed use in the epigenomics field, few studies have evaluated and benchmarked how correlation and association statistics behave across ATAC-seq experiments with known differences or the effects of removing specific outliers from the data. Here, we developed a computational simulation of ATAC-seq data to elucidate the behavior of correlation statistics and to compare their accuracy under set conditions of reproducibility. RESULTS Using these simulations, we monitored the behavior of several correlation statistics, including the Pearson's R and Spearman's [Formula: see text] coefficients as well as Kendall's [Formula: see text] and Top-Down correlation. We also test the behavior of association measures, including the coefficient of determination R[Formula: see text], Kendall's W, and normalized mutual information. Our experiments reveal an insensitivity of most statistics, including Spearman's [Formula: see text], Kendall's [Formula: see text], and Kendall's W, to increasing differences between simulated ATAC-seq replicates. The removal of co-zeros (regions lacking mapped sequenced reads) between simulated experiments greatly improves the estimates of correlation and association. After removing co-zeros, the R[Formula: see text] coefficient and normalized mutual information display the best performance, having a closer one-to-one relationship with the known portion of shared, enhanced loci between simulated replicates. When comparing values between experimental ATAC-seq data using a random forest model, mutual information best predicts ATAC-seq replicate relationships. CONCLUSIONS Collectively, this study demonstrates how measures of correlation and association can behave in epigenomics experiments. We provide improved strategies for quantifying relationships in these increasingly prevalent and important chromatin accessibility assays.
Collapse
Affiliation(s)
- Cullen Roth
- Los Alamos National Laboratory, Genomics and Bioanalytics, Los Alamos, NM, USA.
| | - Vrinda Venu
- Los Alamos National Laboratory, Climate, Ecosystems, and Environmental Science, Los Alamos, NM, USA
| | - Vanessa Job
- Los Alamos National Laboratory, High Performance Computing and Design, Los Alamos, NM, USA
| | - Nicholas Lubbers
- Los Alamos National Laboratory, Information Sciences, Los Alamos, NM, USA
| | - Karissa Y Sanbonmatsu
- Los Alamos National Laboratory, Theoretical Biology and Biophysics, Los Alamos, NM, USA
| | - Christina R Steadman
- Los Alamos National Laboratory, Climate, Ecosystems, and Environmental Science, Los Alamos, NM, USA
| | - Shawn R Starkenburg
- Los Alamos National Laboratory, Genomics and Bioanalytics, Los Alamos, NM, USA
| |
Collapse
|
6
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 92] [Impact Index Per Article: 92.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
7
|
Roder AE, Johnson KEE, Knoll M, Khalfan M, Wang B, Schultz-Cherry S, Banakis S, Kreitman A, Mederos C, Youn JH, Mercado R, Wang W, Chung M, Ruchnewitz D, Samanovic MI, Mulligan MJ, Lässig M, Luksza M, Das S, Gresham D, Ghedin E. Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data. mBio 2023; 14:e0104623. [PMID: 37389439 PMCID: PMC10470513 DOI: 10.1128/mbio.01046-23] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 05/02/2023] [Indexed: 07/01/2023] Open
Abstract
High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. However, accurate detection of minority variants in viral sequence data is complicated by errors introduced during sample preparation and data analysis. We used synthetic RNA controls and simulated data to test seven variant-calling tools across a range of allele frequencies and simulated coverages. We show that choice of variant caller and use of replicate sequencing have the most significant impact on single-nucleotide variant (SNV) discovery and demonstrate how both allele frequency and coverage thresholds impact both false discovery and false-negative rates. When replicates are not available, using a combination of multiple callers with more stringent cutoffs is recommended. We use these parameters to find minority variants in sequencing data from SARS-CoV-2 clinical specimens and provide guidance for studies of intra-host viral diversity using either single replicate data or data from technical replicates. Our study provides a framework for rigorous assessment of technical factors that impact SNV identification in viral samples and establishes heuristics that will inform and improve future studies of intra-host variation, viral diversity, and viral evolution. IMPORTANCE When viruses replicate inside a host cell, the virus replication machinery makes mistakes. Over time, these mistakes create mutations that result in a diverse population of viruses inside the host. Mutations that are neither lethal to the virus nor strongly beneficial can lead to minority variants that are minor members of the virus population. However, preparing samples for sequencing can also introduce errors that resemble minority variants, resulting in the inclusion of false-positive data if not filtered correctly. In this study, we aimed to determine the best methods for identification and quantification of these minority variants by testing the performance of seven commonly used variant-calling tools. We used simulated and synthetic data to test their performance against a true set of variants and then used these studies to inform variant identification in data from SARS-CoV-2 clinical specimens. Together, analyses of our data provide extensive guidance for future studies of viral diversity and evolution.
Collapse
Affiliation(s)
- A. E. Roder
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
| | - K. E. E. Johnson
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, USA
| | - M. Knoll
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, USA
| | - M. Khalfan
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, USA
| | - B. Wang
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, USA
| | - S. Schultz-Cherry
- Department of Infectious Diseases, St Jude Children Research Hospital, Memphis, Tennessee, USA
| | - S. Banakis
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
| | - A. Kreitman
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
| | - C. Mederos
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
| | - J.-H. Youn
- Department of Laboratory Medicine, NIH, Bethesda, Maryland, USA
| | - R. Mercado
- Department of Laboratory Medicine, NIH, Bethesda, Maryland, USA
| | - W. Wang
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
| | - M. Chung
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
| | - D. Ruchnewitz
- Institute for Biological Physics, University of Cologne, Cologne, Germany
| | - M. I. Samanovic
- Department of Medicine, New York University Langone Vaccine Center, New York, New York, USA
| | - M. J. Mulligan
- Department of Medicine, New York University Langone Vaccine Center, New York, New York, USA
| | - M. Lässig
- Institute for Biological Physics, University of Cologne, Cologne, Germany
| | - M. Luksza
- Department of Oncological Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - S. Das
- Department of Laboratory Medicine, NIH, Bethesda, Maryland, USA
| | - D. Gresham
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, USA
| | - E. Ghedin
- Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH, Bethesda, Maryland, USA
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, USA
| |
Collapse
|
8
|
Glick L, Mayrose I. The Effect of Methodological Considerations on the Construction of Gene-Based Plant Pan-genomes. Genome Biol Evol 2023; 15:evad121. [PMID: 37401440 PMCID: PMC10340445 DOI: 10.1093/gbe/evad121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 06/21/2023] [Accepted: 06/28/2023] [Indexed: 07/05/2023] Open
Abstract
Pan-genomics is an emerging approach for studying the genetic diversity within plant populations. In contrast to common resequencing studies that compare whole genome sequencing data with a single reference genome, the construction of a pan-genome (PG) involves the direct comparison of multiple genomes to one another, thereby enabling the detection of genomic sequences and genes not present in the reference, as well as the analysis of gene content diversity. Although multiple studies describing PGs of various plant species have been published in recent years, a better understanding regarding the effect of the computational procedures used for PG construction could guide researchers in making more informed methodological decisions. Here, we examine the effect of several key methodological factors on the obtained gene pool and on gene presence-absence detections by constructing and comparing multiple PGs of Arabidopsis thaliana and cultivated soybean, as well as conducting a meta-analysis on published PGs. These factors include the construction method, the sequencing depth, and the extent of input data used for gene annotation. We observe substantial differences between PGs constructed using three common procedures (de novo assembly and annotation, map-to-pan, and iterative assembly) and that results are dependent on the extent of the input data. Specifically, we report low agreement between the gene content inferred using different procedures and input data. Our results should increase the awareness of the community to the consequences of methodological decisions made during the process of PG construction and emphasize the need for further investigation of commonly applied methodologies.
Collapse
Affiliation(s)
- Lior Glick
- Department of Life Sciences, School of Plant Sciences and Food Security, Tel-Aviv University, Tel Aviv, Israel
| | - Itay Mayrose
- Department of Life Sciences, School of Plant Sciences and Food Security, Tel-Aviv University, Tel Aviv, Israel
| |
Collapse
|
9
|
Shen W, Sellers HL, Choate LA, Stein MI, Tandale PP, Tan J, Setlem R, Sakai Y, Fadra N, Sosa C, McClelland SP, Barnett SS, Rasmussen KJ, Runke CK, Smoley SA, Tillmans LS, Marcou CA, Rowsey RA, Thorland EC, Boczek NJ, Kearney HM. Clinical Validation of Tagmentation-Based Genome Sequencing for Germline Disorders. J Mol Diagn 2023; 25:524-531. [PMID: 37088140 DOI: 10.1016/j.jmoldx.2023.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/09/2023] [Accepted: 04/04/2023] [Indexed: 04/25/2023] Open
Abstract
Genome sequencing (GS) is a powerful clinical tool used for the comprehensive diagnosis of germline disorders. GS library preparation typically involves mechanical DNA fragmentation, end repair, and bead-based library size selection followed by adapter ligation, which can require a large amount of input genomic DNA. Tagmentation using bead-linked transposomes can simplify the library preparation process and reduce the DNA input requirement. Here we describe the clinical validation of tagmentation-based PCR-free GS as a clinical test for rare germline disorders. Compared with the Genome-in-a-Bottle Consortium benchmark variant sets, GS had a recall >99.7% and a precision of 99.8% for single nucleotide variants and small insertion-deletions. GS also exhibited 100% sensitivity for clinically reported sequence variants and the copy number variants examined. Furthermore, GS detected mitochondrial sequence variants above 5% heteroplasmy and showed reliable detection of disease-relevant repeat expansions and SMN1 homozygous loss. Our results indicate that while lowering DNA input requirements and reducing library preparation time, GS enables uniform coverage across the genome as well as robust detection of various types of genetic alterations. With the advantage of comprehensive profiling of multiple types of genetic alterations, GS is positioned as an ideal first-tier diagnostic test for germline disorders.
Collapse
Affiliation(s)
- Wei Shen
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota.
| | - Heidi L Sellers
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Lauren A Choate
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Mariam I Stein
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Pratyush P Tandale
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Jiayu Tan
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Rohit Setlem
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Yuta Sakai
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Numrah Fadra
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Carlos Sosa
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Shawn P McClelland
- Division of Computational Biology, Mayo Clinic Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota
| | - Sarah S Barnett
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Kristen J Rasmussen
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Cassandra K Runke
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Stephanie A Smoley
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Lori S Tillmans
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Cherisse A Marcou
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Ross A Rowsey
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Erik C Thorland
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Nicole J Boczek
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Hutton M Kearney
- Division of Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota.
| |
Collapse
|
10
|
Hoskins I, Sun S, Cote A, Roth FP, Cenik C. satmut_utils: a simulation and variant calling package for multiplexed assays of variant effect. Genome Biol 2023; 24:82. [PMID: 37081510 PMCID: PMC10116734 DOI: 10.1186/s13059-023-02922-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 04/04/2023] [Indexed: 04/22/2023] Open
Abstract
The impact of millions of individual genetic variants on molecular phenotypes in coding sequences remains unknown. Multiplexed assays of variant effect (MAVEs) are scalable methods to annotate relevant variants, but existing software lacks standardization, requires cumbersome configuration, and does not scale to large targets. We present satmut_utils as a flexible solution for simulation and variant quantification. We then benchmark MAVE software using simulated and real MAVE data. We finally determine mRNA abundance for thousands of cystathionine beta-synthase variants using two experimental methods. The satmut_utils package enables high-performance analysis of MAVEs and reveals the capability of variants to alter mRNA abundance.
Collapse
Affiliation(s)
- Ian Hoskins
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - Song Sun
- The Donnelly Centre and Departments of Molecular Genetics and Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Atina Cote
- The Donnelly Centre and Departments of Molecular Genetics and Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Frederick P Roth
- The Donnelly Centre and Departments of Molecular Genetics and Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Can Cenik
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA.
| |
Collapse
|
11
|
Performance evaluation of six popular short-read simulators. Heredity (Edinb) 2023; 130:55-63. [PMID: 36496447 PMCID: PMC9905089 DOI: 10.1038/s41437-022-00577-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Revised: 11/10/2022] [Accepted: 11/11/2022] [Indexed: 12/14/2022] Open
Abstract
High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas "gold-standard" empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design-yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators-ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim-and discuss important considerations for selecting suitable models for benchmarking.
Collapse
|
12
|
Duncavage EJ, Coleman JF, de Baca ME, Kadri S, Leon A, Routbort M, Roy S, Suarez CJ, Vanderbilt C, Zook JM. Recommendations for the Use of in Silico Approaches for Next-Generation Sequencing Bioinformatic Pipeline Validation: A Joint Report of the Association for Molecular Pathology, Association for Pathology Informatics, and College of American Pathologists. J Mol Diagn 2023; 25:3-16. [PMID: 36244574 DOI: 10.1016/j.jmoldx.2022.09.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 09/14/2022] [Accepted: 09/28/2022] [Indexed: 11/21/2022] Open
Abstract
In silico approaches for next-generation sequencing (NGS) data modeling have utility in the clinical laboratory as a tool for clinical assay validation. In silico NGS data can take a variety of forms, including pure simulated data or manipulated data files in which variants are inserted into existing data files. In silico data enable simulation of a range of variants that may be difficult to obtain from a single physical sample. Such data allow laboratories to more accurately test the performance of clinical bioinformatics pipelines without sequencing additional cases. For example, clinical laboratories may use in silico data to simulate low variant allele fraction variants to test the analytical sensitivity of variant calling software or simulate a range of insertion/deletion sizes to determine the performance of insertion/deletion calling software. In this article, the Working Group reviews the different types of in silico data with their strengths and limitations, methods to generate in silico data, and how data can be used in the clinical molecular diagnostic laboratory. Survey data indicate how in silico NGS data are currently being used. Finally, potential applications for which in silico data may become useful in the future are presented.
Collapse
Affiliation(s)
- Eric J Duncavage
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, Missouri.
| | - Joshua F Coleman
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, University of Utah, Salt Lake City, Utah
| | - Monica E de Baca
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Pacific Pathology Partners, Seattle, Washington
| | - Sabah Kadri
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Anne and Robert H Lurie Children's Hospital of Chicago, Chicago, Illinois
| | - Annette Leon
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Color Health, Burlingame, California
| | - Mark Routbort
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Hematopathology, MD Anderson Cancer Center, Houston, Texas
| | - Somak Roy
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology and Laboratory Medicine, Cincinnati Children's Hospital, Cincinnati, Ohio
| | - Carlos J Suarez
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Stanford University, Palo Alto, California
| | - Chad Vanderbilt
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Justin M Zook
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Biomarker and Genomic Sciences Group, National Institute of Standards and Technology, Gaithersburg, Maryland
| |
Collapse
|
13
|
Scandino R, Calabrese F, Romanel A. Synggen: fast and data-driven generation of synthetic heterogeneous NGS cancer data. Bioinformatics 2022; 39:6885441. [PMID: 36484701 PMCID: PMC9825741 DOI: 10.1093/bioinformatics/btac792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 11/02/2022] [Accepted: 12/08/2022] [Indexed: 12/13/2022] Open
Abstract
SUMMARY Whole-exome and targeted sequencing are widely utilized both in translational cancer genomics and in the setting of precision medicine. The benchmarking of computational methods and tools that are in continuous development is fundamental for the correct interpretation of somatic genomic profiling results. To this aim we developed synggen, a tool for the fast generation of large-scale realistic and heterogeneous cancer whole-exome and targeted sequencing synthetic datasets, which enables the incorporation of phased germline single nucleotide polymorphisms and complex allele-specific somatic genomic events. Synggen performances and effectiveness in generating synthetic cancer data are shown across different scenarios and considering different platforms with distinct characteristics. AVAILABILITY AND IMPLEMENTATION synggen is freely available at https://bitbucket.org/CibioBCG/synggen/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Riccardo Scandino
- Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento 38123, Italy
| | - Federico Calabrese
- Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento 38123, Italy
| | | |
Collapse
|
14
|
Pan J, Li X, Zhang M, Lu Y, Zhu Y, Wu K, Wu Y, Wang W, Chen B, Liu Z, Wang X, Gao J. TransFlow: a Snakemake workflow for transmission analysis of Mycobacterium tuberculosis whole-genome sequencing data. Bioinformatics 2022; 39:6873737. [PMID: 36469333 PMCID: PMC9825751 DOI: 10.1093/bioinformatics/btac785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 10/26/2022] [Accepted: 12/02/2022] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Whole-genome sequencing (WGS) is increasingly used to aid the understanding of Mycobacterium tuberculosis (MTB) transmission. The epidemiological analysis of tuberculosis based on the WGS technique requires a diverse collection of bioinformatics tools. Effectively using these analysis tools in a scalable and reproducible way can be challenging, especially for non-experts. RESULTS Here, we present TransFlow (Transmission Workflow), a user-friendly, fast, efficient and comprehensive WGS-based transmission analysis pipeline. TransFlow combines some state-of-the-art tools to take transmission analysis from raw sequencing data, through quality control, sequence alignment and variant calling, into downstream transmission clustering, transmission network reconstruction and transmission risk factor inference, together with summary statistics and data visualization in a summary report. TransFlow relies on Snakemake and Conda to resolve dependencies among consecutive processing steps and can be easily adapted to any computation environment. AVAILABILITY AND IMPLEMENTATION TransFlow is free available at https://github.com/cvn001/transflow. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Mingwu Zhang
- The Institute of TB Control, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, Zhejiang 310051, China
| | - Yewei Lu
- Key Laboratory of Precision Medicine in Diagnosis and Monitoring Research of Zhejiang Province, Hangzhou, Zhejiang 310020, China
| | - Yelei Zhu
- The Institute of TB Control, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, Zhejiang 310051, China
| | - Kunyang Wu
- The Institute of TB Control, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, Zhejiang 310051, China
| | - Yiwen Wu
- Department of Medical Oncology, Zhejiang Chinese Medical University, Hangzhou, Zhejiang 310053, China
| | - Weixin Wang
- Key Laboratory of Precision Medicine in Diagnosis and Monitoring Research of Zhejiang Province, Hangzhou, Zhejiang 310020, China
| | - Bin Chen
- The Institute of TB Control, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, Zhejiang 310051, China
| | - Zhengwei Liu
- To whom correspondence should be addressed. or or
| | | | - Junshun Gao
- To whom correspondence should be addressed. or or
| |
Collapse
|
15
|
Sultanov D, Hochwagen A. Varying strength of selection contributes to the intragenomic diversity of rRNA genes. Nat Commun 2022; 13:7245. [PMID: 36434003 PMCID: PMC9700816 DOI: 10.1038/s41467-022-34989-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 11/14/2022] [Indexed: 11/27/2022] Open
Abstract
Ribosome biogenesis in eukaryotes is supported by hundreds of ribosomal RNA (rRNA) gene copies that are encoded in the ribosomal DNA (rDNA). The multiple copies of rRNA genes are thought to have low sequence diversity within one species. Here, we present species-wide rDNA sequence analysis in Saccharomyces cerevisiae that challenges this view. We show that rDNA copies in this yeast are heterogeneous, both among and within isolates, and that many variants avoided fixation or elimination over evolutionary time. The sequence diversity landscape across the rDNA shows clear functional stratification, suggesting different copy-number thresholds for selection that contribute to rDNA diversity. Notably, nucleotide variants in the most conserved rDNA regions are sufficiently deleterious to exhibit signatures of purifying selection even when present in only a small fraction of rRNA gene copies. Our results portray a complex evolutionary landscape that shapes rDNA sequence diversity within a single species and reveal unexpectedly strong purifying selection of multi-copy genes.
Collapse
Affiliation(s)
- Daniel Sultanov
- grid.137628.90000 0004 1936 8753Department of Biology, New York University, New York, NY 10003 USA
| | - Andreas Hochwagen
- grid.137628.90000 0004 1936 8753Department of Biology, New York University, New York, NY 10003 USA
| |
Collapse
|
16
|
Pipes L, Chen Z, Afanaseva S, Nielsen R. Estimating the relative proportions of SARS-CoV-2 haplotypes from wastewater samples. CELL REPORTS METHODS 2022; 2:100313. [PMID: 36159190 PMCID: PMC9485417 DOI: 10.1016/j.crmeth.2022.100313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Revised: 06/27/2022] [Accepted: 09/14/2022] [Indexed: 12/02/2022]
Abstract
Wastewater surveillance has become essential for monitoring the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The quantification of SARS-CoV-2 RNA in wastewater correlates with the coronavirus disease 2019 (COVID-19) caseload in a community. However, estimating the proportions of different SARS-CoV-2 haplotypes has remained technically difficult. We present a phylogenetic imputation method for improving the SARS-CoV-2 reference database and a method for estimating the relative proportions of SARS-CoV-2 haplotypes from wastewater samples. The phylogenetic imputation method uses the global SARS-CoV-2 phylogeny and imputes based on the maximum of the posterior probability of each nucleotide. We show that the imputation method has error rates comparable to, or lower than, typical sequencing error rates, which substantially improves the reference database and allows for accurate inferences of haplotype composition. Our method for estimating relative proportions of haplotypes uses an initial step to remove unlikely haplotypes and an expectation maximization (EM) algorithm for obtaining maximum likelihood estimates of the proportions of different haplotypes in a sample. Using simulations with a reference database of >3 million SARS-CoV-2 genomes, we show that the estimated proportions reflect the true proportions given sufficiently high sequencing depth.
Collapse
Affiliation(s)
- Lenore Pipes
- Department of Integrative Biology, University of California-Berkeley, 4098 Valley Life Sciences Building, Berkeley, CA 94720, USA
| | - Zihao Chen
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Svetlana Afanaseva
- Department of Integrative Biology, University of California-Berkeley, 4098 Valley Life Sciences Building, Berkeley, CA 94720, USA
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California-Berkeley, 4098 Valley Life Sciences Building, Berkeley, CA 94720, USA
- GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
17
|
Vilgalys TP, Fogel AS, Anderson JA, Mututua RS, Warutere JK, Siodi IL, Kim SY, Voyles TN, Robinson JA, Wall JD, Archie EA, Alberts SC, Tung J. Selection against admixture and gene regulatory divergence in a long-term primate field study. Science 2022; 377:635-641. [PMID: 35926022 PMCID: PMC9682493 DOI: 10.1126/science.abm4917] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Genetic admixture is central to primate evolution. We combined 50 years of field observations of immigration and group demography with genomic data from ~9 generations of hybrid baboons to investigate the consequences of admixture in the wild. Despite no obvious fitness costs to hybrids, we found signatures of selection against admixture similar to those described for archaic hominins. These patterns were concentrated near genes where ancestry is strongly associated with gene expression. Our analyses also show that introgression is partially predictable across the genome. This study demonstrates the value of integrating genomic and field data for revealing how "genomic signatures of selection" (e.g., reduced introgression in low-recombination regions) manifest in nature; moreover, it underscores the importance of other primates as living models for human evolution.
Collapse
Affiliation(s)
- Tauras P. Vilgalys
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA,Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Arielle S. Fogel
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA,University Program in Genetics and Genomics, Duke University, Durham, NC, USA
| | - Jordan A. Anderson
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA
| | | | | | | | - Sang Yoon Kim
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA
| | - Tawni N. Voyles
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA
| | | | - Jeffrey D. Wall
- Institute for Human Genetics, University of California, San Francisco, CA, USA
| | - Elizabeth A. Archie
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA
| | - Susan C. Alberts
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA,Department of Biology, Duke University, Durham, NC, USA,Duke University Population Research Institute, Duke University, Durham, NC, USA
| | - Jenny Tung
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA,Department of Biology, Duke University, Durham, NC, USA,Duke University Population Research Institute, Duke University, Durham, NC, USA,Canadian Institute for Advanced Research, Toronto, Canada,Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany,Corresponding author
| |
Collapse
|
18
|
Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022; 12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open
Abstract
Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.
Collapse
Affiliation(s)
| | - Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| |
Collapse
|
19
|
Prodanov T, Bansal V. Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing. Nat Commun 2022; 13:3221. [PMID: 35680869 PMCID: PMC9184528 DOI: 10.1038/s41467-022-30930-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 05/20/2022] [Indexed: 11/09/2022] Open
Abstract
The human genome contains hundreds of low-copy repeats (LCRs) that are challenging to analyze using short-read sequencing technologies due to extensive copy number variation and ambiguity in read mapping. Copy number and sequence variants in more than 150 duplicated genes that overlap LCRs have been implicated in monogenic and complex human diseases. We describe a computational tool, Parascopy, for estimating the aggregate and paralog-specific copy number of duplicated genes using whole-genome sequencing (WGS). Parascopy is an efficient method that jointly analyzes reads mapped to different repeat copies without the need for global realignment. It leverages multiple samples to mitigate sequencing bias and to identify reliable paralogous sequence variants (PSVs) that differentiate repeat copies. Analysis of WGS data for 2504 individuals from diverse populations showed that Parascopy is robust to sequencing bias, has higher accuracy compared to existing methods and enables prioritization of pathogenic copy number changes in duplicated genes.
Collapse
Affiliation(s)
- Timofey Prodanov
- Bioinformatics and Systems Biology Graduate Program, University of California, La Jolla, San Diego, CA, 92093, USA
| | - Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, La Jolla, San Diego, CA, 92093, USA.
| |
Collapse
|
20
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2022; 10:80. [PMID: 35847383 PMCID: PMC9243550 DOI: 10.12688/f1000research.39214.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2022] [Indexed: 11/20/2022] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
Affiliation(s)
| | - Marco Fabbri
- European Commission Joint Research Centre, Ispra, Italy
| | | | | | - Guy Van den Eede
- European Commission Joint Research Centre, Ispra, Italy
- European Commission Joint Research Centre, Geel, Belgium
| | - Erik Alm
- The European Centre for Disease Prevention and Control, Stockholm, Sweden
| | - Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Lyngby, Denmark
| | | | - Catherine Carrillo
- Ottawa Laboratory – Carling, Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | | | - Kok-Gan Chan
- International Genome Centre, Jiangsu University, Zhenjiang, China
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur, Malaysia
| | - Teresa Coque
- Servicio de Microbiología, Hospital Universitario Ramón y Cajal, Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), Madrid, Spain
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Carlos III Health Institute, Madrid, Spain
| | | | - Ivo Gut
- Centro Nacional de Análisis Genómico, Centre for Genomic Regulation (CNAG-CRG), Barcelona Institute of Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Paul Hammer
- BIOMES. NGS GmbH c/o Technische Hochschule Wildau, Wildau, Germany
| | - Gemma L. Kay
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Jean-Yves Madec
- Unité Antibiorésistance et Virulence Bactériennes, ANSES Site de Lyon, Lyon, France
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
- University of East Anglia, Norwich, UK
| | | | - Thierry Naas
- French-NRC for CPEs, Service de Bactériologie-Hygiène, Hôpital de Bicêtre, Le Kremlin-Bicêtre, France
| | | | - Silke Peter
- Institute of Medical Microbiology and Hygiene, University of Tübingen, Tübingen, Germany
| | - Arthur Pightling
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, USA
| | | | - John Rossen
- Department of Medical Microbiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | | - Robert Schlaberg
- Department of Pathology, University of Utah, Salt Lake City, UT, USA
| | - Kevin Vanneste
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Lukas M. Weber
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Present address: Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | | | | |
Collapse
|
21
|
Cherukuri PF, Soe MM, Condon DE, Bartaria S, Meis K, Gu S, Frost FG, Fricke LM, Lubieniecki KP, Lubieniecka JM, Pyatt RE, Hajek C, Boerkoel CF, Carmichael L. Establishing analytical validity of BeadChip array genotype data by comparison to whole-genome sequence and standard benchmark datasets. BMC Med Genomics 2022; 15:56. [PMID: 35287663 PMCID: PMC8919546 DOI: 10.1186/s12920-022-01199-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Accepted: 02/28/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clinical use of genotype data requires high positive predictive value (PPV) and thorough understanding of the genotyping platform characteristics. BeadChip arrays, such as the Global Screening Array (GSA), potentially offer a high-throughput, low-cost clinical screen for known variants. We hypothesize that quality assessment and comparison to whole-genome sequence and benchmark data establish the analytical validity of GSA genotyping. METHODS To test this hypothesis, we selected 263 samples from Coriell, generated GSA genotypes in triplicate, generated whole genome sequence (rWGS) genotypes, assessed the quality of each set of genotypes, and compared each set of genotypes to each other and to the 1000 Genomes Phase 3 (1KG) genotypes, a performance benchmark. For 59 genes (MAP59), we also performed theoretical and empirical evaluation of variants deemed medically actionable predispositions. RESULTS Quality analyses detected sample contamination and increased assay failure along the chip margins. Comparison to benchmark data demonstrated that > 82% of the GSA assays had a PPV of 1. GSA assays targeting transitions, genomic regions of high complexity, and common variants performed better than those targeting transversions, regions of low complexity, and rare variants. Comparison of GSA data to rWGS and 1KG data showed > 99% performance across all measured parameters. Consistent with predictions from prior studies, the GSA detection of variation within the MAP59 genes was 3/261. CONCLUSION We establish the analytical validity of GSA assays using quality analytics and comparison to benchmark and rWGS data. GSA assays meet the standards of a clinical screen although assays interrogating rare variants, transversions, and variants within low-complexity regions require careful evaluation.
Collapse
Affiliation(s)
- Praveen F Cherukuri
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA. .,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA. .,Sanford Research Center, Sioux Falls, SD, USA.
| | - Melissa M Soe
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - David E Condon
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA
| | - Shubhi Bartaria
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Kaitlynn Meis
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Shaopeng Gu
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Frederick G Frost
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Lindsay M Fricke
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Krzysztof P Lubieniecki
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA.,Sanford Research Center, Sioux Falls, SD, USA
| | - Joanna M Lubieniecka
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA.,Sanford Research Center, Sioux Falls, SD, USA
| | - Robert E Pyatt
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA
| | - Catherine Hajek
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA
| | - Cornelius F Boerkoel
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Lynn Carmichael
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| |
Collapse
|
22
|
MQuad enables clonal substructure discovery using single cell mitochondrial variants. Nat Commun 2022; 13:1205. [PMID: 35260582 PMCID: PMC8904442 DOI: 10.1038/s41467-022-28845-0] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 02/14/2022] [Indexed: 02/08/2023] Open
Abstract
Mitochondrial mutations are increasingly recognised as informative endogenous genetic markers that can be used to reconstruct cellular clonal structure using single-cell RNA or DNA sequencing data. However, identifying informative mtDNA variants in noisy and sparse single-cell sequencing data is still challenging with few computation methods available. Here we present an open source computational tool MQuad that accurately calls clonally informative mtDNA variants in a population of single cells, and an analysis suite for complete clonality inference, based on single cell RNA, DNA or ATAC sequencing data. Through a variety of simulated and experimental single cell sequencing data, we showed that MQuad can identify mitochondrial variants with both high sensitivity and specificity, outperforming existing methods by a large extent. Furthermore, we demonstrate its wide applicability in different single cell sequencing protocols, particularly in complementing single-nucleotide and copy-number variations to extract finer clonal resolution. Mitochondrial variants are informative endogenous barcodes for clonal substructure. Here, the authors developed a computational method MQuad to effectively detect these clonal informed mtDNA variants from single-cell RNA, DNA or ATAC sequencing data.
Collapse
|
23
|
Long EM, Bradbury PJ, Romay MC, Buckler ES, Robbins KR. Genome-wide Imputation Using the Practical Haplotype Graph in the Heterozygous Crop Cassava. G3-GENES GENOMES GENETICS 2021; 12:6423990. [PMID: 34751380 PMCID: PMC8728015 DOI: 10.1093/g3journal/jkab383] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 10/14/2021] [Indexed: 11/13/2022]
Abstract
Genomic applications such as genomic selection and genome-wide association have become increasingly common since the advent of genome sequencing. The cost of sequencing has decreased in the past two decades; however, genotyping costs are still prohibitive to gathering large datasets for these genomic applications, especially in nonmodel species where resources are less abundant. Genotype imputation makes it possible to infer whole-genome information from limited input data, making large sampling for genomic applications more feasible. Imputation becomes increasingly difficult in heterozygous species where haplotypes must be phased. The practical haplotype graph (PHG) is a recently developed tool that can accurately impute genotypes, using a reference panel of haplotypes. We showcase the ability of the PHG to impute genomic information in the highly heterozygous crop cassava (Manihot esculenta). Accurately phased haplotypes were sampled from runs of homozygosity across a diverse panel of individuals to populate PHG, which proved more accurate than relying on computational phasing methods. The PHG achieved high imputation accuracy, using sparse skim-sequencing input, which translated to substantial genomic prediction accuracy in cross-validation testing. The PHG showed improved imputation accuracy, compared to a standard imputation tool Beagle, especially in predicting rare alleles.
Collapse
Affiliation(s)
- Evan M Long
- Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
| | - Peter J Bradbury
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA.,United States Department of Agriculture-Agricultural Research Service, Robert W. Holley, Center for Agriculture and Health, Ithaca, NY 14853, USA
| | - M Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA
| | - Edward S Buckler
- Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA.,Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA.,United States Department of Agriculture-Agricultural Research Service, Robert W. Holley, Center for Agriculture and Health, Ithaca, NY 14853, USA
| | - Kelly R Robbins
- Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
24
|
Ahmed AE, Allen JM, Bhat T, Burra P, Fliege CE, Hart SN, Heldenbrand JR, Hudson ME, Istanto DD, Kalmbach MT, Kapraun GD, Kendig KI, Kendzior MC, Klee EW, Mattson N, Ross CA, Sharif SM, Venkatakrishnan R, Fadlelmola FM, Mainzer LS. Design considerations for workflow management systems use in production genomics research and the clinic. Sci Rep 2021; 11:21680. [PMID: 34737383 PMCID: PMC8569008 DOI: 10.1038/s41598-021-99288-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 09/15/2021] [Indexed: 01/22/2023] Open
Abstract
The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.
Collapse
Affiliation(s)
- Azza E Ahmed
- Faculty of Science, Center for Bioinformatics and Systems Biology, University of Khartoum, 11111, Khartoum, Sudan.
- Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, 11111, Khartoum, Sudan.
- Bernoulli Institute, University of Groningen, 9747 AG, Groningen, The Netherlands.
| | - Joshua M Allen
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Tajesvi Bhat
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Prakruthi Burra
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Christina E Fliege
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Steven N Hart
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jacob R Heldenbrand
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Matthew E Hudson
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Dave Deandre Istanto
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Michael T Kalmbach
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Gregory D Kapraun
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Katherine I Kendig
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Matthew Charles Kendzior
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Eric W Klee
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Nate Mattson
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Christian A Ross
- Laboratory Pathology and Extramural Applications, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Sami M Sharif
- Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, 11111, Khartoum, Sudan
| | - Ramshankar Venkatakrishnan
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Faisal M Fadlelmola
- Faculty of Science, Center for Bioinformatics and Systems Biology, University of Khartoum, 11111, Khartoum, Sudan
| | - Liudmila S Mainzer
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
25
|
Stephens Z, Milosevic D, Kipp B, Grebe S, Iyer RK, Kocher JPA. PB-Motif-A Method for Identifying Gene/Pseudogene Rearrangements With Long Reads: An Application to CYP21A2 Genotyping. Front Genet 2021; 12:716586. [PMID: 34394200 PMCID: PMC8355628 DOI: 10.3389/fgene.2021.716586] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Accepted: 07/05/2021] [Indexed: 12/30/2022] Open
Abstract
Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.
Collapse
Affiliation(s)
- Zachary Stephens
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL, United States
| | | | | | | | - Ravishankar K Iyer
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL, United States
| | | |
Collapse
|
26
|
Desai S, Rane A, Joshi A, Dutt A. IPD 2.0: To derive insights from an evolving SARS-CoV-2 genome. BMC Bioinformatics 2021; 22:247. [PMID: 33985433 PMCID: PMC8118100 DOI: 10.1186/s12859-021-04172-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 05/05/2021] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Rapid analysis of SARS-CoV-2 genomic data plays a crucial role in surveillance and adoption of measures in controlling spread of Covid-19. Fast, inclusive and adaptive methods are required for the heterogenous SARS-CoV-2 sequence data generated at an unprecedented rate. RESULTS We present an updated version of the SARS-CoV-2 analysis module of our automated computational pipeline, Infectious Pathogen Detector (IPD) 2.0, to perform genomic analysis to understand the variability and dynamics of the virus. It adopts the recent clade nomenclature and demonstrates the clade prediction accuracy of 92.8%. IPD 2.0 also contains a SARS-CoV-2 updater module, allowing automatic upgrading of the variant database using genome sequences from GISAID. As a proof of principle, analyzing 208,911 SARS-CoV-2 genome sequences, we generate an extensive database of 2.58 million sample-wise variants. A comparative account of lineage-specific mutations in the newer SARS-CoV-2 strains emerging in the UK, South Africa and Brazil and data reported from India identify overlapping and lineages specific acquired mutations suggesting a repetitive convergent and adaptive evolution. CONCLUSIONS A novel and dynamic feature of the SARS-CoV-2 module of IPD 2.0 makes it a contemporary tool to analyze the diverse and growing genomic strains of the virus and serve as a vital tool to help facilitate rapid genomic surveillance in a population to identify variants involved in breakthrough infections. IPD 2.0 is freely available from http://www.actrec.gov.in/pi-webpages/AmitDutt/IPD/IPD.html and the web-application is available at http://ipd.actrec.gov.in/ipdweb/ .
Collapse
Affiliation(s)
- Sanket Desai
- Integrated Cancer Genomics Laboratory, Advanced Centre for Treatment, Research, and Education in Cancer, Tata Memorial Centre, Kharghar, Navi Mumbai, Maharashtra, 410210, India
- Homi Bhabha National Institute, Training School Complex, Anushakti Nagar, Mumbai, Maharashtra, 400094, India
| | - Aishwarya Rane
- Integrated Cancer Genomics Laboratory, Advanced Centre for Treatment, Research, and Education in Cancer, Tata Memorial Centre, Kharghar, Navi Mumbai, Maharashtra, 410210, India
| | - Asim Joshi
- Integrated Cancer Genomics Laboratory, Advanced Centre for Treatment, Research, and Education in Cancer, Tata Memorial Centre, Kharghar, Navi Mumbai, Maharashtra, 410210, India
- Homi Bhabha National Institute, Training School Complex, Anushakti Nagar, Mumbai, Maharashtra, 400094, India
| | - Amit Dutt
- Integrated Cancer Genomics Laboratory, Advanced Centre for Treatment, Research, and Education in Cancer, Tata Memorial Centre, Kharghar, Navi Mumbai, Maharashtra, 410210, India.
- Homi Bhabha National Institute, Training School Complex, Anushakti Nagar, Mumbai, Maharashtra, 400094, India.
- Adjunct Faculty, Institute of Advanced Virology, Kerala State Council for Science, Technology and Environment, Govt. of Kerala, Thonnakkal, Kerala, 695317, India.
| |
Collapse
|
27
|
Kısakol B, Sarıhan Ş, Ergün MA, Baysan M. Detailed evaluation of cancer sequencing pipelines in different microenvironments and heterogeneity levels. ACTA ACUST UNITED AC 2021; 45:114-126. [PMID: 33907494 PMCID: PMC8068765 DOI: 10.3906/biy-2008-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Accepted: 02/03/2021] [Indexed: 11/25/2022]
Abstract
The importance of next generation sequencing (NGS) rises in cancer research as accessing this key technology becomes easier for researchers. The sequence data created by NGS technologies must be processed by various bioinformatics algorithms within a pipeline in order to convert raw data to meaningful information. Mapping and variant calling are the two main steps of these analysis pipelines, and many algorithms are available for these steps. Therefore, detailed benchmarking of these algorithms in different scenarios is crucial for the efficient utilization of sequencing technologies. In this study, we compared the performance of twelve pipelines (three mapping and four variant discovery algorithms) with recommended settings to capture single nucleotide variants. We observed significant discrepancy in variant calls among tested pipelines for different heterogeneity levels in real and simulated samples with overall high specificity and low sensitivity. Additional to the individual evaluation of pipelines, we also constructed and tested the performance of pipeline combinations. In these analyses, we observed that certain pipelines complement each other much better than others and display superior performance than individual pipelines. This suggests that adhering to a single pipeline is not optimal for cancer sequencing analysis and sample heterogeneity should be considered in algorithm optimization.
Collapse
Affiliation(s)
- Batuhan Kısakol
- Department of Physiology and Medical Physics, Centre for Systems Medicine, Royal College of Surgeons in Ireland, Dublin Ireland
| | - Şahin Sarıhan
- Computer Engineering Department, Faculty of Engineering, Marmara University, İstanbul, Turkey Turkey
| | - Mehmet Arif Ergün
- Computer Engineering Department, Faculty of Computer and Informatics Engineering, İstanbul Technical University,İstanbul Turkey
| | - Mehmet Baysan
- Computer Engineering Department, Faculty of Computer and Informatics Engineering, İstanbul Technical University,İstanbul Turkey
| |
Collapse
|
28
|
Shah RN, Ruthenburg AJ. Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads. PLoS Comput Biol 2021; 17:e1008926. [PMID: 33872311 PMCID: PMC8084338 DOI: 10.1371/journal.pcbi.1008926] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 04/29/2021] [Accepted: 03/30/2021] [Indexed: 11/18/2022] Open
Abstract
Next-generation sequencing (NGS) has transformed molecular biology and contributed to many seminal insights into genomic regulation and function. Apart from whole-genome sequencing, an NGS workflow involves alignment of the sequencing reads to the genome of study, after which the resulting alignments can be used for downstream analyses. However, alignment is complicated by the repetitive sequences; many reads align to more than one genomic locus, with 15-30% of the genome not being uniquely mappable by short-read NGS. This problem is typically addressed by discarding reads that do not uniquely map to the genome, but this practice can lead to systematic distortion of the data. Previous studies that developed methods for handling ambiguously mapped reads were often of limited applicability or were computationally intensive, hindering their broader usage. In this work, we present SmartMap: an algorithm that augments industry-standard aligners to enable usage of ambiguously mapped reads by assigning weights to each alignment with Bayesian analysis of the read distribution and alignment quality. SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads in approximately one hour on a desktop PC. By applying SmartMap to peak-type NGS data, including MNase-seq, ChIP-seq, and ATAC-seq in three organisms, we can increase read depth by up to 53% and increase the mapped proportion of the genome by up to 18% compared to analyses utilizing only uniquely mapped reads. We further show that SmartMap enables the analysis of more than 140,000 repetitive elements that could not be analyzed by traditional ChIP-seq workflows, and we utilize this method to gain insight into the epigenetic regulation of different classes of repetitive elements. These data emphasize both the dangers of discarding ambiguously mapped reads and their power for driving biological discovery.
Collapse
Affiliation(s)
- Rohan N. Shah
- Pritzker School of Medicine, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- Department of Molecular Biology and Cell Genetics, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (RNS); (AJR)
| | - Alexander J. Ruthenburg
- Department of Molecular Biology and Cell Genetics, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- Department of Biochemistry and Molecular Biology, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (RNS); (AJR)
| |
Collapse
|
29
|
Schmeing S, Robinson MD. ReSeq simulates realistic Illumina high-throughput sequencing data. Genome Biol 2021; 22:67. [PMID: 33608040 PMCID: PMC7896392 DOI: 10.1186/s13059-021-02265-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 01/07/2021] [Indexed: 12/18/2022] Open
Abstract
In high-throughput sequencing data, performance comparisons between computational tools are essential for making informed decisions at each step of a project. Simulations are a critical part of method comparisons, but for standard Illumina sequencing of genomic DNA, they are often oversimplified, which leads to optimistic results for most tools. ReSeq improves the authenticity of synthetic data by extracting and reproducing key components from real data. Major advancements are the inclusion of systematic errors, a fragment-based coverage model and sampling-matrix estimates based on two-dimensional margins. These improvements lead to more faithful performance evaluations. ReSeq is available at https://github.com/schmeing/ReSeq.
Collapse
Affiliation(s)
- Stephan Schmeing
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland. .,SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland. .,SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
| |
Collapse
|
30
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2021; 10:80. [PMID: 35847383 PMCID: PMC9243550 DOI: 10.12688/f1000research.39214.1] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2022] [Indexed: 10/31/2024] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain "live" (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines' implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
Affiliation(s)
| | - Marco Fabbri
- European Commission Joint Research Centre, Ispra, Italy
| | | | | | - Guy Van den Eede
- European Commission Joint Research Centre, Ispra, Italy
- European Commission Joint Research Centre, Geel, Belgium
| | - Erik Alm
- The European Centre for Disease Prevention and Control, Stockholm, Sweden
| | - Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Lyngby, Denmark
| | | | - Catherine Carrillo
- Ottawa Laboratory – Carling, Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | | | - Kok-Gan Chan
- International Genome Centre, Jiangsu University, Zhenjiang, China
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur, Malaysia
| | - Teresa Coque
- Servicio de Microbiología, Hospital Universitario Ramón y Cajal, Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), Madrid, Spain
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Carlos III Health Institute, Madrid, Spain
| | | | - Ivo Gut
- Centro Nacional de Análisis Genómico, Centre for Genomic Regulation (CNAG-CRG), Barcelona Institute of Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Paul Hammer
- BIOMES. NGS GmbH c/o Technische Hochschule Wildau, Wildau, Germany
| | - Gemma L. Kay
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Jean-Yves Madec
- Unité Antibiorésistance et Virulence Bactériennes, ANSES Site de Lyon, Lyon, France
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
- University of East Anglia, Norwich, UK
| | | | - Thierry Naas
- French-NRC for CPEs, Service de Bactériologie-Hygiène, Hôpital de Bicêtre, Le Kremlin-Bicêtre, France
| | | | - Silke Peter
- Institute of Medical Microbiology and Hygiene, University of Tübingen, Tübingen, Germany
| | - Arthur Pightling
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, USA
| | | | - John Rossen
- Department of Medical Microbiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | | - Robert Schlaberg
- Department of Pathology, University of Utah, Salt Lake City, UT, USA
| | - Kevin Vanneste
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Lukas M. Weber
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Present address: Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | | | | |
Collapse
|
31
|
Alosaimi S, Bandiang A, van Biljon N, Awany D, Thami PK, Tchamga MSS, Kiran A, Messaoud O, Hassan RIM, Mugo J, Ahmed A, Bope CD, Allali I, Mazandu GK, Mulder NJ, Chimusa ER. A broad survey of DNA sequence data simulation tools. Brief Funct Genomics 2020; 19:49-59. [PMID: 31867604 DOI: 10.1093/bfgp/elz033] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 10/27/2019] [Accepted: 11/04/2019] [Indexed: 11/12/2022] Open
Abstract
In silico DNA sequence generation is a powerful technology to evaluate and validate bioinformatics tools, and accordingly more than 35 DNA sequence simulation tools have been developed. With such a diverse array of tools to choose from, an important question is: Which tool should be used for a desired outcome? This question is largely unanswered as documentation for many of these DNA simulation tools is sparse. To address this, we performed a review of DNA sequence simulation tools developed to date and evaluated 20 state-of-art DNA sequence simulation tools on their ability to produce accurate reads based on their implemented sequence error model. We provide a succinct description of each tool and suggest which tool is most appropriate for the given different scenarios. Given the multitude of similar yet non-identical tools, researchers can use this review as a guide to inform their choice of DNA sequence simulation tool. This paves the way towards assessing existing tools in a unified framework, as well as enabling different simulation scenario analysis within the same framework.
Collapse
Affiliation(s)
- Shatha Alosaimi
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Armand Bandiang
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Noelle van Biljon
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Denis Awany
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Prisca K Thami
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
| | - Milaine S S Tchamga
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Anmol Kiran
- Malawi-Liverpool-Wellcome Trust Clinical Research Programme, Blantyre, Malawi.,Edinburgh University, Edinburgh, UK
| | - Olfa Messaoud
- Université de Tunis El Manar, Institut Pasteur de Tunis, LR16IPT05 Génomique Biomédicale et Oncogénétique, Tunis, 1002, Tunisia
| | - Radia Ismaeel Mohammed Hassan
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Jacquiline Mugo
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Azza Ahmed
- Centre for Bioinformatics and Systems Biology, Faculty of Science, University of Khartoum, Sudan
| | - Christian D Bope
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Imane Allali
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Gaston K Mazandu
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,African Institute for Mathematical Sciences (AIMS), Cape Town, South Africa
| | - Nicola J Mulder
- Computational Biology Division, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Emile R Chimusa
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
32
|
Abstract
Advances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, "synthetic-diploid" and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.
Collapse
|
33
|
Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep 2020; 10:20222. [PMID: 33214604 PMCID: PMC7678823 DOI: 10.1038/s41598-020-77218-4] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 11/02/2020] [Indexed: 12/30/2022] Open
Abstract
Advances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, "synthetic-diploid" and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.
Collapse
Affiliation(s)
- Sen Zhao
- Department of Tumor Biology, Institute of Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, 0310, Oslo, Norway
| | | | - Abdulrahman Azab
- Center for Bioinformatics, Department of Informatics, University of Oslo, 0316, Oslo, Norway
- Division of Research Computing, University Center for Information Technology (USIT), University of Oslo, 0316, Oslo, Norway
| | - Tomasz Stokowy
- Computational Biology Unit, Institute of Informatics, University of Bergen, 5008, Bergen, Norway
- Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Eivind Hovig
- Department of Tumor Biology, Institute of Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, 0310, Oslo, Norway.
- Center for Bioinformatics, Department of Informatics, University of Oslo, 0316, Oslo, Norway.
| |
Collapse
|
34
|
Nam K, Nhim S, Robin S, Bretaudeau A, Nègre N, d'Alençon E. Positive selection alone is sufficient for whole genome differentiation at the early stage of speciation process in the fall armyworm. BMC Evol Biol 2020; 20:152. [PMID: 33187468 PMCID: PMC7663868 DOI: 10.1186/s12862-020-01715-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Accepted: 10/28/2020] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND The process of speciation involves differentiation of whole genome sequences between a pair of diverging taxa. In the absence of a geographic barrier and in the presence of gene flow, genomic differentiation may occur when the homogenizing effect of recombination is overcome across the whole genome. The fall armyworm is observed as two sympatric strains with different host-plant preferences across the entire habitat. These two strains exhibit a very low level of genetic differentiation across the whole genome, suggesting that genomic differentiation occurred at an early stage of speciation. In this study, we aim at identifying critical evolutionary forces responsible for genomic differentiation in the fall armyworm. RESULTS These two strains exhibit a low level of genomic differentiation (FST = 0.0174), while 99.2% of 200 kb windows have genetically differentiated sequences (FST > 0). We found that the combined effect of mild positive selection and genetic linkage to selectively targeted loci are responsible for the genomic differentiation. However, a single event of very strong positive selection appears not to be responsible for genomic differentiation. The contribution of chromosomal inversions or tight genetic linkage among positively selected loci causing reproductive barriers is not supported by our data. Phylogenetic analysis shows that the genomic differentiation occurred by sub-setting of genetic variants in one strain from the other. CONCLUSIONS From these results, we concluded that genomic differentiation may occur at the early stage of a speciation process in the fall armyworm and that mild positive selection targeting many loci alone is sufficient evolutionary force for generating the pattern of genomic differentiation. This genomic differentiation may provide a condition for accelerated genomic differentiation by synergistic effects among linkage disequilibrium generated by following events of positive selection. Our study highlights genomic differentiation as a key evolutionary factor connecting positive selection to divergent selection.
Collapse
Affiliation(s)
- Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| | - Sandra Nhim
- DGIMI, Univ Montpellier, INRAE, Montpellier, France
| | - Stéphanie Robin
- INRAE, UMR-IGEPP, BioInformatics Platform for Agroecosystems Arthropods, Campus Beaulieu, Rennes, France
- INRIA, IRISA, GenOuest Core Facility, Campus de Beaulieu, Rennes, France
| | - Anthony Bretaudeau
- INRAE, UMR-IGEPP, BioInformatics Platform for Agroecosystems Arthropods, Campus Beaulieu, Rennes, France
- INRIA, IRISA, GenOuest Core Facility, Campus de Beaulieu, Rennes, France
| | | | | |
Collapse
|
35
|
Yu Z, Du F, Ban R, Zhang Y. SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles. BMC Bioinformatics 2020; 21:331. [PMID: 32703148 PMCID: PMC7379788 DOI: 10.1186/s12859-020-03665-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2018] [Accepted: 07/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required. RESULTS Here, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools. CONCLUSIONS SimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data.
Collapse
Affiliation(s)
- Zhenhua Yu
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China.
| | - Fang Du
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
| | - Rongjun Ban
- Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China
| | - Yuanwei Zhang
- Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
36
|
Noguera-Julian M, Lee ER, Shafer RW, Kantor R, Ji H. Dry Panels Supporting External Quality Assessment Programs for Next Generation Sequencing-Based HIV Drug Resistance Testing. Viruses 2020; 12:v12060666. [PMID: 32575676 PMCID: PMC7354622 DOI: 10.3390/v12060666] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 06/18/2020] [Accepted: 06/18/2020] [Indexed: 12/18/2022] Open
Abstract
External quality assessment (EQA) is a keystone element in the validation and implementation of next generation sequencing (NGS)-based HIV drug resistance testing (DRT). Software validation and evaluation is a critical element in NGS EQA programs. While the development, sharing, and adoption of wet lab protocols is coupled with the increasing access to NGS technology worldwide, rendering it easy to produce NGS data for HIV-DRT, bioinformatic data analysis remains a bottleneck for most of the diagnostic laboratories. Several computational tools have been made available, via free or commercial sources, to automate the conversion of raw NGS data into an actionable clinical report. Although different software platforms yield equivalent results when identical raw NGS datasets are analyzed for variations at higher abundance, discrepancies arise when variations at lower frequencies are considered. This implies that validation and performance assessment of the bioinformatics tools applied in NGS HIV-DRT is critical, and the origins of the observed discrepancies should be determined. Well-characterized reference NGS datasets with ground truth on the genotype composition at all examined loci and the exact frequencies of HIV variations they may harbor, so-called dry panels, would be essential in such cases. The strategic design and construction of such panels are challenging but imperative tasks in support of EQA programs for NGS-based HIV-DRT and the validation of relevant bioinformatics tools. Here, we present criteria that can guide the design of such dry panels, which were discussed in the Second International Winnipeg Symposium themed for EQA strategies for NGS HIVDR assays.
Collapse
Affiliation(s)
- Marc Noguera-Julian
- IrsiCaixa AIDS Research Institute, Hospital Germans Trias i Pujol, s/n, Catalonia, 08196 Badalona, Spain
- Chair in AIDS and Related Illnesses, Centre for Health and Social Care Research (CESS), Faculty of Medicine, University of Vic, Central University of Catalonia, Can Baumann. Ctra. de Roda, 70, 08500 Vic, Spain
- Correspondence:
| | - Emma R. Lee
- National HIV and Retrovirology Laboratories, National Microbiology Laboratory at JC Wilt Infectious Diseases Research Centre, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada; (E.R.L.); (H.J.)
| | | | - Rami Kantor
- Division of Infectious Diseases, Brown University Alpert Medical School, Providence, RI 02903, USA;
| | - Hezhao Ji
- National HIV and Retrovirology Laboratories, National Microbiology Laboratory at JC Wilt Infectious Diseases Research Centre, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada; (E.R.L.); (H.J.)
- Department of Medical Microbiology and Infectious Diseases, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB R3E 0J9, Canada
| |
Collapse
|
37
|
Delhomme TM, Avogbe PH, Gabriel AAG, Alcala N, Leblay N, Voegele C, Vallée M, Chopard P, Chabrier A, Abedi-Ardekani B, Gaborieau V, Holcatova I, Janout V, Foretová L, Milosavljevic S, Zaridze D, Mukeriya A, Brambilla E, Brennan P, Scelo G, Fernandez-Cuesta L, Byrnes G, Calvez-Kelm FL, McKay JD, Foll M. Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data. NAR Genom Bioinform 2020; 2:lqaa021. [PMID: 32363341 PMCID: PMC7182099 DOI: 10.1093/nargab/lqaa021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 01/28/2020] [Accepted: 04/16/2020] [Indexed: 12/22/2022] Open
Abstract
The emergence of next-generation sequencing (NGS) has revolutionized the way of reaching a genome sequence, with the promise of potentially providing a comprehensive characterization of DNA variations. Nevertheless, detecting somatic mutations is still a difficult problem, in particular when trying to identify low abundance mutations, such as subclonal mutations, tumour-derived alterations in body fluids or somatic mutations from histological normal tissue. The main challenge is to precisely distinguish between sequencing artefacts and true mutations, particularly when the latter are so rare they reach similar abundance levels as artefacts. Here, we present needlestack, a highly sensitive variant caller, which directly learns from the data the level of systematic sequencing errors to accurately call mutations. Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analysing multiple samples together. We show that the sequencing error rate varies across alterations, illustrating the need to precisely estimate it. We evaluate the performance of needlestack for various types of variations, and we show that needlestack is robust among positions and outperforms existing state-of-the-art method for low abundance mutations. Needlestack, along with its source code is freely available on the GitHub platform: https://github.com/IARCbioinfo/needlestack.
Collapse
Affiliation(s)
- Tiffany M Delhomme
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Patrice H Avogbe
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Aurélie A G Gabriel
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Nicolas Alcala
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Noemie Leblay
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Catherine Voegele
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Maxime Vallée
- Genetic Epidemiology Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Priscilia Chopard
- Genetic Epidemiology Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Amélie Chabrier
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Behnoush Abedi-Ardekani
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Valérie Gaborieau
- Genetic Epidemiology Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Ivana Holcatova
- Institute of Hygiene and Epidemiology, Charles University, 1st Faculty of Medicine, 116 36 Prague, Czech Republic
| | - Vladimir Janout
- Faculty of Health Sciences, Palacky University, 775 15 Olomouc, Czech Republic
| | - Lenka Foretová
- Department of Cancer Epidemiology and Genetics, Masaryk Memorial Cancer Institute, 656 53 Brno, Czech Republic
| | - Sasa Milosavljevic
- International Organization for Cancer Prevention and Research (IOCPR), 11070 Belgrade, Serbia
| | - David Zaridze
- Russian N.N. Blokhin Cancer Research Centre, 115478 Moscow, The Russian Federation
| | - Anush Mukeriya
- Russian N.N. Blokhin Cancer Research Centre, 115478 Moscow, The Russian Federation
| | - Elisabeth Brambilla
- Centre Hospitalier Universitaire de Grenoble Département d’Anatomie et Cytologie Pathologiques, CS 10217 38043 Grenoble, France
| | - Paul Brennan
- Genetic Epidemiology Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Ghislaine Scelo
- Genetic Epidemiology Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Lynnette Fernandez-Cuesta
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Graham Byrnes
- Section of Environment and Radiation, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Florence L Calvez-Kelm
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - James D McKay
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| | - Matthieu Foll
- Genetic Cancer Susceptibility Group, Section of Genetics, International Agency for Research on Cancer (IARC-WHO), 150 cours Albert Thomas, 69008 Lyon, France
| |
Collapse
|
38
|
Giguere C, Dubey HV, Sarsani VK, Saddiki H, He S, Flaherty P. SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data. BMC Bioinformatics 2020; 21:215. [PMID: 32456609 PMCID: PMC7249349 DOI: 10.1186/s12859-020-03550-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 05/18/2020] [Indexed: 11/21/2022] Open
Abstract
Background Recently, it has become possible to collect next-generation DNA sequencing data sets that are composed of multiple samples from multiple biological units where each of these samples may be from a single cell or bulk tissue. Yet, there does not yet exist a tool for simulating DNA sequencing data from such a nested sampling arrangement with single-cell and bulk samples so that developers of analysis methods can assess accuracy and precision. Results We have developed a tool that simulates DNA sequencing data from hierarchically grouped (correlated) samples where each sample is designated bulk or single-cell. Our tool uses a simple configuration file to define the experimental arrangement and can be integrated into software pipelines for testing of variant callers or other genomic tools. Conclusions The DNA sequencing data generated by our simulator is representative of real data and integrates seamlessly with standard downstream analysis tools.
Collapse
Affiliation(s)
- Collin Giguere
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Harsh Vardhan Dubey
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Vishal Kumar Sarsani
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Hachem Saddiki
- School of Public Health, University of Massachusetts Amherst, Amherst, 01003, USA
| | - Shai He
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Patrick Flaherty
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA.
| |
Collapse
|
39
|
Tattini L, Tellini N, Mozzachiodi S, D'Angiolo M, Loeillet S, Nicolas A, Liti G. Accurate Tracking of the Mutational Landscape of Diploid Hybrid Genomes. Mol Biol Evol 2020; 36:2861-2877. [PMID: 31397846 PMCID: PMC6878955 DOI: 10.1093/molbev/msz177] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Mutations, recombinations, and genome duplications may promote genetic diversity and trigger evolutionary processes. However, quantifying these events in diploid hybrid genomes is challenging. Here, we present an integrated experimental and computational workflow to accurately track the mutational landscape of yeast diploid hybrids (MuLoYDH) in terms of single-nucleotide variants, small insertions/deletions, copy-number variants, aneuploidies, and loss-of-heterozygosity. Pairs of haploid Saccharomyces parents were combined to generate ancestor hybrids with phased genomes and varying levels of heterozygosity. These diploids were evolved under different laboratory protocols, in particular mutation accumulation experiments. Variant simulations enabled the efficient integration of competitive and standard mapping of short reads, depending on local levels of heterozygosity. Experimental validations proved the high accuracy and resolution of our computational approach. Finally, applying MuLoYDH to four different diploids revealed striking genetic background effects. Homozygous Saccharomyces cerevisiae showed a ∼4-fold higher mutation rate compared with its closely related species S. paradoxus. Intraspecies hybrids unveiled that a substantial fraction of the genome (∼250 bp per generation) was shaped by loss-of-heterozygosity, a process strongly inhibited in interspecies hybrids by high levels of sequence divergence between homologous chromosomes. In contrast, interspecies hybrids exhibited higher single-nucleotide mutation rates compared with intraspecies hybrids. MuLoYDH provided an unprecedented quantitative insight into the evolutionary processes that mold diploid yeast genomes and can be generalized to other genetic systems.
Collapse
Affiliation(s)
- Lorenzo Tattini
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| | - Nicolò Tellini
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| | | | | | - Sophie Loeillet
- CNRS UMR3244, Institut Curie, PSL Research University, Paris, France
| | - Alain Nicolas
- CNRS UMR3244, Institut Curie, PSL Research University, Paris, France
| | - Gianni Liti
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| |
Collapse
|
40
|
Jandrasits C, Kröger S, Haas W, Renard BY. Computational pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters. PLoS Comput Biol 2019; 15:e1007527. [PMID: 31815935 PMCID: PMC6922483 DOI: 10.1371/journal.pcbi.1007527] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Revised: 12/19/2019] [Accepted: 11/03/2019] [Indexed: 12/30/2022] Open
Abstract
Next-generation sequencing based base-by-base distance measures have become an integral complement to epidemiological investigation of infectious disease outbreaks. This study introduces PANPASCO, a computational pan-genome mapping based, pairwise distance method that is highly sensitive to differences between cases, even when located in regions of lineage specific reference genomes. We show that our approach is superior to previously published methods in several datasets and across different Mycobacterium tuberculosis lineages, as its characteristics allow the comparison of a high number of diverse samples in one analysis—a scenario that becomes more and more likely with the increased usage of whole-genome sequencing in transmission surveillance. Tuberculosis still is a threat to global health. It is essential to detect and interrupt transmissions to stop the spread of this infectious disease. With the rising use of next-generation sequencing methods, its application in the surveillance of Mycobacterium tuberculosis has become increasingly important in the last years. The main goal of molecular surveillance is the identification of patient-patient transmission and cluster detection. The mutation rate of M. tuberculosis is very low and stable. Therefore, many existing methods for comparative analysis of isolates provide inadequate results since their resolution is too limited. There is a need for a method that takes every detectable difference into account. We developed PANPASCO, a novel approach for comparing pairs of isolates using all genomic information available for each pair. We combine improved SNP-distance calculation with the use of a pan-genome incorporating more than 100 M. tuberculosis reference genomes representing lineages 1-4 for read mapping prior to variant detection. We thereby enable the collective analysis and comparison of similar and diverse isolates associated with different M. tuberculosis strains.
Collapse
Affiliation(s)
| | - Stefan Kröger
- Respiratory Infections Unit, Robert Koch Institute, Berlin, Germany
| | - Walter Haas
- Respiratory Infections Unit, Robert Koch Institute, Berlin, Germany
| | | |
Collapse
|
41
|
Kendig KI, Baheti S, Bockol MA, Drucker TM, Hart SN, Heldenbrand JR, Hernaez M, Hudson ME, Kalmbach MT, Klee EW, Mattson NR, Ross CA, Taschuk M, Wieben ED, Wiepert M, Wildman DE, Mainzer LS. Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy. Front Genet 2019; 10:736. [PMID: 31481971 PMCID: PMC6710408 DOI: 10.3389/fgene.2019.00736] [Citation(s) in RCA: 113] [Impact Index Per Article: 22.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 07/12/2019] [Indexed: 12/22/2022] Open
Abstract
As reliable, efficient genome sequencing becomes ubiquitous, the need for similarly reliable and efficient variant calling becomes increasingly important. The Genome Analysis Toolkit (GATK), maintained by the Broad Institute, is currently the widely accepted standard for variant calling software. However, alternative solutions may provide faster variant calling without sacrificing accuracy. One such alternative is Sentieon DNASeq, a toolkit analogous to GATK but built on a highly optimized backend. We conducted an independent evaluation of the DNASeq single-sample variant calling pipeline in comparison to that of GATK. Our results support the near-identical accuracy of the two software packages, showcase optimal scalability and great speed from Sentieon, and describe computational performance considerations for the deployment of DNASeq.
Collapse
Affiliation(s)
- Katherine I Kendig
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Saurabh Baheti
- Department of Research Services, Mayo Clinic, Rochester, MN, United States
| | - Matthew A Bockol
- Department of Executive IT Administration, Mayo Clinic, Rochester, MN, United States
| | - Travis M Drucker
- Department of Executive IT Administration, Mayo Clinic, Rochester, MN, United States
| | - Steven N Hart
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Jacob R Heldenbrand
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Matthew E Hudson
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States.,Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Michael T Kalmbach
- Department of Executive IT Administration, Mayo Clinic, Rochester, MN, United States
| | - Eric W Klee
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Nathan R Mattson
- Department of Executive IT Administration, Mayo Clinic, Rochester, MN, United States
| | - Christian A Ross
- Department of Executive IT Administration, Mayo Clinic, Rochester, MN, United States
| | - Morgan Taschuk
- Genome Sequence Informatics, Ontario Institute for Cancer Research, Toronto ON, Canada
| | - Eric D Wieben
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, United States
| | - Mathieu Wiepert
- Department of Executive IT Administration, Mayo Clinic, Rochester, MN, United States
| | - Derek E Wildman
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States.,Department of Molecular and Integrative Physiology, Mayo Clinic, Rochester, MN, United States
| | - Liudmila S Mainzer
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| |
Collapse
|
42
|
Ahmed AE, Heldenbrand J, Asmann Y, Fadlelmola FM, Katz DS, Kendig K, Kendzior MC, Li T, Ren Y, Rodriguez E, Weber MR, Wozniak JM, Zermeno J, Mainzer LS. Managing genomic variant calling workflows with Swift/T. PLoS One 2019; 14:e0211608. [PMID: 31287816 PMCID: PMC6615596 DOI: 10.1371/journal.pone.0211608] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Accepted: 06/08/2019] [Indexed: 12/30/2022] Open
Abstract
Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.
Collapse
Affiliation(s)
- Azza E. Ahmed
- Centre for Bioinformatics & Systems Biology, Faculty of Science, University of Khartoum, Khartoum, Sudan
- Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, Khartoum, Sudan
| | - Jacob Heldenbrand
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Yan Asmann
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, Florida, United States of America
| | - Faisal M. Fadlelmola
- Centre for Bioinformatics & Systems Biology, Faculty of Science, University of Khartoum, Khartoum, Sudan
| | - Daniel S. Katz
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Katherine Kendig
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Matthew C. Kendzior
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Tiffany Li
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Yingxue Ren
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, Florida, United States of America
| | - Elliott Rodriguez
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Matthew R. Weber
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Justin M. Wozniak
- Argonne National Laboratory, Argonne, Illinois, United States of America
| | - Jennie Zermeno
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
| | - Liudmila S. Mainzer
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, United States of America
- * E-mail:
| |
Collapse
|
43
|
Stephens Z, Wang C, Iyer RK, Kocher JP. Detection and visualization of complex structural variants from long reads. BMC Bioinformatics 2018; 19:508. [PMID: 30577744 PMCID: PMC6302372 DOI: 10.1186/s12859-018-2539-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background With applications in cancer, drug metabolism, and disease etiology, understanding structural variation in the human genome is critical in advancing the thrusts of individualized medicine. However, structural variants (SVs) remain challenging to detect with high sensitivity using short read sequencing technologies. This problem is exacerbated when considering complex SVs comprised of multiple overlapping or nested rearrangements. Longer reads, such as those from Pacific Biosciences platforms, often span multiple breakpoints of such events, and thus provide a way to unravel small-scale complexities in SVs with higher confidence. Results We present CORGi (COmplex Rearrangement detection with Graph-search), a method for the detection and visualization of complex local genomic rearrangements. This method leverages the ability of long reads to span multiple breakpoints to untangle SVs that appear very complicated with respect to a reference genome. We validated our approach against both simulated long reads, and real data from two long read sequencing technologies. We demonstrate the ability of our method to identify breakpoints inserted in synthetic data with high accuracy, and the ability to detect and plot SVs from NA12878 germline, achieving 88.4% concordance between the two sets of sequence data. The patterns of complexity we find in many NA12878 SVs match known mechanisms associated with DNA replication and structural variant formation, and highlight the ability of our method to automatically label complex SVs with an intuitive combination of adjacent or overlapping reference transformations. Conclusions CORGi is a method for interrogating genomic regions suspected to contain local rearrangements using long reads. Using pairwise alignments and graph search CORGi produces labels and visualizations for local SVs of arbitrary complexity.
Collapse
Affiliation(s)
- Zachary Stephens
- Coordinated Science Lab, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | | | - Ravishankar K Iyer
- Coordinated Science Lab, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | | |
Collapse
|