1
|
Batisti Biffignandi G, Bellinzona G, Petazzoni G, Sassera D, Zuccotti GV, Bandi C, Baldanti F, Comandatore F, Gaiarsa S. P-DOR, an easy-to-use pipeline to reconstruct bacterial outbreaks using genomics. Bioinformatics 2023; 39:btad571. [PMID: 37701995 PMCID: PMC10533420 DOI: 10.1093/bioinformatics/btad571] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 08/24/2023] [Accepted: 09/12/2023] [Indexed: 09/14/2023] Open
Abstract
SUMMARY Bacterial Healthcare-Associated Infections (HAIs) are a major threat worldwide, which can be counteracted by establishing effective infection control measures, guided by constant surveillance and timely epidemiological investigations. Genomics is crucial in modern epidemiology but lacks standard methods and user-friendly software, accessible to users without a strong bioinformatics proficiency. To overcome these issues we developed P-DOR, a novel tool for rapid bacterial outbreak characterization. P-DOR accepts genome assemblies as input, it automatically selects a background of publicly available genomes using k-mer distances and adds it to the analysis dataset before inferring a Single-Nucleotide Polymorphism (SNP)-based phylogeny. Epidemiological clusters are identified considering the phylogenetic tree topology and SNP distances. By analyzing the SNP-distance distribution, the user can gauge the correct threshold. Patient metadata can be inputted as well, to provide a spatio-temporal representation of the outbreak. The entire pipeline is fast and scalable and can be also run on low-end computers. AVAILABILITY AND IMPLEMENTATION P-DOR is implemented in Python3 and R and can be installed using conda environments. It is available from GitHub https://github.com/SteMIDIfactory/P-DOR under the GPL-3.0 license.
Collapse
Affiliation(s)
| | - Greta Bellinzona
- Department of Biology and Biotechnology, University of Pavia, Pavia, 27100, Italy
| | - Greta Petazzoni
- Department of Medical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, Pavia, 27100, Italy
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, 27100, Italy
| | - Davide Sassera
- Department of Biology and Biotechnology, University of Pavia, Pavia, 27100, Italy
- Fondazione IRCCS Policlinico San Matteo, Pavia, 27100, Italy
| | - Gian Vincenzo Zuccotti
- Department of Biomedical and Clinical Sciences, Pediatric Clinical Research Center Romeo ed Enrica Invernizzi, University of Milan, Milan, 20157, Italy
- Pediatric Department, Buzzi Children’s Hospital, Milan, 20154, Italy
| | - Claudio Bandi
- Department of Biosciences, Pediatric Clinical Research Center Romeo ed Enrica Invernizzi, University of Milan, Milan, 20133, Italy
| | - Fausto Baldanti
- Department of Medical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, Pavia, 27100, Italy
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, 27100, Italy
| | - Francesco Comandatore
- Department of Biomedical and Clinical Sciences, Pediatric Clinical Research Center Romeo ed Enrica Invernizzi, University of Milan, Milan, 20157, Italy
| | - Stefano Gaiarsa
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, 27100, Italy
| |
Collapse
|
2
|
Jiang Z, Zhang H, Ahearn TU, Garcia-Closas M, Chatterjee N, Zhu H, Zhan X, Zhao N. The sequence kernel association test for multicategorical outcomes. Genet Epidemiol 2023; 47:432-449. [PMID: 37078108 DOI: 10.1002/gepi.22527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 03/29/2023] [Accepted: 03/30/2023] [Indexed: 04/21/2023]
Abstract
Disease heterogeneity is ubiquitous in biomedical and clinical studies. In genetic studies, researchers are increasingly interested in understanding the distinct genetic underpinning of subtypes of diseases. However, existing set-based analysis methods for genome-wide association studies are either inadequate or inefficient to handle such multicategorical outcomes. In this paper, we proposed a novel set-based association analysis method, sequence kernel association test (SKAT)-MC, the sequence kernel association test for multicategorical outcomes (nominal or ordinal), which jointly evaluates the relationship between a set of variants (common and rare) and disease subtypes. Through comprehensive simulation studies, we showed that SKAT-MC effectively preserves the nominal type I error rate while substantially increases the statistical power compared to existing methods under various scenarios. We applied SKAT-MC to the Polish breast cancer study (PBCS), and identified gene FGFR2 was significantly associated with estrogen receptor (ER)+ and ER- breast cancer subtypes. We also investigated educational attainment using UK Biobank data (N = 127 , 127 $N=127,127$ ) with SKAT-MC, and identified 21 significant genes in the genome. Consequently, SKAT-MC is a powerful and efficient analysis tool for genetic association studies with multicategorical outcomes. A freely distributed R package SKAT-MC can be accessed at https://github.com/Zhiwen-Owen-Jiang/SKATMC.
Collapse
Affiliation(s)
- Zhiwen Jiang
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Haoyu Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Thomas U Ahearn
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Montserrat Garcia-Closas
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Xiang Zhan
- Department of Biostatistics, Peking University, Beijing, China
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
3
|
Yang Z, Guarracino A, Biggs PJ, Black MA, Ismail N, Wold JR, Merriman TR, Prins P, Garrison E, de Ligt J. Pangenome graphs in infectious disease: a comprehensive genetic variation analysis of Neisseria meningitidis leveraging Oxford Nanopore long reads. Front Genet 2023; 14:1225248. [PMID: 37636268 PMCID: PMC10448961 DOI: 10.3389/fgene.2023.1225248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 08/01/2023] [Indexed: 08/29/2023] Open
Abstract
Whole genome sequencing has revolutionized infectious disease surveillance for tracking and monitoring the spread and evolution of pathogens. However, using a linear reference genome for genomic analyses may introduce biases, especially when studies are conducted on highly variable bacterial genomes of the same species. Pangenome graphs provide an efficient model for representing and analyzing multiple genomes and their variants as a graph structure that includes all types of variations. In this study, we present a practical bioinformatics pipeline that employs the PanGenome Graph Builder and the Variation Graph toolkit to build pangenomes from assembled genomes, align whole genome sequencing data and call variants against a graph reference. The pangenome graph enables the identification of structural variants, rearrangements, and small variants (e.g., single nucleotide polymorphisms and insertions/deletions) simultaneously. We demonstrate that using a pangenome graph, instead of a single linear reference genome, improves mapping rates and variant calling for both simulated and real datasets of the pathogen Neisseria meningitidis. Overall, pangenome graphs offer a promising approach for comparative genomics and comprehensive genetic variation analysis in infectious disease. Moreover, this innovative pipeline, leveraging pangenome graphs, can bridge variant analysis, genome assembly, population genetics, and evolutionary biology, expanding the reach of genomic understanding and applications.
Collapse
Affiliation(s)
- Zuyu Yang
- Institute of Environmental Science and Research, Porirua, New Zealand
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Patrick J. Biggs
- Molecular Biosciences Group, School of Natural Sciences, Massey University, Palmerston North, New Zealand
- Molecular Epidemiology and Public Health Laboratory, School of Veterinary Science, Massey University, Palmerston North, New Zealand
| | - Michael A. Black
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
| | - Nuzla Ismail
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
| | - Jana Renee Wold
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Tony R. Merriman
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
- Division of Clinical Immunology and Rheumatology, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Joep de Ligt
- Institute of Environmental Science and Research, Porirua, New Zealand
| |
Collapse
|
4
|
Woods A, Kramer ST, Xu D, Jiang W. Secure Comparisons of Single Nucleotide Polymorphisms Using Secure Multiparty Computation: Method Development. JMIR BIOINFORMATICS AND BIOTECHNOLOGY 2023; 4:e44700. [PMID: 38935952 PMCID: PMC11135223 DOI: 10.2196/44700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 05/21/2023] [Accepted: 06/09/2023] [Indexed: 06/29/2024]
Abstract
BACKGROUND While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party. OBJECTIVE In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference. METHODS Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority. RESULTS We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model. CONCLUSIONS Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.
Collapse
Affiliation(s)
- Andrew Woods
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Skyler T Kramer
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO, United States
| | - Wei Jiang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
| |
Collapse
|
5
|
Duncavage EJ, Coleman JF, de Baca ME, Kadri S, Leon A, Routbort M, Roy S, Suarez CJ, Vanderbilt C, Zook JM. Recommendations for the Use of in Silico Approaches for Next-Generation Sequencing Bioinformatic Pipeline Validation: A Joint Report of the Association for Molecular Pathology, Association for Pathology Informatics, and College of American Pathologists. J Mol Diagn 2023; 25:3-16. [PMID: 36244574 DOI: 10.1016/j.jmoldx.2022.09.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 09/14/2022] [Accepted: 09/28/2022] [Indexed: 11/21/2022] Open
Abstract
In silico approaches for next-generation sequencing (NGS) data modeling have utility in the clinical laboratory as a tool for clinical assay validation. In silico NGS data can take a variety of forms, including pure simulated data or manipulated data files in which variants are inserted into existing data files. In silico data enable simulation of a range of variants that may be difficult to obtain from a single physical sample. Such data allow laboratories to more accurately test the performance of clinical bioinformatics pipelines without sequencing additional cases. For example, clinical laboratories may use in silico data to simulate low variant allele fraction variants to test the analytical sensitivity of variant calling software or simulate a range of insertion/deletion sizes to determine the performance of insertion/deletion calling software. In this article, the Working Group reviews the different types of in silico data with their strengths and limitations, methods to generate in silico data, and how data can be used in the clinical molecular diagnostic laboratory. Survey data indicate how in silico NGS data are currently being used. Finally, potential applications for which in silico data may become useful in the future are presented.
Collapse
Affiliation(s)
- Eric J Duncavage
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, Missouri.
| | - Joshua F Coleman
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, University of Utah, Salt Lake City, Utah
| | - Monica E de Baca
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Pacific Pathology Partners, Seattle, Washington
| | - Sabah Kadri
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Anne and Robert H Lurie Children's Hospital of Chicago, Chicago, Illinois
| | - Annette Leon
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Color Health, Burlingame, California
| | - Mark Routbort
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Hematopathology, MD Anderson Cancer Center, Houston, Texas
| | - Somak Roy
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology and Laboratory Medicine, Cincinnati Children's Hospital, Cincinnati, Ohio
| | - Carlos J Suarez
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Stanford University, Palo Alto, California
| | - Chad Vanderbilt
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Justin M Zook
- In Silico Pipeline Validation Working Group of the Clinical Practice Committee, Association for Molecular Pathology, Rockville, Maryland; Biomarker and Genomic Sciences Group, National Institute of Standards and Technology, Gaithersburg, Maryland
| |
Collapse
|
6
|
Hirao AS, Watanabe Y, Hasegawa Y, Takagi T, Ueno S, Kaneko S. Mutational effects of chronic gamma radiation throughout the life cycle of Arabidopsis thaliana: Insight into radiosensitivity in the reproductive stage. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 838:156224. [PMID: 35644386 DOI: 10.1016/j.scitotenv.2022.156224] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 05/17/2022] [Accepted: 05/21/2022] [Indexed: 06/15/2023]
Abstract
Organisms living on Earth have always been exposed to natural sources of ionizing radiation, but following recent nuclear disasters, these background levels have often increased regionally due to the addition of man-made sources of radiation. To assess the mutational effects of ubiquitously present radiation on plants, we performed a whole-genome resequencing analysis of mutations induced by chronic irradiation throughout the life cycle of Arabidopsis thaliana grown under controlled conditions. We obtained resequencing data from 36 second generation post-mutagenesis (M2) progeny derived from 12 first generation (M1) lines grown under gamma-irradiation conditions, ranging from 0.0 to 2.0 Gray per day (Gy/day), to identify de novo mutations, including single base substitutions (SBSs) and small insertions/deletions (INDELs). The relationship between de novo mutation frequency and radiation dose rate from 0.0 to 2.0 Gy/day was assessed by statistical modeling. The increase in de novo mutations in response to irradiation dose fit the negative binomial model, which accounted for the high variability of mutation frequency observed. Among the different types of mutations, SBSs were more prevalent than INDELs, and deletions were more frequent than insertions. Furthermore, we observed that the mutational effects of chronic radiation were greater during the reproductive stage. These results will provide valuable insights into practical strategies for analyzing mutational effects in wild plants growing in environments with various mutagens.
Collapse
Affiliation(s)
- Akira S Hirao
- Faculty of Symbiotic Systems Science, Fukushima University, 1 Kanayagawa, Fukushima, Fukushima 960-1296, Japan; National Research Institute of Fisheries Science, Japan Fisheries Research and Education Agency, 2-12-4 Fukuura, Kanazawa, Yokohama, Kanagawa 236-8648, Japan
| | - Yoshito Watanabe
- Fukushima Project Headquarters, National Institute of Radiological Sciences, National Institutes for Quantum and Radiological Science and Technology, 4-9-1 Anagawa, Inage-ku, Chiba 263-8555, Japan
| | - Yoichi Hasegawa
- Department of Forest Molecular Genetics and Biotechnology, Forestry and Forest Products Research Institute, Forest Research and Management Organization, 1 Matsunosato, Tsukuba, Ibaraki, Japan
| | - Toshihito Takagi
- Graduate School of Symbiotic Systems Science and Technology, Fukushima University, 1 Kanayagawa, Fukushima, Fukushima, Japan
| | - Saneyoshi Ueno
- Department of Forest Molecular Genetics and Biotechnology, Forestry and Forest Products Research Institute, Forest Research and Management Organization, 1 Matsunosato, Tsukuba, Ibaraki, Japan
| | - Shingo Kaneko
- Faculty of Symbiotic Systems Science, Fukushima University, 1 Kanayagawa, Fukushima, Fukushima 960-1296, Japan; Institute of Environmental Radioactivity, Fukushima University, 1 Kanayagawa, Fukushima, Fukushima, Japan.
| |
Collapse
|
7
|
Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022; 12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open
Abstract
Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.
Collapse
Affiliation(s)
| | - Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| |
Collapse
|
8
|
Shirafuji S, Torikai H. A novel ergodic cellular automaton gene network model towards efficient hardware-based genome simulator. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2022; 2022:2232-2235. [PMID: 36086611 DOI: 10.1109/embc48229.2022.9871858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this paper, a novel ergodic cellular automaton model of Hes1 mRNA and Hes1 protein network is presented. Detailed analyses reveal that the presented network model can reproduce a typical nonlinear bifurcation phenomenon observed in a conventional delay differential equation model of the Hes1 mRNA and Hes1 protein network. Furthermore, the presented network model is implemented by a field programmable gate array and its operation is validated by experiments. It is shown that the presented network model consumes much fewer circuit elements and much lower power compared to the delay differential equation network model. Hence the results of this paper will provide fundamental knowledge to design an efficient hardware-based gene network simulator.
Collapse
|
9
|
Li J, Llorente B, Liti G, Yue JX. RecombineX: A generalized computational framework for automatic high-throughput gamete genotyping and tetrad-based recombination analysis. PLoS Genet 2022; 18:e1010047. [PMID: 35533184 PMCID: PMC9119626 DOI: 10.1371/journal.pgen.1010047] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 05/19/2022] [Accepted: 04/14/2022] [Indexed: 01/09/2023] Open
Abstract
Meiotic recombination is an essential biological process that ensures faithful chromosome segregation and promotes parental allele shuffling. Tetrad analysis is a powerful approach to quantify the genetic makeups and recombination landscapes of meiotic products. Here we present RecombineX (https://github.com/yjx1217/RecombineX), a generalized computational framework that automates the full workflow of marker identification, gamete genotyping, and tetrad-based recombination profiling based on any organism or genetic background with batch processing capability. Aside from conventional reference-based analysis, RecombineX can also perform analysis based on parental genome assemblies, which facilitates analyzing meiotic recombination landscapes in their native genomic contexts. Additional features such as copy number variation profiling and missing genotype inference further enhance downstream analysis. RecombineX also includes a dedicate module for simulating the genomes and reads of recombinant tetrads, which enables fine-tuned simulation-based hypothesis testing. This simulation module revealed the power and accuracy of RecombineX even when analyzing tetrads with very low sequencing depths (e.g., 1-2X). Tetrad sequencing data from the budding yeast Saccharomyces cerevisiae and green alga Chlamydomonas reinhardtii were further used to demonstrate the accuracy and robustness of RecombineX for organisms with both small and large genomes, manifesting RecombineX as an all-around one stop solution for future tetrad analysis. Interestingly, our re-analysis of the budding yeast tetrad sequencing data with RecombineX and Oxford Nanopore sequencing revealed two unusual structural rearrangement events that were not noticed before, which exemplify the occasional genome instability triggered by meiosis.
Collapse
Affiliation(s)
- Jing Li
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Sun Yat-sen University Cancer Center, Guangzhou, China
- Université Côte d’Azur, CNRS, INSERM, IRCAN, Nice, France
| | - Bertrand Llorente
- Aix-Marseille Université, CNRS, INSERM, CRCM, Institut Paoli-Calmettes, Marseille, France
| | - Gianni Liti
- Université Côte d’Azur, CNRS, INSERM, IRCAN, Nice, France
- * E-mail: (GL); (JXY)
| | - Jia-Xing Yue
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Sun Yat-sen University Cancer Center, Guangzhou, China
- Université Côte d’Azur, CNRS, INSERM, IRCAN, Nice, France
- * E-mail: (GL); (JXY)
| |
Collapse
|
10
|
Chen D, Randhawa GS, Soltysiak MPM, de Souza CPE, Kari L, Singh SM, Hill KA. SomaticSiMu: A mutational signature simulator. Bioinformatics 2022; 38:2619-2620. [PMID: 35258549 DOI: 10.1093/bioinformatics/btac128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 02/01/2022] [Indexed: 11/14/2022] Open
Abstract
SUMMARY SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational catalogues with imposed mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates, and built-in visualization tools of the simulated mutations. Simulated datasets are useful as a ground truth to test the accuracy and sensitivity of DNA sequence classification tools and mutational signature extraction tools under different experimental scenarios. The reliability of SomaticSiMu was affirmed by 1) supervised machine learning classification of simulated sequences with different mutation types and burdens, and 2) mutational signature extraction from simulated mutational catalogues. AVAILABILITY AND IMPLEMENTATION SomaticSiMu is written in Python 3.8.3. The open-source code, documentation, and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the Creative Commons Attribution 4.0 International License. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Chen
- Department of Biology, Western University, London, Ontario, Canada
| | - Gurjit S Randhawa
- School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
| | | | - Camila P E de Souza
- Department of Statistical and Actuarial Sciences, Western University, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Shiva M Singh
- Department of Biology, Western University, London, Ontario, Canada
| | - Kathleen A Hill
- Department of Biology, Western University, London, Ontario, Canada
| |
Collapse
|
11
|
Liao H, Cai D, Sun Y. VirStrain: a strain identification tool for RNA viruses. Genome Biol 2022; 23:38. [PMID: 35101081 PMCID: PMC8801933 DOI: 10.1186/s13059-022-02609-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022] Open
Abstract
Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
| |
Collapse
|
12
|
Lorente-Leal V, Farrell D, Romero B, Álvarez J, de Juan L, Gordon SV. Performance and Agreement Between WGS Variant Calling Pipelines Used for Bovine Tuberculosis Control: Toward International Standardization. Front Vet Sci 2022; 8:780018. [PMID: 34970617 PMCID: PMC8712436 DOI: 10.3389/fvets.2021.780018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 11/25/2021] [Indexed: 11/29/2022] Open
Abstract
Whole genome sequencing (WGS) and allied variant calling pipelines are a valuable tool for the control and eradication of infectious diseases, since they allow the assessment of the genetic relatedness of strains of animal pathogens. In the context of the control of tuberculosis (TB) in livestock, mainly caused by Mycobacterium bovis, these tools offer a high-resolution alternative to traditional molecular methods in the study of herd breakdown events. However, despite the increased use and efforts in the standardization of WGS methods in human tuberculosis around the world, the application of these WGS-enabled approaches to control TB in livestock is still in early development. Our study pursued an initial evaluation of the performance and agreement of four publicly available pipelines for the analysis of M. bovis WGS data (vSNP, SNiPgenie, BovTB, and MTBseq) on a set of simulated Illumina reads generated from a real-world setting with high TB prevalence in cattle and wildlife in the Republic of Ireland. The overall performance of the evaluated pipelines was high, with recall and precision rates above 99% once repeat-rich and problematic regions were removed from the analyses. In addition, when the same filters were applied, distances between inferred phylogenetic trees were similar and pairwise comparison revealed that most of the differences were due to the positioning of polytomies. Hence, under the studied conditions, all pipelines offer similar performance for variant calling to underpin real-world studies of M. bovis transmission dynamics.
Collapse
Affiliation(s)
- Víctor Lorente-Leal
- VISAVET Health Surveillance Center, Universidad Complutense de Madrid, Madrid, Spain.,Animal Health Department, Faculty of Veterinary Medicine, Universidad Complutense de Madrid, Madrid, Spain
| | - Damien Farrell
- UCD School of Veterinary Medicine, University College Dublin, Dublin, Ireland
| | - Beatriz Romero
- VISAVET Health Surveillance Center, Universidad Complutense de Madrid, Madrid, Spain.,Animal Health Department, Faculty of Veterinary Medicine, Universidad Complutense de Madrid, Madrid, Spain
| | - Julio Álvarez
- VISAVET Health Surveillance Center, Universidad Complutense de Madrid, Madrid, Spain.,Animal Health Department, Faculty of Veterinary Medicine, Universidad Complutense de Madrid, Madrid, Spain
| | - Lucía de Juan
- VISAVET Health Surveillance Center, Universidad Complutense de Madrid, Madrid, Spain.,Animal Health Department, Faculty of Veterinary Medicine, Universidad Complutense de Madrid, Madrid, Spain
| | - Stephen V Gordon
- UCD School of Veterinary Medicine, University College Dublin, Dublin, Ireland
| |
Collapse
|
13
|
Liu Y, Jiang T, Gao Y, Liu B, Zang T, Wang Y. Psi-Caller: A Lightweight Short Read-Based Variant Caller With High Speed and Accuracy. Front Cell Dev Biol 2021; 9:731424. [PMID: 34485311 PMCID: PMC8414796 DOI: 10.3389/fcell.2021.731424] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 07/15/2021] [Indexed: 01/23/2023] Open
Abstract
With the rapid development of short-read sequencing technologies, many population-scale resequencing studies have been carried out to study the associations between human genome variants and various phenotypes in recent years. Variant calling is one of the core bioinformatics tasks in such studies to comprehensively discover genomic variants in sequenced samples. Many efforts have been made to develop short read-based variant calling approaches; however, state-of-the-art tools are still computationally expensive. Meanwhile, cutting-edge genomics studies also have higher requirements on the yields of variant calling. Herein, we propose Partial-Order Alignment-based single nucleotide polymorphism (SNV) and Indel caller (Psi-caller), a lightweight variant calling algorithm that simultaneously achieves high performance and yield. Mainly, Psi-caller recognizes and divides the candidate variant site into three categories according to the complexity and location of the signatures and employs various methods including binomial model, partial-order alignment, and de Bruijn graph-based local assembly to handle various categories of candidate variant sites to call and genotype SNVs/Indels, respectively. Benchmarks on simulated and real short-read sequencing data sets demonstrate that Psi-caller is times faster than state-of-the-art tools with higher or equal sensitivity and accuracy. It has the potential to well handle large-scale data sets in cutting-edge genomics studies.
Collapse
Affiliation(s)
- Yadong Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Tao Jiang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Yan Gao
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Tianyi Zang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
14
|
Kühl MA, Stich B, Ries DC. Mutation-Simulator: fine-grained simulation of random mutations in any genome. Bioinformatics 2021; 37:568-569. [PMID: 32780803 PMCID: PMC8088320 DOI: 10.1093/bioinformatics/btaa716] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 06/12/2020] [Accepted: 08/05/2020] [Indexed: 01/11/2023] Open
Abstract
Summary Mutation-Simulator allows the introduction of various types of sequence alterations in reference sequences, with reasonable compute-time even for large eukaryotic genomes. Its intuitive system for fine-grained control over mutation rates along the sequence enables the mimicking of natural mutation patterns. Using standard file formats for input and output data, it can easily be integrated into any development and benchmarking workflow for high-throughput sequencing applications. Availability and implementation Mutation-Simulator is written in Python 3 and the source code, documentation, help and use cases are available on the Github page at https://github.com/mkpython3/Mutation-Simulator. It is free for use under the GPL 3 license.
Collapse
Affiliation(s)
- M A Kühl
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - B Stich
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - D C Ries
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| |
Collapse
|
15
|
Das JK, Roy S. A study on non-synonymous mutational patterns in structural proteins of SARS-CoV-2. Genome 2021; 64:665-678. [PMID: 33788636 DOI: 10.1139/gen-2020-0157] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
SARS-CoV-2 is mutating and creating divergent variants across the world. An in-depth investigation of the amino acid substitutions in the genomic signature of SARS-CoV-2 proteins is highly essential for understanding its host adaptation and infection biology. A total of 9587 SARS-CoV-2 structural protein sequences collected from 49 different countries are used to characterize protein-wise variants, substitution patterns (type and location), and major substitution changes. The majority of the substitutions are distinct, mostly in a particular location, and lead to a change in an amino acid's biochemical properties. In terms of mutational changes, envelope (E) and membrane (M) proteins are relatively more stable than nucleocapsid (N) and spike (S) proteins. Several co-occurrence substitutions are observed, particularly in S and N proteins. Substitution specific to active sub-domains reveals that heptapeptide repeat, fusion peptides, transmembrane in S protein, and N-terminal and C-terminal domains in the N protein are remarkably mutated. We also observe a few deleterious mutations in the above domains. The overall study on non-synonymous mutation in structural proteins of SARS-CoV-2 at the start of the pandemic indicates a diversity amongst virus sequences.
Collapse
Affiliation(s)
- Jayanta Kumar Das
- Department of Pediatrics, Johns Hopkins University School of Medicine, Maryland, USA
| | - Swarup Roy
- Network Reconstruction & Analysis (NetRA) Lab, Department of Computer Applications, Sikkim University, Gangtok, India
| |
Collapse
|
16
|
Whibley A, Kelley JL, Narum SR. The changing face of genome assemblies: Guidance on achieving high-quality reference genomes. Mol Ecol Resour 2021; 21:641-652. [PMID: 33326691 DOI: 10.1111/1755-0998.13312] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 12/08/2020] [Accepted: 12/11/2020] [Indexed: 12/20/2022]
Abstract
The quality of genome assemblies has improved rapidly in recent years due to continual advances in sequencing technology, assembly approaches, and quality control. In the field of molecular ecology, this has led to the development of exceptional quality genome assemblies that will be important long-term resources for broader studies into ecological, conservation, evolutionary, and population genomics of naturally occurring species. Moreover, the extent to which a single reference genome represents the diversity within a species varies: pan-genomes will become increasingly important ecological genomics resources, particularly in systems found to have considerable presence-absence variation in their functional content. Here, we highlight advances in technology that have raised the bar for genome assembly and provide guidance on standards to achieve exceptional quality reference genomes. Key recommendations include the following: (a) Genome assemblies should include long-read sequencing except in rare cases where it is effectively impossible to acquire adequately preserved samples needed for high molecular weight DNA standards. (b) At least one scaffolding approach should be included with genome assembly such as Hi-C or optical mapping. (c) Genome assemblies should be carefully evaluated, this may involve utilising short read data for genome polishing, error correction, k-mer analyses, and estimating the percent of reads that map back to an assembly. Finally, a genome assembly is most valuable if all data and methods are made publicly available and the utility of a genome for further studies is verified through examples. While these recommendations are based on current technology, we anticipate that future advances will push the field further and the molecular ecology community should continue to adopt new approaches that attain the highest quality genome assemblies.
Collapse
Affiliation(s)
| | | | - Shawn R Narum
- University of Idaho, Moscow, ID, USA.,Columbia River Inter-Tribal Fish Commission, Hagerman, ID, USA
| |
Collapse
|
17
|
Camiolo S, Suárez NM, Chalka A, Venturini C, Breuer J, Davison AJ. GRACy: A tool for analysing human cytomegalovirus sequence data. Virus Evol 2020; 7:veaa099. [PMID: 33505707 PMCID: PMC7816668 DOI: 10.1093/ve/veaa099] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Modern DNA sequencing has instituted a new era in human cytomegalovirus (HCMV) genomics. A key development has been the ability to determine the genome sequences of HCMV strains directly from clinical material. This involves the application of complex and often non-standardized bioinformatics approaches to analysing data of variable quality in a process that requires substantial manual intervention. To relieve this bottleneck, we have developed GRACy (Genome Reconstruction and Annotation of Cytomegalovirus), an easy-to-use toolkit for analysing HCMV sequence data. GRACy automates and integrates modules for read filtering, genotyping, genome assembly, genome annotation, variant analysis, and data submission. These modules were tested extensively on simulated and experimental data and outperformed generic approaches. GRACy is written in Python and is embedded in a graphical user interface with all required dependencies installed by a single command. It runs on the Linux operating system and is designed to allow the future implementation of a cross-platform version. GRACy is distributed under a GPL 3.0 license and is freely available at https://bioinformatics.cvr.ac.uk/software/ with the manual and a test dataset.
Collapse
Affiliation(s)
| | - Nicolás M Suárez
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Antonia Chalka
- Division of Infection & Immunity, Roslin Institute, R(D)SVM, University of Edinburgh, Edinburgh, UK
| | - Cristina Venturini
- Division of Infection and Immunity, University College London, London, UK
| | - Judith Breuer
- Division of Infection and Immunity, University College London, London, UK
| | - Andrew J Davison
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| |
Collapse
|
18
|
Lamb HJ, Hayes BJ, Nguyen LT, Ross EM. The Future of Livestock Management: A Review of Real-Time Portable Sequencing Applied to Livestock. Genes (Basel) 2020; 11:E1478. [PMID: 33317066 PMCID: PMC7763041 DOI: 10.3390/genes11121478] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 11/10/2020] [Accepted: 12/01/2020] [Indexed: 12/12/2022] Open
Abstract
Oxford Nanopore Technologies' MinION has proven to be a valuable tool within human and microbial genetics. Its capacity to produce long reads in real time has opened up unique applications for portable sequencing. Examples include tracking the recent African swine fever outbreak in China and providing a diagnostic tool for disease in the cassava plant in Eastern Africa. Here we review the current applications of Oxford Nanopore sequencing in livestock, then focus on proposed applications in livestock agriculture for rapid diagnostics, base modification detection, reference genome assembly and genomic prediction. In particular, we propose a future application: 'crush-side genotyping' for real-time on-farm genotyping for extensive industries such as northern Australian beef production. An initial in silico experiment to assess the feasibility of crush-side genotyping demonstrated promising results. SNPs were called from simulated Nanopore data, that included the relatively high base call error rate that is characteristic of the data, and calling parameters were varied to understand the feasibility of SNP calling at low coverages in a heterozygous population. With optimised genotype calling parameters, over 85% of the 10,000 simulated SNPs were able to be correctly called with coverages as low as 6×. These results provide preliminary evidence that Oxford Nanopore sequencing has potential to be used for real-time SNP genotyping in extensive livestock operations.
Collapse
Affiliation(s)
- Harrison J. Lamb
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St. Lucia, QLD 4067, Australia; (B.J.H.); (L.T.N.); (E.M.R.)
| | | | | | | |
Collapse
|
19
|
Jung H, Ventura T, Chung JS, Kim WJ, Nam BH, Kong HJ, Kim YO, Jeon MS, Eyun SI. Twelve quick steps for genome assembly and annotation in the classroom. PLoS Comput Biol 2020; 16:e1008325. [PMID: 33180771 PMCID: PMC7660529 DOI: 10.1371/journal.pcbi.1008325] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.
Collapse
Affiliation(s)
- Hyungtaek Jung
- School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia
- Centre for Agriculture and Bioeconomy, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Tomer Ventura
- Genecology Research Centre, School of Science and Engineering, University of the Sunshine Coast, Sippy Downs, Queensland, Australia
| | - J. Sook Chung
- Institute of Marine and Environmental Technology, University of Maryland Center for Environmental Science, Baltimore, Maryland, United States of America
| | - Woo-Jin Kim
- Genetics and Breeding Research Center, National Institute of Fisheries Science, Geoje, Korea
| | - Bo-Hye Nam
- Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea
| | - Hee Jeong Kong
- Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea
| | - Young-Ok Kim
- Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea
| | - Min-Seung Jeon
- Department of Life Science, Chung-Ang University, Seoul, Korea
| | - Seong-il Eyun
- Department of Life Science, Chung-Ang University, Seoul, Korea
| |
Collapse
|
20
|
Li Y, Wang S, Bi C, Qiu Z, Li M, Gao X. DeepSimulator1.5: a more powerful, quicker and lighter simulator for Nanopore sequencing. Bioinformatics 2020; 36:2578-2580. [PMID: 31913436 PMCID: PMC7178411 DOI: 10.1093/bioinformatics/btz963] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 11/17/2019] [Accepted: 01/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Nanopore sequencing is one of the leading third-generation sequencing technologies. A number of computational tools have been developed to facilitate the processing and analysis of the Nanopore data. Previously, we have developed DeepSimulator1.0 (DS1.0), which is the first simulator for Nanopore sequencing to produce both the raw electrical signals and the reads. However, although DS1.0 can produce high-quality reads, for some sequences, the divergence between the simulated raw signals and the real signals can be large. Furthermore, the Nanopore sequencing technology has evolved greatly since DS1.0 was released. It is thus necessary to update DS1.0 to accommodate those changes. RESULTS We propose DeepSimulator1.5 (DS1.5), all three modules of which have been updated substantially from DS1.0. As for the sequence generator, we updated the sample read length distribution to reflect the newest real reads' features. In terms of the signal generator, which is the core of DeepSimulator, we added one more pore model, the context-independent pore model, which is much faster than the previous context-dependent one. Furthermore, to make the generated signals more similar to the real ones, we added a low-pass filter to post-process the pore model signals. Regarding the basecaller, we added the support for the newest official basecaller, Guppy, which can support both GPU and CPU. In addition, multiple optimizations, related to multiprocessing control, memory and storage management, have been implemented to make DS1.5 a much more amenable and lighter simulator than DS1.0. AVAILABILITY AND IMPLEMENTATION The main program and the data are available at https://github.com/lykaust15/DeepSimulator. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yu Li
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Sheng Wang
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Tencent AI lab, Shenzhen 518000, China
| | - Chongwei Bi
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Zhaowen Qiu
- Institute of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mo Li
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
21
|
He C, Lin G, Wei H, Tang H, White FF, Valent B, Liu S. Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences. NAR Genom Bioinform 2020; 2:lqaa075. [PMID: 33575622 PMCID: PMC7671381 DOI: 10.1093/nargab/lqaa075] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 08/02/2020] [Accepted: 09/01/2020] [Indexed: 12/25/2022] Open
Abstract
Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.
Collapse
Affiliation(s)
- Cheng He
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| | - Guifang Lin
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| | - Hairong Wei
- College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI 49931, USA
| | - Haibao Tang
- Center for Genomics and Biotechnology and Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fujian 350002, China
| | - Frank F White
- Department of Plant Pathology, University of Florida, Gainesville, FL 32611-0680, USA
| | - Barbara Valent
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| | - Sanzhen Liu
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| |
Collapse
|
22
|
Xing Y, Li X, Gao X, Dong Q. MicroGMT: A Mutation Tracker for SARS-CoV-2 and Other Microbial Genome Sequences. Front Microbiol 2020; 11:1502. [PMID: 32670259 PMCID: PMC7330013 DOI: 10.3389/fmicb.2020.01502] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 06/10/2020] [Indexed: 12/12/2022] Open
Abstract
With the continued spread of SARS-CoV-2 virus around the world, researchers often need to quickly identify novel mutations in newly sequenced SARS-CoV-2 genomes for studying the molecular evolution and epidemiology of the virus. We have developed a Python package, MicroGMT, which takes either raw sequence reads or assembled genome sequences as input and compares against database sequences to identify and characterize indels and point mutations. Although our default setting is optimized for SARS-CoV-2 virus, the package can be also applied to any other microbial genomes. The software is freely available at Github URL https://github.com/qunfengdong/MicroGMT.
Collapse
Affiliation(s)
- Yue Xing
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, United States
| | - Xiao Li
- Department of Molecular and Cellular Medicine, Texas A&M University, College Station, TX, United States
| | - Xiang Gao
- Department of Medicine, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, United States
| | - Qunfeng Dong
- Department of Medicine, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, United States.,Center for Biomedical Informatics, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, United States
| |
Collapse
|
23
|
O’Donnell S, Fischer G. MUM&Co: accurate detection of all SV types through whole-genome alignment. Bioinformatics 2020; 36:3242-3243. [DOI: 10.1093/bioinformatics/btaa115] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Revised: 01/31/2020] [Accepted: 02/18/2020] [Indexed: 11/13/2022] Open
Abstract
Abstract
Summary
MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details.
Availability and implementation
https://github.com/SAMtoBAM/MUMandCo.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Samuel O’Donnell
- Laboratory of Computational and Quantitative Biology, CNRS, Institut de Biologie Paris-Seine, Sorbonne Université, Paris F-75005, France
| | - Gilles Fischer
- Laboratory of Computational and Quantitative Biology, CNRS, Institut de Biologie Paris-Seine, Sorbonne Université, Paris F-75005, France
| |
Collapse
|
24
|
Juan L, Wang Y, Jiang J, Yang Q, Jiang Q, Wang Y. PGsim: A Comprehensive and Highly Customizable Personal Genome Simulator. Front Bioeng Biotechnol 2020; 8:28. [PMID: 32047747 PMCID: PMC6997238 DOI: 10.3389/fbioe.2020.00028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 01/13/2020] [Indexed: 11/26/2022] Open
Abstract
Although genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. Privacy and legal barriers also prevent many applications of real data. Large sequencing projects in recent years have provided a deeper understanding of the human genome. However, there is a lack of tools to leverage known data to simulate personal genomes as real as possible. Here, we designed and developed PGsim, a comprehensive and highly customizable individual genome simulator, that fully uses existing knowledge, such as variant allele frequencies in global or world main populations, mutation probability differences between protein-coding regions and non-coding regions, transition/transversion (Ti/Tv) ratios, Indel incidence, Indel length distribution, structural variation sites, and pathogenic mutation sites. Users can flexibly control the proportion and quantity of known variants, common variants, novel variants in both coding and non-coding regions, and special variants through detailed parameter settings. To ensure that the simulated personal genome has sufficient randomness, PGsim makes the generated variants more real and reliable in terms of variant distribution, proportion, and population characteristics. PGsim is able to employ a huge volume database as background data to simulate personal genomes and does not require SQL database support. Users can easily change the variant databases used as needed. As a Perl script, there is no obstacle to running PGsim on any version of the MAC OS or Linux systems, and no libraries, packages, interpreters, compilers, or other dependencies need to be installed in advance. The PGsim tool is publicly available at https://github.com/lrjuan/PGsim.
Collapse
Affiliation(s)
- Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yongtian Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jingyi Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qi Yang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|