1
|
Wennmann JT, Lim FS, Senger S, Gani M, Jehle JA, Keilwagen J. Haplotype determination of the Bombyx mori nucleopolyhedrovirus by Nanopore sequencing and linkage of single nucleotide variants. J Gen Virol 2024; 105. [PMID: 38767624 DOI: 10.1099/jgv.0.001983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
Naturally occurring isolates of baculoviruses, such as the Bombyx mori nucleopolyhedrovirus (BmNPV), usually consist of numerous genetically different haplotypes. Deciphering the different haplotypes of such isolates is hampered by the large size of the dsDNA genome, as well as the short read length of next generation sequencing (NGS) techniques that are widely applied for baculovirus isolate characterization. In this study, we addressed this challenge by combining the accuracy of NGS to determine single nucleotide variants (SNVs) as genetic markers with the long read length of Nanopore sequencing technique. This hybrid approach allowed the comprehensive analysis of genetically homogeneous and heterogeneous isolates of BmNPV. Specifically, this allowed the identification of two putative major haplotypes in the heterogeneous isolate BmNPV-Ja by SNV position linkage. SNV positions, which were determined based on NGS data, were linked by the long Nanopore reads in a Position Weight Matrix. Using a modified Expectation-Maximization algorithm, the Nanopore reads were assigned according to the occurrence of variable SNV positions by machine learning. The cohorts of reads were de novo assembled, which led to the identification of BmNPV haplotypes. The method demonstrated the strength of the combined approach of short- and long-read sequencing techniques to decipher the genetic diversity of baculovirus isolates.
Collapse
Affiliation(s)
- Jörg T Wennmann
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Fang-Shiang Lim
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Sergei Senger
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Mudasir Gani
- Division of Entomology, Faculty of Agriculture, Sher-e-Kashmir University of Agricultural Sciences & Technology, Kashmir 193 201, J&K, India
| | - Johannes A Jehle
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Jens Keilwagen
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biosafety in Plant Biotechnology, Ernst-Baur-Str. 27, 06484 Quedlinburg, Germany
| |
Collapse
|
2
|
García-López R, Taboada B, Zárate S, Muñoz-Medina JE, Salas-Lais AG, Herrera-Estrella A, Boukadida C, Vazquez-Perez JA, Gómez-Gil B, Sanchez-Flores A, Arias CF. Exploration of low-frequency allelic variants of SARS-CoV-2 genomes reveals coinfections in Mexico occurred during periods of VOCs turnover. Microb Genom 2024; 10:001220. [PMID: 38512312 PMCID: PMC11004493 DOI: 10.1099/mgen.0.001220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 03/04/2024] [Indexed: 03/22/2024] Open
Abstract
A total of 14 973 alleles in 29 661 sequenced samples collected between March 2021 and January 2023 by the Mexican Consortium for Genomic Surveillance (CoViGen-Mex) and collaborators were used to construct a thorough map of mutations of the Mexican SARS-CoV-2 genomic landscape containing Intra-Patient Minor Allelic Variants (IPMAVs), which are low-frequency alleles not ordinarily present in a genomic consensus sequence. This additional information proved critical in identifying putative coinfecting variants included alongside the most common variants, B.1.1.222, B.1.1.519, and variants of concern (VOCs) Alpha, Gamma, Delta, and Omicron. A total of 379 coinfection events were recorded in the dataset (a rate of 1.28 %), resulting in the first such catalogue in Mexico. The most common putative coinfections occurred during the spread of Delta or after the introduction of Omicron BA.2 and its descendants. Coinfections occurred constantly during periods of variant turnover when more than one variant shared the same niche and high infection rate was observed, which was dependent on the local variants and time. Coinfections might occur at a higher frequency than customarily reported, but they are often ignored as only the consensus sequence is reported for lineage identification.
Collapse
Affiliation(s)
- Rodrigo García-López
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Blanca Taboada
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Selene Zárate
- Posgrado en Ciencias Genómicas, Universidad Autónoma de la Ciudad de México, Mexico City, Mexico
| | - José Esteban Muñoz-Medina
- Coordinación de Calidad de Insumos y Laboratorios Especializados, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Angel Gustavo Salas-Lais
- Coordinación de Calidad de Insumos y Laboratorios Especializados, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Alfredo Herrera-Estrella
- Laboratorio Nacional de Genómica para la Biodiversidad-Unidad de Genómica Avanzada, Centro de Investigación y de Estudios Avanzados del IPN, Irapuato, Guanajuato, Mexico
| | - Celia Boukadida
- Centro de Investigación en Enfermedades Infecciosas, Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Mexico City, Mexico
| | - Joel Armando Vazquez-Perez
- Laboratorio de Biología Molecular de Enfermedades Emergentes y EPOC, Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Mexico City, Mexico
| | - Bruno Gómez-Gil
- Centro de Investigación en Alimentación y Desarrollo AC, Unidad Mazatlán, Mazatlán, Sinaloa, Mexico
| | - Alejandro Sanchez-Flores
- Unidad Universitaria de Secuenciación Masiva y Bioinformática, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Carlos F. Arias
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| |
Collapse
|
3
|
Fuhrmann L, Jablonski KP, Topolsky I, Batavia AA, Borgsmüller N, Baykal PI, Carrara M, Chen C, Dondi A, Dragan M, Dreifuss D, John A, Langer B, Okoniewski M, du Plessis L, Schmitt U, Singer F, Stadler T, Beerenwinkel N. V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation. Gigascience 2024; 13:giae065. [PMID: 39347649 PMCID: PMC11440432 DOI: 10.1093/gigascience/giae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 06/11/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024] Open
Abstract
The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, scaling to large sample sizes, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting 2 large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.
Collapse
Affiliation(s)
- Lara Fuhrmann
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Aashil A Batavia
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Nico Borgsmüller
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Matteo Carrara
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Chaoran Chen
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Arthur Dondi
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Monica Dragan
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - David Dreifuss
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Anika John
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Benjamin Langer
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
| | | | - Louis du Plessis
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Uwe Schmitt
- Scientific IT Services, ETH Zurich, Zurich 8092, Switzerland
| | - Franziska Singer
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
4
|
Freire B, Ladra S, Parama JR, Salmela L. ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1550-1562. [PMID: 35853050 DOI: 10.1109/tcbb.2022.3190282] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.
Collapse
|
5
|
Baaijens JA, Zulli A, Ott IM, Nika I, van der Lugt MJ, Petrone ME, Alpert T, Fauver JR, Kalinich CC, Vogels CBF, Breban MI, Duvallet C, McElroy KA, Ghaeli N, Imakaev M, Mckenzie-Bennett MF, Robison K, Plocik A, Schilling R, Pierson M, Littlefield R, Spencer ML, Simen BB, Hanage WP, Grubaugh ND, Peccia J, Baym M. Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques. Genome Biol 2022; 23:236. [PMID: 36348471 PMCID: PMC9643916 DOI: 10.1186/s13059-022-02805-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 10/25/2022] [Indexed: 11/09/2022] Open
Abstract
Effectively monitoring the spread of SARS-CoV-2 mutants is essential to efforts to counter the ongoing pandemic. Predicting lineage abundance from wastewater, however, is technically challenging. We show that by sequencing SARS-CoV-2 RNA in wastewater and applying algorithms initially used for transcriptome quantification, we can estimate lineage abundance in wastewater samples. We find high variability in signal among individual samples, but the overall trends match those observed from sequencing clinical samples. Thus, while clinical sequencing remains a more sensitive technique for population surveillance, wastewater sequencing can be used to monitor trends in mutant prevalence in situations where clinical sequencing is unavailable.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands.
| | - Alessandro Zulli
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Isabel M Ott
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Ioanna Nika
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mart J van der Lugt
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mary E Petrone
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Tara Alpert
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Joseph R Fauver
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Epidemiology, University of Nebraska Medical Center, Omaha, NE, USA
| | - Chaney C Kalinich
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Chantal B F Vogels
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Mallery I Breban
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | - William P Hanage
- Center for Communicable Disease Dynamics and Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Nathan D Grubaugh
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Jordan Peccia
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
6
|
Venturini C, Pang J, Tamuri AU, Roy S, Atkinson C, Griffiths P, Breuer J, Goldstein RA. Haplotype assignment of longitudinal viral deep sequencing data using covariation of variant frequencies. Virus Evol 2022; 8:veac093. [PMID: 36478783 PMCID: PMC9719071 DOI: 10.1093/ve/veac093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 09/15/2022] [Accepted: 10/05/2022] [Indexed: 11/13/2022] Open
Abstract
Longitudinal deep sequencing of viruses can provide detailed information about intra-host evolutionary dynamics including how viruses interact with and transmit between hosts. Many analyses require haplotype reconstruction, identifying which variants are co-located on the same genomic element. Most current methods to perform this reconstruction are based on a high density of variants and cannot perform this reconstruction for slowly evolving viruses. We present a new approach, HaROLD (HAplotype Reconstruction Of Longitudinal Deep sequencing data), which performs this reconstruction based on identifying co-varying variant frequencies using a probabilistic framework. We illustrate HaROLD on both RNA and DNA viruses with synthetic Illumina paired read data created from mixed human cytomegalovirus (HCMV) and norovirus genomes, and clinical datasets of HCMV and norovirus samples, demonstrating high accuracy, especially when longitudinal samples are available.
Collapse
Affiliation(s)
- Cristina Venturini
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
| | - Juanita Pang
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK
| | - Asif U Tamuri
- Research IT Services, University College London, London WC1E 6BT, UK
| | - Sunando Roy
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
| | - Claire Atkinson
- Institute for Immunity and Transplantation, University College London, London NW3 2PP, UK
| | - Paul Griffiths
- Institute for Immunity and Transplantation, University College London, London NW3 2PP, UK
| | - Judith Breuer
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
- Great Ormond Street Hospital for Children, London WC1N 3JH, UK
| | - Richard A Goldstein
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
| |
Collapse
|
7
|
Molina-Mora JA, Cordero-Laurent E, Calderón-Osorno M, Chacón-Ramírez E, Duarte-Martínez F. Metagenomic pipeline for identifying co-infections among distinct SARS-CoV-2 variants of concern: study cases from Alpha to Omicron. Sci Rep 2022; 12:9377. [PMID: 35672431 PMCID: PMC9172093 DOI: 10.1038/s41598-022-13113-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 05/03/2022] [Indexed: 01/04/2023] Open
Abstract
Concomitant infection or co-infection with distinct SARS-CoV-2 genotypes has been reported as part of the epidemiological surveillance of the COVID-19 pandemic. In the context of the spread of more transmissible variants during 2021, co-infections are not only important due to the possible changes in the clinical outcome, but also the chance to generate new genotypes by recombination. However, a few approaches have developed bioinformatic pipelines to identify co-infections. Here we present a metagenomic pipeline based on the inference of multiple fragments similar to amplicon sequence variant (ASV-like) from sequencing data and a custom SARS-CoV-2 database to identify the concomitant presence of divergent SARS-CoV-2 genomes, i.e., variants of concern (VOCs). This approach was compared to another strategy based on whole-genome (metagenome) assembly. Using single or pairs of sequencing data of COVID-19 cases with distinct SARS-CoV-2 VOCs, each approach was used to predict the VOC classes (Alpha, Beta, Gamma, Delta, Omicron or non-VOC and their combinations). The performance of each pipeline was assessed using the ground-truth or expected VOC classes. Subsequently, the ASV-like pipeline was used to analyze 1021 cases of COVID-19 from Costa Rica to investigate the possible occurrence of co-infections. After the implementation of the two approaches, an accuracy of 96.2% was revealed for the ASV-like inference approach, which contrasts with the misclassification found (accuracy 46.2%) for the whole-genome assembly strategy. The custom SARS-CoV-2 database used for the ASV-like analysis can be updated according to the appearance of new VOCs to track co-infections with eventual new genotypes. In addition, the application of the ASV-like approach to all the 1021 sequenced samples from Costa Rica in the period October 12th-December 21th 2021 found that none corresponded to co-infections with VOCs. In conclusion, we developed a metagenomic pipeline based on ASV-like inference for the identification of co-infection with distinct SARS-CoV-2 VOCs, in which an outstanding accuracy was achieved. Due to the epidemiological, clinical, and molecular relevance of the concomitant infection with distinct genotypes, this work represents another piece in the process of the surveillance of the COVID-19 pandemic in Costa Rica and worldwide.
Collapse
Affiliation(s)
- Jose Arturo Molina-Mora
- Centro de Investigación en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica.
| | - Estela Cordero-Laurent
- Instituto Costarricense de Investigación y Enseñanza en Nutrición y Salud (INCIENSA), Tres Ríos, Cartago, Costa Rica
| | - Melany Calderón-Osorno
- Instituto Costarricense de Investigación y Enseñanza en Nutrición y Salud (INCIENSA), Tres Ríos, Cartago, Costa Rica
| | - Edgar Chacón-Ramírez
- Centro de Investigación en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica
| | - Francisco Duarte-Martínez
- Instituto Costarricense de Investigación y Enseñanza en Nutrición y Salud (INCIENSA), Tres Ríos, Cartago, Costa Rica
| |
Collapse
|
8
|
Liao H, Cai D, Sun Y. VirStrain: a strain identification tool for RNA viruses. Genome Biol 2022; 23:38. [PMID: 35101081 PMCID: PMC8801933 DOI: 10.1186/s13059-022-02609-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022] Open
Abstract
Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
| |
Collapse
|
9
|
Luo X, Kang X, Schönhuth A. Strainline: full-length de novo viral haplotype reconstruction from noisy long reads. Genome Biol 2022; 23:29. [PMID: 35057847 PMCID: PMC8771625 DOI: 10.1186/s13059-021-02587-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 12/17/2021] [Indexed: 12/02/2022] Open
Abstract
Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline.
Collapse
|
10
|
Melnyk A, Mohebbi F, Knyazev S, Sahoo B, Hosseini R, Skums P, Zelikovsky A, Patterson M. From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering. J Comput Biol 2021; 28:1113-1129. [PMID: 34698508 DOI: 10.1089/cmb.2021.0302] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Collapse
Affiliation(s)
- Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Fatemeh Mohebbi
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Bikram Sahoo
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Roya Hosseini
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA.,World-Class Research Center "Digital Biodesign and Personalized Healthcare," I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
11
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmueller T, Sczyrba A, Dilthey A, Klawonn F, McHardy AC. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol 2021; 22:212. [PMID: 34281604 PMCID: PMC8287296 DOI: 10.1186/s13059-021-02426-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 06/29/2021] [Indexed: 01/03/2023] Open
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
Affiliation(s)
- Adrian Fritz
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Andreas Bremges
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Robin Lesker
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Jasper Götting
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - Tina Ganzenmueller
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - Alexander Sczyrba
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alice Carolyn McHardy
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany.
| |
Collapse
|
12
|
Bendall ML, Gibson KM, Steiner MC, Rentia U, Pérez-Losada M, Crandall KA. HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intrahost Viral Populations. Mol Biol Evol 2021; 38:1677-1690. [PMID: 33367849 PMCID: PMC8042772 DOI: 10.1093/molbev/msaa315] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Deep sequencing of viral populations using next-generation sequencing (NGS) offers opportunities to understand and investigate evolution, transmission dynamics, and population genetics. Currently, the standard practice for processing NGS data to study viral populations is to summarize all the observed sequences from a sample as a single consensus sequence, thus discarding valuable information about the intrahost viral molecular epidemiology. Furthermore, existing analytical pipelines may only analyze genomic regions involved in drug resistance, thus are not suited for full viral genome analysis. Here, we present HAPHPIPE, a HAplotype and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes. The HAPHPIPE protocol includes modules for quality trimming, error correction, de novo assembly, alignment, and haplotype reconstruction. The resulting consensus sequences, haplotypes, and alignments can be further analyzed using a variety of phylogenetic and population genetic software. HAPHPIPE is designed to provide users with a single pipeline to rapidly analyze sequences from viral populations generated from NGS platforms and provide quality output properly formatted for downstream evolutionary analyses.
Collapse
Affiliation(s)
- Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Margaret C Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| |
Collapse
|
13
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
14
|
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de Bruijn graph. Bioinformatics 2021; 37:473-481. [PMID: 32926162 DOI: 10.1093/bioinformatics/btaa782] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 03/11/2020] [Accepted: 09/02/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Borja Freire
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Susana Ladra
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Jose R Paramá
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
15
|
Detecting and phasing minor single-nucleotide variants from long-read sequencing data. Nat Commun 2021; 12:3032. [PMID: 34031367 PMCID: PMC8144375 DOI: 10.1038/s41467-021-23289-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 04/15/2021] [Indexed: 02/04/2023] Open
Abstract
Cellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.
Collapse
|
16
|
Cao C, He J, Mak L, Perera D, Kwok D, Wang J, Li M, Mourier T, Gavriliuc S, Greenberg M, Morrissy AS, Sycuro LK, Yang G, Jeffares DC, Long Q. Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding. Mol Biol Evol 2021; 38:2660-2672. [PMID: 33547786 PMCID: PMC8136496 DOI: 10.1093/molbev/msab037] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or "haplotypes." However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Jingni He
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Cardiology, Xiangya Hospital, Central South University, Changsha, China
| | - Lauren Mak
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Present address: Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY, USA
| | - Deshan Perera
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Devin Kwok
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - Jia Wang
- Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USA
| | - Minghao Li
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Tobias Mourier
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Stefan Gavriliuc
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - A Sorana Morrissy
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Laura K Sycuro
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Microbiology, Immunology, and Infectious Diseases, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Guang Yang
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada
| | - Daniel C Jeffares
- Department of Biology, York Biomedical Research Institute, University of York, York, United Kingdom
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada,Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, AB, Canada,Corresponding author: E-mail:
| |
Collapse
|
17
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmüller T, Sczyrba A, Dilthey A, Klawonn F, McHardy A. Haploflow: Strain-resolved de novo assembly of viral genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.01.25.428049. [PMID: 33532769 PMCID: PMC7852260 DOI: 10.1101/2021.01.25.428049] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.
Collapse
Affiliation(s)
- A. Fritz
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - A. Bremges
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - Z.-L. Deng
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - T.-R. Lesker
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - J. Götting
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - T. Ganzenmüller
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - A. Sczyrba
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - A. Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - F. Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - A.C. McHardy
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| |
Collapse
|
18
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
19
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
20
|
Gibson KM, Steiner MC, Rentia U, Bendall ML, Pérez-Losada M, Crandall KA. Validation of Variant Assembly Using HAPHPIPE with Next-Generation Sequence Data from Viruses. Viruses 2020; 12:E758. [PMID: 32674515 PMCID: PMC7412389 DOI: 10.3390/v12070758] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 07/03/2020] [Accepted: 07/06/2020] [Indexed: 01/04/2023] Open
Abstract
Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.
Collapse
Affiliation(s)
- Keylie M. Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Margaret C. Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Matthew L. Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4169-007 Vairão, Portugal
| | - Keith A. Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
| |
Collapse
|
21
|
Deng ZL, Dhingra A, Fritz A, Götting J, Münch PC, Steinbrück L, Schulz TF, Ganzenmüller T, McHardy AC. Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses. Brief Bioinform 2020; 22:5868070. [PMID: 34020538 PMCID: PMC8138829 DOI: 10.1093/bib/bbaa123] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2019] [Revised: 05/18/2020] [Accepted: 05/19/2020] [Indexed: 02/06/2023] Open
Abstract
Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a 'G.G' context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
Collapse
Affiliation(s)
- Zhi-Luo Deng
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| | | | - Adrian Fritz
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| | | | - Philipp C Münch
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research and Max von Pettenkofer Institute in Ludwig Maximilian University of Munich
| | | | | | | | - Alice C McHardy
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| |
Collapse
|
22
|
Wang M, Li J, Zhang X, Han Y, Yu D, Zhang D, Yuan Z, Yang Z, Huang J, Zhang X. An integrated software for virus community sequencing data analysis. BMC Genomics 2020; 21:363. [PMID: 32414327 PMCID: PMC7227348 DOI: 10.1186/s12864-020-6744-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 04/21/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND A virus community is the spectrum of viral strains populating an infected host, which plays a key role in pathogenesis and therapy response in viral infectious diseases. However automatic and dedicated pipeline for interpreting virus community sequencing data has not been developed yet. RESULTS We developed Quasispecies Analysis Package (QAP), an integrated software platform to address the problems associated with making biological interpretations from massive viral population sequencing data. QAP provides quantitative insight into virus ecology by first introducing the definition "virus OTU" and supports a wide range of viral community analyses and results visualizations. Various forms of QAP were developed in consideration of broader users, including a command line, a graphical user interface and a web server. Utilities of QAP were thoroughly evaluated with high-throughput sequencing data from hepatitis B virus, hepatitis C virus, influenza virus and human immunodeficiency virus, and the results showed highly accurate viral quasispecies characteristics related to biological phenotypes. CONCLUSIONS QAP provides a complete solution for virus community high throughput sequencing data analysis, and it would facilitate the easy analysis of virus quasispecies in clinical applications.
Collapse
Affiliation(s)
- Mingjie Wang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Jianfeng Li
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025, China
| | - Xiaonan Zhang
- Key Lab of Medicine Molecular Virology of MOE/MOH, Shanghai Medical School, Fudan University, Shanghai, 200032, China
| | - Yue Han
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Demin Yu
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Donghua Zhang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Zhenghong Yuan
- Key Lab of Medicine Molecular Virology of MOE/MOH, Shanghai Medical School, Fudan University, Shanghai, 200032, China
| | - Zhitao Yang
- Emergency Department, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China.
| | - Jinyan Huang
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025, China.
| | - Xinxin Zhang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China. .,Clinical Research Center, Ruijin Hospital North, Shanghai Jiaotong University, School of Medicine, Shanghai, 201821, China.
| |
Collapse
|
23
|
Alamil M, Hughes J, Berthier K, Desbiez C, Thébaud G, Soubeyrand S. Inferring epidemiological links from deep sequencing data: a statistical learning approach for human, animal and plant diseases. Philos Trans R Soc Lond B Biol Sci 2020; 374:20180258. [PMID: 31056055 PMCID: PMC6553606 DOI: 10.1098/rstb.2018.0258] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Pathogen sequence data have been exploited to infer who infected whom, by using empirical and model-based approaches. Most of these approaches exploit one pathogen sequence per infected host (e.g. individual, household, field). However, modern sequencing techniques can reveal the polymorphic nature of within-host populations of pathogens. Thus, these techniques provide a subsample of the pathogen variants that were present in the host at the sampling time. Such data are expected to give more insight on epidemiological links than a single sequence per host. In general, a mechanistic viewpoint to transmission and micro-evolution has been followed to infer epidemiological links from these data. Here, we investigate an alternative approach grounded on statistical learning. The idea consists of learning the structure of epidemiological links with a pseudo-evolutionary model applied to training data obtained from contact tracing, for example, and using this initial stage to infer links for the whole dataset. Such an approach has the potential to be particularly valuable in the case of a risk of erroneous mechanistic assumptions, it is sufficiently parsimonious to allow the handling of big datasets in the future, and it is versatile enough to be applied to very different contexts from animal, human and plant epidemiology. This article is part of the theme issue ‘Modelling infectious disease outbreaks in humans, animals and plants: approaches and important themes’. This issue is linked with the subsequent theme issue ‘Modelling infectious disease outbreaks in humans, animals and plants: epidemic forecasting and control’.
Collapse
Affiliation(s)
- M Alamil
- 1 BioSP, INRA, 84914 Avignon , France
| | - J Hughes
- 2 MRC-University of Glasgow Centre for Virus Research , Glasgow G61 1QH , UK
| | - K Berthier
- 3 Pathologie Végétale, INRA , 84140 Montfavet , France
| | - C Desbiez
- 3 Pathologie Végétale, INRA , 84140 Montfavet , France
| | - G Thébaud
- 4 BGPI, INRA, Univ. Montpellier , SupAgro, Cirad, 34398 Montpellier , France
| | | |
Collapse
|
24
|
Gibson KM, Jair K, Castel AD, Bendall ML, Wilbourn B, Jordan JA, Crandall KA, Pérez-Losada M. A cross-sectional study to characterize local HIV-1 dynamics in Washington, DC using next-generation sequencing. Sci Rep 2020; 10:1989. [PMID: 32029767 PMCID: PMC7004982 DOI: 10.1038/s41598-020-58410-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 12/31/2019] [Indexed: 11/08/2022] Open
Abstract
Washington, DC continues to experience a generalized HIV-1 epidemic. We characterized the local phylodynamics of HIV-1 in DC using next-generation sequencing (NGS) data. Viral samples from 68 participants from 2016 through 2017 were sequenced and paired with epidemiological data. Phylogenetic and network inferences, drug resistant mutations (DRMs), subtypes and HIV-1 diversity estimations were completed. Haplotypes were reconstructed to infer transmission clusters. Phylodynamic inferences based on the HIV-1 polymerase (pol) and envelope genes (env) were compared. Higher HIV-1 diversity (n.s.) was seen in men who have sex with men, heterosexual, and male participants in DC. 54.0% of the participants contained at least one DRM. The 40-49 year-olds showed the highest prevalence of DRMs (22.9%). Phylogenetic analysis of pol and env sequences grouped 31.9-33.8% of the participants into clusters. HIV-TRACE grouped 2.9-12.8% of participants when using consensus sequences and 9.0-64.2% when using haplotypes. NGS allowed us to characterize the local phylodynamics of HIV-1 in DC more broadly and accurately, given a better representation of its diversity and dynamics. Reconstructed haplotypes provided novel and deeper phylodynamic insights, which led to networks linking a higher number of participants. Our understanding of the HIV-1 epidemic was expanded with the powerful coupling of HIV-1 NGS data with epidemiological data.
Collapse
Grants
- P30 AI117970 NIAID NIH HHS
- U01 AI069503 NIAID NIH HHS
- UM1 AI069503 NIAID NIH HHS
- This study was supported by the DC Cohort Study (U01 AI69503-03S2), a supplement from the Women’s Interagency Study for HIV-1 (410722_GR410708), a DC D-CFAR pilot award, and a 2015 HIV-1 Phylodynamics Supplement award from the District of Columbia for AIDS Research, an NIH funded program (AI117970), which is supported by the following NIH Co-Funding and Participating Institutes and Centers: NIAID, NCI, NICHD, NHLBI, NIDA, NIMH, NIA, FIC, NIGMS, NIDDK and OAR. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Collapse
Affiliation(s)
- Keylie M Gibson
- Computational Biology Institute, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
| | - Kamwing Jair
- Department of Epidemiology, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
| | - Amanda D Castel
- Department of Epidemiology, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
| | - Matthew L Bendall
- Computational Biology Institute, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
| | - Brittany Wilbourn
- Department of Epidemiology, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
| | - Jeanne A Jordan
- Department of Epidemiology, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
| | - Keith A Crandall
- Computational Biology Institute, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
- Department of Biostatistics and Bioinformatics, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
- Department of Biostatistics and Bioinformatics, The Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| |
Collapse
|
25
|
Pérez-Losada M, Arenas M, Galán JC, Bracho MA, Hillung J, García-González N, González-Candelas F. High-throughput sequencing (HTS) for the analysis of viral populations. INFECTION GENETICS AND EVOLUTION 2020; 80:104208. [PMID: 32001386 DOI: 10.1016/j.meegid.2020.104208] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 01/21/2020] [Accepted: 01/24/2020] [Indexed: 12/12/2022]
Abstract
The development of High-Throughput Sequencing (HTS) technologies is having a major impact on the genomic analysis of viral populations. Current HTS platforms can capture nucleic acid variation across millions of genes for both selected amplicons and full viral genomes. HTS has already facilitated the discovery of new viruses, hinted new taxonomic classifications and provided a deeper and broader understanding of their diversity, population and genetic structure. Hence, HTS has already replaced standard Sanger sequencing in basic and applied research fields, but the next step is its implementation as a routine technology for the analysis of viruses in clinical settings. The most likely application of this implementation will be the analysis of viral genomics, because the huge population sizes, high mutation rates and very fast replacement of viral populations have demonstrated the limited information obtained with Sanger technology. In this review, we describe new technologies and provide guidelines for the high-throughput sequencing and genetic and evolutionary analyses of viral populations and metaviromes, including software applications. With the development of new HTS technologies, new and refurbished molecular and bioinformatic tools are also constantly being developed to process and integrate HTS data. These allow assembling viral genomes and inferring viral population diversity and dynamics. Finally, we also present several applications of these approaches to the analysis of viral clinical samples including transmission clusters and outbreak characterization.
Collapse
Affiliation(s)
- Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão 4485-661, Portugal
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain; Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain.
| | - Juan Carlos Galán
- Microbiology Service, Hospital Ramón y Cajal, Madrid, Spain; CIBER in Epidemiology and Public Health, Spain.
| | - Mª Alma Bracho
- CIBER in Epidemiology and Public Health, Spain; Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain.
| | - Julia Hillung
- Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| | - Neris García-González
- Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| | - Fernando González-Candelas
- CIBER in Epidemiology and Public Health, Spain; Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| |
Collapse
|
26
|
Brandão PE, Hora AS, Silva SOS, Taniwaki SA, Berg M. Extinction and emergence of genomic haplotypes during the evolution of Avian coronavirus in chicken embryos. Genet Mol Biol 2020; 43:e20190064. [PMID: 32338275 PMCID: PMC7249780 DOI: 10.1590/1678-4685-gmb-2019-0064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Accepted: 06/20/2019] [Indexed: 11/21/2022] Open
Abstract
Avian coronavirus (AvCoV) is ubiquitously present on poultry as a multitude of virus lineages. Studies on AvCoV phenotypic traits are dependent on the isolation of field strains in chicken embryonated eggs, but the mutant spectrum on each isolate is not considered. This manuscript reports the previously unknown HTS (high throughput sequencing)-based complete genome haplotyping of AvCoV isolates after passages of two field strains in chicken embryonated eggs. For the first and third passages of strain 23/2013, virus loads were 6.699 log copies/ μL and 6 log copies/ μL and, for 38/2013, 5.699 log copies/μL and 2.699 log copies/μL of reaction, respectively. The first passage of strain 23/2013 contained no variant haplotype, while, for the third passage, five putative variant haplotypes were found, with > 99.9% full genome identity with each other and with the dominant genome. Regarding strain 38/2013, five variant haplotypes were found for the first passage, with > 99.9% full genome identity with each other and with the dominant genome, and a single variant haplotype was found. Extinction and emergence of haplotypes with polymorphisms in genes involved in receptor binding and regulation of RNA synthesis were observed, suggesting that phenotypic traits of AvCoV isolates are a result of their mutant spectrum.
Collapse
Affiliation(s)
- Paulo E Brandão
- Universidade de São Paulo, Faculdade de Medicina Veterinária e Zootecnia, Departamento de Medicina Veterinária Preventiva e Saúde Animal, São Paulo, SP, Brazil
| | - Aline S Hora
- Universidade Federal de Uberlândia, Faculdade de Medicina Veterinária, Uberlândia, MG, Brazil
| | - Sheila O S Silva
- Universidade de São Paulo, Faculdade de Medicina Veterinária e Zootecnia, Departamento de Medicina Veterinária Preventiva e Saúde Animal, São Paulo, SP, Brazil
| | - Sueli A Taniwaki
- Universidade de São Paulo, Faculdade de Medicina Veterinária e Zootecnia, Departamento de Medicina Veterinária Preventiva e Saúde Animal, São Paulo, SP, Brazil
| | - Mikael Berg
- Swedish University of Agricultural Sciences, Department of Biomedical Sciences and Veterinary Public Health, Section of Virology, Uppsala, Sweden
| |
Collapse
|
27
|
Detecting viral sequences in NGS data. Curr Opin Virol 2019; 39:41-48. [DOI: 10.1016/j.coviro.2019.07.010] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 07/29/2019] [Accepted: 07/30/2019] [Indexed: 01/03/2023]
|
28
|
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics 2019; 34:2927-2935. [PMID: 29617936 DOI: 10.1093/bioinformatics/bty202] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 04/02/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation RNA virus populations contain different but genetically related strains, all infecting an individual host. Reconstruction of the viral haplotypes is a fundamental step to characterize the virus population, predict their viral phenotypes and finally provide important information for clinical treatment and prevention. Advances of the next-generation sequencing technologies open up new opportunities to assemble full-length haplotypes. However, error-prone short reads, high similarities between related strains, an unknown number of haplotypes pose computational challenges for reference-free haplotype reconstruction. There is still much room to improve the performance of existing haplotype assembly tools. Results In this work, we developed a de novo haplotype reconstruction tool named PEHaplo, which employs paired-end reads to distinguish highly similar strains for viral quasispecies data. It was applied on both simulated and real quasispecies data, and the results were benchmarked against several recently published de novo haplotype reconstruction tools. The comparison shows that PEHaplo outperforms the benchmarked tools in a comprehensive set of metrics. Availability and implementation The source code and the documentation of PEHaplo are available at https://github.com/chjiao/PEHaplo. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiao Chen
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yingchao Zhao
- School of Computing and Information Sciences, Caritas Institute of Higher Education, Hong Kong, China
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
29
|
Baaijens JA, Van der Roest B, Köster J, Stougie L, Schönhuth A. Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics 2019; 35:5086-5094. [DOI: 10.1093/bioinformatics/btz443] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2018] [Revised: 04/17/2019] [Accepted: 05/27/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly is the reconstruction of strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains is an important step for various treatment-related reasons. Reference genome independent (‘de novo’) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. The remaining challenge is to reconstruct full-length haplotypes together with their abundances from such contigs.
Results
We present Virus-VG as a de novo approach to viral haplotype reconstruction from preassembled contigs. Our method constructs a variation graph from the short input contigs without making use of a reference genome. Then, to obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is, optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances. Benchmarking experiments on challenging simulated and real datasets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates compared to the state-of-the-art viral quasispecies assemblers.
Availability and implementation
Virus-VG is freely available at https://bitbucket.org/jbaaijens/virus-vg.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
| | | | - Johannes Köster
- Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Leen Stougie
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- Department of Econometrics and Operations Research, Vrije Universiteit, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
| | - Alexander Schönhuth
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| |
Collapse
|
30
|
Barik S, Das S, Vikalo H. QSdpR: Viral quasispecies reconstruction via correlation clustering. Genomics 2018; 110:375-381. [DOI: 10.1016/j.ygeno.2017.12.007] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Revised: 12/03/2017] [Accepted: 12/13/2017] [Indexed: 02/05/2023]
|
31
|
Wong TKF, Ranjard L, Lin Y, Rodrigo AG. HaploJuice : accurate haplotype assembly from a pool of sequences with known relative concentrations. BMC Bioinformatics 2018; 19:389. [PMID: 30348075 PMCID: PMC6198429 DOI: 10.1186/s12859-018-2424-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 10/09/2018] [Indexed: 11/10/2022] Open
Abstract
Background Pooling techniques, where multiple sub-samples are mixed in a single sample, are widely used to take full advantage of high-throughput DNA sequencing. Recently, Ranjard et al. (PLoS ONE 13:0195090, 2018) proposed a pooling strategy without the use of barcodes. Three sub-samples were mixed in different known proportions (i.e. 62.5%, 25% and 12.5%), and a method was developed to use these proportions to reconstruct the three haplotypes effectively. Results HaploJuice provides an alternative haplotype reconstruction algorithm for Ranjard et al.’s pooling strategy. HaploJuice significantly increases the accuracy by first identifying the empirical proportions of the three mixed sub-samples and then assembling the haplotypes using a dynamic programming approach. HaploJuice was evaluated against five different assembly algorithms, Hmmfreq (Ranjard et al., PLoS ONE 13:0195090, 2018), ShoRAH (Zagordi et al., BMC Bioinformatics 12:119, 2011), SAVAGE (Baaijens et al., Genome Res 27:835-848, 2017), PredictHaplo (Prabhakaran et al., IEEE/ACM Trans Comput Biol Bioinform 11:182-91, 2014) and QuRe (Prosperi and Salemi, Bioinformatics 28:132-3, 2012). Using simulated and real data sets, HaploJuice reconstructed the true sequences with the highest coverage and the lowest error rate. Conclusion HaploJuice provides high accuracy in haplotype reconstruction, making Ranjard et al.’s pooling strategy more efficient, feasible, and applicable, with the benefit of reducing the sequencing cost.
Collapse
Affiliation(s)
- Thomas K F Wong
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia.
| | - Louis Ranjard
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia
| | - Yu Lin
- College of Engineering and Computer Science, The Australian National University, Acton ACT, 2601, Australia
| | - Allen G Rodrigo
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia
| |
Collapse
|
32
|
Ahn S, Ke Z, Vikalo H. Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics 2018; 34:i23-i31. [PMID: 29949976 PMCID: PMC6022648 DOI: 10.1093/bioinformatics/bty291] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Motivation As RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains--a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small. Results This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1-10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains. Availability and implementation TenSQR is available at https://github.com/SoYeonA/TenSQR. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Soyeon Ahn
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Ziqi Ke
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
33
|
aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity. J Comput Biol 2018; 25:637-648. [DOI: 10.1089/cmb.2017.0249] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
|
34
|
Leviyang S, Griva I, Ita S, Johnson WE. A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection. Bioinformatics 2018; 33:2455-2463. [PMID: 28379346 DOI: 10.1093/bioinformatics/btx187] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2016] [Accepted: 03/29/2017] [Indexed: 12/14/2022] Open
Abstract
Motivation Next generation sequencing (NGS) has been increasingly applied to characterize viral evolution during HIV and SIV infections. In particular, NGS datasets sampled during the initial months of infection are characterized by relatively low levels of diversity as well as convergent evolution at multiple loci dispersed across the viral genome. Consequently, fully characterizing viral evolution from NGS datasets requires haplotype reconstruction across large regions of the viral genome. Existing haplotype reconstruction algorithms have not been developed with the particular characteristics of early HIV/SIV infection in mind, raising the possibility that better performance could be achieved through a specifically designed algorithm. Results Here, we introduce a haplotype reconstruction algorithm, RegressHaplo, specifically designed for low diversity and convergent evolution regimes. The algorithm uses a penalized regression that balances a data fitting term with a penalty term that encourages solutions with few haplotypes. The regression covariates are a large set of potential haplotypes and fitting the regression is made computationally feasible by the low diversity setting. Using simulated and in vivo datasets, we compare RegressHaplo to PredictHaplo and QuRe, two existing haplotype reconstruction algorithms. RegressHaplo performs better than these algorithms on simulated datasets with relatively low diversity levels. We suggest RegressHaplo as a novel tool for the investigation of early infection HIV/SIV datasets and, more generally, low diversity viral NGS datasets. Contact sr286@georgetown.edu. Availability and Implementation https://github.com/SLeviyang/RegressHaplo.
Collapse
Affiliation(s)
- Sivan Leviyang
- Department of Mathematics and Statistics, Georgetown University, Washington DC, 20057, USA
| | - Igor Griva
- Department of Mathematics, George Mason University, Fairfax, VA 22030, USA
| | - Sergio Ita
- Department of Medicine, University of California - San Diego, La Jolla, CA 92093, USA
| | - Welkin E Johnson
- Department of Biology, Boston College, Chestnut Hill, MA 02467, USA
| |
Collapse
|
35
|
Wymant C, Hall M, Ratmann O, Bonsall D, Golubchik T, de Cesare M, Gall A, Cornelissen M, Fraser C. PHYLOSCANNER: Inferring Transmission from Within- and Between-Host Pathogen Genetic Diversity. Mol Biol Evol 2018; 35:719-733. [PMID: 29186559 PMCID: PMC5850600 DOI: 10.1093/molbev/msx304] [Citation(s) in RCA: 96] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
A central feature of pathogen genomics is that different infectious particles (virions and bacterial cells) within an infected individual may be genetically distinct, with patterns of relatedness among infectious particles being the result of both within-host evolution and transmission from one host to the next. Here, we present a new software tool, phyloscanner, which analyses pathogen diversity from multiple infected hosts. phyloscanner provides unprecedented resolution into the transmission process, allowing inference of the direction of transmission from sequence data alone. Multiply infected individuals are also identified, as they harbor subpopulations of infectious particles that are not connected by within-host evolution, except where recombinant types emerge. Low-level contamination is flagged and removed. We illustrate phyloscanner on both viral and bacterial pathogens, namely HIV-1 sequenced on Illumina and Roche 454 platforms, HCV sequenced with the Oxford Nanopore MinION platform, and Streptococcus pneumoniae with sequences from multiple colonies per individual. phyloscanner is available from https://github.com/BDI-pathogens/phyloscanner.
Collapse
Affiliation(s)
- Chris Wymant
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, Medical Research Council Centre for Outbreak Analysis and Modelling, Imperial College London, London, United Kingdom
| | - Matthew Hall
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, Medical Research Council Centre for Outbreak Analysis and Modelling, Imperial College London, London, United Kingdom
| | - Oliver Ratmann
- Department of Infectious Disease Epidemiology, Medical Research Council Centre for Outbreak Analysis and Modelling, Imperial College London, London, United Kingdom
- Department of Mathematics, Imperial College London, London, United Kingdom
| | - David Bonsall
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, United Kingdom
- Peter Medawar Building for Pathogen Research, Nuffield Department of Medicine and the NIHR Oxford BRC, University of Oxford, United Kingdom
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, United Kingdom
| | - Tanya Golubchik
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, United Kingdom
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, United Kingdom
| | - Mariateresa de Cesare
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, United Kingdom
| | - Astrid Gall
- Department of Veterinary Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Marion Cornelissen
- Laboratory of Experimental Virology, Department of Medical Microbiology, Center for Infection and Immunity Amsterdam (CINIMA), Academic Medical Center of the University of Amsterdam, Amsterdam, The Netherlands
| | - Christophe Fraser
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, Medical Research Council Centre for Outbreak Analysis and Modelling, Imperial College London, London, United Kingdom
| | | |
Collapse
|
36
|
Karagiannis K, Simonyan V, Chumakov K, Mazumder R. Separation and assembly of deep sequencing data into discrete sub-population genomes. Nucleic Acids Res 2017; 45:10989-11003. [PMID: 28977510 PMCID: PMC5737798 DOI: 10.1093/nar/gkx755] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2016] [Accepted: 08/16/2017] [Indexed: 12/15/2022] Open
Abstract
Sequence heterogeneity is a common characteristic of RNA viruses that is often referred to as sub-populations or quasispecies. Traditional techniques used for assembly of short sequence reads produced by deep sequencing, such as de-novo assemblers, ignore the underlying diversity. Here, we introduce a novel algorithm that simultaneously assembles discrete sequences of multiple genomes present in populations. Using in silico data we were able to detect populations at as low as 0.1% frequency with complete global genome reconstruction and in a single sample detected 16 resolved sequences with no mismatches. We also applied the algorithm to high throughput sequencing data obtained for viruses present in sewage samples and successfully detected multiple sub-populations and recombination events in these diverse mixtures. High sensitivity of the algorithm also enables genomic analysis of heterogeneous pathogen genomes from patient samples and accurate detection of intra-host diversity, enabling not just basic research in personalized medicine but also accurate diagnostics and monitoring drug therapies, which are critical in clinical and regulatory decision-making process.
Collapse
Affiliation(s)
- Konstantinos Karagiannis
- Department of Biochemistry and Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA.,Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Konstantin Chumakov
- Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA.,McCormick Genomic and Proteomic Center, George Washington University, Washington, DC 20037, USA
| |
Collapse
|
37
|
Effects of Mutations on Replicative Fitness and Major Histocompatibility Complex Class I Binding Affinity Are Among the Determinants Underlying Cytotoxic-T-Lymphocyte Escape of HIV-1 Gag Epitopes. mBio 2017; 8:mBio.01050-17. [PMID: 29184023 PMCID: PMC5705913 DOI: 10.1128/mbio.01050-17] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Certain “protective” major histocompatibility complex class I (MHC-I) alleles, such as B*57 and B*27, are associated with long-term control of HIV-1 in vivo mediated by the CD8+ cytotoxic-T-lymphocyte (CTL) response. However, the mechanism of such superior protection is not fully understood. Here we combined high-throughput fitness profiling of mutations in HIV-1 Gag, in silico prediction of MHC-peptide binding affinity, and analysis of intraperson virus evolution to systematically compare differences with respect to CTL escape mutations between epitopes targeted by protective MHC-I alleles and those targeted by nonprotective MHC-I alleles. We observed that the effects of mutations on both viral replication and MHC-I binding affinity are among the determinants of CTL escape. Mutations in Gag epitopes presented by protective MHC-I alleles are associated with significantly higher fitness cost and lower reductions in binding affinity with respect to MHC-I. A linear regression model accounting for the effect of mutations on both viral replicative capacity and MHC-I binding can explain the protective efficacy of MHC-I alleles. Finally, we found a consistent pattern in the evolution of Gag epitopes in long-term nonprogressors versus progressors. Overall, our results suggest that certain protective MHC-I alleles allow superior control of HIV-1 by targeting epitopes where mutations typically incur high fitness costs and small reductions in MHC-I binding affinity. Understanding the mechanism of viral control achieved in long-term nonprogressors with protective HLA alleles provides insights for developing functional cure of HIV infection. Through the characterization of CTL escape mutations in infected persons, previous researchers hypothesized that protective alleles target epitopes where escape mutations significantly reduce viral replicative capacity. However, these studies were usually limited to a few mutations observed in vivo. Here we utilized our recently developed high-throughput fitness profiling method to quantitatively measure the fitness of mutations across the entirety of HIV-1 Gag. The data enabled us to integrate the results with in silico prediction of MHC-peptide binding affinity and analysis of intraperson virus evolution to systematically determine the differences in CTL escape mutations between epitopes targeted by protective HLA alleles and those targeted by nonprotective HLA alleles. We observed that the effects of Gag epitope mutations on HIV replicative fitness and MHC-I binding affinity are among the major determinants of CTL escape.
Collapse
|
38
|
Cesana D, Santoni de Sio FR, Rudilosso L, Gallina P, Calabria A, Beretta S, Merelli I, Bruzzesi E, Passerini L, Nozza S, Vicenzi E, Poli G, Gregori S, Tambussi G, Montini E. HIV-1-mediated insertional activation of STAT5B and BACH2 trigger viral reservoir in T regulatory cells. Nat Commun 2017; 8:498. [PMID: 28887441 PMCID: PMC5591266 DOI: 10.1038/s41467-017-00609-1] [Citation(s) in RCA: 70] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2016] [Accepted: 07/12/2017] [Indexed: 12/13/2022] Open
Abstract
HIV-1 insertions targeting BACH2 or MLK2 are enriched and persist for decades in hematopoietic cells from patients under combination antiretroviral therapy. However, it is unclear how these insertions provide such selective advantage to infected cell clones. Here, we show that in 30/87 (34%) patients under combination antiretroviral therapy, BACH2, and STAT5B are activated by insertions triggering the formation of mRNAs that contain viral sequences fused by splicing to their first protein-coding exon. These chimeric mRNAs, predicted to express full-length proteins, are enriched in T regulatory and T central memory cells, but not in other T lymphocyte subsets or monocytes. Overexpression of BACH2 or STAT5B in primary T regulatory cells increases their proliferation and survival without compromising their function. Hence, we provide evidence that HIV-1-mediated insertional activation of BACH2 and STAT5B favor the persistence of a viral reservoir in T regulatory cells in patients under combination antiretroviral therapy. HIV insertions in hematopoietic cells are enriched in BACH2 or MLK2 genes, but the selective advantages conferred are unknown. Here, the authors show that BACH2 and additionally STAT5B are activated by viral insertions, generating chimeric mRNAs specifically enriched in T regulatory cells favoring their persistence.
Collapse
Affiliation(s)
- Daniela Cesana
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy.
| | - Francesca R Santoni de Sio
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Laura Rudilosso
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Pierangela Gallina
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Andrea Calabria
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Stefano Beretta
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Viale Sarca 336, Milan, 20126, Italy.,National Research Council, Institute for Biomedical Technologies, Via Fratelli Cervi 93, Segrate, 20090, Italy
| | - Ivan Merelli
- National Research Council, Institute for Biomedical Technologies, Via Fratelli Cervi 93, Segrate, 20090, Italy
| | - Elena Bruzzesi
- Department of Infectious Diseases, IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Laura Passerini
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Silvia Nozza
- Department of Infectious Diseases, IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Elisa Vicenzi
- Viral Pathogens and Biosafety Unit, Division of Immunology, Transplantation and Infectious Diseases, IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Guido Poli
- AIDS Immunopathogenesis Unit, Division of Immunology, Transplantation and Infectious Diseases, IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy.,Vita-Salute San Raffaele University School of Medicine, Milan, 20132, Italy
| | - Silvia Gregori
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Giuseppe Tambussi
- Department of Infectious Diseases, IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy
| | - Eugenio Montini
- San Raffaele Telethon Institute for Gene Therapy (SR-Tiget), IRCCS, San Raffaele Scientific Institute, Milan, 20132, Italy.
| |
Collapse
|
39
|
Time-Sampled Population Sequencing Reveals the Interplay of Selection and Genetic Drift in Experimental Evolution of Potato Virus Y. J Virol 2017; 91:JVI.00690-17. [PMID: 28592544 PMCID: PMC5533922 DOI: 10.1128/jvi.00690-17] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Accepted: 05/28/2017] [Indexed: 11/20/2022] Open
Abstract
RNA viruses are one of the fastest-evolving biological entities. Within their hosts, they exist as genetically diverse populations (i.e., viral mutant swarms), which are sculpted by different evolutionary mechanisms, such as mutation, natural selection, and genetic drift, and also the interactions between genetic variants within the mutant swarms. To elucidate the mechanisms that modulate the population diversity of an important plant-pathogenic virus, we performed evolution experiments with Potato virus Y (PVY) in potato genotypes that differ in their defense response against the virus. Using deep sequencing of small RNAs, we followed the temporal dynamics of standing and newly generated variations in the evolving viral lineages. A time-sampled approach allowed us to (i) reconstruct theoretical haplotypes in the starting population by using clustering of single nucleotide polymorphisms' trajectories and (ii) use quantitative population genetics approaches to estimate the contribution of selection and genetic drift, and their interplay, to the evolution of the virus. We detected imprints of strong selective sweeps and narrow genetic bottlenecks, followed by the shift in frequency of selected haplotypes. Comparison of patterns of viral evolution in differently susceptible host genotypes indicated possible diversifying evolution of PVY in the less-susceptible host (efficient in the accumulation of salicylic acid).IMPORTANCE High diversity of within-host populations of RNA viruses is an important aspect of their biology, since they represent a reservoir of genetic variants, which can enable quick adaptation of viruses to a changing environment. This study focuses on an important plant virus, Potato virus Y, and describes, at high resolution, temporal changes in the structure of viral populations within different potato genotypes. A novel and easy-to-implement computational approach was established to cluster single nucleotide polymorphisms into viral haplotypes from very short sequencing reads. During the experiment, a shift in the frequency of selected viral haplotypes was observed after a narrow genetic bottleneck, indicating an important role of the genetic drift in the evolution of the virus. On the other hand, a possible case of diversifying selection of the virus was observed in less susceptible host genotypes.
Collapse
|
40
|
Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res 2017; 27:835-848. [PMID: 28396522 PMCID: PMC5411778 DOI: 10.1101/gr.215038.116] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 03/10/2017] [Indexed: 11/24/2022]
Abstract
A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.
Collapse
Affiliation(s)
| | | | - Eric Rivals
- LIRMM, CNRS and Université de Montpellier, 34095 Montpellier, France
- Institut Biologie Computationnelle, CNRS and Université de Montpellier, 34095 Montpellier, France
| | | |
Collapse
|
41
|
aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_22] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
42
|
Artyomenko A, Wu NC, Mangul S, Eskin E, Sun R, Zelikovsky A. Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants. J Comput Biol 2016; 24:558-570. [PMID: 27901586 DOI: 10.1089/cmb.2016.0146] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
As a result of a high rate of mutations and recombination events, an RNA-virus exists as a heterogeneous "swarm" of mutant variants. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, high error rate limits the ability to reconstruct heterogeneous viral population composed of rare, related mutant variants. In this article, we present two single-nucleotide variants (2SNV), a method able to tolerate the high error rate of the single-molecule protocol and reconstruct mutant variants. 2SNV uses linkage between single-nucleotide variations to efficiently distinguish them from read errors. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction.
Collapse
Affiliation(s)
| | - Nicholas C Wu
- 2 Department of Integrative Structural and Computational Biology, The Scripps Research Institute , La Jolla, California
| | - Serghei Mangul
- 3 Department of Computer Science, University of California , Los Angeles, Los Angeles, California.,4 Institute for Quantitative and Computational Biosciences, University of California Los Angeles , Los Angeles, California
| | - Eleazar Eskin
- 3 Department of Computer Science, University of California , Los Angeles, Los Angeles, California
| | - Ren Sun
- 5 Molecular and Medical Pharmacology, University of California , Los Angeles, Los Angeles, California
| | - Alex Zelikovsky
- 1 Department of Computer Science, Georgia State University , Atlanta, Georgia
| |
Collapse
|
43
|
Posada-Cespedes S, Seifert D, Beerenwinkel N. Recent advances in inferring viral diversity from high-throughput sequencing data. Virus Res 2016; 239:17-32. [PMID: 27693290 DOI: 10.1016/j.virusres.2016.09.016] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Revised: 09/23/2016] [Accepted: 09/24/2016] [Indexed: 02/05/2023]
Abstract
Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.
Collapse
Affiliation(s)
- Susana Posada-Cespedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland.
| |
Collapse
|
44
|
Rose R, Constantinides B, Tapinos A, Robertson DL, Prosperi M. Challenges in the analysis of viral metagenomes. Virus Evol 2016; 2:vew022. [PMID: 29492275 PMCID: PMC5822887 DOI: 10.1093/ve/vew022] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Genome sequencing technologies continue to develop with remarkable pace, yet
analytical approaches for reconstructing and classifying viral genomes from
mixed samples remain limited in their performance and usability. Existing
solutions generally target expert users and often have unclear scope, making it
challenging to critically evaluate their performance. There is a growing need
for intuitive analytical tooling for researchers lacking specialist computing
expertise and that is applicable in diverse experimental circumstances. Notable
technical challenges have impeded progress; for example, fragments of viral
genomes are typically orders of magnitude less abundant than those of host,
bacteria, and/or other organisms in clinical and environmental metagenomes;
observed viral genomes often deviate considerably from reference genomes
demanding use of exhaustive alignment approaches; high intrapopulation viral
diversity can lead to ambiguous sequence reconstruction; and finally, the
relatively few documented viral reference genomes compared to the estimated
number of distinct viral taxa renders classification problematic. Various
software tools have been developed to accommodate the unique challenges and use
cases associated with characterizing viral sequences; however, the quality of
these tools varies, and their use often necessitates computing expertise or
access to powerful computers, thus limiting their usefulness to many
researchers. In this review, we consider the general and application-specific
challenges posed by viral sequencing and analysis, outline the landscape of
available tools and methodologies, and propose ways of overcoming the current
barriers to effective analysis.
Collapse
Affiliation(s)
- Rebecca Rose
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Bede Constantinides
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Avraam Tapinos
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - David L Robertson
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Mattia Prosperi
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| |
Collapse
|
45
|
Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants. LECTURE NOTES IN COMPUTER SCIENCE 2016. [DOI: 10.1007/978-3-319-31957-5_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
46
|
A Comprehensive Analysis of Primer IDs to Study Heterogeneous HIV-1 Populations. J Mol Biol 2015; 428:238-250. [PMID: 26711506 DOI: 10.1016/j.jmb.2015.12.012] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 11/25/2015] [Accepted: 12/16/2015] [Indexed: 01/01/2023]
Abstract
Determining the composition of viral populations is becoming increasingly important in the field of medical virology. While recently developed computational tools for viral haplotype analysis allow for correcting sequencing errors, they do not always allow for the removal of errors occurring in the upstream experimental protocol, such as PCR errors. Primer IDs (pIDs) are one method to address this problem by harnessing redundant template resampling for error correction. By using a reference mixture of five HIV-1 strains, we show how pIDs can be useful for estimating key experimental parameters, such as the substitution rate of the PCR process and the reverse transcription (RT) error rate. In addition, we introduce a hidden Markov model for determining the recombination rate of the RT PCR process. We found no strong sequence-specific bias in pID abundances (the same RT efficiencies as compared to commonly used short, specific RT primers) and no effects of pIDs on the estimated distribution of the references viruses.
Collapse
|
47
|
Jayasundara D, Saeed I, Chang BC, Tang SL, Halgamuge SK. Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness. BMC Bioinformatics 2015; 16 Suppl 18:S3. [PMID: 26678073 PMCID: PMC4682401 DOI: 10.1186/1471-2105-16-s18-s3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. RESULTS On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. CONCLUSIONS The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors. AVAILABILITY http://sourceforge.net/projects/viquas/.
Collapse
Affiliation(s)
- Duleepa Jayasundara
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| | - I Saeed
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| | - BC Chang
- Yourgene Bioscience, No. 376-5, Fuxing Rd., Shu-Lin District, New Taipei City, Taiwan
| | - Sen-Lin Tang
- Biodiversity Research Center, Academia Sinica, Taipei 11529, Nan-Kang, Taiwan
| | - Saman K Halgamuge
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| |
Collapse
|
48
|
Wu SH, Rodrigo AG. Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals. BMC Bioinformatics 2015; 16:357. [PMID: 26536860 PMCID: PMC4634753 DOI: 10.1186/s12859-015-0810-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 10/30/2015] [Indexed: 11/17/2022] Open
Abstract
Background Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny. Results We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95 % Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95 % Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95 % HPDs are larger than those obtained by BEAST. Conclusion We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95 % HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.
Collapse
Affiliation(s)
- Steven H Wu
- Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA. .,Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA.
| | - Allen G Rodrigo
- Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA. .,The National Evolutionary Synthesis Center, Durham, NC, 27705, USA.
| |
Collapse
|
49
|
Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering. APPL INTELL 2015. [DOI: 10.1007/s10489-015-0714-6] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
50
|
Pulido-Tamayo S, Sánchez-Rodríguez A, Swings T, Van den Bergh B, Dubey A, Steenackers H, Michiels J, Fostier J, Marchal K. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res 2015; 43:e105. [PMID: 25990729 PMCID: PMC4652744 DOI: 10.1093/nar/gkv478] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 04/29/2015] [Indexed: 11/23/2022] Open
Abstract
Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems adds ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information. Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples.
Collapse
Affiliation(s)
- Sergio Pulido-Tamayo
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Aminael Sánchez-Rodríguez
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Departamento de Ciencias Naturales, Universidad Técnica Particular de Loja, San Cayetano Alto S/N, EC1101608 Loja, Ecuador
| | - Toon Swings
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Bram Van den Bergh
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Akanksha Dubey
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Hans Steenackers
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Jan Michiels
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Jan Fostier
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium
| | - Kathleen Marchal
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| |
Collapse
|