1
|
do Lago BV, Bezerra CS, Moreira DA, Parente TE, Portilho MM, Pessôa R, Sanabani SS, Villar LM. Genetic diversity of hepatitis B virus quasispecies in different biological compartments reveals distinct genotypes. Sci Rep 2023; 13:17023. [PMID: 37813888 PMCID: PMC10562391 DOI: 10.1038/s41598-023-43655-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 09/26/2023] [Indexed: 10/11/2023] Open
Abstract
The selection pressure imposed by the host immune system impacts hepatitis B virus (HBV) quasispecies variability. This study evaluates HBV genetic diversity in different biological fluids. Twenty paired serum, oral fluid, and DBS samples from chronic HBV carriers were analyzed using both Sanger and next generation sequencing (NGS). The mean HBV viral load in serum was 5.19 ± 4.3 log IU/mL (median 5.29, IQR 3.01-7.93). Genotype distribution was: HBV/A1 55% (11/20), A2 15% (3/20), D3 10% (2/20), F2 15% (3/20), and F4 5% (1/20). Genotype agreement between serum and oral fluid was 100% (genetic distances 0.0-0.006), while that between serum and DBS was 80% (genetic distances 0.0-0.115). Two individuals presented discordant genotypes in serum and DBS. Minor population analysis revealed a mixed population. All samples displayed mutations in polymerase and/or surface genes. Major population analysis of the polymerase pointed to positions H122 and M129 as the most polymorphic (≥ 75% variability), followed by V163 (55%) and I253 (50%). Neither Sanger nor NGS detected any antiviral primary resistance mutations in the major populations. Minor population analysis, however, demonstrated the rtM204I resistance mutation in all individuals, ranging from 2.8 to 7.5% in serum, 2.5 to 6.3% in oral fluid, and 3.6 to 7.2% in DBS. This study demonstrated that different fluids can be used to assess HBV diversity, nonetheless, genotypic differences according to biological compartments can be observed.
Collapse
Affiliation(s)
- Bárbara Vieira do Lago
- Laboratório de Hepatites Virais, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Rio de Janeiro, Brazil.
| | - Cristianne Sousa Bezerra
- Laboratório de Hepatites Virais, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Rio de Janeiro, Brazil
- Departamento de Educação, Instituto Federal de Educação, Ciência e Tecnologia do Ceará, Fortaleza, Ceará, Brazil
| | - Daniel Andrade Moreira
- Laboratório de Genômica Aplicada e Bioinovações, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
| | - Thiago Estevam Parente
- Laboratório de Genômica Aplicada e Bioinovações, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
| | | | - Rodrigo Pessôa
- Postgraduate Program in Translational Medicine, Department of Medicine, Federal University of Sao Paulo (UNIFESP), São Paulo, Brazil
| | - Sabri Saeed Sanabani
- Laboratory of Medical Investigation (LIM) 03, Clinics Hospital, Faculty of Medicine, University of São Paulo, São Paulo, Brazil
| | - Livia Melo Villar
- Laboratório de Hepatites Virais, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Rio de Janeiro, Brazil.
| |
Collapse
|
2
|
Lee JH, Kim HS. Current laboratory tests for diagnosis of hepatitis B virus infection. Int J Clin Pract 2021; 75:e14812. [PMID: 34487586 DOI: 10.1111/ijcp.14812] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 09/03/2021] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Hepatitis B virus (HBV) has a long history in human infectious diseases. HBV infection can progress chronically, leading to cancer. After introduction of a vaccine, the overall incidence rate of HBV infection has decreased, although it remains a health problem in many countries. PURPOSE The aim of this review was to summarise current diagnostic efforts for HBV infection and future HBV diagnosis perspectives. METHODS We reviewed and summarised current laboratory diagnosis related with HBV infection in clinical practice. RESULTS There have been various serologic- and molecular-based methods to diagnose acute or chronic HBV infection. Since intrahepatic covalently closed circular DNAs (cccDNAs) function as robust HBV replication templates, cure of chronic HBV infection is limited. Recently, new biomarkers such as hepatitis B virus core-related antigen (HBcrAg) and HBV RNA have emerged that appear to reflect intrahepatic cccDNA status. These new biomarkers should be validated before clinical usage. CONCLUSION An effective diagnostic approach and current updated knowledge of treatment response monitoring are important for HBV infection management. Brand new ultrasensitive and accurate immunologic methods may pave the way to manage HBV infection in parallel with immunotherapy era.
Collapse
Affiliation(s)
- Jong-Han Lee
- Department of Laboratory Medicine, Yonsei University Wonju College of Medicine, Wonju, Republic of Korea
| | - Hyon-Suk Kim
- Department of Laboratory Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
3
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmueller T, Sczyrba A, Dilthey A, Klawonn F, McHardy AC. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol 2021; 22:212. [PMID: 34281604 PMCID: PMC8287296 DOI: 10.1186/s13059-021-02426-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 06/29/2021] [Indexed: 01/03/2023] Open
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
Affiliation(s)
- Adrian Fritz
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Andreas Bremges
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Robin Lesker
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Jasper Götting
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - Tina Ganzenmueller
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - Alexander Sczyrba
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alice Carolyn McHardy
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany.
| |
Collapse
|
4
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
5
|
Alipanahi B, Muggli MD, Jundi M, Noyes NR, Boucher C. Metagenome SNP calling via read-colored de Bruijn graphs. Bioinformatics 2021; 36:5275-5281. [PMID: 32049324 DOI: 10.1093/bioinformatics/btaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 01/08/2020] [Accepted: 02/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Martin D Muggli
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Musa Jundi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Noelle R Noyes
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
6
|
Muralidharan HS, Shah N, Meisel JS, Pop M. Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins. Front Microbiol 2021; 12:638561. [PMID: 33717033 PMCID: PMC7945042 DOI: 10.3389/fmicb.2021.638561] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 02/04/2021] [Indexed: 01/03/2023] Open
Abstract
High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.
Collapse
Affiliation(s)
- Harihara Subrahmaniam Muralidharan
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Nidhi Shah
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Jacquelyn S Meisel
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Mihai Pop
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| |
Collapse
|
7
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmüller T, Sczyrba A, Dilthey A, Klawonn F, McHardy A. Haploflow: Strain-resolved de novo assembly of viral genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.01.25.428049. [PMID: 33532769 PMCID: PMC7852260 DOI: 10.1101/2021.01.25.428049] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.
Collapse
Affiliation(s)
- A. Fritz
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - A. Bremges
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - Z.-L. Deng
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - T.-R. Lesker
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - J. Götting
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - T. Ganzenmüller
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - A. Sczyrba
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - A. Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - F. Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - A.C. McHardy
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| |
Collapse
|
8
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
9
|
Streamlined Subpopulation, Subtype, and Recombination Analysis of HIV-1 Half-Genome Sequences Generated by High-Throughput Sequencing. mSphere 2020; 5:5/5/e00551-20. [PMID: 33055255 PMCID: PMC7565892 DOI: 10.1128/msphere.00551-20] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
The highly recombinogenic nature of human immunodeficiency virus type 1 (HIV-1) leads to recombination and emergence of quasispecies. It is important to reliably identify subpopulations to understand the complexity of a viral population for drug resistance surveillance and vaccine development. High-throughput sequencing (HTS) provides improved resolution over Sanger sequencing for the analysis of heterogeneous viral subpopulations. However, current methods of analysis of HTS reads are unable to fully address accurate population reconstruction. Hence, there is a dire need for a more sensitive, accurate, user-friendly, and cost-effective method to analyze viral quasispecies. For this purpose, we have improved the HIVE-hexahedron algorithm that we previously developed with in silico short sequences to analyze raw HTS short reads. The significance of this study is that our standalone algorithm enables a streamlined analysis of quasispecies, subtype, and recombination patterns from long HIV-1 genome regions without the need of additional sequence analysis tools. Distinct viral populations and recombination patterns identified by HIVE-hexahedron are further validated by comparison with sequences obtained by single genome sequencing (SGS). High-throughput sequencing (HTS) has been widely used to characterize HIV-1 genome sequences. There are no algorithms currently that can directly determine genotype and quasispecies population using short HTS reads generated from long genome sequences without additional software. To establish a robust subpopulation, subtype, and recombination analysis workflow, we amplified the HIV-1 3′-half genome from plasma samples of 65 HIV-1-infected individuals and sequenced the entire amplicon (∼4,500 bp) by HTS. With direct analysis of raw reads using HIVE-hexahedron, we showed that 48% of samples harbored 2 to 13 subpopulations. We identified various subtypes (17 A1s, 4 Bs, 27 Cs, 6 CRF02_AGs, and 11 unique recombinant forms) and defined recombinant breakpoints of 10 recombinants. These results were validated with viral genome sequences generated by single genome sequencing (SGS) or the analysis of consensus sequence of the HTS reads. The HIVE-hexahedron workflow is more sensitive and accurate than just evaluating the consensus sequence and also more cost-effective than SGS. IMPORTANCE The highly recombinogenic nature of human immunodeficiency virus type 1 (HIV-1) leads to recombination and emergence of quasispecies. It is important to reliably identify subpopulations to understand the complexity of a viral population for drug resistance surveillance and vaccine development. High-throughput sequencing (HTS) provides improved resolution over Sanger sequencing for the analysis of heterogeneous viral subpopulations. However, current methods of analysis of HTS reads are unable to fully address accurate population reconstruction. Hence, there is a dire need for a more sensitive, accurate, user-friendly, and cost-effective method to analyze viral quasispecies. For this purpose, we have improved the HIVE-hexahedron algorithm that we previously developed with in silico short sequences to analyze raw HTS short reads. The significance of this study is that our standalone algorithm enables a streamlined analysis of quasispecies, subtype, and recombination patterns from long HIV-1 genome regions without the need of additional sequence analysis tools. Distinct viral populations and recombination patterns identified by HIVE-hexahedron are further validated by comparison with sequences obtained by single genome sequencing (SGS).
Collapse
|
10
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
11
|
Li X, Saadat S, Hu H, Li X. BHap: a novel approach for bacterial haplotype reconstruction. Bioinformatics 2020; 35:4624-4631. [PMID: 31004480 DOI: 10.1093/bioinformatics/btz280] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 03/07/2019] [Accepted: 04/13/2019] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations. RESULTS In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes. AVAILABILITY AND IMPLEMENTATION The BHap tool is available at http://www.cs.ucf.edu/∼xiaoman/BHap/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xin Li
- Department of Computer Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA
| | - Samaneh Saadat
- Department of Computer Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Department of Computer Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
12
|
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics 2019; 34:2927-2935. [PMID: 29617936 DOI: 10.1093/bioinformatics/bty202] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 04/02/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation RNA virus populations contain different but genetically related strains, all infecting an individual host. Reconstruction of the viral haplotypes is a fundamental step to characterize the virus population, predict their viral phenotypes and finally provide important information for clinical treatment and prevention. Advances of the next-generation sequencing technologies open up new opportunities to assemble full-length haplotypes. However, error-prone short reads, high similarities between related strains, an unknown number of haplotypes pose computational challenges for reference-free haplotype reconstruction. There is still much room to improve the performance of existing haplotype assembly tools. Results In this work, we developed a de novo haplotype reconstruction tool named PEHaplo, which employs paired-end reads to distinguish highly similar strains for viral quasispecies data. It was applied on both simulated and real quasispecies data, and the results were benchmarked against several recently published de novo haplotype reconstruction tools. The comparison shows that PEHaplo outperforms the benchmarked tools in a comprehensive set of metrics. Availability and implementation The source code and the documentation of PEHaplo are available at https://github.com/chjiao/PEHaplo. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiao Chen
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yingchao Zhao
- School of Computing and Information Sciences, Caritas Institute of Higher Education, Hong Kong, China
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
13
|
Baaijens JA, Van der Roest B, Köster J, Stougie L, Schönhuth A. Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics 2019; 35:5086-5094. [DOI: 10.1093/bioinformatics/btz443] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2018] [Revised: 04/17/2019] [Accepted: 05/27/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly is the reconstruction of strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains is an important step for various treatment-related reasons. Reference genome independent (‘de novo’) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. The remaining challenge is to reconstruct full-length haplotypes together with their abundances from such contigs.
Results
We present Virus-VG as a de novo approach to viral haplotype reconstruction from preassembled contigs. Our method constructs a variation graph from the short input contigs without making use of a reference genome. Then, to obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is, optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances. Benchmarking experiments on challenging simulated and real datasets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates compared to the state-of-the-art viral quasispecies assemblers.
Availability and implementation
Virus-VG is freely available at https://bitbucket.org/jbaaijens/virus-vg.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
| | | | - Johannes Köster
- Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Leen Stougie
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- Department of Econometrics and Operations Research, Vrije Universiteit, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
| | - Alexander Schönhuth
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| |
Collapse
|
14
|
|
15
|
Ahn S, Ke Z, Vikalo H. Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics 2018; 34:i23-i31. [PMID: 29949976 PMCID: PMC6022648 DOI: 10.1093/bioinformatics/bty291] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Motivation As RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains--a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small. Results This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1-10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains. Availability and implementation TenSQR is available at https://github.com/SoYeonA/TenSQR. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Soyeon Ahn
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Ziqi Ke
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
16
|
aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity. J Comput Biol 2018; 25:637-648. [DOI: 10.1089/cmb.2017.0249] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
|
17
|
Wu IC, Liu WC, Chang TT. Applications of next-generation sequencing analysis for the detection of hepatocellular carcinoma-associated hepatitis B virus mutations. J Biomed Sci 2018; 25:51. [PMID: 29859540 PMCID: PMC5984823 DOI: 10.1186/s12929-018-0442-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Accepted: 04/30/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Next-generation sequencing (NGS) is a powerful and high-throughput method for the detection of viral mutations. This article provides a brief overview about optimization of NGS analysis for hepatocellular carcinoma (HCC)-associated hepatitis B virus (HBV) mutations, and hepatocarcinogenesis of relevant mutations. MAIN BODY For the application of NGS analysis in the genome of HBV, four noteworthy steps were discovered in testing. First, a sample-specific reference sequence was the most effective mapping reference for NGS. Second, elongating the end of reference sequence improved mapping performance at the end of the genome. Third, resetting the origin of mapping reference sequence could probed deletion mutations and variants at a certain location with common mutations. Fourth, using a platform-specific cut-off value to distinguish authentic minority variants from technical artifacts was found to be highly effective. One hundred and sixty-seven HBV single nucleotide variants (SNVs) were found to be studied previously through a systematic literature review, and 12 SNVs were determined to be associated with HCC by meta-analysis. From comprehensive research using a HBV genome-wide NGS analysis, 60 NGS-defined HCC-associated SNVs with their pathogenic frequencies were identified, with 19 reported previously. All the 12 HCC-associated SNVs proved by meta-analysis were confirmed by NGS analysis, except for C1766T and T1768A which were mainly expressed in genotypes A and D, but including the subgroup analysis of A1762T. In the 41 novel NGS-defined HCC-associated SNVs, 31.7% (13/41) had cut-off values of SNV frequency lower than 20%. This showed that NGS could be used to detect HCC-associated SNVs with low SNV frequency. Most SNV II (the minor strains in the majority of non-HCC patients) had either low (< 20%) or high (> 80%) SNV frequencies in HCC patients, a characteristic U-shaped distribution pattern. The cut-off values of SNV frequency for HCC-associated SNVs represent their pathogenic frequencies. The pathogenic frequencies of HCC-associated SNV II also showed a U-shaped distribution. Hepatocarcinogenesis induced by HBV mutated proteins through cellular pathways was reviewed. CONCLUSION NGS analysis is useful to discover novel HCC-associated HBV SNVs, especially those with low SNV frequency. The hepatocarcinogenetic mechanisms of novel HCC-associated HBV SNVs defined by NGS analysis deserve further investigation.
Collapse
Affiliation(s)
- I-Chin Wu
- Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, 138 Sheng-Li Road, Tainan, 70403, Taiwan, Republic of China.,Infectious Disease and Signaling Research Center, National Cheng Kung University, Tainan, Taiwan, Republic of China
| | - Wen-Chun Liu
- Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, 138 Sheng-Li Road, Tainan, 70403, Taiwan, Republic of China.,Infectious Disease and Signaling Research Center, National Cheng Kung University, Tainan, Taiwan, Republic of China
| | - Ting-Tsung Chang
- Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, 138 Sheng-Li Road, Tainan, 70403, Taiwan, Republic of China.
| |
Collapse
|
18
|
Leviyang S, Griva I, Ita S, Johnson WE. A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection. Bioinformatics 2018; 33:2455-2463. [PMID: 28379346 DOI: 10.1093/bioinformatics/btx187] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2016] [Accepted: 03/29/2017] [Indexed: 12/14/2022] Open
Abstract
Motivation Next generation sequencing (NGS) has been increasingly applied to characterize viral evolution during HIV and SIV infections. In particular, NGS datasets sampled during the initial months of infection are characterized by relatively low levels of diversity as well as convergent evolution at multiple loci dispersed across the viral genome. Consequently, fully characterizing viral evolution from NGS datasets requires haplotype reconstruction across large regions of the viral genome. Existing haplotype reconstruction algorithms have not been developed with the particular characteristics of early HIV/SIV infection in mind, raising the possibility that better performance could be achieved through a specifically designed algorithm. Results Here, we introduce a haplotype reconstruction algorithm, RegressHaplo, specifically designed for low diversity and convergent evolution regimes. The algorithm uses a penalized regression that balances a data fitting term with a penalty term that encourages solutions with few haplotypes. The regression covariates are a large set of potential haplotypes and fitting the regression is made computationally feasible by the low diversity setting. Using simulated and in vivo datasets, we compare RegressHaplo to PredictHaplo and QuRe, two existing haplotype reconstruction algorithms. RegressHaplo performs better than these algorithms on simulated datasets with relatively low diversity levels. We suggest RegressHaplo as a novel tool for the investigation of early infection HIV/SIV datasets and, more generally, low diversity viral NGS datasets. Contact sr286@georgetown.edu. Availability and Implementation https://github.com/SLeviyang/RegressHaplo.
Collapse
Affiliation(s)
- Sivan Leviyang
- Department of Mathematics and Statistics, Georgetown University, Washington DC, 20057, USA
| | - Igor Griva
- Department of Mathematics, George Mason University, Fairfax, VA 22030, USA
| | - Sergio Ita
- Department of Medicine, University of California - San Diego, La Jolla, CA 92093, USA
| | - Welkin E Johnson
- Department of Biology, Boston College, Chestnut Hill, MA 02467, USA
| |
Collapse
|
19
|
Karagiannis K, Simonyan V, Chumakov K, Mazumder R. Separation and assembly of deep sequencing data into discrete sub-population genomes. Nucleic Acids Res 2017; 45:10989-11003. [PMID: 28977510 PMCID: PMC5737798 DOI: 10.1093/nar/gkx755] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2016] [Accepted: 08/16/2017] [Indexed: 12/15/2022] Open
Abstract
Sequence heterogeneity is a common characteristic of RNA viruses that is often referred to as sub-populations or quasispecies. Traditional techniques used for assembly of short sequence reads produced by deep sequencing, such as de-novo assemblers, ignore the underlying diversity. Here, we introduce a novel algorithm that simultaneously assembles discrete sequences of multiple genomes present in populations. Using in silico data we were able to detect populations at as low as 0.1% frequency with complete global genome reconstruction and in a single sample detected 16 resolved sequences with no mismatches. We also applied the algorithm to high throughput sequencing data obtained for viruses present in sewage samples and successfully detected multiple sub-populations and recombination events in these diverse mixtures. High sensitivity of the algorithm also enables genomic analysis of heterogeneous pathogen genomes from patient samples and accurate detection of intra-host diversity, enabling not just basic research in personalized medicine but also accurate diagnostics and monitoring drug therapies, which are critical in clinical and regulatory decision-making process.
Collapse
Affiliation(s)
- Konstantinos Karagiannis
- Department of Biochemistry and Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA.,Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Konstantin Chumakov
- Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA.,McCormick Genomic and Proteomic Center, George Washington University, Washington, DC 20037, USA
| |
Collapse
|
20
|
Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res 2017; 27:835-848. [PMID: 28396522 PMCID: PMC5411778 DOI: 10.1101/gr.215038.116] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 03/10/2017] [Indexed: 11/24/2022]
Abstract
A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.
Collapse
Affiliation(s)
| | | | - Eric Rivals
- LIRMM, CNRS and Université de Montpellier, 34095 Montpellier, France
- Institut Biologie Computationnelle, CNRS and Université de Montpellier, 34095 Montpellier, France
| | | |
Collapse
|
21
|
Brass JRJ, Owens RA, Matoušek J, Steger G. Viroid quasispecies revealed by deep sequencing. RNA Biol 2017; 14:317-325. [PMID: 28027000 PMCID: PMC5367258 DOI: 10.1080/15476286.2016.1272745] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Revised: 12/04/2016] [Accepted: 12/12/2016] [Indexed: 10/20/2022] Open
Abstract
Viroids are non-coding single-stranded circular RNA molecules that replicate autonomously in infected host plants causing mild to lethal symptoms. Their genomes contain about 250-400 nucleotides, depending on viroid species. Members of the family Pospiviroidae, like the Potato spindle tuber viroid (PSTVd), replicate via an asymmetric rolling-circle mechanism using the host DNA-dependent RNA-Polymerase II in the nucleus, while members of Avsunviroidae are replicated in a symmetric rolling-circle mechanism probably by the nuclear-encoded polymerase in chloroplasts. Viroids induce the production of viroid-specific small RNAs (vsRNA) that can direct (post-)transcriptional gene silencing against host transcripts or genomic sequences. Here, we used deep-sequencing to analyze vsRNAs from plants infected with different PSTVd variants to elucidate the PSTVd quasipecies evolved during infection. We recovered several novel as well as previously known PSTVd variants that were obviously competent in replication and identified common strand-specific mutations. The calculated mean error rate per nucleotide position was less than [Formula: see text], quite comparable to the value of [Formula: see text] reported for a member of Avsunviroidae. The resulting error threshold allows the synthesis of longer-than-unit-length replication intermediates as required by the asymmetric rolling-circle mechanism of members of Pospiviroidae.
Collapse
Affiliation(s)
- Joseph R. J. Brass
- Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany
| | - Robert A. Owens
- United States Department of Agriculture, Agricultural Research Service, Molecular Plant Pathology Laboratory, Beltsville, MD, USA
| | - Jaroslav Matoušek
- Biology Centre, CAS, v. v. i., Institute of Plant Molecular Biology, Branišovská, České Budějovice, Czech Republic
| | - Gerhard Steger
- Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany
| |
Collapse
|
22
|
aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_22] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
23
|
Ghurye JS, Cepeda-Espinoza V, Pop M. Metagenomic Assembly: Overview, Challenges and Applications. THE YALE JOURNAL OF BIOLOGY AND MEDICINE 2016; 89:353-362. [PMID: 27698619 PMCID: PMC5045144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.
Collapse
Affiliation(s)
| | | | - Mihai Pop
- To whom all correspondence should be addressed: Mihai Pop, Department of Computer Science and Center of Bioinformatics and Computational Biology, University of Maryland, Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building. Rm. 3120F, College Park, MD 20742, Phone Number: 301-405-7245,
| |
Collapse
|
24
|
Posada-Cespedes S, Seifert D, Beerenwinkel N. Recent advances in inferring viral diversity from high-throughput sequencing data. Virus Res 2016; 239:17-32. [PMID: 27693290 DOI: 10.1016/j.virusres.2016.09.016] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Revised: 09/23/2016] [Accepted: 09/24/2016] [Indexed: 02/05/2023]
Abstract
Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.
Collapse
Affiliation(s)
- Susana Posada-Cespedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland.
| |
Collapse
|
25
|
Rybicka M, Stalke P, Bielawski KP. Current molecular methods for the detection of hepatitis B virus quasispecies. Rev Med Virol 2016; 26:369-81. [PMID: 27506508 DOI: 10.1002/rmv.1897] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2016] [Revised: 06/16/2016] [Accepted: 06/22/2016] [Indexed: 01/20/2023]
Abstract
Chronic HBV infection affects more than 240 million people worldwide and is associated with a broad range of clinical manifestations including liver cirrhosis, liver failure and hepatocellular carcinoma. Because of the lack of an efficient cure for chronic hepatitis B, the main goal of antiviral therapy is the prevention of liver disease progression coupled with prolonged survival of patients. Because HBV viral load has been shown to be a crucial determinant of the progression of liver damage, these goals can be achieved as long as HBV replication can be suppressed. Unfortunately, long-term therapy with the low-to-moderate genetic barrier drugs, which are still recommended in a majority of developing countries, are strongly associated with HBV resistance development and treatment failure. In such cases, the precise and accurate determination of drug-resistant variants in an individual patient before treatment is important for a proper choice of first-line potent therapy. Nowadays, a number of techniques are available to study HBV quasispecies evolution. This review describes the advantages and limitations of various assays detecting drug-resistant HBV variants. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Magda Rybicka
- Intercollegiate Faculty of Biotechnology, University of Gdansk and Medical University of Gdansk, Gdansk, Poland.
| | - Piotr Stalke
- Department of Infectious Diseases, Medical University of Gdansk, Gdansk, Poland
| | - Krzysztof Piotr Bielawski
- Intercollegiate Faculty of Biotechnology, University of Gdansk and Medical University of Gdansk, Gdansk, Poland
| |
Collapse
|
26
|
Chen N, Trible BR, Kerrigan MA, Tian K, Rowland RRR. ORF5 of porcine reproductive and respiratory syndrome virus (PRRSV) is a target of diversifying selection as infection progresses from acute infection to virus rebound. INFECTION GENETICS AND EVOLUTION 2016; 40:167-175. [PMID: 26961593 DOI: 10.1016/j.meegid.2016.03.002] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2016] [Revised: 02/28/2016] [Accepted: 03/02/2016] [Indexed: 02/05/2023]
Abstract
Genetic variation in both structural and nonstructural genes is a key factor in the capacity of porcine reproductive and respiratory syndrome virus (PRRSV) to evade host defenses and maintain within animals, farms and metapopulations. However, the exact mechanisms by which genetic variation contribute to immune evasion remain unclear. In a study to understand the role of host genetics in disease resistance, a population of pigs were experimentally infected with a type 2 PRRSV isolate. Four pigs that showed virus rebound at 42days post-infection (dpi) were analyzed by 454 sequencing to characterize the rebound quasispecies. Deep sequencing of variable regions in nsp1, nsp2, ORF3 and ORF5 showed the largest number of nucleotide substitutions at day 28 compared to days 4 and 42 post-infection. Differences were also found in genetic variations when comparing tonsil versus serum. The results of dN/dS ratios showed that the same regions evolved under negative selection. However, eight amino acid sites were identified as possessing significant levels of positive selection, including A27V and N32S substitutions in the GP5 ectodomain region. These changes may alter GP5 peptide signal sequence processing and N-glycosylation, respectively. The results indicate that the greatest genetic diversity occurs during the transition between acute and rebound stages of infection, and the introduction of mutations that may result in a gain of fitness provides a potential mechanism for persistence.
Collapse
Affiliation(s)
- Nanhua Chen
- Department of Diagnostic Medicine and Pathobiology, College of Veterinary Medicine, Kansas State University, Manhattan, KS 66506, United States; College of Veterinary Medicine, Yangzhou University, Yangzhou, Jiangsu 225009, PR China.
| | - Benjamin R Trible
- Department of Diagnostic Medicine and Pathobiology, College of Veterinary Medicine, Kansas State University, Manhattan, KS 66506, United States
| | - Maureen A Kerrigan
- Department of Diagnostic Medicine and Pathobiology, College of Veterinary Medicine, Kansas State University, Manhattan, KS 66506, United States
| | - Kegong Tian
- OIE Porcine Reproductive and Respiratory Syndrome Reference Laboratory, Beijing, PR China
| | - Raymond R R Rowland
- Department of Diagnostic Medicine and Pathobiology, College of Veterinary Medicine, Kansas State University, Manhattan, KS 66506, United States
| |
Collapse
|
27
|
Bellecave P, Recordon-Pinson P, Fleury H. Evaluation of Automatic Analysis of Ultradeep Pyrosequencing Raw Data to Determine Percentages of HIV Resistance Mutations in Patients Followed-Up in Hospital. AIDS Res Hum Retroviruses 2016; 32:85-92. [PMID: 26529549 DOI: 10.1089/aid.2015.0201] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
A major obstacle to using next generation sequencing (NGS) technology in clinical routine practice is reliable data analysis. Thousands of sequences need to be aligned and validated, to exclude sequencing artifacts and generate accurate results. We compared two analysis pipelines for Roche 454 ultradeep pyrosequencing (UDPS) raw data generated from HIV-1 clinical samples: a commercial and fully automated Web-based software NGS HIV-1 Module (SmartGene, Zug, Switzerland) vs. the Amplicon Variant Analyzer software (AVA, 454 Life Sciences; Roche). Results were also compared to those obtained with Sanger sequencing. HIV-1 reverse transcriptase and protease genes from 34 plasma samples were submitted to Sanger sequencing and GS Junior UDPS. Raw UDPS data (sff files) from all samples were analyzed with AVA 2.7 software plus manual review of the alignments and the fully automated SmartGene NGS HIV-1 Module prototype (SMG). Results obtained with both analysis pipelines showed good correlation (85.0%). Divergent results were mainly observed at homopolymer positions, such as K101, where the frame-aware alignment and error corrections of the automated approach were more efficient and more accurate, both in terms of detecting and quantifying drug resistance mutations. Our study shows that NGS data can easily be analyzed via a fully automated analysis pipeline, here the SmartGene NGS HIV-1 Module, thus minimizing the need for manual review of alignments by the user, otherwise essential to ensure accurate results. Such automated analysis pipelines may facilitate the adoption of NGS platforms in the routine clinical laboratory.
Collapse
Affiliation(s)
- Pantxika Bellecave
- CNRS-UMR 5234, Microbiologie Fondamentale et Pathogénicité, Université Bordeaux Segalen, Bordeaux, France
- Centre Hospitalier Universitaire de Bordeaux (CHU), Laboratoire de Virologie, Bordeaux, France
| | - Patricia Recordon-Pinson
- CNRS-UMR 5234, Microbiologie Fondamentale et Pathogénicité, Université Bordeaux Segalen, Bordeaux, France
- Centre Hospitalier Universitaire de Bordeaux (CHU), Laboratoire de Virologie, Bordeaux, France
| | - Hervé Fleury
- CNRS-UMR 5234, Microbiologie Fondamentale et Pathogénicité, Université Bordeaux Segalen, Bordeaux, France
- Centre Hospitalier Universitaire de Bordeaux (CHU), Laboratoire de Virologie, Bordeaux, France
| |
Collapse
|
28
|
Jayasundara D, Saeed I, Chang BC, Tang SL, Halgamuge SK. Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness. BMC Bioinformatics 2015; 16 Suppl 18:S3. [PMID: 26678073 PMCID: PMC4682401 DOI: 10.1186/1471-2105-16-s18-s3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. RESULTS On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. CONCLUSIONS The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors. AVAILABILITY http://sourceforge.net/projects/viquas/.
Collapse
Affiliation(s)
- Duleepa Jayasundara
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| | - I Saeed
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| | - BC Chang
- Yourgene Bioscience, No. 376-5, Fuxing Rd., Shu-Lin District, New Taipei City, Taiwan
| | - Sen-Lin Tang
- Biodiversity Research Center, Academia Sinica, Taipei 11529, Nan-Kang, Taiwan
| | - Saman K Halgamuge
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| |
Collapse
|
29
|
Dilernia DA, Chien JT, Monaco DC, Brown MPS, Ende Z, Deymier MJ, Yue L, Paxinos EE, Allen S, Tirado-Ramos A, Hunter E. Multiplexed highly-accurate DNA sequencing of closely-related HIV-1 variants using continuous long reads from single molecule, real-time sequencing. Nucleic Acids Res 2015; 43:e129. [PMID: 26101252 PMCID: PMC4787755 DOI: 10.1093/nar/gkv630] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Revised: 06/02/2015] [Accepted: 06/05/2015] [Indexed: 01/30/2023] Open
Abstract
Single Molecule, Real-Time (SMRT) Sequencing (Pacific Biosciences, Menlo Park, CA, USA) provides the longest continuous DNA sequencing reads currently available. However, the relatively high error rate in the raw read data requires novel analysis methods to deconvolute sequences derived from complex samples. Here, we present a workflow of novel computer algorithms able to reconstruct viral variant genomes present in mixtures with an accuracy of >QV50. This approach relies exclusively on Continuous Long Reads (CLR), which are the raw reads generated during SMRT Sequencing. We successfully implement this workflow for simultaneous sequencing of mixtures containing up to forty different >9 kb HIV-1 full genomes. This was achieved using a single SMRT Cell for each mixture and desktop computing power. This novel approach opens the possibility of solving complex sequencing tasks that currently lack a solution.
Collapse
Affiliation(s)
| | - Jung-Ting Chien
- Emory Vaccine Center, Emory University, Atlanta, GA, 30329, USA
| | | | | | - Zachary Ende
- Emory Vaccine Center, Emory University, Atlanta, GA, 30329, USA
| | | | - Ling Yue
- Emory Vaccine Center, Emory University, Atlanta, GA, 30329, USA
| | | | - Susan Allen
- Pathology and Laboratory Medicine, Emory University, Atlanta, 30322, GA
| | | | - Eric Hunter
- Emory Vaccine Center, Emory University, Atlanta, GA, 30329, USA Pathology and Laboratory Medicine, Emory University, Atlanta, 30322, GA
| |
Collapse
|
30
|
Choudhury MA, Lott WB, Banu S, Cheng AY, Teo YY, Ong RTH, Aaskov J. Nature and Extent of Genetic Diversity of Dengue Viruses Determined by 454 Pyrosequencing. PLoS One 2015; 10:e0142473. [PMID: 26566128 PMCID: PMC4643897 DOI: 10.1371/journal.pone.0142473] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2013] [Accepted: 10/22/2015] [Indexed: 12/23/2022] Open
Abstract
Dengue virus (DENV) populations are characteristically highly diverse. Regular lineage extinction and replacement is an important dynamic DENV feature, and most DENV lineage turnover events are associated with increased incidence of disease. The role of genetic diversity in DENV lineage extinctions is not understood. We investigated the nature and extent of genetic diversity in the envelope (E) gene of DENV serotype 1 representing different lineages histories. A region of the DENV genome spanning the E gene was amplified and sequenced by Roche/454 pyrosequencing. The pyrosequencing results identified distinct sub-populations (haplotypes) for each DENV-1 E gene. A phylogenetic tree was constructed with the consensus DENV-1 E gene nucleotide sequences, and the sequences of each constructed haplotype showed that the haplotypes segregated with the Sanger consensus sequence of the population from which they were drawn. Haplotypes determined through pyrosequencing identified a recombinant DENV genome that could not be identified through Sanger sequencing. Nucleotide level sequence diversities of DENV-1 populations determined from SNP analysis were very low, estimated from 0.009–0.01. There were also no stop codon, frameshift or non-frameshift mutations observed in the E genes of any lineage. No significant correlations between the accumulation of deleterious mutations or increasing genetic diversity and lineage extinction were observed (p>0.5). Although our hypothesis that accumulation of deleterious mutations over time led to the extinction and replacement of DENV lineages was ultimately not supported by the data, our data does highlight the significant technical issues that must be resolved in the way in which population diversity is measured for DENV and other viruses. The results provide an insight into the within-population genetic structure and diversity of DENV-1 populations.
Collapse
Affiliation(s)
- Md Abu Choudhury
- Menzies Health Institute Queensland, Griffith University, Brisbane, Australia
- Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Australia
- * E-mail:
| | - William B Lott
- Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Australia
- School of Chemistry, Physics, and Mechanical Engineering, Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia
| | - Shahera Banu
- Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Australia
| | - Anthony Youzhi Cheng
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
| | - Yik-Ying Teo
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
- Life Sciences Institute, National University of Singapore, Singapore, Singapore
- Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore, Singapore
| | - Rick Twee-Hee Ong
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
| | - John Aaskov
- Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Australia
| |
Collapse
|
31
|
Wu SH, Rodrigo AG. Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals. BMC Bioinformatics 2015; 16:357. [PMID: 26536860 PMCID: PMC4634753 DOI: 10.1186/s12859-015-0810-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 10/30/2015] [Indexed: 11/17/2022] Open
Abstract
Background Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny. Results We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95 % Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95 % Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95 % HPDs are larger than those obtained by BEAST. Conclusion We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95 % HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.
Collapse
Affiliation(s)
- Steven H Wu
- Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA. .,Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA.
| | - Allen G Rodrigo
- Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA. .,The National Evolutionary Synthesis Center, Durham, NC, 27705, USA.
| |
Collapse
|
32
|
Chedom DF, Murcia PR, Greenman CD. Inferring the Clonal Structure of Viral Populations from Time Series Sequencing. PLoS Comput Biol 2015; 11:e1004344. [PMID: 26571026 PMCID: PMC4646700 DOI: 10.1371/journal.pcbi.1004344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2014] [Accepted: 05/17/2015] [Indexed: 11/18/2022] Open
Abstract
RNA virus populations will undergo processes of mutation and selection resulting in a mixed population of viral particles. High throughput sequencing of a viral population subsequently contains a mixed signal of the underlying clones. We would like to identify the underlying evolutionary structures. We utilize two sources of information to attempt this; within segment linkage information, and mutation prevalence. We demonstrate that clone haplotypes, their prevalence, and maximum parsimony reticulate evolutionary structures can be identified, although the solutions may not be unique, even for complete sets of information. This is applied to a chain of influenza infection, where we infer evolutionary structures, including reassortment, and demonstrate some of the difficulties of interpretation that arise from deep sequencing due to artifacts such as template switching during PCR amplification.
Collapse
Affiliation(s)
- Donatien F. Chedom
- The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom
| | - Pablo R. Murcia
- MRC-University of Glasgow Centre for Virus Research, United Kingdom
| | - Chris D. Greenman
- The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom
- School of Computing Sciences, University of East Anglia, Norwich, United Kingdom
| |
Collapse
|
33
|
High-resolution genetic profile of viral genomes: why it matters. Curr Opin Virol 2015; 14:62-70. [DOI: 10.1016/j.coviro.2015.08.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Revised: 08/07/2015] [Accepted: 08/07/2015] [Indexed: 12/12/2022]
|
34
|
Pulido-Tamayo S, Sánchez-Rodríguez A, Swings T, Van den Bergh B, Dubey A, Steenackers H, Michiels J, Fostier J, Marchal K. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res 2015; 43:e105. [PMID: 25990729 PMCID: PMC4652744 DOI: 10.1093/nar/gkv478] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 04/29/2015] [Indexed: 11/23/2022] Open
Abstract
Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems adds ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information. Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples.
Collapse
Affiliation(s)
- Sergio Pulido-Tamayo
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Aminael Sánchez-Rodríguez
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Departamento de Ciencias Naturales, Universidad Técnica Particular de Loja, San Cayetano Alto S/N, EC1101608 Loja, Ecuador
| | - Toon Swings
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Bram Van den Bergh
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Akanksha Dubey
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Hans Steenackers
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Jan Michiels
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Jan Fostier
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium
| | - Kathleen Marchal
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| |
Collapse
|
35
|
Yano Y, Azuma T, Hayashi Y. Variations and mutations in the hepatitis B virus genome and their associations with clinical characteristics. World J Hepatol 2015; 7:583-92. [PMID: 25848482 PMCID: PMC4381181 DOI: 10.4254/wjh.v7.i3.583] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Revised: 11/27/2014] [Accepted: 12/29/2014] [Indexed: 02/06/2023] Open
Abstract
Hepatitis B virus (HBV) infection is major global issue, because chronic HBV infection is strongly associated with liver cancer. HBV spread worldwide with various mutations and variations. This variability, called quasispecies, is derived from no proof-reading capacity of viral reverse transcriptase. So far, thousands of studies reported that the variety of genome is closely related to the geographic distribution and clinical characteristics. Recent technological advances including capillary sequencer and next generation sequencer have made in easier to analyze mutations. The variety of HBV genome is related to not only antigenicity of HBs-antigen but also resistance to antiviral therapies. Understanding of these variations is important for the development of diagnostic tools and the appropriate therapy for chronic hepatitis B. In this review, recent publications in relation to HBV mutations and variations are updated and summarized.
Collapse
Affiliation(s)
- Yoshihiko Yano
- Yoshihiko Yano, Takeshi Azuma, Department of Gastroenterology, Kobe University Graduate School of Medicine, Kusunoki-cho, Kobe 650-0017, Japan
| | - Takeshi Azuma
- Yoshihiko Yano, Takeshi Azuma, Department of Gastroenterology, Kobe University Graduate School of Medicine, Kusunoki-cho, Kobe 650-0017, Japan
| | - Yoshitake Hayashi
- Yoshihiko Yano, Takeshi Azuma, Department of Gastroenterology, Kobe University Graduate School of Medicine, Kusunoki-cho, Kobe 650-0017, Japan
| |
Collapse
|
36
|
Li F, Zhang D, Li Y, Jiang D, Luo S, Du N, Chen W, Deng L, Zeng C. Whole genome characterization of hepatitis B virus quasispecies with massively parallel pyrosequencing. Clin Microbiol Infect 2015; 21:280-7. [DOI: 10.1016/j.cmi.2014.10.007] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2014] [Revised: 10/04/2014] [Accepted: 10/10/2014] [Indexed: 01/19/2023]
|
37
|
Seifert D, Beerenwinkel N. Estimating Fitness of Viral Quasispecies from Next-Generation Sequencing Data. Curr Top Microbiol Immunol 2015; 392:181-200. [PMID: 26318139 DOI: 10.1007/82_2015_462] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The quasispecies model is ubiquitous in the study of viruses. While having lead to a number of insights that have stood the test of time, the quasispecies model has mostly been discussed in a theoretical fashion with little support of data. With next-generation sequencing (NGS), this situation is changing and a wealth of data can now be produced in a time- and cost-efficient manner. NGS can, after removal of technical errors, yield an exceedingly detailed picture of the viral population structure. The widespread availability of cross-sectional data can be used to study fitness landscapes of viral populations in the quasispecies model. This chapter highlights methods that estimate the strength of selection in selective sweeps, assesses marginal fitness effects of quasispecies, and finally infers the fitness landscape of a viral quasispecies, all on the basis of NGS data.
Collapse
|
38
|
Aguirre de Cárcer D, Angly FE, Alcamí A. Evaluation of viral genome assembly and diversity estimation in deep metagenomes. BMC Genomics 2014; 15:989. [PMID: 25407630 PMCID: PMC4247695 DOI: 10.1186/1471-2164-15-989] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2014] [Accepted: 10/30/2014] [Indexed: 01/21/2023] Open
Abstract
Background Viruses have unique properties, small genome and regions of high similarity, whose effects on metagenomic assemblies have not been characterized so far. This study uses diverse in silico simulated viromes to evaluate how extensively genomes can be assembled using different sequencing platforms and assemblers. Further, it investigates the suitability of different methods to estimate viral diversity in metagenomes. Results We created in silico metagenomes mimicking various platforms at different sequencing depths. The CLC assembler revealed subpar compared to IDBA_UD and CAMERA , which are metagenomic-specific. Up to a saturation point, Illumina platforms proved more capable of reconstructing large portions of viral genomes compared to 454. Read length was an important factor for limiting chimericity, while scaffolding marginally improved contig length and accuracy. The genome length of the various viruses in the metagenomes did not significantly affect genome reconstruction, but the co-existence of highly similar genomes was detrimental. When evaluating diversity estimation tools, we found that PHACCS results were more accurate than those from CatchAll and clustering, which were both orders of magnitude above expected. Conclusions Assemblers designed specifically for the analysis of metagenomes should be used to facilitate the creation of high-quality long contigs. Despite the high coverage possible, scientists should not expect to always obtain complete genomes, because their reconstruction may be hindered by co-existing species bearing highly similar genomic regions. Further development of metagenomics-oriented assemblers may help bypass these limitations in future studies. Meanwhile, the lack of fully reconstructed communities keeps methods to estimate viral diversity relevant. While none of the three methods tested had absolute precision, only PHACCS was deemed suitable for comparative studies. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-989) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Daniel Aguirre de Cárcer
- Centro de Biología Molecular Severo Ochoa, Consejo Superior de Investigaciones Científicas (CSIC)-Universidad Autónoma de Madrid, Madrid, Spain.
| | | | | |
Collapse
|
39
|
Jayasundara D, Saeed I, Maheswararajah S, Chang B, Tang SL, Halgamuge SK. ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing. Bioinformatics 2014; 31:886-96. [DOI: 10.1093/bioinformatics/btu754] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
|
40
|
Sequencing pools of individuals — mining genome-wide polymorphism data without big funding. Nat Rev Genet 2014; 15:749-63. [DOI: 10.1038/nrg3803] [Citation(s) in RCA: 512] [Impact Index Per Article: 51.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
41
|
Shao W, Kearney MF, Boltz VF, Spindler JE, Mellors JW, Maldarelli F, Coffin JM. PAPNC, a novel method to calculate nucleotide diversity from large scale next generation sequencing data. J Virol Methods 2014; 203:73-80. [PMID: 24681054 PMCID: PMC4104926 DOI: 10.1016/j.jviromet.2014.03.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Revised: 03/10/2014] [Accepted: 03/11/2014] [Indexed: 02/06/2023]
Abstract
Estimating viral diversity in infected patients can provide insight into pathogen evolution and emergence of drug resistance. With the widespread adoption of deep sequencing, it is important to develop tools to accurately calculate population diversity from very large datasets. Current methods for estimating diversity that are based on multiple alignments are not practical to apply to such data. In this study, the authors report a novel method (Pairwise Alignment Positional Nucleotide Counting, PAPNC) for estimating population diversity from 454 sequence data. The diversity measurements determined using this method were comparable to those calculated by average pairwise difference (APD) of multiply aligned sequences using MEGA5. Diversities were estimated for 9 patient plasma HIV samples sequenced with Titanium 454 technology and by single-genome sequencing (SGS). Diversities calculated from deep sequencing using PAPNC ranged from 0.002 to 0.021 while APD measurements calculated from SGS data ranged proximately from 0.001 to 0.018, with the difference being attributable to PCR error (contributing background diversity of 0.0016 in a control sample). Comparison of APDs estimated from 100 sets of sequences drawn at random from 454 generated data and from corresponding SGS data showed very close correlation between the two methods with R(2) of 0.96, and differing on average by about 1% (after correction for PCR error). The authors have developed a novel method that is good for calculating genetic diversities for large scale datasets from next generation sequencing. It can be implemented easily as a function in available variation calling programs like SAMtools or haplotype reconstruction software for nucleotide genetic diversity calculation. A Perl script implementing this method is available upon request.
Collapse
Affiliation(s)
- Wei Shao
- Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, Frederick, MD, United States.
| | - Mary F Kearney
- HIV Drug Resistance Program, NCI, Frederick, MD, United States
| | - Valerie F Boltz
- HIV Drug Resistance Program, NCI, Frederick, MD, United States
| | | | - John W Mellors
- Division of Infectious Diseases, University of Pittsburgh, Pittsburgh, PA, United States
| | | | - John M Coffin
- Department of Molecular Biology and Microbiology, Tufts University, Boston, MA, United States
| |
Collapse
|
42
|
Mangul S, Wu NC, Mancuso N, Zelikovsky A, Sun R, Eskin E. Accurate viral population assembly from ultra-deep sequencing data. Bioinformatics 2014; 30:i329-37. [PMID: 24932001 PMCID: PMC4058922 DOI: 10.1093/bioinformatics/btu295] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. RESULTS In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation-maximization algorithm to estimate abundances of the assembled viral variants in the population. RESULTS on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. AVAILABILITY Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/
Collapse
Affiliation(s)
- Serghei Mangul
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Nicholas C Wu
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Nicholas Mancuso
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Alex Zelikovsky
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Ren Sun
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USAComputer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
43
|
HIV-1 quasispecies delineation by tag linkage deep sequencing. PLoS One 2014; 9:e97505. [PMID: 24842159 PMCID: PMC4026136 DOI: 10.1371/journal.pone.0097505] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Accepted: 04/17/2014] [Indexed: 12/16/2022] Open
Abstract
Trade-offs between throughput, read length, and error rates in high-throughput sequencing limit certain applications such as monitoring viral quasispecies. Here, we describe a molecular-based tag linkage method that allows assemblage of short sequence reads into long DNA fragments. It enables haplotype phasing with high accuracy and sensitivity to interrogate individual viral sequences in a quasispecies. This approach is demonstrated to deduce ∼2000 unique 1.3 kb viral sequences from HIV-1 quasispecies in vivo and after passaging ex vivo with a detection limit of ∼0.005% to ∼0.001%. Reproducibility of the method is validated quantitatively and qualitatively by a technical replicate. This approach can improve monitoring of the genetic architecture and evolution dynamics in any quasispecies population.
Collapse
|
44
|
Töpfer A, Marschall T, Bull RA, Luciani F, Schönhuth A, Beerenwinkel N. Viral quasispecies assembly via maximal clique enumeration. PLoS Comput Biol 2014; 10:e1003515. [PMID: 24675810 PMCID: PMC3967922 DOI: 10.1371/journal.pcbi.1003515] [Citation(s) in RCA: 76] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Accepted: 01/31/2014] [Indexed: 11/25/2022] Open
Abstract
Virus populations can display high genetic diversity within individual hosts. The intra-host collection of viral haplotypes, called viral quasispecies, is an important determinant of virulence, pathogenesis, and treatment outcome. We present HaploClique, a computational approach to reconstruct the structure of a viral quasispecies from next-generation sequencing data as obtained from bulk sequencing of mixed virus samples. We develop a statistical model for paired-end reads accounting for mutations, insertions, and deletions. Using an iterative maximal clique enumeration approach, read pairs are assembled into haplotypes of increasing length, eventually enabling global haplotype assembly. The performance of our quasispecies assembly method is assessed on simulated data for varying population characteristics and sequencing technology parameters. Owing to its paired-end handling, HaploClique compares favorably to state-of-the-art haplotype inference methods. It can reconstruct error-free full-length haplotypes from low coverage samples and detect large insertions and deletions at low frequencies. We applied HaploClique to sequencing data derived from a clinical hepatitis C virus population of an infected patient and discovered a novel deletion of length 357±167 bp that was validated by two independent long-read sequencing experiments. HaploClique is available at https://github.com/armintoepfer/haploclique. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2-5.
Collapse
Affiliation(s)
- Armin Töpfer
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | | | - Rowena A. Bull
- Inflammation and Infection Research Centre, School of Medical Sciences, UNSW, Sydney, Australia
| | - Fabio Luciani
- Inflammation and Infection Research Centre, School of Medical Sciences, UNSW, Sydney, Australia
| | | | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
45
|
Prabhakaran S, Rey M, Zagordi O, Beerenwinkel N, Roth V. HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:182-191. [PMID: 26355517 DOI: 10.1109/tcbb.2013.145] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
This paper presents a new computational technique for the identification of HIV haplotypes. HIV tends to generate many potentially drug-resistant mutants within the HIV-infected patient and being able to identify these different mutants is important for efficient drug administration. With the view of identifying the mutants, we aim at analyzing short deep sequencing data called reads. From a statistical perspective, the analysis of such data can be regarded as a nonstandard clustering problem due to missing pairwise similarity measures between non-overlapping reads. To overcome this problem we propagate a Dirichlet Process Mixture Model by sequentially updating the prior information from successive local analyses. The model is verified using both simulated and real sequencing data.
Collapse
|
46
|
Poh WT, Xia E, Chin-Inmanu K, Wong LP, Cheng AY, Malasit P, Suriyaphol P, Teo YY, Ong RTH. Viral quasispecies inference from 454 pyrosequencing. BMC Bioinformatics 2013; 14:355. [PMID: 24308284 PMCID: PMC4234478 DOI: 10.1186/1471-2105-14-355] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2013] [Accepted: 11/15/2013] [Indexed: 02/05/2023] Open
Abstract
Background Many potentially life-threatening infectious viruses are highly mutable in nature. Characterizing the fittest variants within a quasispecies from infected patients is expected to allow unprecedented opportunities to investigate the relationship between quasispecies diversity and disease epidemiology. The advent of next-generation sequencing technologies has allowed the study of virus diversity with high-throughput sequencing, although these methods come with higher rates of errors which can artificially increase diversity. Results Here we introduce a novel computational approach that incorporates base quality scores from next-generation sequencers for reconstructing viral genome sequences that simultaneously infers the number of variants within a quasispecies that are present. Comparisons on simulated and clinical data on dengue virus suggest that the novel approach provides a more accurate inference of the underlying number of variants within the quasispecies, which is vital for clinical efforts in mapping the within-host viral diversity. Sequence alignments generated by our approach are also found to exhibit lower rates of error. Conclusions The ability to infer the viral quasispecies colony that is present within a human host provides the potential for a more accurate classification of the viral phenotype. Understanding the genomics of viruses will be relevant not just to studying how to control or even eradicate these viral infectious diseases, but also in learning about the innate protection in the human host against the viruses.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Yik-Ying Teo
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore.
| | | |
Collapse
|
47
|
Rodriguez-Frias F, Buti M, Tabernero D, Homs M. Quasispecies structure, cornerstone of hepatitis B virus infection: Mass sequencing approach. World J Gastroenterol 2013; 19:6995-7023. [PMID: 24222943 PMCID: PMC3819535 DOI: 10.3748/wjg.v19.i41.6995] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/28/2013] [Revised: 07/23/2013] [Accepted: 09/17/2013] [Indexed: 02/06/2023] Open
Abstract
Hepatitis B virus (HBV) is a DNA virus with complex replication, and high replication and mutation rates, leading to a heterogeneous viral population. The population is comprised of genomes that are closely related, but not identical; hence, HBV is considered a viral quasispecies. Quasispecies variability may be somewhat limited by the high degree of overlapping between the HBV coding regions, which is especially important in the P and S gene overlapping regions, but is less significant in the X and preCore/Core genes. Despite this restriction, several clinically and pathologically relevant variants have been characterized along the viral genome. Next-generation sequencing (NGS) approaches enable high-throughput analysis of thousands of clonally amplified regions and are powerful tools for characterizing genetic diversity in viral strains. In the present review, we update the information regarding HBV variability and present a summary of the various NGS approaches available for research in this virus. In addition, we provide an analysis of the clinical implications of HBV variants and their study by NGS.
Collapse
|
48
|
Aita T, Ichihashi N, Yomo T. Probabilistic model based error correction in a set of various mutant sequences analyzed by next-generation sequencing. Comput Biol Chem 2013; 47:221-30. [PMID: 24184706 DOI: 10.1016/j.compbiolchem.2013.09.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Revised: 09/13/2013] [Accepted: 09/27/2013] [Indexed: 01/14/2023]
Abstract
To analyze the evolutionary dynamics of a mutant population in an evolutionary experiment, it is necessary to sequence a vast number of mutants by high-throughput (next-generation) sequencing technologies, which enable rapid and parallel analysis of multikilobase sequences. However, the observed sequences include many errors of base call. Therefore, if next-generation sequencing is applied to analysis of a heterogeneous population of various mutant sequences, it is necessary to discriminate between true bases as point mutations and errors of base call in the observed sequences, and to subject the sequences to error-correction processes. To address this issue, we have developed a novel method of error correction based on the Potts model and a maximum a posteriori probability (MAP) estimate of its parameters corresponding to the "true sequences". Our method of error correction utilizes (1) the "quality scores" which are assigned to individual bases in the observed sequences and (2) the neighborhood relationship among the observed sequences mapped in sequence space. The computer experiments of error correction of artificially generated sequences supported the effectiveness of our method, showing that 50-90% of errors were removed. Interestingly, this method is analogous to a probabilistic model based method of image restoration developed in the field of information engineering.
Collapse
Affiliation(s)
- Takuyo Aita
- Exploratory Research for Advanced Technology, Japan Science and Technology Agency, Yamadaoka 1-5, Suita, Osaka, Japan
| | | | | |
Collapse
|
49
|
Prosperi MCF, Yin L, Nolan DJ, Lowe AD, Goodenow MM, Salemi M. Empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges. Sci Rep 2013; 3:2837. [PMID: 24089188 PMCID: PMC3789152 DOI: 10.1038/srep02837] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Accepted: 09/13/2013] [Indexed: 11/22/2022] Open
Abstract
Next generation sequencing (NGS) is superseding Sanger technology for analysing intra-host viral populations, in terms of genome length and resolution. We introduce two new empirical validation data sets and test the available viral population assembly software. Two intra-host viral population 'quasispecies' samples (type-1 human immunodeficiency and hepatitis C virus) were Sanger-sequenced, and plasmid clone mixtures at controlled proportions were shotgun-sequenced using Roche's 454 sequencing platform. The performance of different assemblers was compared in terms of phylogenetic clustering and recombination with the Sanger clones. Phylogenetic clustering showed that all assemblers captured a proportion of the most divergent lineages, but none were able to provide a high precision/recall tradeoff. Estimated variant frequencies mildly correlated with the original. Given the limitations of currently available algorithms identified by our empirical validation, the development and exploitation of additional data sets is needed, in order to establish an efficient framework for viral population reconstruction using NGS.
Collapse
Affiliation(s)
- Mattia C. F. Prosperi
- University of Manchester, Faculty of Medical and Human Sciences, Northwest Institute of Bio-Health Informatics, Centre for Health Informatics, Institute of Population Health, Manchester, UK
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
| | - Li Yin
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
| | - David J. Nolan
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
| | - Amanda D. Lowe
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
| | - Maureen M. Goodenow
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
| | - Marco Salemi
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
- Emerging Pathogens Institute, Gainesville, Florida, USA
| |
Collapse
|
50
|
Improved detection of rare HIV-1 variants using 454 pyrosequencing. PLoS One 2013; 8:e76502. [PMID: 24098517 PMCID: PMC3788733 DOI: 10.1371/journal.pone.0076502] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2013] [Accepted: 08/27/2013] [Indexed: 01/21/2023] Open
Abstract
454 pyrosequencing, a massively parallel sequencing (MPS) technology, is often used to study HIV genetic variation. However, the substantial mismatch error rate of the PCR required to prepare HIV-containing samples for pyrosequencing has limited the detection of rare variants within viral populations to those present above ~1%. To improve detection of rare variants, we varied PCR enzymes and conditions to identify those that combined high sensitivity with a low error rate. Substitution errors were found to vary up to 3-fold between the different enzymes tested. The sensitivity of each enzyme, which impacts the number of templates amplified for pyrosequencing, was shown to vary, although not consistently across genes and different samples. We also describe an amplicon-based method to improve the consistency of read coverage over stretches of the HIV-1 genome. Twenty-two primers were designed to amplify 11 overlapping amplicons in the HIV-1 clade B gag-pol and env gp120 coding regions to encompass 4.7 kb of the viral genome per sample at sensitivities as low as 0.01-0.2%.
Collapse
|