1
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
2
|
Huang SW, Hung SJ, Wang JR. Application of deep sequencing methods for inferring viral population diversity. J Virol Methods 2019; 266:95-102. [PMID: 30690049 DOI: 10.1016/j.jviromet.2019.01.013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 01/11/2019] [Accepted: 01/24/2019] [Indexed: 12/13/2022]
Abstract
The first deep sequencing method was announced in 2005. Due to an increasing number of sequencing data and a reduction in the costs of each sequencing dataset, this innovative technique was soon applied to genetic investigations of viral genome diversity in various viruses, particularly RNA viruses. These deep sequencing findings documented viral epidemiology and evolution and provided high-resolution data on the genetic changes in viral populations. Here, we review deep sequencing platforms that have been applied in viral quasispecies studies. Further, we discuss recent deep sequencing studies on viral inter- and intrahost evolution, drug resistance, and humoral immune selection, especially in emerging and re-emerging viruses. Deep sequencing methods are becoming the standard for providing comprehensive results of viral population diversity, and their applications are discussed.
Collapse
Affiliation(s)
- Sheng-Wen Huang
- National Mosquito-Borne Diseases Control Research Center, National Health Research Institutes, Tainan, Taiwan
| | - Su-Jhen Hung
- Department of Medical Laboratory Science and Biotechnology, National Cheng Kung University, Tainan, Taiwan
| | - Jen-Ren Wang
- Department of Medical Laboratory Science and Biotechnology, National Cheng Kung University, Tainan, Taiwan; Center of Infectious Disease and Signaling Research, National Cheng Kung University, Tainan, Taiwan; Department of Pathology, National Cheng Kung University Hospital, Tainan, Taiwan; National Institute of Infectious Diseases and Vaccinology, National Health Research Institutes, Tainan, Taiwan.
| |
Collapse
|
3
|
Posada-Cespedes S, Seifert D, Beerenwinkel N. Recent advances in inferring viral diversity from high-throughput sequencing data. Virus Res 2016; 239:17-32. [PMID: 27693290 DOI: 10.1016/j.virusres.2016.09.016] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Revised: 09/23/2016] [Accepted: 09/24/2016] [Indexed: 02/05/2023]
Abstract
Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.
Collapse
Affiliation(s)
- Susana Posada-Cespedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland.
| |
Collapse
|
4
|
Liu Y, Chiaromonte F, Ross H, Malhotra R, Elleder D, Poss M. Error correction and statistical analyses for intra-host comparisons of feline immunodeficiency virus diversity from high-throughput sequencing data. BMC Bioinformatics 2015; 16:202. [PMID: 26123018 PMCID: PMC4486422 DOI: 10.1186/s12859-015-0607-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2014] [Accepted: 04/29/2015] [Indexed: 11/16/2022] Open
Abstract
Background Infection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are co-infected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3′ half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology. Results Since rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponential-normal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from high-throughput sequencing. Similar to accounting for over-dispersion in counts, this accounts for error-inflated variability in frequencies – and quite effectively reproduces empirically observed distributions. After obtaining error-corrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy. Conclusions Our error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of high-throughput sequencing data in understanding viral diversity. We found that co-infection with PLV shifts FIV diversity from bone marrow to lymph node and spleen. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0607-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yang Liu
- Department of Statistics, The Pennsylvania State University, University Park, PA, 16802, USA. .,The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA, 16802, USA. .,The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Howard Ross
- Bioinformatics Institute, School of Biological Sciences, University of Auckland, Auckland, 1142, New Zealand.
| | - Raunaq Malhotra
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Daniel Elleder
- Department of Biology, The Pennsylvania State University, University Park, PA, 16802, USA. .,The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA. .,Current address: Institute of Molecular Genetics, Academy of Sciences of the Czech Republic, Videnska 1083, Prague, 14000, Czech Republic.
| | - Mary Poss
- Department of Biology, The Pennsylvania State University, University Park, PA, 16802, USA. .,Department of Veterinary and Biomedical Sciences, The Pennsylvania State University, University Park, PA, 16802, USA. .,The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA.
| |
Collapse
|
5
|
Archer J, Whiteley G, Casewell NR, Harrison RA, Wagstaff SC. VTBuilder: a tool for the assembly of multi isoform transcriptomes. BMC Bioinformatics 2014; 15:389. [PMID: 25465054 PMCID: PMC4260244 DOI: 10.1186/s12859-014-0389-8] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2014] [Accepted: 11/19/2014] [Indexed: 01/10/2023] Open
Abstract
Background Within many research areas, such as transcriptomics, the millions of short DNA fragments (reads) produced by current sequencing platforms need to be assembled into transcript sequences before they can be utilized. Despite recent advances in assembly software, creating such transcripts from read data harboring isoform variation remains challenging. This is because current approaches fail to identify all variants present or they create chimeric transcripts within which relationships between co-evolving sites and other evolutionary factors are disrupted. We present VTBuilder, a tool for constructing non-chimeric transcripts from read data that has been sequenced from sources containing isoform complexity. Results We validated VTBuilder using reads simulated from 54 Sanger sequenced transcripts (SSTs) expressed in the venom gland of the saw scaled viper, Echis ocellatus. The SSTs were selected to represent genes from major co-expressed toxin groups known to harbor isoform variants. From the simulated reads, VTBuilder constructed 55 transcripts, 50 of which had a greater than 99% sequence similarity to 48 of the SSTs. In contrast, using the popular assembler tool Trinity (r2013-02-25), only 14 transcripts were constructed with a similar level of sequence identity to just 11 SSTs. Furthermore VTBuilder produced transcripts with a similar length distribution to the SSTs while those produced by Trinity were considerably shorter. To demonstrate that our approach can be scaled to real world data we assembled the venom gland transcriptome of the African puff adder Bitis arietans using paired-end reads sequenced on Illumina’s MiSeq platform. VTBuilder constructed 1481 transcripts from 5 million reads and, following annotation, all major toxin genes were recovered demonstrating reconstruction of complex underlying sequence and isoform diversity. Conclusion Unlike other approaches, VTBuilder strives to maintain the relationships between co-evolving sites within the constructed transcripts, and thus increases transcript utility for a wide range of research areas ranging from transcriptomics to phylogenetics and including the monitoring of drug resistant parasite populations. Additionally, improving the quality of transcripts assembled from read data will have an impact on future studies that query these data. VTBuilder has been implemented in java and is available, under the GPL GPU V0.3 license, from http:// http://www.lstmed.ac.uk/vtbuilder. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0389-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- John Archer
- Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK.
| | - Gareth Whiteley
- Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK.
| | - Nicholas R Casewell
- Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK.
| | - Robert A Harrison
- Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK.
| | - Simon C Wagstaff
- Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK.
| |
Collapse
|