1
|
Wennmann JT, Lim FS, Senger S, Gani M, Jehle JA, Keilwagen J. Haplotype determination of the Bombyx mori nucleopolyhedrovirus by Nanopore sequencing and linkage of single nucleotide variants. J Gen Virol 2024; 105. [PMID: 38767624 DOI: 10.1099/jgv.0.001983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
Naturally occurring isolates of baculoviruses, such as the Bombyx mori nucleopolyhedrovirus (BmNPV), usually consist of numerous genetically different haplotypes. Deciphering the different haplotypes of such isolates is hampered by the large size of the dsDNA genome, as well as the short read length of next generation sequencing (NGS) techniques that are widely applied for baculovirus isolate characterization. In this study, we addressed this challenge by combining the accuracy of NGS to determine single nucleotide variants (SNVs) as genetic markers with the long read length of Nanopore sequencing technique. This hybrid approach allowed the comprehensive analysis of genetically homogeneous and heterogeneous isolates of BmNPV. Specifically, this allowed the identification of two putative major haplotypes in the heterogeneous isolate BmNPV-Ja by SNV position linkage. SNV positions, which were determined based on NGS data, were linked by the long Nanopore reads in a Position Weight Matrix. Using a modified Expectation-Maximization algorithm, the Nanopore reads were assigned according to the occurrence of variable SNV positions by machine learning. The cohorts of reads were de novo assembled, which led to the identification of BmNPV haplotypes. The method demonstrated the strength of the combined approach of short- and long-read sequencing techniques to decipher the genetic diversity of baculovirus isolates.
Collapse
Affiliation(s)
- Jörg T Wennmann
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Fang-Shiang Lim
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Sergei Senger
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Mudasir Gani
- Division of Entomology, Faculty of Agriculture, Sher-e-Kashmir University of Agricultural Sciences & Technology, Kashmir 193 201, J&K, India
| | - Johannes A Jehle
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Jens Keilwagen
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biosafety in Plant Biotechnology, Ernst-Baur-Str. 27, 06484 Quedlinburg, Germany
| |
Collapse
|
2
|
Fuhrmann L, Jablonski KP, Topolsky I, Batavia AA, Borgsmüller N, Baykal PI, Carrara M, Chen C, Dondi A, Dragan M, Dreifuss D, John A, Langer B, Okoniewski M, du Plessis L, Schmitt U, Singer F, Stadler T, Beerenwinkel N. V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation. Gigascience 2024; 13:giae065. [PMID: 39347649 PMCID: PMC11440432 DOI: 10.1093/gigascience/giae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 06/11/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024] Open
Abstract
The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, scaling to large sample sizes, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting 2 large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.
Collapse
Affiliation(s)
- Lara Fuhrmann
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Aashil A Batavia
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Nico Borgsmüller
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Matteo Carrara
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Chaoran Chen
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Arthur Dondi
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Monica Dragan
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - David Dreifuss
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Anika John
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Benjamin Langer
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
| | | | - Louis du Plessis
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Uwe Schmitt
- Scientific IT Services, ETH Zurich, Zurich 8092, Switzerland
| | - Franziska Singer
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
3
|
Gafurov A, Baláž A, Amman F, Boršová K, Čabanová V, Klempa B, Bergthaler A, Vinař T, Brejová B. VirPool: model-based estimation of SARS-CoV-2 variant proportions in wastewater samples. BMC Bioinformatics 2022; 23:551. [PMID: 36536300 PMCID: PMC9761630 DOI: 10.1186/s12859-022-05100-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 12/06/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The genomes of SARS-CoV-2 are classified into variants, some of which are monitored as variants of concern (e.g. the Delta variant B.1.617.2 or Omicron variant B.1.1.529). Proportions of these variants circulating in a human population are typically estimated by large-scale sequencing of individual patient samples. Sequencing a mixture of SARS-CoV-2 RNA molecules from wastewater provides a cost-effective alternative, but requires methods for estimating variant proportions in a mixed sample. RESULTS We propose a new method based on a probabilistic model of sequencing reads, capturing sequence diversity present within individual variants, as well as sequencing errors. The algorithm is implemented in an open source Python program called VirPool. We evaluate the accuracy of VirPool on several simulated and real sequencing data sets from both Illumina and nanopore sequencing platforms, including wastewater samples from Austria and France monitoring the onset of the Alpha variant. CONCLUSIONS VirPool is a versatile tool for wastewater and other mixed-sample analysis that can handle both short- and long-read sequencing data. Our approach does not require pre-selection of characteristic mutations for variant profiles, it is able to use the entire length of reads instead of just the most informative positions, and can also capture haplotype dependencies within a single read.
Collapse
Affiliation(s)
- Askar Gafurov
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Andrej Baláž
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Fabian Amman
- CeMM Research Center for Molecular Medicine, Austrian Academy of Sciences, Lazarettgasse 14 AKH BT 25.3, 1090 Vienna, Austria
- Institute of Hygiene and Applied Immunology, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Kinderspitalsgasse 15, Vienna, 1090 Austria
| | - Kristína Boršová
- Biomedical Research Center, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Viktória Čabanová
- Biomedical Research Center, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Boris Klempa
- Biomedical Research Center, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Andreas Bergthaler
- CeMM Research Center for Molecular Medicine, Austrian Academy of Sciences, Lazarettgasse 14 AKH BT 25.3, 1090 Vienna, Austria
- Institute of Hygiene and Applied Immunology, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Kinderspitalsgasse 15, Vienna, 1090 Austria
| | - Tomáš Vinař
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| |
Collapse
|
4
|
Jiao X, Imamichi H, Sherman BT, Nahar R, Dewar RL, Lane HC, Imamichi T, Chang W. QuasiSeq: profiling viral quasispecies via self-tuning spectral clustering with PacBio long sequencing reads. Bioinformatics 2022; 38:3192-3199. [PMID: 35532087 PMCID: PMC9890302 DOI: 10.1093/bioinformatics/btac313] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Revised: 04/27/2022] [Accepted: 05/04/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION The existence of quasispecies in the viral population causes difficulties for disease prevention and treatment. High-throughput sequencing provides opportunity to determine rare quasispecies and long sequencing reads covering full genomes reduce quasispecies determination to a clustering problem. The challenge is high similarity of quasispecies and high error rate of long sequencing reads. RESULTS We developed QuasiSeq using a novel signature-based self-tuning clustering method, SigClust, to profile viral mixtures with high accuracy and sensitivity. QuasiSeq can correctly identify quasispecies even using low-quality sequencing reads (accuracy <80%) and produce quasispecies sequences with high accuracy (≥99.55%). Using high-quality circular consensus sequencing reads, QuasiSeq can produce quasispecies sequences with 100% accuracy. QuasiSeq has higher sensitivity and specificity than similar published software. Moreover, the requirement of the computational resource can be controlled by the size of the signature, which makes it possible to handle big sequencing data for rare quasispecies discovery. Furthermore, parallel computation is implemented to process the clusters and further reduce the runtime. Finally, we developed a web interface for the QuasiSeq workflow with simple parameter settings based on the quality of sequencing data, making it easy to use for users without advanced data science skills. AVAILABILITY AND IMPLEMENTATION QuasiSeq is open source and freely available at https://github.com/LHRI-Bioinformatics/QuasiSeq. The current release (v1.0.0) is archived and available at https://zenodo.org/badge/latestdoi/340494542. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoli Jiao
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Hiromi Imamichi
- Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA
| | - Brad T Sherman
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Rishub Nahar
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Robin L Dewar
- Virus Isolation and Serology Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - H Clifford Lane
- Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA
| | - Tomozumi Imamichi
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Weizhong Chang
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| |
Collapse
|
5
|
Liao H, Cai D, Sun Y. VirStrain: a strain identification tool for RNA viruses. Genome Biol 2022; 23:38. [PMID: 35101081 PMCID: PMC8801933 DOI: 10.1186/s13059-022-02609-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022] Open
Abstract
Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
| |
Collapse
|
6
|
Wu Y, Zheng S, Liu T, Liu X, Tang H, He Y, Xu W, Li L, Yu W, Xing K, Xia X. Viral Haplotypes in COVID-19 Patients Associated With Prolonged Viral Shedding. Front Cell Infect Microbiol 2021; 11:715143. [PMID: 34858866 PMCID: PMC8631178 DOI: 10.3389/fcimb.2021.715143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 09/13/2021] [Indexed: 11/13/2022] Open
Abstract
Background Recently, more patients who recovered from the novel coronavirus disease 2019 (COVID-19) may later test positive for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) again using reverse transcription-polymerase chain reaction (RT-PCR) testing. Even though it is still controversial about the possible explanation for clinical cases of long-term viral shedding, it remains unclear whether the persistent viral shedding means re-infection or recurrence. Methods Specimens were collected from three COVID-19-confirmed patients, and whole-genome sequencing was performed on these clinical specimens during their first hospital admission with a high viral load of SARS-CoV-2. Laboratory tests were examined and analyzed throughout the whole course of the disease. Phylogenetic analysis was carried out for SARS-CoV-2 haplotypes. Results We found haplotypes of SARS-CoV-2 co-infection in two COVID-19 patients (YW01 and YW03) with a long period of hospitalization. However, only one haplotype was observed in the other patient with chronic lymphocytic leukemia (YW02), which was verified as one kind of viral haplotype. Patients YW01 and YW02 were admitted to the hospital after being infected with COVID-19 as members of a family cluster, but they had different haplotype characteristics in the early stage of infection; YW01 and YW03 were from different infection sources; however, similar haplotypes were found together. Conclusion These findings show that haplotype diversity of SARS-CoV-2 may result in viral adaptation for persistent shedding in multiple recurrences of COVID-19 patients, who met the discharge requirement. However, the correlation between haplotype diversity of SARS-CoV-2 virus and immune status is not absolute. It showed important implications for the clinical management strategies for COVID-19 patients with long-term hospitalization or cases of recurrence.
Collapse
Affiliation(s)
- Yingping Wu
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Shufa Zheng
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Centre for Infectious Diseases, Collaborative Innovation Centre for Diagnosis and Treatment of Infectious Diseases, First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China
| | - Tian Liu
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Xueke Liu
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Huina Tang
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Yutong He
- School of Life Sciences, Guangzhou University, Guangzhou, China
| | - Wei Xu
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Lele Li
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Wenxu Yu
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| | - Ke Xing
- School of Life Sciences, Guangzhou University, Guangzhou, China
| | - Xiaoping Xia
- Fourth Affiliated Hospital, College of Medicine, Zhejiang University, Jinhua, China
| |
Collapse
|
7
|
Melnyk A, Mohebbi F, Knyazev S, Sahoo B, Hosseini R, Skums P, Zelikovsky A, Patterson M. From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering. J Comput Biol 2021; 28:1113-1129. [PMID: 34698508 DOI: 10.1089/cmb.2021.0302] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Collapse
Affiliation(s)
- Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Fatemeh Mohebbi
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Bikram Sahoo
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Roya Hosseini
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA.,World-Class Research Center "Digital Biodesign and Personalized Healthcare," I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
8
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
9
|
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de Bruijn graph. Bioinformatics 2021; 37:473-481. [PMID: 32926162 DOI: 10.1093/bioinformatics/btaa782] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 03/11/2020] [Accepted: 09/02/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Borja Freire
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Susana Ladra
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Jose R Paramá
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
10
|
Cao C, He J, Mak L, Perera D, Kwok D, Wang J, Li M, Mourier T, Gavriliuc S, Greenberg M, Morrissy AS, Sycuro LK, Yang G, Jeffares DC, Long Q. Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding. Mol Biol Evol 2021; 38:2660-2672. [PMID: 33547786 PMCID: PMC8136496 DOI: 10.1093/molbev/msab037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or "haplotypes." However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Jingni He
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Cardiology, Xiangya Hospital, Central South University, Changsha, China
| | - Lauren Mak
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Present address: Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY, USA
| | - Deshan Perera
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Devin Kwok
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - Jia Wang
- Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USA
| | - Minghao Li
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Tobias Mourier
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Stefan Gavriliuc
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - A Sorana Morrissy
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Laura K Sycuro
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Microbiology, Immunology, and Infectious Diseases, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Guang Yang
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada
| | - Daniel C Jeffares
- Department of Biology, York Biomedical Research Institute, University of York, York, United Kingdom
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada,Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, AB, Canada,Corresponding author: E-mail:
| |
Collapse
|
11
|
Cao C, Greenberg M, Long Q. WgLink: reconstructing whole-genome viral haplotypes using L0+L1-regularization. Bioinformatics 2021; 37:2744-2746. [PMID: 33532820 DOI: 10.1093/bioinformatics/btab076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 12/23/2020] [Accepted: 01/29/2021] [Indexed: 12/24/2022] Open
Abstract
SUMMARY Many tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0+L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. AVAILABILITY Source code and binaries are freely available at https://github.com/theLongLab/wglink. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children's Hospital Research Institute, Calgary, AB, T2N 4N1, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, Calgary, AB, T2N 4N1, Canada
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children's Hospital Research Institute, Calgary, AB, T2N 4N1, Canada.,Department of Mathematics & Statistics, Calgary, AB, T2N 4N1, Canada.,Department of Medical Genetics, Hotchkiss Brain Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada
| |
Collapse
|
12
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
13
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
14
|
González R, Wu B, Li X, Martínez F, Elena SF. Mutagenesis Scanning Uncovers Evolutionary Constraints on Tobacco Etch Potyvirus Membrane-Associated 6K2 Protein. Genome Biol Evol 2019; 11:1207-1222. [PMID: 30918938 PMCID: PMC6482416 DOI: 10.1093/gbe/evz069] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/26/2019] [Indexed: 12/30/2022] Open
Abstract
RNA virus high mutation rate is a double-edged sword. At the one side, most mutations jeopardize proteins functions; at the other side, mutations are needed to fuel adaptation. The relevant question then is the ratio between beneficial and deleterious mutations. To evaluate this ratio, we created a mutant library of the 6K2 gene of tobacco etch potyvirus that contains every possible single-nucleotide substitution. 6K2 protein anchors the virus replication complex to the network of endoplasmic reticulum membranes. The library was inoculated into the natural host Nicotiana tabacum, allowing competition among all these mutants and selection of those that are potentially viable. We identified 11 nonsynonymous mutations that remain in the viral population at measurable frequencies and evaluated their fitness. Some had fitness values higher than the wild-type and some were deleterious. The effect of these mutations in the structure, transmembrane properties, and function of 6K2 was evaluated in silico. In parallel, the effect of these mutations in infectivity, virus accumulation, symptoms development, and subcellular localization was evaluated in the natural host. The α-helix H1 in the N-terminal part of 6K2 turned out to be under purifying selection, while most observed mutations affect the link between transmembrane α-helices H2 and H3, fusing them into a longer helix and increasing its rigidity. In general, these changes are associated with higher within-host fitness and development of milder or no symptoms. This finding suggests that in nature selection upon 6K2 may result from a tradeoff between within-host accumulation and severity of symptoms.
Collapse
Affiliation(s)
- Rubén González
- Instituto de Biología Integrativa de Sistemas (I2SysBio), CSIC-Universitat de València, València, Spain
| | - Beilei Wu
- Instituto de Biología Molecular y Celular de Plantas (IBMCP), CSIC-Universitat Politècnica de València, València, Spain.,Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xianghua Li
- Systems Biology Program, Centre de Regulació Genòmica (CRG), The Barcelona Institute of Science and Technology, PRBB, Barcelona, Spain
| | - Fernando Martínez
- Instituto de Biología Molecular y Celular de Plantas (IBMCP), CSIC-Universitat Politècnica de València, València, Spain
| | - Santiago F Elena
- Instituto de Biología Integrativa de Sistemas (I2SysBio), CSIC-Universitat de València, València, Spain.,The Santa Fe Institute, Santa Fe, New Mexico
| |
Collapse
|
15
|
Baaijens JA, Van der Roest B, Köster J, Stougie L, Schönhuth A. Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics 2019; 35:5086-5094. [DOI: 10.1093/bioinformatics/btz443] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2018] [Revised: 04/17/2019] [Accepted: 05/27/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly is the reconstruction of strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains is an important step for various treatment-related reasons. Reference genome independent (‘de novo’) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. The remaining challenge is to reconstruct full-length haplotypes together with their abundances from such contigs.
Results
We present Virus-VG as a de novo approach to viral haplotype reconstruction from preassembled contigs. Our method constructs a variation graph from the short input contigs without making use of a reference genome. Then, to obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is, optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances. Benchmarking experiments on challenging simulated and real datasets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates compared to the state-of-the-art viral quasispecies assemblers.
Availability and implementation
Virus-VG is freely available at https://bitbucket.org/jbaaijens/virus-vg.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
| | | | - Johannes Köster
- Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Leen Stougie
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- Department of Econometrics and Operations Research, Vrije Universiteit, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
| | - Alexander Schönhuth
- Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- INRIA-Erable, Grenoble, France
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| |
Collapse
|