1
|
Wattanasombat S, Tongjai S. Easing genomic surveillance: A comprehensive performance evaluation of long-read assemblers across multi-strain mixture data of HIV-1 and Other pathogenic viruses for constructing a user-friendly bioinformatic pipeline. F1000Res 2024; 13:556. [PMID: 38984017 PMCID: PMC11231628 DOI: 10.12688/f1000research.149577.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/14/2024] [Indexed: 07/11/2024] Open
Abstract
Background Determining the appropriate computational requirements and software performance is essential for efficient genomic surveillance. The lack of standardized benchmarking complicates software selection, especially with limited resources. Methods We developed a containerized benchmarking pipeline to evaluate seven long-read assemblers-Canu, GoldRush, MetaFlye, Strainline, HaploDMF, iGDA, and RVHaplo-for viral haplotype reconstruction, using both simulated and experimental Oxford Nanopore sequencing data of HIV-1 and other viruses. Benchmarking was conducted on three computational systems to assess each assembler's performance, utilizing QUAST and BLASTN for quality assessment. Results Our findings show that assembler choice significantly impacts assembly time, with CPU and memory usage having minimal effect. Assembler selection also influences the size of the contigs, with a minimum read length of 2,000 nucleotides required for quality assembly. A 4,000-nucleotide read length improves quality further. Canu was efficient among de novo assemblers but not suitable for multi-strain mixtures, while GoldRush produced only consensus assemblies. Strainline and MetaFlye were suitable for metagenomic sequencing data, with Strainline requiring high memory and MetaFlye operable on low-specification machines. Among reference-based assemblers, iGDA had high error rates, RVHaplo showed the best runtime and accuracy but became ineffective with similar sequences, and HaploDMF, utilizing machine learning, had fewer errors with a slightly longer runtime. Conclusions The HIV-64148 pipeline, containerized using Docker, facilitates easy deployment and offers flexibility to select from a range of assemblers to match computational systems or study requirements. This tool aids in genome assembly and provides valuable information on HIV-1 sequences, enhancing viral evolution monitoring and understanding.
Collapse
Affiliation(s)
- Sara Wattanasombat
- Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Siripong Tongjai
- Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand
| |
Collapse
|
2
|
Fuhrmann L, Jablonski KP, Topolsky I, Batavia AA, Borgsmüller N, Baykal PI, Carrara M, Chen C, Dondi A, Dragan M, Dreifuss D, John A, Langer B, Okoniewski M, du Plessis L, Schmitt U, Singer F, Stadler T, Beerenwinkel N. V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation. Gigascience 2024; 13:giae065. [PMID: 39347649 PMCID: PMC11440432 DOI: 10.1093/gigascience/giae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 06/11/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024] Open
Abstract
The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, scaling to large sample sizes, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting 2 large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.
Collapse
Affiliation(s)
- Lara Fuhrmann
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Aashil A Batavia
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Nico Borgsmüller
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Matteo Carrara
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Chaoran Chen
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Arthur Dondi
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Monica Dragan
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - David Dreifuss
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Anika John
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Benjamin Langer
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
| | | | - Louis du Plessis
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Uwe Schmitt
- Scientific IT Services, ETH Zurich, Zurich 8092, Switzerland
| | - Franziska Singer
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
3
|
Evans AB, Winkler CW, Anzick SL, Ricklefs SM, Sturdevant DE, Peterson KE. Zika virus diversity in mice is maintained during early vertical transmission from placenta to fetus, but reduced in fetal bodies and brains at late stages of infection. PLoS Negl Trop Dis 2023; 17:e0011657. [PMID: 37796973 PMCID: PMC10581492 DOI: 10.1371/journal.pntd.0011657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 10/17/2023] [Accepted: 09/11/2023] [Indexed: 10/07/2023] Open
Abstract
Since emerging in French Polynesia and Brazil in the 2010s, Zika virus (ZIKV) has been associated with fetal congenital disease. Previous studies have compared ancestral and epidemic ZIKV strains to identify strain differences that may contribute to vertical transmission and fetal disease. However, within-host diversity in ZIKV populations during vertical transmission has not been well studied. Here, we used the established anti-interferon treated Rag1-/- mouse model of ZIKV vertical transmission to compare genomic variation within ZIKV populations in matched placentas, fetal bodies, and fetal brains via RNASeq. At early stages of vertical transmission, the ZIKV populations in the matched placentas and fetal bodies were similar. Most ZIKV single nucleotide variants were present in both tissues, indicating little to no restriction in transmission of ZIKV variants from placenta to fetus. In contrast, at later stages of fetal infection there was a sharp reduction in ZIKV diversity in fetal bodies and fetal brains. All fetal brain ZIKV populations were comprised of one of two haplotypes, containing either a single variant or three variants together, as largely homogenous populations. In most cases, the dominant haplotype present in the fetal brain was also the dominant haplotype present in the matched fetal body. However, in two of ten fetal brains the dominant ZIKV haplotype was undetectable or present at low frequencies in the matched placenta and fetal body ZIKV populations, suggesting evidence of a strict selective bottleneck and possible selection for certain variants during neuroinvasion of ZIKV into fetal brains.
Collapse
Affiliation(s)
- Alyssa B. Evans
- Laboratory of Neurological Infections and Immunity, Neuroimmunology Section; Rocky Mountain Laboratories; National Institute of Allergy and Infectious Diseases (NIAID); National Institutes of Health (NIH); Hamilton, Montana, United States of America
| | - Clayton W. Winkler
- Laboratory of Neurological Infections and Immunity, Neuroimmunology Section; Rocky Mountain Laboratories; National Institute of Allergy and Infectious Diseases (NIAID); National Institutes of Health (NIH); Hamilton, Montana, United States of America
| | - Sarah L. Anzick
- Genomics Research Section, Research Technologies Branch; Rocky Mountain Laboratories; National Institute of Allergy and Infectious Diseases (NIAID); National Institutes of Health (NIH); Hamilton, Montana, United States of America
| | - Stacy M. Ricklefs
- Genomics Research Section, Research Technologies Branch; Rocky Mountain Laboratories; National Institute of Allergy and Infectious Diseases (NIAID); National Institutes of Health (NIH); Hamilton, Montana, United States of America
| | - Dan E. Sturdevant
- Genomics Research Section, Research Technologies Branch; Rocky Mountain Laboratories; National Institute of Allergy and Infectious Diseases (NIAID); National Institutes of Health (NIH); Hamilton, Montana, United States of America
| | - Karin E. Peterson
- Laboratory of Neurological Infections and Immunity, Neuroimmunology Section; Rocky Mountain Laboratories; National Institute of Allergy and Infectious Diseases (NIAID); National Institutes of Health (NIH); Hamilton, Montana, United States of America
| |
Collapse
|
4
|
Freire B, Ladra S, Parama JR, Salmela L. ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1550-1562. [PMID: 35853050 DOI: 10.1109/tcbb.2022.3190282] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.
Collapse
|
5
|
Gregori J, Ibañez-Lligoña M, Quer J. Quantifying In-Host Quasispecies Evolution. Int J Mol Sci 2023; 24:ijms24021301. [PMID: 36674827 PMCID: PMC9867078 DOI: 10.3390/ijms24021301] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/04/2023] [Accepted: 01/05/2023] [Indexed: 01/12/2023] Open
Abstract
What takes decades, centuries or millennia to happen with a natural ecosystem, it takes only days, weeks or months with a replicating viral quasispecies in a host, especially when under treatment. Some methods to quantify the evolution of a quasispecies are introduced and discussed, along with simple simulated examples to help in the interpretation and understanding of the results. The proposed methods treat the molecules in a quasispecies as individuals of competing species in an ecosystem, where the haplotypes are the competing species, and the ecosystem is the quasispecies in a host, and the evolution of the system is quantified by monitoring changes in haplotype frequencies. The correlation between the proposed indices is also discussed, and the R code used to generate the simulations, the data and the plots is provided. The virtues of the proposed indices are finally shown on a clinical case.
Collapse
Affiliation(s)
- Josep Gregori
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Correspondence: or
| | - Marta Ibañez-Lligoña
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Josep Quer
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| |
Collapse
|
6
|
Baaijens JA, Zulli A, Ott IM, Nika I, van der Lugt MJ, Petrone ME, Alpert T, Fauver JR, Kalinich CC, Vogels CBF, Breban MI, Duvallet C, McElroy KA, Ghaeli N, Imakaev M, Mckenzie-Bennett MF, Robison K, Plocik A, Schilling R, Pierson M, Littlefield R, Spencer ML, Simen BB, Hanage WP, Grubaugh ND, Peccia J, Baym M. Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques. Genome Biol 2022; 23:236. [PMID: 36348471 PMCID: PMC9643916 DOI: 10.1186/s13059-022-02805-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 10/25/2022] [Indexed: 11/09/2022] Open
Abstract
Effectively monitoring the spread of SARS-CoV-2 mutants is essential to efforts to counter the ongoing pandemic. Predicting lineage abundance from wastewater, however, is technically challenging. We show that by sequencing SARS-CoV-2 RNA in wastewater and applying algorithms initially used for transcriptome quantification, we can estimate lineage abundance in wastewater samples. We find high variability in signal among individual samples, but the overall trends match those observed from sequencing clinical samples. Thus, while clinical sequencing remains a more sensitive technique for population surveillance, wastewater sequencing can be used to monitor trends in mutant prevalence in situations where clinical sequencing is unavailable.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands.
| | - Alessandro Zulli
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Isabel M Ott
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Ioanna Nika
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mart J van der Lugt
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mary E Petrone
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Tara Alpert
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Joseph R Fauver
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Epidemiology, University of Nebraska Medical Center, Omaha, NE, USA
| | - Chaney C Kalinich
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Chantal B F Vogels
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Mallery I Breban
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | - William P Hanage
- Center for Communicable Disease Dynamics and Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Nathan D Grubaugh
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Jordan Peccia
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
7
|
Venturini C, Pang J, Tamuri AU, Roy S, Atkinson C, Griffiths P, Breuer J, Goldstein RA. Haplotype assignment of longitudinal viral deep sequencing data using covariation of variant frequencies. Virus Evol 2022; 8:veac093. [PMID: 36478783 PMCID: PMC9719071 DOI: 10.1093/ve/veac093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 09/15/2022] [Accepted: 10/05/2022] [Indexed: 11/13/2022] Open
Abstract
Longitudinal deep sequencing of viruses can provide detailed information about intra-host evolutionary dynamics including how viruses interact with and transmit between hosts. Many analyses require haplotype reconstruction, identifying which variants are co-located on the same genomic element. Most current methods to perform this reconstruction are based on a high density of variants and cannot perform this reconstruction for slowly evolving viruses. We present a new approach, HaROLD (HAplotype Reconstruction Of Longitudinal Deep sequencing data), which performs this reconstruction based on identifying co-varying variant frequencies using a probabilistic framework. We illustrate HaROLD on both RNA and DNA viruses with synthetic Illumina paired read data created from mixed human cytomegalovirus (HCMV) and norovirus genomes, and clinical datasets of HCMV and norovirus samples, demonstrating high accuracy, especially when longitudinal samples are available.
Collapse
Affiliation(s)
- Cristina Venturini
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
| | - Juanita Pang
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK
| | - Asif U Tamuri
- Research IT Services, University College London, London WC1E 6BT, UK
| | - Sunando Roy
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
| | - Claire Atkinson
- Institute for Immunity and Transplantation, University College London, London NW3 2PP, UK
| | - Paul Griffiths
- Institute for Immunity and Transplantation, University College London, London NW3 2PP, UK
| | - Judith Breuer
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
- Great Ormond Street Hospital for Children, London WC1N 3JH, UK
| | - Richard A Goldstein
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK
- Infection, Immunity, Inflammation, Institute of Child Health, University College London, London WC1E 6BT, UK
| |
Collapse
|
8
|
Moudi M, Vahidi Mehrjardi MY, Kalantar SM, Taheri M, Metanat Z, Ghasemi N, Dehghani M. Co-segregation of variant NSUN2 Lue198Arg among Iranian family with intellectual disability: a case report. EGYPTIAN JOURNAL OF MEDICAL HUMAN GENETICS 2022. [DOI: 10.1186/s43042-022-00293-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Intellectual disability is characterized by impairments in adaptive behavior and cognitive functioning manifested during the developmental period. Since disabilities are heterogeneous, variant analysis can help us confirm and accurately diagnose children with intellectual disabilities. Some papers reported that bi-allelic variants of the NSUN2 gene caused a group of neurological disorders, including non-syndromic autosomal recessive intellectual disability (NS-ARID), Dubowitz syndrome, and familial restrictive cardiomyopathy 1 (RCM1). We report on a consanguineous family with three siblings diagnosed with intellectual disability.
Case presentation
The 7-year-old female was referred to Ali-Asghar hospital, Zahedan, Iran, with clinical manifestations comprising moderate intellectual disability, ptosis, long face, and short stature. Chromosome banding, metabolic testing, and magnetic resonance imaging examinations revealed no abnormalities. Accordingly, other affected siblings born of the same parents were considered. Whole-exome sequencing (WES) was conducted on the sufferer to consider NS-ARID variants. Findings identified a variant with uncertain significance (NM_017755.6: c.593 T > G) in the NSUN2 gene in the proband. This variant was confirmed through Sanger sequencing of the affected and unaffected family members. Besides, the computational results showed that the L198R exchange could change the interaction between wild-type and other residues in the protein. The affected patients with NS-ARID had similar clinical characteristics and genetic abnormalities.
Conclusion
Taken together, we described the variant in three Iranian siblings; further expanding of the other variants involved in the disease will be evident by using high-throughput sequencing technologies.
Collapse
|
9
|
Cai D, Sun Y. Reconstructing viral haplotypes using long reads. Bioinformatics 2022; 38:2127-2134. [PMID: 35157018 DOI: 10.1093/bioinformatics/btac089] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 01/19/2022] [Accepted: 02/08/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Most RNA viruses lack strict proofreading during replication. Coupled with a high replication rate, some RNA viruses can form a virus population containing a group of genetically related but different haplotypes. Characterizing the haplotype composition in a virus population is thus important to understand viruses' evolution. Many attempts have been made to reconstruct viral haplotypes using next-generation sequencing (NGS) reads. However, the short length of NGS reads cannot cover distant single-nucleotide variants, making it difficult to reconstruct complete or near-complete haplotypes. Given the fast developments of third-generation sequencing technologies, a new opportunity has arisen for reconstructing full-length haplotypes with long reads. RESULTS In this work, we developed a new tool, RVHaplo to reconstruct haplotypes for known viruses from long reads. We tested it rigorously on both simulated and real viral sequencing data and compared it against other popular haplotype reconstruction tools. The results demonstrated that RVHaplo outperforms the state-of-the-art tools for viral haplotype reconstruction from long reads. Especially, RVHaplo can reconstruct the rare (1% abundance) haplotypes that other tools usually missed. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of RVHaplo are available at https://github.com/dhcai21/RVHaplo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| |
Collapse
|
10
|
Liao H, Cai D, Sun Y. VirStrain: a strain identification tool for RNA viruses. Genome Biol 2022; 23:38. [PMID: 35101081 PMCID: PMC8801933 DOI: 10.1186/s13059-022-02609-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022] Open
Abstract
Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
| |
Collapse
|
11
|
Moudi M, Vahidi Mehrjardi MY, Hozhabri H, Metanat Z, Kalantar SM, Taheri M, Ghasemi N, Dehghani M. Novel variants underlying autosomal recessive neurodevelopmental disorders with intellectual disability in Iranian consanguineous families. J Clin Lab Anal 2022; 36:e24241. [PMID: 35019165 PMCID: PMC8842163 DOI: 10.1002/jcla.24241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 12/25/2021] [Accepted: 12/27/2021] [Indexed: 11/17/2022] Open
Abstract
Background Intellectual disability (ID) is a heterogeneous group of neurodevelopmental disorders that is characterized by significant impairment in intellectual and adaptive functioning with onset during the developmental period. Whole‐exome sequencing (WES)‐based studies in the consanguineous families with individuals affected with ID have shown a high burden of relevant variants. So far, over 700 genes have been reported in syndromic and non‐syndromic ID. However, genetic causes in more than 50% of ID patients still remain unclear. Methods Whole‐exome sequencing was applied for investigation of various variants of ID, then Sanger sequencing and in silico analysis in ten patients from five Iranian consanguineous families diagnosed with autosomal recessive neurodevelopmental disorders, intellectual disability, performed for confirming the causative mutation within the probands. The most patients presented moderate‐to‐severe intellectual disability, developmental delay, seizure, speech problem, high level of lactate, and onset before 10 years. Results Filtering the data identified by WES, two novel homozygous missense variants in FBXO31 and TIMM50 genes and one previously reported mutation in the CEP290 gene in the probands were found. Sanger sequencing confirmed the homozygote variant's presence of TIMM50 and FBXO31 genes in six patients and two affected siblings in their respective families. Our computational results predicted that the variants are located in the conserved regions across different species and have the impacts on the protein stability. Conclusion Hence, we provide evidence for the pathogenicity of two novel variants in the patients which will expand our knowledge about potential mutation involved in the heterogeneous disease.
Collapse
Affiliation(s)
- Mahdiyeh Moudi
- Department of Genetics, Shahid Sadoughi University of Medical Sciences, Yazd, Iran.,Genetics of Non-Communicable Disease Research Center, Zahedan University of Medical Sciences, Zahedan, Iran
| | | | | | - Zahra Metanat
- Department of Genetics, School of Medicine, Zahedan University of Medical Sciences, Zahedan, Iran
| | - Seyed Mehdi Kalantar
- Department of Genetics, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| | - Mohsen Taheri
- Genetics of Non-Communicable Disease Research Center, Zahedan University of Medical Sciences, Zahedan, Iran.,Department of Genetics, School of Medicine, Zahedan University of Medical Sciences, Zahedan, Iran
| | - Nasrin Ghasemi
- Abortion Research Centre, Yazd Reproductive Sciences Institute, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| | - Mohammadreza Dehghani
- Medical Genetics Research Center, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| |
Collapse
|
12
|
Van Poelvoorde LAE, Delcourt T, Coucke W, Herman P, De Keersmaecker SCJ, Saelens X, Roosens NHC, Vanneste K. Strategy and Performance Evaluation of Low-Frequency Variant Calling for SARS-CoV-2 Using Targeted Deep Illumina Sequencing. Front Microbiol 2021; 12:747458. [PMID: 34721349 PMCID: PMC8548777 DOI: 10.3389/fmicb.2021.747458] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 09/21/2021] [Indexed: 12/24/2022] Open
Abstract
The ongoing COVID-19 pandemic, caused by SARS-CoV-2, constitutes a tremendous global health issue. Continuous monitoring of the virus has become a cornerstone to make rational decisions on implementing societal and sanitary measures to curtail the virus spread. Additionally, emerging SARS-CoV-2 variants have increased the need for genomic surveillance to detect particular strains because of their potentially increased transmissibility, pathogenicity and immune escape. Targeted SARS-CoV-2 sequencing of diagnostic and wastewater samples has been explored as an epidemiological surveillance method for the competent authorities. Currently, only the consensus genome sequence of the most abundant strain is taken into consideration for analysis, but multiple variant strains are now circulating in the population. Consequently, in diagnostic samples, potential co-infection(s) by several different variants can occur or quasispecies can develop during an infection in an individual. In wastewater samples, multiple variant strains will often be simultaneously present. Currently, quality criteria are mainly available for constructing the consensus genome sequence, and some guidelines exist for the detection of co-infections and quasispecies in diagnostic samples. The performance of detection and quantification of low-frequency variants using whole genome sequencing (WGS) of SARS-CoV-2 remains largely unknown. Here, we evaluated the detection and quantification of mutations present at low abundances using the mutations defining the SARS-CoV-2 lineage B.1.1.7 (alpha variant) as a case study. Real sequencing data were in silico modified by introducing mutations of interest into raw wild-type sequencing data, or by mixing wild-type and mutant raw sequencing data, to construct mixed samples subjected to WGS using a tiling amplicon-based targeted metagenomics approach and Illumina sequencing. As anticipated, higher variation and lower sensitivity were observed at lower coverages and allelic frequencies. We found that detection of all low-frequency variants at an abundance of 10, 5, 3, and 1%, requires at least a sequencing coverage of 250, 500, 1500, and 10,000×, respectively. Although increasing variability of estimated allelic frequencies at decreasing coverages and lower allelic frequencies was observed, its impact on reliable quantification was limited. This study provides a highly sensitive low-frequency variant detection approach, which is publicly available at https://galaxy.sciensano.be, and specific recommendations for minimum sequencing coverages to detect clade-defining mutations at certain allelic frequencies. This approach will be useful to detect and quantify low-frequency variants in both diagnostic (e.g., co-infections and quasispecies) and wastewater [e.g., multiple variants of concern (VOCs)] samples.
Collapse
Affiliation(s)
- Laura A. E. Van Poelvoorde
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
- Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
| | - Thomas Delcourt
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Wim Coucke
- Quality of Laboratories, Sciensano, Brussels, Belgium
| | - Philippe Herman
- Expertise and Service Provision, Sciensano, Brussels, Belgium
| | | | - Xavier Saelens
- Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
| | | | - Kevin Vanneste
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| |
Collapse
|
13
|
Gallardo CM, Wang S, Montiel-Garcia DJ, Little SJ, Smith DM, Routh AL, Torbett BE. MrHAMER yields highly accurate single molecule viral sequences enabling analysis of intra-host evolution. Nucleic Acids Res 2021; 49:e70. [PMID: 33849057 PMCID: PMC8266615 DOI: 10.1093/nar/gkab231] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 03/12/2021] [Accepted: 03/31/2021] [Indexed: 12/31/2022] Open
Abstract
Technical challenges remain in the sequencing of RNA viruses due to their high intra-host diversity. This bottleneck is particularly pronounced when interrogating long-range co-evolved genetic interactions given the read-length limitations of next-generation sequencing platforms. This has hampered the direct observation of these genetic interactions that code for protein-protein interfaces with relevance in both drug and vaccine development. Here we overcome these technical limitations by developing a nanopore-based long-range viral sequencing pipeline that yields accurate single molecule sequences of circulating virions from clinical samples. We demonstrate its utility in observing the evolution of individual HIV Gag-Pol genomes in response to antiviral pressure. Our pipeline, called Multi-read Hairpin Mediated Error-correction Reaction (MrHAMER), yields >1000s of viral genomes per sample at 99.9% accuracy, maintains the original proportion of sequenced virions present in a complex mixture, and allows the detection of rare viral genomes with their associated mutations present at <1% frequency. This method facilitates scalable investigation of genetic correlates of resistance to both antiviral therapy and immune pressure and enables the identification of novel host-viral and viral-viral interfaces that can be modulated for therapeutic benefit.
Collapse
Affiliation(s)
- Christian M Gallardo
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.,Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA
| | - Shiyi Wang
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.,Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA
| | - Daniel J Montiel-Garcia
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Susan J Little
- Division of Infectious Diseases and Global Public Health, University of California, San Diego, La Jolla, CA, USA
| | - Davey M Smith
- Division of Infectious Diseases and Global Public Health, University of California, San Diego, La Jolla, CA, USA.,Veterans Affairs San Diego Healthcare System, San Diego, CA, USA
| | - Andrew L Routh
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX, USA.,Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, TX, USA
| | - Bruce E Torbett
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.,Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA.,Department of Pediatrics, University of Washington School of Medicine, Seattle, WA, USA
| |
Collapse
|
14
|
Bendall ML, Gibson KM, Steiner MC, Rentia U, Pérez-Losada M, Crandall KA. HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intrahost Viral Populations. Mol Biol Evol 2021; 38:1677-1690. [PMID: 33367849 PMCID: PMC8042772 DOI: 10.1093/molbev/msaa315] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Deep sequencing of viral populations using next-generation sequencing (NGS) offers opportunities to understand and investigate evolution, transmission dynamics, and population genetics. Currently, the standard practice for processing NGS data to study viral populations is to summarize all the observed sequences from a sample as a single consensus sequence, thus discarding valuable information about the intrahost viral molecular epidemiology. Furthermore, existing analytical pipelines may only analyze genomic regions involved in drug resistance, thus are not suited for full viral genome analysis. Here, we present HAPHPIPE, a HAplotype and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes. The HAPHPIPE protocol includes modules for quality trimming, error correction, de novo assembly, alignment, and haplotype reconstruction. The resulting consensus sequences, haplotypes, and alignments can be further analyzed using a variety of phylogenetic and population genetic software. HAPHPIPE is designed to provide users with a single pipeline to rapidly analyze sequences from viral populations generated from NGS platforms and provide quality output properly formatted for downstream evolutionary analyses.
Collapse
Affiliation(s)
- Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Margaret C Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| |
Collapse
|
15
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
16
|
Fuhrmann L, Jablonski KP, Beerenwinkel N. Quantitative measures of within-host viral genetic diversity. Curr Opin Virol 2021; 49:157-163. [PMID: 34153841 DOI: 10.1016/j.coviro.2021.06.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/03/2021] [Accepted: 06/07/2021] [Indexed: 12/22/2022]
Abstract
The genetic diversity of virus populations within their hosts is known to influence disease progression, treatment outcome, drug resistance, cell tropism, and transmission risk, and the study of dynamic changes of genetic heterogeneity can provide insights into the evolution of viruses. Several measures to quantify within-host genetic diversity capturing different aspects of diversity patterns in a sample or population are used, based on incidence, relative frequencies, pairwise distances, or phylogenetic trees. Here, we review and compare several of these measures.
Collapse
Affiliation(s)
- Lara Fuhrmann
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland.
| |
Collapse
|
17
|
Serrão de Andrade AA, Soares AER, Paula de Almeida LG, Ciapina LP, Pestana CP, Aquino CL, Medeiros MA, Ribeiro de Vasconcelos AT. Testing the genomic stability of the Brazilian yellow fever vaccine strain using next-generation sequencing data. Interface Focus 2021; 11:20200063. [PMID: 34123353 PMCID: PMC8193464 DOI: 10.1098/rsfs.2020.0063] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/14/2021] [Indexed: 01/06/2023] Open
Abstract
The live attenuated yellow fever (YF) vaccine was developed in the 1930s. Currently, the 17D and 17DD attenuated substrains are used for vaccine production. The 17D strain is used for vaccine production by several countries, while the 17DD strain is used exclusively in Brazil. The cell passages carried out through the seed-lot system of vaccine production influence the presence of quasispecies causing changes in the stability and immunogenicity of attenuated genotypes by increasing attenuation or virulence. Using next-generation sequencing, we carried out genomic characterization and genetic diversity analysis between vaccine lots of the Brazilian YF vaccine, produced by BioManguinhos–Fiocruz, and used during 11 years of vaccination in Brazil. We present 20 assembled and annotated genomes from the Brazilian 17DD vaccine strain, eight single nucleotide polymorphisms and the quasispecies spectrum reconstruction for the 17DD vaccine, through a pipeline here introduced. The V2IDA pipeline provided a relationship between low genetic diversity, maintained through the seed lot system, and the confirmation of genetic stability of lots of the Brazilian vaccine against YF. Our study sets precedents for use of V2IDA in genetic diversity analysis and in silico stability investigation of attenuated viral vaccines, facilitating genetic surveillance during the vaccine production process.
Collapse
Affiliation(s)
- Amanda Araújo Serrão de Andrade
- National Laboratory for Scientific Computing, Bioinformatics Laboratory (LABINFO), Avenida Getúlio Vargas, 333, Quitandinha 25651-075, Petrópolis, Rio de Janeiro, Brazil
| | - André E R Soares
- National Laboratory for Scientific Computing, Bioinformatics Laboratory (LABINFO), Avenida Getúlio Vargas, 333, Quitandinha 25651-075, Petrópolis, Rio de Janeiro, Brazil
| | - Luiz Gonzaga Paula de Almeida
- National Laboratory for Scientific Computing, Bioinformatics Laboratory (LABINFO), Avenida Getúlio Vargas, 333, Quitandinha 25651-075, Petrópolis, Rio de Janeiro, Brazil
| | - Luciane Prioli Ciapina
- National Laboratory for Scientific Computing, Bioinformatics Laboratory (LABINFO), Avenida Getúlio Vargas, 333, Quitandinha 25651-075, Petrópolis, Rio de Janeiro, Brazil
| | - Cristiane Pinheiro Pestana
- Fiocruz, Bio-Manguinhos, Recombinant Technology Laboratory (LATER), Brazilian Ministry of Health, Rio de Janeiro, Brazil
| | - Carolina Lessa Aquino
- Fiocruz, Bio-Manguinhos, Recombinant Technology Laboratory (LATER), Brazilian Ministry of Health, Rio de Janeiro, Brazil
| | - Marco Alberto Medeiros
- Fiocruz, Bio-Manguinhos, Recombinant Technology Laboratory (LATER), Brazilian Ministry of Health, Rio de Janeiro, Brazil
| | - Ana Tereza Ribeiro de Vasconcelos
- National Laboratory for Scientific Computing, Bioinformatics Laboratory (LABINFO), Avenida Getúlio Vargas, 333, Quitandinha 25651-075, Petrópolis, Rio de Janeiro, Brazil
| |
Collapse
|
18
|
Asada K, Kaneko S, Takasawa K, Machino H, Takahashi S, Shinkai N, Shimoyama R, Komatsu M, Hamamoto R. Integrated Analysis of Whole Genome and Epigenome Data Using Machine Learning Technology: Toward the Establishment of Precision Oncology. Front Oncol 2021; 11:666937. [PMID: 34055633 PMCID: PMC8149908 DOI: 10.3389/fonc.2021.666937] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 04/26/2021] [Indexed: 12/17/2022] Open
Abstract
With the completion of the International Human Genome Project, we have entered what is known as the post-genome era, and efforts to apply genomic information to medicine have become more active. In particular, with the announcement of the Precision Medicine Initiative by U.S. President Barack Obama in his State of the Union address at the beginning of 2015, "precision medicine," which aims to divide patients and potential patients into subgroups with respect to disease susceptibility, has become the focus of worldwide attention. The field of oncology is also actively adopting the precision oncology approach, which is based on molecular profiling, such as genomic information, to select the appropriate treatment. However, the current precision oncology is dominated by a method called targeted-gene panel (TGP), which uses next-generation sequencing (NGS) to analyze a limited number of specific cancer-related genes and suggest optimal treatments, but this method causes the problem that the number of patients who benefit from it is limited. In order to steadily develop precision oncology, it is necessary to integrate and analyze more detailed omics data, such as whole genome data and epigenome data. On the other hand, with the advancement of analysis technologies such as NGS, the amount of data obtained by omics analysis has become enormous, and artificial intelligence (AI) technologies, mainly machine learning (ML) technologies, are being actively used to make more efficient and accurate predictions. In this review, we will focus on whole genome sequencing (WGS) analysis and epigenome analysis, introduce the latest results of omics analysis using ML technologies for the development of precision oncology, and discuss the future prospects.
Collapse
Affiliation(s)
- Ken Asada
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Syuzo Kaneko
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Ken Takasawa
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Hidenori Machino
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Satoshi Takahashi
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Norio Shinkai
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| | - Ryo Shimoyama
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Masaaki Komatsu
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Ryuji Hamamoto
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| |
Collapse
|
19
|
Dolan PT, Taguwa S, Rangel MA, Acevedo A, Hagai T, Andino R, Frydman J. Principles of dengue virus evolvability derived from genotype-fitness maps in human and mosquito cells. eLife 2021; 10:e61921. [PMID: 33491648 PMCID: PMC7880689 DOI: 10.7554/elife.61921] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Accepted: 01/24/2021] [Indexed: 01/11/2023] Open
Abstract
Dengue virus (DENV) cycles between mosquito and mammalian hosts. To examine how DENV populations adapt to these different host environments, we used serial passage in human and mosquito cell lines and estimated fitness effects for all single-nucleotide variants in these populations using ultra-deep sequencing. This allowed us to determine the contributions of beneficial and deleterious mutations to the collective fitness of the population. Our analysis revealed that the continuous influx of a large burden of deleterious mutations counterbalances the effect of rare, host-specific beneficial mutations to shape the path of adaptation. Beneficial mutations preferentially map to intrinsically disordered domains in the viral proteome and cluster to defined regions in the genome. These phenotypically redundant adaptive alleles may facilitate host-specific DENV adaptation. Importantly, the evolutionary constraints described in our simple system mirror trends observed across DENV and Zika strains, indicating it recapitulates key biophysical and biological constraints shaping long-term viral evolution.
Collapse
Affiliation(s)
- Patrick T Dolan
- Stanford University, Department of BiologyStanfordUnited States
- University of California, Microbiology and Immunology, San FranciscoSan FranciscoUnited States
| | - Shuhei Taguwa
- Stanford University, Department of BiologyStanfordUnited States
| | | | - Ashley Acevedo
- University of California, Microbiology and Immunology, San FranciscoSan FranciscoUnited States
| | - Tzachi Hagai
- Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv UniversityTel AvivIsrael
| | - Raul Andino
- University of California, Microbiology and Immunology, San FranciscoSan FranciscoUnited States
| | - Judith Frydman
- Stanford University, Department of BiologyStanfordUnited States
| |
Collapse
|
20
|
Posada-Céspedes S, Seifert D, Topolsky I, Jablonski KP, Metzner KJ, Beerenwinkel N. V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data. Bioinformatics 2021; 37:1673-1680. [PMID: 33471068 PMCID: PMC8289377 DOI: 10.1093/bioinformatics/btab015] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 12/09/2020] [Accepted: 01/08/2021] [Indexed: 12/30/2022] Open
Abstract
Motivation High-throughput sequencing technologies are used increasingly not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. Results To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. Availabilityand implementation V-pipe is freely available at https://github.com/cbg-ethz/V-pipe. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Susana Posada-Céspedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Karin J Metzner
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of Zurich, Zurich, 8091, Switzerland.,4 Institute of Medical Virology, University of Zurich, Zurich, 8091, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| |
Collapse
|
21
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
22
|
Maclot F, Candresse T, Filloux D, Malmstrom CM, Roumagnac P, van der Vlugt R, Massart S. Illuminating an Ecological Blackbox: Using High Throughput Sequencing to Characterize the Plant Virome Across Scales. Front Microbiol 2020; 11:578064. [PMID: 33178159 PMCID: PMC7596190 DOI: 10.3389/fmicb.2020.578064] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Accepted: 09/24/2020] [Indexed: 01/08/2023] Open
Abstract
The ecology of plant viruses began to be explored at the end of the 19th century. Since then, major advances have revealed mechanisms of virus-host-vector interactions in various environments. These advances have been accelerated by new technlogies for virus detection and characterization, most recently including high throughput sequencing (HTS). HTS allows investigators, for the first time, to characterize all or nearly all viruses in a sample without a priori information about which viruses might be present. This powerful approach has spurred new investigation of the viral metagenome (virome). The rich virome datasets accumulated illuminate important ecological phenomena such as virus spread among host reservoirs (wild and domestic), effects of ecosystem simplification caused by human activities (and agriculture) on the biodiversity and the emergence of new viruses in crops. To be effective, however, HTS-based virome studies must successfully navigate challenges and pitfalls at each procedural step, from plant sampling to library preparation and bioinformatic analyses. This review summarizes major advances in plant virus ecology associated with technological developments, and then presents important considerations and best practices for HTS use in virome studies.
Collapse
Affiliation(s)
- François Maclot
- Plant Pathology Laboratory, Terra-Gembloux Agro-Bio Tech, Liège University, Gembloux, Belgium
| | | | - Denis Filloux
- CIRAD, BGPI, Montpellier, France
- BGPI, INRAE, CIRAD, Institut Agro, Montpellier University, Montpellier, France
| | - Carolyn M. Malmstrom
- Department of Plant Biology and Graduate Program in Ecology, Evolution and Behavior, Michigan State University, East Lansing, MI, United States
| | - Philippe Roumagnac
- CIRAD, BGPI, Montpellier, France
- BGPI, INRAE, CIRAD, Institut Agro, Montpellier University, Montpellier, France
| | - René van der Vlugt
- Laboratory of Virology, Wageningen University and Research Centre (WUR-PRI), Wageningen, Netherlands
| | - Sébastien Massart
- Plant Pathology Laboratory, Terra-Gembloux Agro-Bio Tech, Liège University, Gembloux, Belgium
| |
Collapse
|
23
|
Gibson KM, Steiner MC, Rentia U, Bendall ML, Pérez-Losada M, Crandall KA. Validation of Variant Assembly Using HAPHPIPE with Next-Generation Sequence Data from Viruses. Viruses 2020; 12:E758. [PMID: 32674515 PMCID: PMC7412389 DOI: 10.3390/v12070758] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 07/03/2020] [Accepted: 07/06/2020] [Indexed: 01/04/2023] Open
Abstract
Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.
Collapse
Affiliation(s)
- Keylie M. Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Margaret C. Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Matthew L. Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4169-007 Vairão, Portugal
| | - Keith A. Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
| |
Collapse
|
24
|
Ji H, Sandstrom P, Paredes R, Harrigan PR, Brumme CJ, Avila Rios S, Noguera-Julian M, Parkin N, Kantor R. Are We Ready for NGS HIV Drug Resistance Testing? The Second "Winnipeg Consensus" Symposium. Viruses 2020; 12:E586. [PMID: 32471096 PMCID: PMC7354487 DOI: 10.3390/v12060586] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 05/13/2020] [Accepted: 05/25/2020] [Indexed: 12/31/2022] Open
Abstract
HIV drug resistance is a major global challenge to successful and sustainable antiretroviral therapy. Next-generation sequencing (NGS)-based HIV drug resistance (HIVDR) assays enable more sensitive and quantitative detection of drug-resistance-associated mutations (DRMs) and outperform Sanger sequencing approaches in detecting lower abundance resistance mutations. While NGS is likely to become the new standard for routine HIVDR testing, many technical and knowledge gaps remain to be resolved before its generalized adoption in regular clinical care, public health, and research. Recognizing this, we conceived and launched an international symposium series on NGS HIVDR, to bring together leading experts in the field to address these issues through in-depth discussions and brainstorming. Following the first symposium in 2018 (Winnipeg, MB Canada, 21-22 February, 2018), a second "Winnipeg Consensus" symposium was held in September 2019 in Winnipeg, Canada, and was focused on external quality assurance strategies for NGS HIVDR assays. In this paper, we summarize this second symposium's goals and highlights.
Collapse
Affiliation(s)
- Hezhao Ji
- National HIV and Retrovirology Laboratories at JC Wilt Infectious Diseases Research Centre, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada;
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB R3E 0J9, Canada
| | - Paul Sandstrom
- National HIV and Retrovirology Laboratories at JC Wilt Infectious Diseases Research Centre, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada;
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB R3E 0J9, Canada
| | - Roger Paredes
- IrsiCaixa AIDS Research Institute, Hospital Germans Trias i Pujol, s/n, 08916 Badalona, Catalonia, Spain; (R.P.); (M.N.-J.)
- Infectious Diseases Department, Hospital Germans Trias i Pujol, 08916 Badalona, Catalonia, Spain
| | - P. Richard Harrigan
- Division of AIDS, Department of Medicine, University of British Columbia, Vancouver, BC V5Z 1M9, Canada;
| | - Chanson J. Brumme
- British Columbia Centre for Excellence in HIV/AIDS, Vancouver, BC V6Z 1Y6, Canada;
- Division of Infectious Diseases, Department of Medicine, Faculty of Medicine, University of British Columbia, Vancouver, BC V5Z 1M9, Canada
| | - Santiago Avila Rios
- Centre for Research in Infectious Diseases, National Institute of Respiratory Diseases, Mexico City 14080, Mexico;
| | - Marc Noguera-Julian
- IrsiCaixa AIDS Research Institute, Hospital Germans Trias i Pujol, s/n, 08916 Badalona, Catalonia, Spain; (R.P.); (M.N.-J.)
- Chair in AIDS and Related Illnesses, Centre for Health and Social Care Research (CESS), Faculty of Medicine, University of Vic–Central University of Catalonia (UVic–UCC), Can Baumann, Ctra. de Roda, 70, 08500 Vic, Spain
| | - Neil Parkin
- Data First Consulting Inc., Sebastopol, CA 95472, USA;
| | - Rami Kantor
- Division of Infectious Diseases, Brown University Alpert Medical School, Providence, RI 02906, USA;
| |
Collapse
|
25
|
Steiner MC, Gibson KM, Crandall KA. Drug Resistance Prediction Using Deep Learning Techniques on HIV-1 Sequence Data. Viruses 2020; 12:E560. [PMID: 32438586 PMCID: PMC7290575 DOI: 10.3390/v12050560] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 05/08/2020] [Accepted: 05/17/2020] [Indexed: 12/20/2022] Open
Abstract
The fast replication rate and lack of repair mechanisms of human immunodeficiency virus (HIV) contribute to its high mutation frequency, with some mutations resulting in the evolution of resistance to antiretroviral therapies (ART). As such, studying HIV drug resistance allows for real-time evaluation of evolutionary mechanisms. Characterizing the biological process of drug resistance is also critically important for sustained effectiveness of ART. Investigating the link between "black box" deep learning methods applied to this problem and evolutionary principles governing drug resistance has been overlooked to date. Here, we utilized publicly available HIV-1 sequence data and drug resistance assay results for 18 ART drugs to evaluate the performance of three architectures (multilayer perceptron, bidirectional recurrent neural network, and convolutional neural network) for drug resistance prediction, jointly with biological analysis. We identified convolutional neural networks as the best performing architecture and displayed a correspondence between the importance of biologically relevant features in the classifier and overall performance. Our results suggest that the high classification performance of deep learning models is indeed dependent on drug resistance mutations (DRMs). These models heavily weighted several features that are not known DRM locations, indicating the utility of model interpretability to address causal relationships in viral genotype-phenotype data.
Collapse
Affiliation(s)
- Margaret C. Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (K.M.G.); (K.A.C.)
| | - Keylie M. Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (K.M.G.); (K.A.C.)
| | - Keith A. Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (K.M.G.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
| |
Collapse
|