1
|
Wang W, Li Y, Ko S, Feng N, Zhang M, Liu JJ, Zheng S, Ren B, Yu YP, Luo JH, Tseng GC, Liu S. IFDlong: an isoform and fusion detector for accurate annotation and quantification of long-read RNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.11.593690. [PMID: 38798496 PMCID: PMC11118288 DOI: 10.1101/2024.05.11.593690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Advancements in long-read transcriptome sequencing (long-RNA-seq) technology have revolutionized the study of isoform diversity. These full-length transcripts enhance the detection of various transcriptome structural variations, including novel isoforms, alternative splicing events, and fusion transcripts. By shifting the open reading frame or altering gene expressions, studies have proved that these transcript alterations can serve as crucial biomarkers for disease diagnosis and therapeutic targets. In this project, we proposed IFDlong, a bioinformatics and biostatistics tool to detect isoform and fusion transcripts using bulk or single-cell long-RNA-seq data. Specifically, the software performed gene and isoform annotation for each long-read, defined novel isoforms, quantified isoform expression by a novel expectation-maximization algorithm, and profiled the fusion transcripts. For evaluation, IFDlong pipeline achieved overall the best performance when compared with several existing tools in large-scale simulation studies. In both isoform and fusion transcript quantification, IFDlong is able to reach more than 0.8 Spearman's correlation with the truth, and more than 0.9 cosine similarity when distinguishing multiple alternative splicing events. In novel isoform simulation, IFDlong can successfully balance the sensitivity (higher than 90%) and specificity (higher than 90%). Furthermore, IFDlong has proved its accuracy and robustness in diverse in-house and public datasets on healthy tissues, cell lines and multiple types of diseases. Besides bulk long-RNA-seq, IFDlong pipeline has proved its compatibility to single-cell long-RNA-seq data. This new software may hold promise for significant impact on long-read transcriptome analysis. The IFDlong software is available at https://github.com/wenjiaking/IFDlong.
Collapse
Affiliation(s)
- Wenjia Wang
- Department of Biostatistics, School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Yuzhen Li
- Department of Surgery, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Sungjin Ko
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Ning Feng
- Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Manling Zhang
- Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Jia-Jun Liu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Songyang Zheng
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Baoguo Ren
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Yan P. Yu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Jian-Hua Luo
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA
| | - George C. Tseng
- Department of Biostatistics, School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Silvia Liu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA
- Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| |
Collapse
|
2
|
Yoshitake K, Yanagisawa K, Sugimoto Y, Nakamura H, Mizusawa N, Miya M, Hamasaki K, Kobayashi T, Watabe S, Nishikiori K, Asakawa S. Pilot study of a comprehensive resource estimation method from environmental DNA using universal D-loop amplification primers. Funct Integr Genomics 2023; 23:96. [PMID: 36947319 PMCID: PMC10033627 DOI: 10.1007/s10142-023-01013-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 03/03/2023] [Accepted: 03/06/2023] [Indexed: 03/23/2023]
Abstract
Many studies have investigated the ability of environmental DNA (eDNA) to identify the species. However, when individual species are to be identified, accurate estimation of their abundance using traditional eDNA analyses is still difficult. We previously developed a novel analytical method called HaCeD-Seq (haplotype count from eDNA by sequencing), which focuses on the mitochondrial D-loop sequence for eels and tuna. In this study, universal D-loop primers were designed to enable the comprehensive detection of multiple fish species by a single sequence. To sequence the full-length D-loop with high accuracy, we performed nanopore sequencing with unique molecular identifiers (UMI). In addition, to determine the D-loop reference sequence, whole genome sequencing was performed with thin coverage, and complete mitochondrial genomes were determined. We developed a UMI-based Nanopore D-loop sequencing analysis pipeline and released it as open-source software. We detected 5 out of 15 species (33%) and 10 haplotypes out of 35 individuals (29%) among the detected species. This study demonstrates the possibility of comprehensively obtaining information related to population size from eDNA. In the future, this method can be used to improve the accuracy of fish resource estimation, which is currently highly dependent on fishing catches.
Collapse
Affiliation(s)
- Kazutoshi Yoshitake
- Department of Aquatic Bioscience, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, 113-8657, Tokyo, Japan
| | - Kyohei Yanagisawa
- Department of Aquatic Bioscience, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, 113-8657, Tokyo, Japan
| | - Yuma Sugimoto
- Tokyo Sea Life Park, 6-2-3 Rinkai-cho, Edogawa-ku, 134-8587, Tokyo, Japan
| | - Hiroshi Nakamura
- Tokyo Sea Life Park, 6-2-3 Rinkai-cho, Edogawa-ku, 134-8587, Tokyo, Japan
| | - Nanami Mizusawa
- School of Marine Biosciences, Kitasato University, 1-15-1 Kitasato, Minami-ku, Kanagawa, 252-0373, Sagamihara, Japan
| | - Masaki Miya
- Department of Collection Management, Natural History Museum and Institute, Chiba, 260-8682, Japan
| | - Koji Hamasaki
- Department of Marine Ecosystem Science, Atmosphere and Ocean Research Institute, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8564, Japan
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8564, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan
| | - Takanori Kobayashi
- School of Marine Biosciences, Kitasato University, 1-15-1 Kitasato, Minami-ku, Kanagawa, 252-0373, Sagamihara, Japan
| | - Shugo Watabe
- School of Marine Biosciences, Kitasato University, 1-15-1 Kitasato, Minami-ku, Kanagawa, 252-0373, Sagamihara, Japan
| | - Kazuomi Nishikiori
- Tokyo Sea Life Park, 6-2-3 Rinkai-cho, Edogawa-ku, 134-8587, Tokyo, Japan
| | - Shuichi Asakawa
- Department of Aquatic Bioscience, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, 113-8657, Tokyo, Japan.
| |
Collapse
|
3
|
Gallardo CM, Wang S, Montiel-Garcia DJ, Little SJ, Smith DM, Routh AL, Torbett BE. MrHAMER yields highly accurate single molecule viral sequences enabling analysis of intra-host evolution. Nucleic Acids Res 2021; 49:e70. [PMID: 33849057 PMCID: PMC8266615 DOI: 10.1093/nar/gkab231] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 03/12/2021] [Accepted: 03/31/2021] [Indexed: 12/31/2022] Open
Abstract
Technical challenges remain in the sequencing of RNA viruses due to their high intra-host diversity. This bottleneck is particularly pronounced when interrogating long-range co-evolved genetic interactions given the read-length limitations of next-generation sequencing platforms. This has hampered the direct observation of these genetic interactions that code for protein-protein interfaces with relevance in both drug and vaccine development. Here we overcome these technical limitations by developing a nanopore-based long-range viral sequencing pipeline that yields accurate single molecule sequences of circulating virions from clinical samples. We demonstrate its utility in observing the evolution of individual HIV Gag-Pol genomes in response to antiviral pressure. Our pipeline, called Multi-read Hairpin Mediated Error-correction Reaction (MrHAMER), yields >1000s of viral genomes per sample at 99.9% accuracy, maintains the original proportion of sequenced virions present in a complex mixture, and allows the detection of rare viral genomes with their associated mutations present at <1% frequency. This method facilitates scalable investigation of genetic correlates of resistance to both antiviral therapy and immune pressure and enables the identification of novel host-viral and viral-viral interfaces that can be modulated for therapeutic benefit.
Collapse
Affiliation(s)
- Christian M Gallardo
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.,Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA
| | - Shiyi Wang
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.,Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA
| | - Daniel J Montiel-Garcia
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Susan J Little
- Division of Infectious Diseases and Global Public Health, University of California, San Diego, La Jolla, CA, USA
| | - Davey M Smith
- Division of Infectious Diseases and Global Public Health, University of California, San Diego, La Jolla, CA, USA.,Veterans Affairs San Diego Healthcare System, San Diego, CA, USA
| | - Andrew L Routh
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX, USA.,Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, TX, USA
| | - Bruce E Torbett
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.,Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA.,Department of Pediatrics, University of Washington School of Medicine, Seattle, WA, USA
| |
Collapse
|
4
|
Assessment of Circulating Nucleic Acids in Cancer: From Current Status to Future Perspectives and Potential Clinical Applications. Cancers (Basel) 2021; 13:cancers13143460. [PMID: 34298675 PMCID: PMC8307284 DOI: 10.3390/cancers13143460] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 07/01/2021] [Accepted: 07/06/2021] [Indexed: 02/06/2023] Open
Abstract
Current approaches for cancer detection and characterization are based on radiological procedures coupled with tissue biopsies, despite relevant limitations in terms of overall accuracy and feasibility, including relevant patients' discomfort. Liquid biopsies enable the minimally invasive collection and analysis of circulating biomarkers released from cancer cells and stroma, representing therefore a promising candidate for the substitution or integration in the current standard of care. Despite the potential, the current clinical applications of liquid biopsies are limited to a few specific purposes. The lack of standardized procedures for the pre-analytical management of body fluids samples and the detection of circulating biomarkers is one of the main factors impacting the effective advancement in the applicability of liquid biopsies to clinical practice. The aim of this work, besides depicting current methods for samples collection, storage, quality check and biomarker extraction, is to review the current techniques aimed at analyzing one of the main circulating biomarkers assessed through liquid biopsy, namely cell-free nucleic acids, with particular regard to circulating tumor DNA (ctDNA). ctDNA current and potential applications are reviewed as well.
Collapse
|
5
|
Callahan BJ, Grinevich D, Thakur S, Balamotis MA, Yehezkel TB. Ultra-accurate microbial amplicon sequencing with synthetic long reads. MICROBIOME 2021; 9:130. [PMID: 34090540 PMCID: PMC8179091 DOI: 10.1186/s40168-021-01072-3] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 04/06/2021] [Indexed: 05/08/2023]
Abstract
BACKGROUND Out of the many pathogenic bacterial species that are known, only a fraction are readily identifiable directly from a complex microbial community using standard next generation DNA sequencing. Long-read sequencing offers the potential to identify a wider range of species and to differentiate between strains within a species, but attaining sufficient accuracy in complex metagenomes remains a challenge. METHODS Here, we describe and analytically validate LoopSeq, a commercially available synthetic long-read (SLR) sequencing technology that generates highly accurate long reads from standard short reads. RESULTS LoopSeq reads are sufficiently long and accurate to identify microbial genes and species directly from complex samples. LoopSeq perfectly recovered the full diversity of 16S rRNA genes from known strains in a synthetic microbial community. Full-length LoopSeq reads had a per-base error rate of 0.005%, which exceeds the accuracy reported for other long-read sequencing technologies. 18S-ITS and genomic sequencing of fungal and bacterial isolates confirmed that LoopSeq sequencing maintains that accuracy for reads up to 6 kb in length. LoopSeq full-length 16S rRNA reads could accurately classify organisms down to the species level in rinsate from retail meat samples, and could differentiate strains within species identified by the CDC as potential foodborne pathogens. CONCLUSIONS The order-of-magnitude improvement in length and accuracy over standard Illumina amplicon sequencing achieved with LoopSeq enables accurate species-level and strain identification from complex- to low-biomass microbiome samples. The ability to generate accurate and long microbiome sequencing reads using standard short read sequencers will accelerate the building of quality microbial sequence databases and removes a significant hurdle on the path to precision microbial genomics. Video abstract.
Collapse
Affiliation(s)
- Benjamin J. Callahan
- Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, NC USA
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC USA
| | - Dmitry Grinevich
- Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, NC USA
| | - Siddhartha Thakur
- Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, NC USA
| | | | | |
Collapse
|
6
|
Liu S, Wu I, Yu YP, Balamotis M, Ren B, Ben Yehezkel T, Luo JH. Targeted transcriptome analysis using synthetic long read sequencing uncovers isoform reprograming in the progression of colon cancer. Commun Biol 2021; 4:506. [PMID: 33907296 PMCID: PMC8079361 DOI: 10.1038/s42003-021-02024-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 03/09/2021] [Indexed: 02/02/2023] Open
Abstract
The characterization of human gene expression is limited by short read lengths, high error rates and large input requirements. Here, we used a synthetic long read (SLR) sequencing approach, LoopSeq, to generate accurate sequencing reads that span full length transcripts using standard short read data. LoopSeq identified isoforms from control samples with 99.4% accuracy and a 0.01% per-base error rate, exceeding the accuracy reported for other long-read technologies. Applied to targeted transcriptome sequencing from colon cancers and their metastatic counterparts, LoopSeq revealed large scale isoform redistributions from benign colon mucosa to primary colon cancer and metastatic cancer and identified several previously unknown fusion isoforms. Strikingly, single nucleotide variants (SNVs) occurred dominantly in specific isoforms and some SNVs underwent isoform switching in cancer progression. The ability to use short reads to generate accurate long-read data as the raw unit of information holds promise as a widely accessible approach in transcriptome sequencing.
Collapse
Affiliation(s)
- Silvia Liu
- Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
- High Throughput Genome Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
- Pittsburgh Liver Research Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
| | - Indira Wu
- Loop Genomics, Inc., San Jose, CA, 95138, USA
| | - Yan-Ping Yu
- Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
- High Throughput Genome Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
- Pittsburgh Liver Research Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
| | | | - Baoguo Ren
- Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
- High Throughput Genome Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA
| | | | - Jian-Hua Luo
- Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA.
- High Throughput Genome Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA.
- Pittsburgh Liver Research Center, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15261, USA.
| |
Collapse
|
7
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
8
|
Zurek PJ, Knyphausen P, Neufeld K, Pushpanath A, Hollfelder F. UMI-linked consensus sequencing enables phylogenetic analysis of directed evolution. Nat Commun 2020; 11:6023. [PMID: 33243970 PMCID: PMC7691348 DOI: 10.1038/s41467-020-19687-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 10/12/2020] [Indexed: 11/09/2022] Open
Abstract
The success of protein evolution campaigns is strongly dependent on the sequence context in which mutations are introduced, stemming from pervasive non-additive interactions between a protein's amino acids ('intra-gene epistasis'). Our limited understanding of such epistasis hinders the correct prediction of the functional contributions and adaptive potential of mutations. Here we present a straightforward unique molecular identifier (UMI)-linked consensus sequencing workflow (UMIC-seq) that simplifies mapping of evolutionary trajectories based on full-length sequences. Attaching UMIs to gene variants allows accurate consensus generation for closely related genes with nanopore sequencing. We exemplify the utility of this approach by reconstructing the artificial phylogeny emerging in three rounds of directed evolution of an amine dehydrogenase biocatalyst via ultrahigh throughput droplet screening. Uniquely, we are able to identify lineages and their founding variant, as well as non-additive interactions between mutations within a full gene showing sign epistasis. Access to deep and accurate long reads will facilitate prediction of key beneficial mutations and adaptive potential based on in silico analysis of large sequence datasets.
Collapse
Affiliation(s)
- Paul Jannis Zurek
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
- Johnson Matthey Plc, Cambridge, CB4 0WE, UK
| | - Philipp Knyphausen
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
| | - Katharina Neufeld
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
- Johnson Matthey Plc, Cambridge, CB4 0WE, UK
| | | | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK.
| |
Collapse
|
9
|
Shaw LP, Doyle RM, Kavaliunaite E, Spencer H, Balloux F, Dixon G, Harris KA. Children With Cystic Fibrosis Are Infected With Multiple Subpopulations of Mycobacterium abscessus With Different Antimicrobial Resistance Profiles. Clin Infect Dis 2020; 69:1678-1686. [PMID: 30689761 PMCID: PMC6821159 DOI: 10.1093/cid/ciz069] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Accepted: 01/21/2019] [Indexed: 12/12/2022] Open
Abstract
Background Children with cystic fibrosis (CF) can develop life-threatening infections of Mycobacterium abscessus. These present a significant clinical challenge, particularly when the strains involved are resistant to antibiotics. Recent evidence of within-patient subclones of M. abscessus in adults with CF suggests the possibility that within-patient diversity may be relevant for the treatment of pediatric CF patients. Methods We performed whole-genome sequencing (WGS) on 32 isolates of M. abscessus that were taken from multiple body sites of 2 patients with CF who were undergoing treatment at Great Ormond Street Hospital, United Kingdom, in 2015. Results We found evidence of extensive diversity within patients over time. A clustering analysis of single nucleotide variants revealed that each patient harbored multiple subpopulations, which were differentially abundant between sputum, lung samples, chest wounds, and pleural fluid. The sputum isolates did not reflect the overall within-patient diversity and did not allow for the detection of subclones with mutations previously associated with macrolide resistance (rrl 2058/2059). Some variants were present at intermediate frequencies before the lung transplants. The time of the transplants coincided with extensive variation, suggesting that this event is particularly disruptive for the microbial community, but the transplants did not clear the M. abscessus infections and both patients died as a result of these infections. Conclusions Isolates of M. abscessus from sputum do not always reflect the entire diversity present within the patient, which can include subclones with differing antimicrobial resistance profiles. An awareness of this phenotypic variability, with the sampling of multiple body sites in conjunction with WGS, may be necessary to ensure the best treatment for this vulnerable patient group.
Collapse
Affiliation(s)
- Liam P Shaw
- UCL Genetics Institute, University College London, London.,Nuffield Department of Medicine, John Radcliffe Hospital, Oxford
| | - Ronan M Doyle
- Department of Microbiology, Virology and Infection Control.,National Institute for Health Research Biomedical Research Centre
| | - Ema Kavaliunaite
- Paediatric Respiratory Medicine and Lung Transplantation, Great Ormond Street Hospital National Health Services Foundation Trust, London, United Kingdom
| | - Helen Spencer
- Paediatric Respiratory Medicine and Lung Transplantation, Great Ormond Street Hospital National Health Services Foundation Trust, London, United Kingdom
| | | | - Garth Dixon
- Department of Microbiology, Virology and Infection Control.,National Institute for Health Research Biomedical Research Centre
| | - Kathryn A Harris
- Department of Microbiology, Virology and Infection Control.,National Institute for Health Research Biomedical Research Centre
| |
Collapse
|
10
|
Wang M, Li J, Zhang X, Han Y, Yu D, Zhang D, Yuan Z, Yang Z, Huang J, Zhang X. An integrated software for virus community sequencing data analysis. BMC Genomics 2020; 21:363. [PMID: 32414327 PMCID: PMC7227348 DOI: 10.1186/s12864-020-6744-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 04/21/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND A virus community is the spectrum of viral strains populating an infected host, which plays a key role in pathogenesis and therapy response in viral infectious diseases. However automatic and dedicated pipeline for interpreting virus community sequencing data has not been developed yet. RESULTS We developed Quasispecies Analysis Package (QAP), an integrated software platform to address the problems associated with making biological interpretations from massive viral population sequencing data. QAP provides quantitative insight into virus ecology by first introducing the definition "virus OTU" and supports a wide range of viral community analyses and results visualizations. Various forms of QAP were developed in consideration of broader users, including a command line, a graphical user interface and a web server. Utilities of QAP were thoroughly evaluated with high-throughput sequencing data from hepatitis B virus, hepatitis C virus, influenza virus and human immunodeficiency virus, and the results showed highly accurate viral quasispecies characteristics related to biological phenotypes. CONCLUSIONS QAP provides a complete solution for virus community high throughput sequencing data analysis, and it would facilitate the easy analysis of virus quasispecies in clinical applications.
Collapse
Affiliation(s)
- Mingjie Wang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Jianfeng Li
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025, China
| | - Xiaonan Zhang
- Key Lab of Medicine Molecular Virology of MOE/MOH, Shanghai Medical School, Fudan University, Shanghai, 200032, China
| | - Yue Han
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Demin Yu
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Donghua Zhang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Zhenghong Yuan
- Key Lab of Medicine Molecular Virology of MOE/MOH, Shanghai Medical School, Fudan University, Shanghai, 200032, China
| | - Zhitao Yang
- Emergency Department, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China.
| | - Jinyan Huang
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025, China.
| | - Xinxin Zhang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China. .,Clinical Research Center, Ruijin Hospital North, Shanghai Jiaotong University, School of Medicine, Shanghai, 201821, China.
| |
Collapse
|
11
|
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics 2019; 34:2927-2935. [PMID: 29617936 DOI: 10.1093/bioinformatics/bty202] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 04/02/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation RNA virus populations contain different but genetically related strains, all infecting an individual host. Reconstruction of the viral haplotypes is a fundamental step to characterize the virus population, predict their viral phenotypes and finally provide important information for clinical treatment and prevention. Advances of the next-generation sequencing technologies open up new opportunities to assemble full-length haplotypes. However, error-prone short reads, high similarities between related strains, an unknown number of haplotypes pose computational challenges for reference-free haplotype reconstruction. There is still much room to improve the performance of existing haplotype assembly tools. Results In this work, we developed a de novo haplotype reconstruction tool named PEHaplo, which employs paired-end reads to distinguish highly similar strains for viral quasispecies data. It was applied on both simulated and real quasispecies data, and the results were benchmarked against several recently published de novo haplotype reconstruction tools. The comparison shows that PEHaplo outperforms the benchmarked tools in a comprehensive set of metrics. Availability and implementation The source code and the documentation of PEHaplo are available at https://github.com/chjiao/PEHaplo. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiao Chen
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yingchao Zhao
- School of Computing and Information Sciences, Caritas Institute of Higher Education, Hong Kong, China
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
12
|
Barik S, Das S, Vikalo H. QSdpR: Viral quasispecies reconstruction via correlation clustering. Genomics 2018; 110:375-381. [DOI: 10.1016/j.ygeno.2017.12.007] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Revised: 12/03/2017] [Accepted: 12/13/2017] [Indexed: 02/05/2023]
|
13
|
Ibrahim B, McMahon DP, Hufsky F, Beer M, Deng L, Mercier PL, Palmarini M, Thiel V, Marz M. A new era of virus bioinformatics. Virus Res 2018; 251:86-90. [PMID: 29751021 DOI: 10.1016/j.virusres.2018.05.009] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 05/07/2018] [Indexed: 01/09/2023]
Abstract
Despite the recognized excellence of virology and bioinformatics, these two communities have interacted surprisingly sporadically, aside from some pioneering work on HIV-1 and influenza. Bringing together the expertise of bioinformaticians and virologists is crucial, since very specific but fundamental computational approaches are required for virus research, particularly in an era of big data. Collaboration between virologists and bioinformaticians is necessary to improve existing analytical tools, cloud-based systems, computational resources, data sharing approaches, new diagnostic tools, and bioinformatic training. Here, we highlight current progress and discuss potential avenues for future developments in this promising era of virus bioinformatics. We end by presenting an overview of current technologies, and by outlining some of the major challenges and advantages that bioinformatics will bring to the field of virology.
Collapse
Affiliation(s)
- Bashar Ibrahim
- European Virus Bioinformatics Center, Jena, Germany; RNA Bioinformatics and High Throughput Analysis Jena, Friedrich Schiller University Jena, Jena, Germany
| | - Dino P McMahon
- European Virus Bioinformatics Center, Jena, Germany; Host Parasite Evolution and Ecology, Institute of Biology, Free University of Berlin, Berlin, Germany; Department for Materials and Environment, BAM Federal Institute for Materials Research and Testing, Berlin, Germany
| | - Franziska Hufsky
- European Virus Bioinformatics Center, Jena, Germany; RNA Bioinformatics and High Throughput Analysis Jena, Friedrich Schiller University Jena, Jena, Germany
| | - Martin Beer
- European Virus Bioinformatics Center, Jena, Germany; Institute of Diagnostic Virology, Friedrich-Loeffler-Institute, Greifswald, Germany
| | - Li Deng
- European Virus Bioinformatics Center, Jena, Germany; Institute of Virology, Helmholtz Zentrum Munich, Munich, Germany
| | - Philippe Le Mercier
- European Virus Bioinformatics Center, Jena, Germany; Swiss-Prot Group, SIB,CMU, University of Geneva Medical School, Geneva, Switzerland
| | - Massimo Palmarini
- MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
| | - Volker Thiel
- European Virus Bioinformatics Center, Jena, Germany; Federal Department of Home Affairs, Institute of Virology and Immunology, Bern and Mittelhausen, Switzerland; Department of Infectious Diseases and Pathobiology, University of Bern, Bern, Switzerland
| | - Manja Marz
- European Virus Bioinformatics Center, Jena, Germany; RNA Bioinformatics and High Throughput Analysis Jena, Friedrich Schiller University Jena, Jena, Germany.
| |
Collapse
|
14
|
Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018; 19:269-285. [PMID: 29576615 PMCID: PMC6485430 DOI: 10.1038/nrg.2017.117] [Citation(s) in RCA: 313] [Impact Index Per Article: 52.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Mutations, the fuel of evolution, are first manifested as rare DNA changes within a population of cells. Although next-generation sequencing (NGS) technologies have revolutionized the study of genomic variation between species and individual organisms, most have limited ability to accurately detect and quantify rare variants among the different genome copies in heterogeneous mixtures of cells or molecules. We describe the technical challenges in characterizing subclonal variants using conventional NGS protocols and the recent development of error correction strategies, both computational and experimental, including consensus sequencing of single DNA molecules. We also highlight major applications for low-frequency mutation detection in science and medicine, describe emerging methodologies and provide our vision for the future of DNA sequencing.
Collapse
Affiliation(s)
- Jesse J Salk
- Department of Pathology, University of Washington School of Medicine, Seattle, WA, USA
- Department of Medicine, Divisions of Hematology and Medical Oncology, University of Washington School of Medicine, Seattle, WA, USA
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA
| | - Michael W Schmitt
- Department of Pathology, University of Washington School of Medicine, Seattle, WA, USA
- Department of Medicine, Divisions of Hematology and Medical Oncology, University of Washington School of Medicine, Seattle, WA, USA
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA
| | - Lawrence A Loeb
- Department of Pathology, University of Washington School of Medicine, Seattle, WA, USA
- Department of Biochemistry, University of Washington School of Medicine, Seattle, WA, USA
| |
Collapse
|
15
|
Karst SM, Dueholm MS, McIlroy SJ, Kirkegaard RH, Nielsen PH, Albertsen M. Retrieval of a million high-quality, full-length microbial 16S and 18S rRNA gene sequences without primer bias. Nat Biotechnol 2018; 36:190-195. [PMID: 29291348 DOI: 10.1038/nbt.4045] [Citation(s) in RCA: 99] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 11/22/2017] [Indexed: 01/02/2023]
Abstract
Small subunit ribosomal RNA (SSU rRNA) genes, 16S in bacteria and 18S in eukaryotes, have been the standard phylogenetic markers used to characterize microbial diversity and evolution for decades. However, the reference databases of full-length SSU rRNA gene sequences are skewed to well-studied ecosystems and subject to primer bias and chimerism, which results in an incomplete view of the diversity present in a sample. We combine poly(A)-tailing and reverse transcription of SSU rRNA molecules with synthetic long-read sequencing to generate high-quality, full-length SSU rRNA sequences, without primer bias, at high throughput. We apply our approach to samples from seven different ecosystems and obtain more than a million SSU rRNA sequences from all domains of life, with an estimated raw error rate of 0.17%. We observe a large proportion of novel diversity, including several deeply branching phylum-level lineages putatively related to the Asgard Archaea. Our approach will enable expansion of the SSU rRNA reference databases by orders of magnitude, and contribute to a comprehensive census of the tree of life.
Collapse
Affiliation(s)
- Søren M Karst
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Denmark
| | - Morten S Dueholm
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Denmark
| | - Simon J McIlroy
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Denmark
| | - Rasmus H Kirkegaard
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Denmark
| | - Per H Nielsen
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Denmark
| | - Mads Albertsen
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Denmark
| |
Collapse
|
16
|
Segawa H, Kukita Y, Kato K. HLA genotyping by next-generation sequencing of complementary DNA. BMC Genomics 2017; 18:914. [PMID: 29179676 PMCID: PMC5704545 DOI: 10.1186/s12864-017-4300-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Accepted: 11/13/2017] [Indexed: 12/23/2022] Open
Abstract
Background Genotyping of the human leucocyte antigen (HLA) is indispensable for various medical treatments. However, unambiguous genotyping is technically challenging due to high polymorphism of the corresponding genomic region. Next-generation sequencing is changing the landscape of genotyping. In addition to high throughput of data, its additional advantage is that DNA templates are derived from single molecules, which is a strong merit for the phasing problem. Although most currently developed technologies use genomic DNA, use of cDNA could enable genotyping with reduced costs in data production and analysis. We thus developed an HLA genotyping system based on next-generation sequencing of cDNA. Methods Each HLA gene was divided into 3 or 4 target regions subjected to PCR amplification and subsequent sequencing with Ion Torrent PGM. The sequence data were then subjected to an automated analysis. The principle of the analysis was to construct candidate sequences generated from all possible combinations of variable bases and arrange them in decreasing order of the number of reads. Upon collecting candidate sequences from all target regions, 2 haplotypes were usually assigned. Cases not assigned 2 haplotypes were forwarded to 4 additional processes: selection of candidate sequences applying more stringent criteria, removal of artificial haplotypes, selection of candidate sequences with a relaxed threshold for sequence matching, and countermeasure for incomplete sequences in the HLA database. Results The genotyping system was evaluated using 30 samples; the overall accuracy was 97.0% at the field 3 level and 98.3% at the G group level. With one sample, genotyping of DPB1 was not completed due to short read size. We then developed a method for complete sequencing of individual molecules of the DPB1 gene, using the molecular barcode technology. Conclusion The performance of the automatic genotyping system was comparable to that of systems developed in previous studies. Thus, next-generation sequencing of cDNA is a viable option for HLA genotyping. Electronic supplementary material The online version of this article (10.1186/s12864-017-4300-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hidenobu Segawa
- Department of Molecular and Medical Genetics, Research Institute, Osaka Medical Center for Cancer and Cardiovascular Diseases, 1-3-3 Nakamichi, Higashinari-ku, Osaka, 537-8511, Japan
| | - Yoji Kukita
- Department of Molecular and Medical Genetics, Research Institute, Osaka Medical Center for Cancer and Cardiovascular Diseases, 1-3-3 Nakamichi, Higashinari-ku, Osaka, 537-8511, Japan
| | - Kikuya Kato
- Laboratory of Medical Genomics, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, 630-0101, Japan.
| |
Collapse
|
17
|
Zhu YO, Aw PPK, de Sessions PF, Hong S, See LX, Hong LZ, Wilm A, Li CH, Hue S, Lim SG, Nagarajan N, Burkholder WF, Hibberd M. Single-virion sequencing of lamivudine-treated HBV populations reveal population evolution dynamics and demographic history. BMC Genomics 2017; 18:829. [PMID: 29078745 PMCID: PMC5660452 DOI: 10.1186/s12864-017-4217-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Accepted: 10/16/2017] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Viral populations are complex, dynamic, and fast evolving. The evolution of groups of closely related viruses in a competitive environment is termed quasispecies. To fully understand the role that quasispecies play in viral evolution, characterizing the trajectories of viral genotypes in an evolving population is the key. In particular, long-range haplotype information for thousands of individual viruses is critical; yet generating this information is non-trivial. Popular deep sequencing methods generate relatively short reads that do not preserve linkage information, while third generation sequencing methods have higher error rates that make detection of low frequency mutations a bioinformatics challenge. Here we applied BAsE-Seq, an Illumina-based single-virion sequencing technology, to eight samples from four chronic hepatitis B (CHB) patients - once before antiviral treatment and once after viral rebound due to resistance. RESULTS With single-virion sequencing, we obtained 248-8796 single-virion sequences per sample, which allowed us to find evidence for both hard and soft selective sweeps. We were able to reconstruct population demographic history that was independently verified by clinically collected data. We further verified four of the samples independently through PacBio SMRT and Illumina Pooled deep sequencing. CONCLUSIONS Overall, we showed that single-virion sequencing yields insight into viral evolution and population dynamics in an efficient and high throughput manner. We believe that single-virion sequencing is widely applicable to the study of viral evolution in the context of drug resistance and host adaptation, allows differentiation between soft or hard selective sweeps, and may be useful in the reconstruction of intra-host viral population demographic history.
Collapse
Affiliation(s)
- Yuan O Zhu
- Genome Institute of Singapore, Singapore, 138672, Singapore.
| | - Pauline P K Aw
- Genome Institute of Singapore, Singapore, 138672, Singapore
| | | | - Shuzhen Hong
- Genome Institute of Singapore, Singapore, 138672, Singapore
| | - Lee Xian See
- Institute of Molecular and Cell Biology, Singapore, 138673, Singapore
| | - Lewis Z Hong
- Institute of Molecular and Cell Biology, Singapore, 138673, Singapore
| | - Andreas Wilm
- Genome Institute of Singapore, Singapore, 138672, Singapore
| | - Chen Hao Li
- Genome Institute of Singapore, Singapore, 138672, Singapore
| | - Stephane Hue
- London School of Hygiene and Tropical Medicine, London, UK
| | - Seng Gee Lim
- National University Hospital, Singapore, 119074, Singapore
| | | | | | - Martin Hibberd
- Genome Institute of Singapore, Singapore, 138672, Singapore.,London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
18
|
Wang K, Lai S, Yang X, Zhu T, Lu X, Wu CI, Ruan J. Ultrasensitive and high-efficiency screen of de novo low-frequency mutations by o2n-seq. Nat Commun 2017; 8:15335. [PMID: 28530222 PMCID: PMC5458117 DOI: 10.1038/ncomms15335] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Accepted: 03/21/2017] [Indexed: 12/15/2022] Open
Abstract
Detection of de novo, low-frequency mutations is essential for characterizing cancer genomes and heterogeneous cell populations. However, the screening capacity of current ultrasensitive NGS methods is inadequate owing to either low-efficiency read utilization or severe amplification bias. Here, we present o2n-seq, an ultrasensitive and high-efficiency NGS library preparation method for discovering de novo, low-frequency mutations. O2n-seq reduces the error rate of NGS to 10-5-10-8. The efficiency of its data usage is about 10-30 times higher than that of barcode-based strategies. For detecting mutations with allele frequency (AF) 1% in 4.6 Mb-sized genome, the sensitivity and specificity of o2n-seq reach to 99% and 98.64%, respectively. For mutations with AF around 0.07% in phix174, o2n-seq detects all the mutations with 100% specificity. Moreover, we successfully apply o2n-seq to screen de novo, low-frequency mutations in human tumours. O2n-seq will aid to characterize the landscape of somatic mutations in research and clinical settings.
Collapse
Affiliation(s)
- Kaile Wang
- Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Pengfei Road No. 7, Dapeng New District, Shenzhen, Guangdong 518120, China
- Key Laboratory of Genomics and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Chaoyang, Beijing 100101, China
- University of Chinese Academy of Sciences, Shijingshan, Beijing 100049, China
| | - Shujuan Lai
- Key Laboratory of Genomics and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Chaoyang, Beijing 100101, China
| | - Xiaoxu Yang
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Haidian, Beijing 100871, China
| | - Tianqi Zhu
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Haidian, Beijing 100190, China
- Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Xuemei Lu
- Key Laboratory of Genomics and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Chaoyang, Beijing 100101, China
| | - Chung-I Wu
- Key Laboratory of Genomics and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Chaoyang, Beijing 100101, China
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, Guangdong 510275, China
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA
| | - Jue Ruan
- Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Pengfei Road No. 7, Dapeng New District, Shenzhen, Guangdong 518120, China
| |
Collapse
|
19
|
Abstract
Whole-genome sequencing (WGS) of pathogens is becoming increasingly important not only for basic research but also for clinical science and practice. In virology, WGS is important for the development of novel treatments and vaccines, and for increasing the power of molecular epidemiology and evolutionary genomics. In this Opinion article, we suggest that WGS of viruses in a clinical setting will become increasingly important for patient care. We give an overview of different WGS methods that are used in virology and summarize their advantages and disadvantages. Although there are only partially addressed technical, financial and ethical issues in regard to the clinical application of viral WGS, this technique provides important insights into virus transmission, evolution and pathogenesis.
Collapse
Affiliation(s)
- Charlotte J. Houldcroft
- Department of Infection, UK; and the Division of Biological Anthropology, Immunity and Inflammation, Great Ormond Street Institute of Child Health, University College London, London WC1N 1EH, University of Cambridge, Cambridge CB2 3QG, UK.,
- and the Division of Biological Anthropology, University of Cambridge, Cambridge CB2 3QG, UK.,
| | - Mathew A. Beale
- Division of Infection and Immunity, University College London, London, WC1E 6BT UK
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA Cambridge UK
| | - Judith Breuer
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK; and at Great Ormond Street Hospital for Children NHS Foundation Trust, London WC1N 3JH, UK.,
- and at Great Ormond Street Hospital for Children NHS Foundation Trust, London WC1N 3JH, UK.,
| |
Collapse
|
20
|
Error rates, PCR recombination, and sampling depth in HIV-1 whole genome deep sequencing. Virus Res 2016; 239:106-114. [PMID: 28039047 DOI: 10.1016/j.virusres.2016.12.009] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Revised: 11/25/2016] [Accepted: 12/16/2016] [Indexed: 11/20/2022]
Abstract
Deep sequencing is a powerful and cost-effective tool to characterize the genetic diversity and evolution of virus populations. While modern sequencing instruments readily cover viral genomes many thousand fold and very rare variants can in principle be detected, sequencing errors, amplification biases, and other artifacts can limit sensitivity and complicate data interpretation. For this reason, the number of studies using whole genome deep sequencing to characterize viral quasi-species in clinical samples is still limited. We have previously undertaken a large scale whole genome deep sequencing study of HIV-1 populations. Here we discuss the challenges, error profiles, control experiments, and computational test we developed to quantify the accuracy of variant frequency estimation.
Collapse
|
21
|
Posada-Cespedes S, Seifert D, Beerenwinkel N. Recent advances in inferring viral diversity from high-throughput sequencing data. Virus Res 2016; 239:17-32. [PMID: 27693290 DOI: 10.1016/j.virusres.2016.09.016] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Revised: 09/23/2016] [Accepted: 09/24/2016] [Indexed: 02/05/2023]
Abstract
Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.
Collapse
Affiliation(s)
- Susana Posada-Cespedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; SIB, Basel, Switzerland.
| |
Collapse
|
22
|
Li C, Chng KR, Boey EJH, Ng AHQ, Wilm A, Nagarajan N. INC-Seq: accurate single molecule reads using nanopore sequencing. Gigascience 2016; 5:34. [PMID: 27485345 PMCID: PMC4970289 DOI: 10.1186/s13742-016-0140-7] [Citation(s) in RCA: 111] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2016] [Revised: 07/14/2016] [Accepted: 07/14/2016] [Indexed: 01/19/2023] Open
Abstract
Background Nanopore sequencing provides a rapid, cheap and portable real-time sequencing platform with the potential to revolutionize genomics. However, several applications are limited by relatively high single-read error rates (>10 %), including RNA-seq, haplotype sequencing and 16S sequencing. Results We developed the Intramolecular-ligated Nanopore Consensus Sequencing (INC-Seq) as a strategy for obtaining long and accurate nanopore reads, starting with low input DNA. Applying INC-Seq for 16S rRNA-based bacterial profiling generated full-length amplicon sequences with a median accuracy >97 %. Conclusions INC-Seq reads enabled accurate species-level classification, identification of species at 0.1 % abundance and robust quantification of relative abundances, providing a cheap and effective approach for pathogen detection and microbiome profiling on the MinION system. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0140-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chenhao Li
- Genome Institute of Singapore, Singapore, 138672, Singapore.,Department of Computer Science, National University of Singapore, Singapore, 117417, Singapore
| | - Kern Rei Chng
- Genome Institute of Singapore, Singapore, 138672, Singapore
| | | | | | - Andreas Wilm
- Genome Institute of Singapore, Singapore, 138672, Singapore
| | - Niranjan Nagarajan
- Genome Institute of Singapore, Singapore, 138672, Singapore. .,Department of Computer Science, National University of Singapore, Singapore, 117417, Singapore.
| |
Collapse
|
23
|
Lavezzo E, Barzon L, Toppo S, Palù G. Third generation sequencing technologies applied to diagnostic microbiology: benefits and challenges in applications and data analysis. Expert Rev Mol Diagn 2016; 16:1011-23. [PMID: 27453996 DOI: 10.1080/14737159.2016.1217158] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
INTRODUCTION The diagnosis of infectious diseases is among the most successful areas of application of new generation sequencing technologies. The field has seen the development of numerous experimental and analytical approaches for the detection and the fine description of pathogenic and non-pathogenic microorganisms. AREAS COVERED Without claiming to be exhaustive with respect to all applications and methods developed over the years, this review focuses on the advantages and the issues brought by the new technologies, with an eye in particular to third generation sequencing methods. Both experimental procedures and algorithmic strategies are presented, following the most relevant publications which have led to progress in our ability of detecting infectious agents. Expert commentary: The technical advance brought by third generation sequencing platforms has the potential to significantly expand the range of diagnostic tools that will be available to clinicians. Nonetheless, the implementation of these technologies in clinical practice is still far from being actionable and will temporally follow the path undertaken by second generation methods, which still require the setup of standardized pipelines in both wet and dry laboratory procedures.
Collapse
Affiliation(s)
- Enrico Lavezzo
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| | - Luisa Barzon
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| | - Stefano Toppo
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| | - Giorgio Palù
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| |
Collapse
|
24
|
Sim S, Hibberd ML. Genomic approaches for understanding dengue: insights from the virus, vector, and host. Genome Biol 2016; 17:38. [PMID: 26931545 PMCID: PMC4774013 DOI: 10.1186/s13059-016-0907-2] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
The incidence and geographic range of dengue have increased dramatically in recent decades. Climate change, rapid urbanization and increased global travel have facilitated the spread of both efficient mosquito vectors and the four dengue virus serotypes between population centers. At the same time, significant advances in genomics approaches have provided insights into host–pathogen interactions, immunogenetics, and viral evolution in both humans and mosquitoes. Here, we review these advances and the innovative treatment and control strategies that they are inspiring.
Collapse
Affiliation(s)
- Shuzhen Sim
- Infectious Diseases, Genome Institute of Singapore, Singapore, 138672, Singapore
| | - Martin L Hibberd
- Infectious Diseases, Genome Institute of Singapore, Singapore, 138672, Singapore. .,Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, WC1E 7HT, UK.
| |
Collapse
|
25
|
Abstract
A central challenge in the field of metabolic engineering is the efficient identification of a metabolic pathway genotype that maximizes specific productivity over a robust range of process conditions. Here we review current methods for optimizing specific productivity of metabolic pathways in living cells. New tools for library generation, computational analysis of pathway sequence-flux space, and high-throughput screening and selection techniques are discussed.
Collapse
Affiliation(s)
- Justin R Klesmith
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Timothy A Whitehead
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA; Department of Biosystems and Agricultural Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
26
|
Cole C, Volden R, Dharmadhikari S, Scelfo-Dalbey C, Vollmers C. Highly Accurate Sequencing of Full-Length Immune Repertoire Amplicons Using Tn5-Enabled and Molecular Identifier–Guided Amplicon Assembly. THE JOURNAL OF IMMUNOLOGY 2016; 196:2902-7. [DOI: 10.4049/jimmunol.1502563] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Accepted: 01/11/2016] [Indexed: 12/22/2022]
|
27
|
Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing. PLoS One 2016; 11:e0147229. [PMID: 26789840 PMCID: PMC4720449 DOI: 10.1371/journal.pone.0147229] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2015] [Accepted: 12/30/2015] [Indexed: 12/26/2022] Open
Abstract
Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise.
Collapse
|
28
|
Abstract
Despite having very limited coding capacity, RNA viruses are able to withstand challenge of antiviral drugs, cause epidemics in previously exposed human populations, and, in some cases, infect multiple host species. They are able to achieve this by virtue of their ability to multiply very rapidly, coupled with their extraordinary degree of genetic heterogeneity. RNA viruses exist not as single genotypes, but as a swarm of related variants, and this genomic diversity is an essential feature of their biology. RNA viruses have a variety of mechanisms that act in combination to determine their genetic heterogeneity. These include polymerase fidelity, error-mitigation mechanisms, genomic recombination, and different modes of genome replication. RNA viruses can vary in their ability to tolerate mutations, or “genetic robustness,” and several factors contribute to this. Finally, there is evidence that some RNA viruses exist close to a threshold where polymerase error rate has evolved to maximize the possible sequence space available, while avoiding the accumulation of a lethal load of deleterious mutations. We speculate that different viruses have evolved different error rates to complement the different “life-styles” they possess.
Collapse
Affiliation(s)
- J.N. Barr
- University of Leeds, Leeds, United Kingdom
| | - R. Fearns
- Boston University School of Medicine, Boston, MA, United States
| |
Collapse
|
29
|
A Comprehensive Analysis of Primer IDs to Study Heterogeneous HIV-1 Populations. J Mol Biol 2015; 428:238-250. [PMID: 26711506 DOI: 10.1016/j.jmb.2015.12.012] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 11/25/2015] [Accepted: 12/16/2015] [Indexed: 01/01/2023]
Abstract
Determining the composition of viral populations is becoming increasingly important in the field of medical virology. While recently developed computational tools for viral haplotype analysis allow for correcting sequencing errors, they do not always allow for the removal of errors occurring in the upstream experimental protocol, such as PCR errors. Primer IDs (pIDs) are one method to address this problem by harnessing redundant template resampling for error correction. By using a reference mixture of five HIV-1 strains, we show how pIDs can be useful for estimating key experimental parameters, such as the substitution rate of the PCR process and the reverse transcription (RT) error rate. In addition, we introduce a hidden Markov model for determining the recombination rate of the RT PCR process. We found no strong sequence-specific bias in pID abundances (the same RT efficiencies as compared to commonly used short, specific RT primers) and no effects of pIDs on the estimated distribution of the references viruses.
Collapse
|
30
|
Klesmith JR, Bacik JP, Michalczyk R, Whitehead TA. Comprehensive Sequence-Flux Mapping of a Levoglucosan Utilization Pathway in E. coli. ACS Synth Biol 2015; 4:1235-43. [PMID: 26369947 DOI: 10.1021/acssynbio.5b00131] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Synthetic metabolic pathways often suffer from low specific productivity, and new methods that quickly assess pathway functionality for many thousands of variants are urgently needed. Here we present an approach that enables the rapid and parallel determination of sequence effects on flux for complete gene-encoding sequences. We show that this method can be used to determine the effects of over 8000 single point mutants of a pyrolysis oil catabolic pathway implanted in Escherichia coli. Experimental sequence-function data sets predicted whether fitness-enhancing mutations to the enzyme levoglucosan kinase resulted from enhanced catalytic efficiency or enzyme stability. A structure of one design incorporating 38 mutations elucidated the structural basis of high fitness mutations. One design incorporating 15 beneficial mutations supported a 15-fold improvement in growth rate and greater than 24-fold improvement in enzyme activity relative to the starting pathway. This technique can be extended to improve a wide variety of designed pathways.
Collapse
Affiliation(s)
- Justin R. Klesmith
- Department
of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| | - John-Paul Bacik
- Bioscience
Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Ryszard Michalczyk
- Bioscience
Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Timothy A. Whitehead
- Department
of Chemical Engineering and Materials Science, Michigan State University, East
Lansing, Michigan 48824, United States
- Department
of Biosystems and Agricultural Engineering, Michigan State University, East
Lansing, Michigan 48824, United States
| |
Collapse
|
31
|
High-resolution genetic profile of viral genomes: why it matters. Curr Opin Virol 2015; 14:62-70. [DOI: 10.1016/j.coviro.2015.08.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Revised: 08/07/2015] [Accepted: 08/07/2015] [Indexed: 12/12/2022]
|
32
|
Rational Protein Engineering Guided by Deep Mutational Scanning. Int J Mol Sci 2015; 16:23094-110. [PMID: 26404267 PMCID: PMC4613353 DOI: 10.3390/ijms160923094] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 09/04/2015] [Accepted: 09/13/2015] [Indexed: 11/16/2022] Open
Abstract
Sequence-function relationship in a protein is commonly determined by the three-dimensional protein structure followed by various biochemical experiments. However, with the explosive increase in the number of genome sequences, facilitated by recent advances in sequencing technology, the gap between protein sequences available and three-dimensional structures is rapidly widening. A recently developed method termed deep mutational scanning explores the functional phenotype of thousands of mutants via massive sequencing. Coupled with a highly efficient screening system, this approach assesses the phenotypic changes made by the substitution of each amino acid sequence that constitutes a protein. Such an informational resource provides the functional role of each amino acid sequence, thereby providing sufficient rationale for selecting target residues for protein engineering. Here, we discuss the current applications of deep mutational scanning and consider experimental design.
Collapse
|