1
|
Li Y, Yang R. PxBLAT: an efficient python binding library for BLAT. BMC Bioinformatics 2024; 25:219. [PMID: 38898394 DOI: 10.1186/s12859-024-05844-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024] Open
Abstract
BACKGROUND With the surge in genomic data driven by advancements in sequencing technologies, the demand for efficient bioinformatics tools for sequence analysis has become paramount. BLAST-like alignment tool (BLAT), a sequence alignment tool, faces limitations in performance efficiency and integration with modern programming environments, particularly Python. This study introduces PxBLAT, a Python-based framework designed to enhance the capabilities of BLAT, focusing on usability, computational efficiency, and seamless integration within the Python ecosystem. RESULTS PxBLAT demonstrates significant improvements over BLAT in execution speed and data handling, as evidenced by comprehensive benchmarks conducted across various sample groups ranging from 50 to 600 samples. These experiments highlight a notable speedup, reducing execution time compared to BLAT. The framework also introduces user-friendly features such as improved server management, data conversion utilities, and shell completion, enhancing the overall user experience. Additionally, the provision of extensive documentation and comprehensive testing supports community engagement and facilitates the adoption of PxBLAT. CONCLUSIONS PxBLAT stands out as a robust alternative to BLAT, offering performance and user interaction enhancements. Its development underscores the potential for modern programming languages to improve bioinformatics tools, aligning with the needs of contemporary genomic research. By providing a more efficient, user-friendly tool, PxBLAT has the potential to impact genomic data analysis workflows, supporting faster and more accurate sequence analysis in a Python environment.
Collapse
Affiliation(s)
- Yangyang Li
- Department of Urology, Northwestern University Feinberg School of Medicine, 303 E Superior St, Chicago, IL, 60611, USA
| | - Rendong Yang
- Department of Urology, Northwestern University Feinberg School of Medicine, 303 E Superior St, Chicago, IL, 60611, USA.
- Robert H. Lurie Comprehensive Cancer Center, Northwestern University Feinberg School of Medicine, 675 N St Clair St, Chicago, IL, 60611, USA.
| |
Collapse
|
2
|
Amano N, Narumi S, Aizu K, Miyazawa M, Okamura K, Ohashi H, Katsumata N, Ishii T, Hasegawa T. Single-Exon Deletions of ZNRF3 Exon 2 Cause Congenital Adrenal Hypoplasia. J Clin Endocrinol Metab 2024; 109:641-648. [PMID: 37878959 DOI: 10.1210/clinem/dgad627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 10/17/2023] [Accepted: 10/18/2023] [Indexed: 10/27/2023]
Abstract
CONTEXT Primary adrenal insufficiency (PAI) is a life-threatening condition characterized by the inability of the adrenal cortex to produce sufficient steroid hormones. E3 ubiquitin protein ligase zinc and ring finger 3 (ZNRF3) is a negative regulator of Wnt/β-catenin signaling. R-spondin 1 (RSPO1) enhances Wnt/β-catenin signaling via binding and removal of ZNRF3 from the cell surface. OBJECTIVE This work aimed to explore a novel genetic form of PAI. METHODS We analyzed 9 patients with childhood-onset PAI of biochemically and genetically unknown etiology using array comparative genomic hybridization. To examine the functionality of the identified single-exon deletions of ZNRF3 exon 2, we performed three-dimensional (3D) structure modeling and in vitro functional studies. RESULTS We identified various-sized single-exon deletions encompassing ZNRF3 exon 2 in 3 patients who showed neonatal-onset adrenal hypoplasia with glucocorticoid and mineralocorticoid deficiencies. Reverse-transcriptase polymerase chain reaction (RT-PCR) analysis showed that the 3 distinct single-exon deletions were commonly transcribed into a 126-nucleotide deleted mRNA and translated into 42-amino acid deleted protein (ΔEx2-ZNRF3). Based on 3D structure modeling, we predicted that interaction between ZNRF3 and RSPO1 would be disturbed in ΔEx2-ZNRF3, suggesting loss of RSPO1-dependent activation of Wnt/β-catenin signaling. Cell-based functional assays with the TCF-LEF reporter showed that RSPO1-dependent activation of Wnt/β-catenin signaling was attenuated in cells expressing ΔEx2-ZNRF3 as compared with those expressing wild-type ZNRF3. CONCLUSION We provided genetic evidence linking deletions encompassing ZNRF3 exon 2 and congenital adrenal hypoplasia, which might be related to constitutive inactivation of Wnt/β-catenin signaling by ΔEx2-ZNRF3.
Collapse
Affiliation(s)
- Naoko Amano
- Department of Pediatrics, Keio University School of Medicine, Tokyo, 160-8582, Japan
- Department of Pediatrics, Saitama City Hospital, Saitama, 336-8522, Japan
| | - Satoshi Narumi
- Department of Pediatrics, Keio University School of Medicine, Tokyo, 160-8582, Japan
- Department of Molecular Endocrinology, National Research Institute for Child Health and Development, Tokyo, 157-8535, Japan
| | - Katsuya Aizu
- Division of Endocrinology and Metabolism, Saitama Children's Medical Center, Saitama, 330-8777, Japan
| | - Mari Miyazawa
- Department of Pediatrics, Kochi Health Sciences Center, Kochi, 781-8555, Japan
| | - Kohji Okamura
- Department of Systems BioMedicine, National Center for Child Health and Development, Tokyo, 157-8535, Japan
| | - Hirofumi Ohashi
- Division of Medical Genetics, Saitama Children's Medical Center, Saitama, 330-8777, Japan
| | - Noriyuki Katsumata
- Department of Molecular Endocrinology, National Research Institute for Child Health and Development, Tokyo, 157-8535, Japan
| | - Tomohiro Ishii
- Department of Pediatrics, Keio University School of Medicine, Tokyo, 160-8582, Japan
| | - Tomonobu Hasegawa
- Department of Pediatrics, Keio University School of Medicine, Tokyo, 160-8582, Japan
| |
Collapse
|
3
|
Mazur FG, Morinisi LM, Martins JO, Guerra PPB, Freire CCM. Exploring Virome Diversity in Public Data in South America as an Approach for Detecting Viral Sources From Potentially Emerging Viruses. Front Genet 2022; 12:722857. [PMID: 35126446 PMCID: PMC8814814 DOI: 10.3389/fgene.2021.722857] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 11/29/2021] [Indexed: 11/13/2022] Open
Abstract
The South American continent presents a great diversity of biomes, whose ecosystems are constantly threatened by the expansion of human activity. The emergence and re-emergence of viral populations with impact on the human population and ecosystem have shown increases in the last decades. In deference to the growing accumulation of genomic data, we explore the potential of South American-related public databases to detect signals that contribute to virosphere research. Therefore, our study aims to investigate public databases with emphasis on the surveillance of viruses with medical and ecological relevance. Herein, we profiled 120 "sequence read archives" metagenomes from 19 independent projects from the last decade. In a coarse view, our analyses identified only 0.38% of the total number of sequences from viruses, showing a higher proportion of RNA viruses. The metagenomes with the most important viral sequences in the analyzed environmental models were 1) aquatic samples from the Amazon River, 2) sewage from Brasilia, and 3) soil from the state of São Paulo, while the models of animal transmission were detected in mosquitoes from Rio Janeiro and Bats from Amazonia. Also, the classification of viral signals into operational taxonomic units (OTUs) (family) allowed us to infer from metadata a probable host range in the virome detected in each sample analyzed. Further, several motifs and viral sequences are related to specific viruses with emergence potential from Togaviridae, Arenaviridae, and Flaviviridae families. In this context, the exploration of public databases allowed us to evaluate the scope and informative capacity of sequences from third-party public databases and to detect signals related to viruses of clinical or environmental importance, which allowed us to infer traits associated with probable transmission routes or signals of ecological disequilibrium. The evaluation of our results showed that in most cases the size and type of the reference database, the percentage of guanine-cytosine (GC), and the length of the query sequences greatly influence the taxonomic classification of the sequences. In sum, our findings describe how the exploration of public genomic data can be exploited as an approach for epidemiological surveillance and the understanding of the virosphere.
Collapse
Affiliation(s)
| | | | | | | | - Caio C. M. Freire
- Department Genetics and Evolution, UFSCar—Federal University of São Carlos, São Carlos, Brazil
| |
Collapse
|
4
|
Fabiańska I, Borutzki S, Richter B, Tran HQ, Neubert A, Mayer D. LABRADOR-A Computational Workflow for Virus Detection in High-Throughput Sequencing Data. Viruses 2021; 13:v13122541. [PMID: 34960810 PMCID: PMC8704571 DOI: 10.3390/v13122541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 12/13/2021] [Accepted: 12/16/2021] [Indexed: 11/16/2022] Open
Abstract
High-throughput sequencing (HTS) allows detection of known and unknown viruses in samples of broad origin. This makes HTS a perfect technology to determine whether or not the biological products, such as vaccines are free from the adventitious agents, which could support or replace extensive testing using various in vitro and in vivo assays. Due to bioinformatics complexities, there is a need for standardized and reliable methods to manage HTS generated data in this field. Thus, we developed LABRADOR—an analysis pipeline for adventitious virus detection. The pipeline consists of several third-party programs and is divided into two major parts: (i) direct reads classification based on the comparison of characteristic profiles between reads and sequences deposited in the database supported with alignment of to the best matching reference sequence and (ii) de novo assembly of contigs and their classification on nucleotide and amino acid levels. To meet the requirements published in guidelines for biologicals’ safety we generated a custom nucleotide database with viral sequences. We tested our pipeline on publicly available HTS datasets and showed that LABRADOR can reliably detect viruses in mixtures of model viruses, vaccines and clinical samples.
Collapse
|
5
|
Identifying proximal RNA interactions from cDNA-encoded crosslinks with ShapeJumper. PLoS Comput Biol 2021; 17:e1009632. [PMID: 34905538 PMCID: PMC8670686 DOI: 10.1371/journal.pcbi.1009632] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 11/11/2021] [Indexed: 01/07/2023] Open
Abstract
SHAPE-JuMP is a concise strategy for identifying close-in-space interactions in RNA molecules. Nucleotides in close three-dimensional proximity are crosslinked with a bi-reactive reagent that covalently links the 2'-hydroxyl groups of the ribose moieties. The identities of crosslinked nucleotides are determined using an engineered reverse transcriptase that jumps across crosslinked sites, resulting in a deletion in the cDNA that is detected using massively parallel sequencing. Here we introduce ShapeJumper, a bioinformatics pipeline to process SHAPE-JuMP sequencing data and to accurately identify through-space interactions, as observed in complex JuMP datasets. ShapeJumper identifies proximal interactions with near-nucleotide resolution using an alignment strategy that is optimized to tolerate the unique non-templated reverse-transcription profile of the engineered crosslink-traversing reverse-transcriptase. JuMP-inspired strategies are now poised to replace adapter-ligation for detecting RNA-RNA interactions in most crosslinking experiments.
Collapse
|
6
|
Truong Nguyen PT, Plyusnin I, Sironen T, Vapalahti O, Kant R, Smura T. HAVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. BMC Bioinformatics 2021; 22:373. [PMID: 34273961 PMCID: PMC8285700 DOI: 10.1186/s12859-021-04294-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 07/08/2021] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND SARS-CoV-2 related research has increased in importance worldwide since December 2019. Several new variants of SARS-CoV-2 have emerged globally, of which the most notable and concerning currently are the UK variant B.1.1.7, the South African variant B1.351 and the Brazilian variant P.1. Detecting and monitoring novel variants is essential in SARS-CoV-2 surveillance. While there are several tools for assembling virus genomes and performing lineage analyses to investigate SARS-CoV-2, each is limited to performing singular or a few functions separately. RESULTS Due to the lack of publicly available pipelines, which could perform fast reference-based assemblies on raw SARS-CoV-2 sequences in addition to identifying lineages to detect variants of concern, we have developed an open source bioinformatic pipeline called HAVoC (Helsinki university Analyzer for Variants of Concern). HAVoC can reference assemble raw sequence reads and assign the corresponding lineages to SARS-CoV-2 sequences. CONCLUSIONS HAVoC is a pipeline utilizing several bioinformatic tools to perform multiple necessary analyses for investigating genetic variance among SARS-CoV-2 samples. The pipeline is particularly useful for those who need a more accessible and fast tool to detect and monitor the spread of SARS-CoV-2 variants of concern during local outbreaks. HAVoC is currently being used in Finland for monitoring the spread of SARS-CoV-2 variants. HAVoC user manual and source code are available at https://www.helsinki.fi/en/projects/havoc and https://bitbucket.org/auto_cov_pipeline/havoc , respectively.
Collapse
Affiliation(s)
| | - Ilya Plyusnin
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
- Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland
| | - Tarja Sironen
- Department of Virology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland
| | - Olli Vapalahti
- Department of Virology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland
- Department of Virology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
| | - Ravi Kant
- Department of Virology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland
| | - Teemu Smura
- Department of Virology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Virology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
| |
Collapse
|
7
|
Zeng X, Zhao L, Shen C, Zhou Y, Li G, Sung WK. HIVID2: an accurate tool to detect virus integrations in the host genome. Bioinformatics 2021; 37:1821-1827. [PMID: 33453108 DOI: 10.1093/bioinformatics/btab031] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 12/27/2020] [Accepted: 01/12/2021] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Virus integration in the host genome is frequently reported to be closely associated with many human diseases, and the detection of virus integration is a critically challenging task. However, most existing tools show limited specificity and sensitivity. Therefore, the objective of this study is to develop a method for accurate detection of virus integration into host genomes. RESULTS Herein, we report a novel method termed HIVID2 that is a significant upgrade of HIVID. HIVID2 performs a paired-end combination (PE-combination) for potentially integrated reads. The resulting sequences are then remapped onto the reference genomes, and both split and discordant chimeric reads are used to identify accurate integration breakpoints with high confidence. HIVID2 represents a great improvement in specificity and sensitivity, and predicts breakpoints closer to the real integrations, compared with existing methods. The advantage of our method was demonstrated using both simulated and real data sets. HIVID2 uncovered novel integration breakpoints in well-known cervical cancer-related genes, including FHIT and LRP1B, which was verified using protein expression data. In addition, HIVID2 allows the user to decide whether to automatically perform advanced analysis using the identified virus integrations. By analyzing the simulated data and real data tests, we demonstrated that HIVID2 is not only more accurate than HIVID but also better than other existing programs with respect to both sensitivity and specificity. We believe that HIVID2 will help in enhancing future research associated with virus integration. AVAILABILITY HIVID2 can be accessed at https://github.com/zengxi-hada/HIVID2/. CONTACT Xi Zeng (zengxi@mail.hzau.edu.cn), Linghao Zhao (michael_yifan@126.com). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xi Zeng
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Linghao Zhao
- Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai 200438, China
| | - Chenhang Shen
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Yi Zhou
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Guoliang Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Wing-Kin Sung
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China.,Department of Computer Science, National University of Singapore, Singapore, 117417, Singapore.,Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672, Singapore
| |
Collapse
|
8
|
Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Sci Rep 2020; 10:19737. [PMID: 33184454 PMCID: PMC7665074 DOI: 10.1038/s41598-020-76881-x] [Citation(s) in RCA: 91] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Accepted: 11/03/2020] [Indexed: 01/16/2023] Open
Abstract
RNA-seq is currently considered the most powerful, robust and adaptable technique for measuring gene expression and transcription activation at genome-wide level. As the analysis of RNA-seq data is complex, it has prompted a large amount of research on algorithms and methods. This has resulted in a substantial increase in the number of options available at each step of the analysis. Consequently, there is no clear consensus about the most appropriate algorithms and pipelines that should be used to analyse RNA-seq data. In the present study, 192 pipelines using alternative methods were applied to 18 samples from two human cell lines and the performance of the results was evaluated. Raw gene expression signal was quantified by non-parametric statistics to measure precision and accuracy. Differential gene expression performance was estimated by testing 17 differential expression methods. The procedures were validated by qRT-PCR in the same samples. This study weighs up the advantages and disadvantages of the tested algorithms and pipelines providing a comprehensive guide to the different methods and procedures applied to the analysis of RNA-seq data, both for the quantification of the raw expression signal and for the differential gene expression.
Collapse
|
9
|
Tong L, Wu PY, Phan JH, Hassazadeh HR, Tong W, Wang MD. Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction. Sci Rep 2020; 10:17925. [PMID: 33087762 PMCID: PMC7578822 DOI: 10.1038/s41598-020-74567-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2019] [Accepted: 08/27/2020] [Indexed: 11/23/2022] Open
Abstract
To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline's performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.
Collapse
Affiliation(s)
- Li Tong
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA
| | - Po-Yen Wu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - John H Phan
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA
| | - Hamid R Hassazadeh
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Weida Tong
- National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - May D Wang
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA.
| |
Collapse
|
10
|
Noguera-Julian M, Lee ER, Shafer RW, Kantor R, Ji H. Dry Panels Supporting External Quality Assessment Programs for Next Generation Sequencing-Based HIV Drug Resistance Testing. Viruses 2020; 12:v12060666. [PMID: 32575676 PMCID: PMC7354622 DOI: 10.3390/v12060666] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 06/18/2020] [Accepted: 06/18/2020] [Indexed: 12/18/2022] Open
Abstract
External quality assessment (EQA) is a keystone element in the validation and implementation of next generation sequencing (NGS)-based HIV drug resistance testing (DRT). Software validation and evaluation is a critical element in NGS EQA programs. While the development, sharing, and adoption of wet lab protocols is coupled with the increasing access to NGS technology worldwide, rendering it easy to produce NGS data for HIV-DRT, bioinformatic data analysis remains a bottleneck for most of the diagnostic laboratories. Several computational tools have been made available, via free or commercial sources, to automate the conversion of raw NGS data into an actionable clinical report. Although different software platforms yield equivalent results when identical raw NGS datasets are analyzed for variations at higher abundance, discrepancies arise when variations at lower frequencies are considered. This implies that validation and performance assessment of the bioinformatics tools applied in NGS HIV-DRT is critical, and the origins of the observed discrepancies should be determined. Well-characterized reference NGS datasets with ground truth on the genotype composition at all examined loci and the exact frequencies of HIV variations they may harbor, so-called dry panels, would be essential in such cases. The strategic design and construction of such panels are challenging but imperative tasks in support of EQA programs for NGS-based HIV-DRT and the validation of relevant bioinformatics tools. Here, we present criteria that can guide the design of such dry panels, which were discussed in the Second International Winnipeg Symposium themed for EQA strategies for NGS HIVDR assays.
Collapse
Affiliation(s)
- Marc Noguera-Julian
- IrsiCaixa AIDS Research Institute, Hospital Germans Trias i Pujol, s/n, Catalonia, 08196 Badalona, Spain
- Chair in AIDS and Related Illnesses, Centre for Health and Social Care Research (CESS), Faculty of Medicine, University of Vic, Central University of Catalonia, Can Baumann. Ctra. de Roda, 70, 08500 Vic, Spain
- Correspondence:
| | - Emma R. Lee
- National HIV and Retrovirology Laboratories, National Microbiology Laboratory at JC Wilt Infectious Diseases Research Centre, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada; (E.R.L.); (H.J.)
| | | | - Rami Kantor
- Division of Infectious Diseases, Brown University Alpert Medical School, Providence, RI 02903, USA;
| | - Hezhao Ji
- National HIV and Retrovirology Laboratories, National Microbiology Laboratory at JC Wilt Infectious Diseases Research Centre, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada; (E.R.L.); (H.J.)
- Department of Medical Microbiology and Infectious Diseases, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB R3E 0J9, Canada
| |
Collapse
|
11
|
Zapatka M, Borozan I, Brewer DS, Iskar M, Grundhoff A, Alawi M, Desai N, Sültmann H, Moch H, Cooper CS, Eils R, Ferretti V, Lichter P. The landscape of viral associations in human cancers. Nat Genet 2020; 52:320-330. [PMID: 32025001 PMCID: PMC8076016 DOI: 10.1038/s41588-019-0558-9] [Citation(s) in RCA: 220] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Accepted: 11/22/2019] [Indexed: 12/30/2022]
Abstract
Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, for which whole-genome and-for a subset-whole-transcriptome sequencing data from 2,658 cancers across 38 tumor types was aggregated, we systematically investigated potential viral pathogens using a consensus approach that integrated three independent pipelines. Viruses were detected in 382 genome and 68 transcriptome datasets. We found a high prevalence of known tumor-associated viruses such as Epstein-Barr virus (EBV), hepatitis B virus (HBV) and human papilloma virus (HPV; for example, HPV16 or HPV18). The study revealed significant exclusivity of HPV and driver mutations in head-and-neck cancer and the association of HPV with APOBEC mutational signatures, which suggests that impaired antiviral defense is a driving force in cervical, bladder and head-and-neck carcinoma. For HBV, HPV16, HPV18 and adeno-associated virus-2 (AAV2), viral integration was associated with local variations in genomic copy numbers. Integrations at the TERT promoter were associated with high telomerase expression evidently activating this tumor-driving process. High levels of endogenous retrovirus (ERV1) expression were linked to a worse survival outcome in patients with kidney cancer.
Collapse
Affiliation(s)
- Marc Zapatka
- Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ivan Borozan
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Daniel S Brewer
- Norwich Medical School, University of East Anglia, Norwich, UK
- Earlham Institute, Norwich, UK
| | - Murat Iskar
- Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Adam Grundhoff
- Heinrich-Pette-Institute, Leibniz Institute for Experimental Virology, Hamburg, Germany
- German Center for Infection Research (DZIF), Partner Site Hamburg-Borstel-Lübeck-Riems, Hamburg, Germany
| | - Malik Alawi
- Heinrich-Pette-Institute, Leibniz Institute for Experimental Virology, Hamburg, Germany
- Bioinformatics Core, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Nikita Desai
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, Francis Crick Institute, London, UK
| | - Holger Sültmann
- National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg, Germany
- Division of Cancer Genome Research, German Cancer Research Center (DKFZ), Heidelberg, Germany
- German Cancer Consortium (DKTK), Heidelberg, Germany
| | - Holger Moch
- Department of Pathology and Molecular Pathology, University and University Hospital Zürich, Zurich, Switzerland
| | - Colin S Cooper
- Norwich Medical School, University of East Anglia, Norwich, UK
- Earlham Institute, Norwich, UK
- Institute of Cancer Research, London, UK
- University of East Anglia, Norwich, UK
| | - Roland Eils
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute of Pharmacy and Molecular Biotechnology, Heidelberg University and BioQuant Center, Heidelberg, Germany
- Center for Digital Health, Berlin Institute of Health and Charité Universitätsmedizin Berlin, Berlin, Germany
| | - Vincent Ferretti
- Ontario Institute for Cancer Research, MaRS Centre, Toronto, Ontario, Canada
- Department of Biochemistry and Molecular Medicine, University of Montreal, Montreal, Québec, Canada
| | - Peter Lichter
- Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
- German Cancer Consortium (DKTK), Heidelberg, Germany.
| |
Collapse
|
12
|
Abstract
Long double-stranded RNAs (dsRNAs) are abundantly expressed in animals, in which they frequently occur in introns and 3' untranslated regions of mRNAs. Functions of long, cellular dsRNAs are poorly understood, although deficiencies in adenosine deaminases that act on RNA, or ADARs, promote their recognition as viral dsRNA and an aberrant immune response. Diverse dsRNA-binding proteins bind cellular dsRNAs, hinting at additional roles. Understanding these roles is facilitated by mapping the genomic locations that express dsRNA in various tissues and organisms. ADAR editing provides a signature of dsRNA structure in cellular transcripts. In this review, we detail approaches to map ADAR editing sites and dsRNAs genome-wide, with particular focus on high-throughput sequencing methods and considerations for their successful application to the detection of editing sites and dsRNAs.
Collapse
Affiliation(s)
- Daniel P Reich
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112
| | - Brenda L Bass
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112
| |
Collapse
|
13
|
Wang M, Kong L. pblat: a multithread blat algorithm speeding up aligning sequences to genomes. BMC Bioinformatics 2019; 20:28. [PMID: 30646844 PMCID: PMC6334396 DOI: 10.1186/s12859-019-2597-8] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 01/03/2019] [Indexed: 11/17/2022] Open
Abstract
Background The blat is a widely used sequence alignment tool. It is especially useful for aligning long sequences and gapped mapping, which cannot be performed properly by other fast sequence mappers designed for short reads. However, the blat tool is single threaded and when used to map whole genome or whole transcriptome sequences to reference genomes this program can take days to finish, making it unsuitable for large scale sequencing projects and iterative analysis. Here, we present pblat (parallel blat), a parallelized blat algorithm with multithread and cluster computing support, which functions to rapidly fine map large scale DNA/RNA sequences against genomes. Results The pblat algorithm takes advantage of modern multicore processors and significantly reduces the run time with the number of threads used. pblat utilizes almost equal amount of memory as when running blat. The results generated by pblat are identical with those generated by blat. The pblat tool is easy to install and can run on Linux and Mac OS systems. In addition, we provide a cluster version of pblat (pblat-cluster) running on computing clusters with MPI support. Conclusion pblat is open source and free available for non-commercial users. It is easy to install and easy to use. pblat and pblat-cluster would facilitate the high-throughput mapping of large scale genomic and transcript sequences to reference genomes with both high speed and high precision.
Collapse
Affiliation(s)
- Meng Wang
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, 100871, People's Republic of China
| | - Lei Kong
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, 100871, People's Republic of China.
| |
Collapse
|
14
|
Analysis of Epstein-Barr Virus Genomes and Expression Profiles in Gastric Adenocarcinoma. J Virol 2018; 92:JVI.01239-17. [PMID: 29093097 DOI: 10.1128/jvi.01239-17] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 10/05/2017] [Indexed: 01/10/2023] Open
Abstract
Epstein-Barr virus (EBV) is a causative agent of a variety of lymphomas, nasopharyngeal carcinoma (NPC), and ∼9% of gastric carcinomas (GCs). An important question is whether particular EBV variants are more oncogenic than others, but conclusions are currently hampered by the lack of sequenced EBV genomes. Here, we contribute to this question by mining whole-genome sequences of 201 GCs to identify 13 EBV-positive GCs and by assembling 13 new EBV genome sequences, almost doubling the number of available GC-derived EBV genome sequences and providing the first non-Asian EBV genome sequences from GC. Whole-genome sequence comparisons of all EBV isolates sequenced to date (85 from tumors and 57 from healthy individuals) showed that most GC and NPC EBV isolates were closely related although American Caucasian GC samples were more distant, suggesting a geographical component. However, EBV GC isolates were found to contain some consistent changes in protein sequences regardless of geographical origin. In addition, transcriptome data available for eight of the EBV-positive GCs were analyzed to determine which EBV genes are expressed in GC. In addition to the expected latency proteins (EBNA1, LMP1, and LMP2A), specific subsets of lytic genes were consistently expressed that did not reflect a typical lytic or abortive lytic infection, suggesting a novel mechanism of EBV gene regulation in the context of GC. These results are consistent with a model in which a combination of specific latent and lytic EBV proteins promotes tumorigenesis.IMPORTANCE Epstein-Barr virus (EBV) is a widespread virus that causes cancer, including gastric carcinoma (GC), in a small subset of individuals. An important question is whether particular EBV variants are more cancer associated than others, but more EBV sequences are required to address this question. Here, we have generated 13 new EBV genome sequences from GC, almost doubling the number of EBV sequences from GC isolates and providing the first EBV sequences from non-Asian GC. We further identify sequence changes in some EBV proteins common to GC isolates. In addition, gene expression analysis of eight of the EBV-positive GCs showed consistent expression of both the expected latency proteins and a subset of lytic proteins that was not consistent with typical lytic or abortive lytic expression. These results suggest that novel mechanisms activate expression of some EBV lytic proteins and that their expression may contribute to oncogenesis.
Collapse
|
15
|
Cox JW, Ballweg RA, Taft DH, Velayutham P, Haslam DB, Porollo A. A fast and robust protocol for metataxonomic analysis using RNAseq data. MICROBIOME 2017; 5:7. [PMID: 28103917 PMCID: PMC5244565 DOI: 10.1186/s40168-016-0219-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 12/05/2016] [Indexed: 05/03/2023]
Abstract
BACKGROUND Metagenomics is a rapidly emerging field aimed to analyze microbial diversity and dynamics by studying the genomic content of the microbiota. Metataxonomics tools analyze high-throughput sequencing data, primarily from 16S rRNA gene sequencing and DNAseq, to identify microorganisms and viruses within a complex mixture. With the growing demand for analysis of the functional microbiome, metatranscriptome studies attract more interest. To make metatranscriptomic data sufficient for metataxonomics, new analytical workflows are needed to deal with sparse and taxonomically less informative sequencing data. RESULTS We present a new protocol, IMSA+A, for accurate taxonomy classification based on metatranscriptome data of any read length that can efficiently and robustly identify bacteria, fungi, and viruses in the same sample. The new protocol improves accuracy by using a conservative reference database, employing a new counting scheme, and by assembling shotgun reads. Assembly also reduces analysis runtime. Simulated data were utilized to evaluate the protocol by permuting common experimental variables. When applied to the real metatranscriptome data for mouse intestines colonized by ASF, the protocol showed superior performance in detection of the microorganisms compared to the existing metataxonomics tools. IMSA+A is available at https://github.com/JeremyCoxBMI/IMSA-A . CONCLUSIONS The developed protocol addresses the need for taxonomy classification from RNAseq data. Previously not utilized, i.e., unmapped to a reference genome, RNAseq reads can now be used to gather taxonomic information about the microbiota present in a biological sample without conducting additional sequencing. Any metatranscriptome pipeline that includes assembly of reads can add this analysis with minimal additional cost of compute time. The new protocol also creates an opportunity to revisit old metatranscriptome data, where taxonomic content may be important but was not analyzed.
Collapse
Affiliation(s)
- Jeremy W Cox
- Department of Electrical Engineering and Computing Systems, University of Cincinnati, 2901 Woodside Drive, Cincinnati, OH, 45221, USA
- The Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, MLC 15012, Cincinnati, OH, 45229-3039, USA
| | - Richard A Ballweg
- The Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, MLC 15012, Cincinnati, OH, 45229-3039, USA
| | - Diana H Taft
- The Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, MLC 15012, Cincinnati, OH, 45229-3039, USA
| | - Prakash Velayutham
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH, 45229, USA
| | - David B Haslam
- Division of Infectious Diseases, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH, 45229, USA
| | - Aleksey Porollo
- The Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, MLC 15012, Cincinnati, OH, 45229-3039, USA.
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH, 45229, USA.
| |
Collapse
|
16
|
Hjelmsø MH, Hellmér M, Fernandez-Cassi X, Timoneda N, Lukjancenko O, Seidel M, Elsässer D, Aarestrup FM, Löfström C, Bofill-Mas S, Abril JF, Girones R, Schultz AC. Evaluation of Methods for the Concentration and Extraction of Viruses from Sewage in the Context of Metagenomic Sequencing. PLoS One 2017; 12:e0170199. [PMID: 28099518 PMCID: PMC5242460 DOI: 10.1371/journal.pone.0170199] [Citation(s) in RCA: 92] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 01/02/2017] [Indexed: 01/18/2023] Open
Abstract
Viral sewage metagenomics is a novel field of study used for surveillance, epidemiological studies, and evaluation of waste water treatment efficiency. In raw sewage human waste is mixed with household, industrial and drainage water, and virus particles are, therefore, only found in low concentrations. This necessitates a step of sample concentration to allow for sensitive virus detection. Additionally, viruses harbor a large diversity of both surface and genome structures, which makes universal viral genomic extraction difficult. Current studies have tackled these challenges in many different ways employing a wide range of viral concentration and extraction procedures. However, there is limited knowledge of the efficacy and inherent biases associated with these methods in respect to viral sewage metagenomics, hampering the development of this field. By the use of next generation sequencing this study aimed to evaluate the efficiency of four commonly applied viral concentrations techniques (precipitation with polyethylene glycol, organic flocculation with skim milk, monolithic adsorption filtration and glass wool filtration) and extraction methods (Nucleospin RNA XS, QIAamp Viral RNA Mini Kit, NucliSENS® miniMAG®, or PowerViral® Environmental RNA/DNA Isolation Kit) to determine the viriome in a sewage sample. We found a significant influence of concentration and extraction protocols on the detected viriome. The viral richness was largest in samples extracted with QIAamp Viral RNA Mini Kit or PowerViral® Environmental RNA/DNA Isolation Kit. Highest viral specificity were found in samples concentrated by precipitation with polyethylene glycol or extracted with Nucleospin RNA XS. Detection of viral pathogens depended on the method used. These results contribute to the understanding of method associated biases, within the field of viral sewage metagenomics, making evaluation of the current literature easier and helping with the design of future studies.
Collapse
Affiliation(s)
- Mathis Hjort Hjelmsø
- Research Group for Genomic Epidemiology, The National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
- * E-mail:
| | - Maria Hellmér
- Division of Microbiology and Production, The National Food Institute, Technical University of Denmark, Søborg, Denmark
| | - Xavier Fernandez-Cassi
- Laboratory of Virus Contaminants of Water and Food, Department of Genetics, Microbiology, and Statistics, University of Barcelona, Barcelona, Catalonia, Spain
| | - Natàlia Timoneda
- Laboratory of Virus Contaminants of Water and Food, Department of Genetics, Microbiology, and Statistics, University of Barcelona, Barcelona, Catalonia, Spain
- Institute of Biomedicine of the University of Barcelona, University of Barcelona, Barcelona, Catalonia, Spain
| | - Oksana Lukjancenko
- Research Group for Genomic Epidemiology, The National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Michael Seidel
- Institute of Hydrochemistry, Chair of Analytical Chemistry, Technical University of Munich, Munich, Germany
| | - Dennis Elsässer
- Institute of Hydrochemistry, Chair of Analytical Chemistry, Technical University of Munich, Munich, Germany
| | - Frank M. Aarestrup
- Research Group for Genomic Epidemiology, The National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Charlotta Löfström
- Division of Microbiology and Production, The National Food Institute, Technical University of Denmark, Søborg, Denmark
| | - Sílvia Bofill-Mas
- Laboratory of Virus Contaminants of Water and Food, Department of Genetics, Microbiology, and Statistics, University of Barcelona, Barcelona, Catalonia, Spain
| | - Josep F. Abril
- Laboratory of Virus Contaminants of Water and Food, Department of Genetics, Microbiology, and Statistics, University of Barcelona, Barcelona, Catalonia, Spain
- Institute of Biomedicine of the University of Barcelona, University of Barcelona, Barcelona, Catalonia, Spain
| | - Rosina Girones
- Laboratory of Virus Contaminants of Water and Food, Department of Genetics, Microbiology, and Statistics, University of Barcelona, Barcelona, Catalonia, Spain
| | - Anna Charlotte Schultz
- Division of Microbiology and Production, The National Food Institute, Technical University of Denmark, Søborg, Denmark
| |
Collapse
|
17
|
Brumme CJ, Poon AFY. Promises and pitfalls of Illumina sequencing for HIV resistance genotyping. Virus Res 2016; 239:97-105. [PMID: 27993623 DOI: 10.1016/j.virusres.2016.12.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Revised: 12/15/2016] [Accepted: 12/15/2016] [Indexed: 12/13/2022]
Abstract
Genetic sequencing ("genotyping") plays a critical role in the modern clinical management of HIV infection. This virus evolves rapidly within patients because of its error-prone reverse transcriptase and short generation time. Consequently, HIV variants with mutations that confer resistance to one or more antiretroviral drugs can emerge during sub-optimal treatment. There are now multiple HIV drug resistance interpretation algorithms that take the region of the HIV genome encoding the major drug targets as inputs; expert use of these algorithms can significantly improve to clinical outcomes in HIV treatment. Next-generation sequencing has the potential to revolutionize HIV resistance genotyping by lowering the threshold that rare but clinically significant HIV variants can be detected reproducibly, and by conferring improved cost-effectiveness in high-throughput scenarios. In this review, we discuss the relative merits and challenges of deploying the Illumina MiSeq instrument for clinical HIV genotyping.
Collapse
Affiliation(s)
- Chanson J Brumme
- BC Centre for Excellence in HIV/AIDS, Vancouver, British Columbia, Canada
| | - Art F Y Poon
- Department of Pathology & Laboratory Medicine, Western University, London, Ontario, Canada.
| |
Collapse
|
18
|
Ali R, Blackburn RM, Kozlakidis Z. Next-Generation Sequencing and Influenza Virus: A Short Review of the Published Implementation Attempts. HAYATI JOURNAL OF BIOSCIENCES 2016. [DOI: 10.1016/j.hjb.2016.12.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/09/2022] Open
|
19
|
Pinthong W, Muangruen P, Suriyaphol P, Mairiang D. A simple grid implementation with Berkeley Open Infrastructure for Network Computing using BLAST as a model. PeerJ 2016; 4:e2248. [PMID: 27547555 PMCID: PMC4974928 DOI: 10.7717/peerj.2248] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Accepted: 06/22/2016] [Indexed: 11/20/2022] Open
Abstract
Development of high-throughput technologies, such as Next-generation sequencing, allows thousands of experiments to be performed simultaneously while reducing resource requirement. Consequently, a massive amount of experiment data is now rapidly generated. Nevertheless, the data are not readily usable or meaningful until they are further analysed and interpreted. Due to the size of the data, a high performance computer (HPC) is required for the analysis and interpretation. However, the HPC is expensive and difficult to access. Other means were developed to allow researchers to acquire the power of HPC without a need to purchase and maintain one such as cloud computing services and grid computing system. In this study, we implemented grid computing in a computer training center environment using Berkeley Open Infrastructure for Network Computing (BOINC) as a job distributor and data manager combining all desktop computers to virtualize the HPC. Fifty desktop computers were used for setting up a grid system during the off-hours. In order to test the performance of the grid system, we adapted the Basic Local Alignment Search Tools (BLAST) to the BOINC system. Sequencing results from Illumina platform were aligned to the human genome database by BLAST on the grid system. The result and processing time were compared to those from a single desktop computer and HPC. The estimated durations of BLAST analysis for 4 million sequence reads on a desktop PC, HPC and the grid system were 568, 24 and 5 days, respectively. Thus, the grid implementation of BLAST by BOINC is an efficient alternative to the HPC for sequence alignment. The grid implementation by BOINC also helped tap unused computing resources during the off-hours and could be easily modified for other available bioinformatics software.
Collapse
Affiliation(s)
- Watthanai Pinthong
- Department of Anatomy, Faculty of Medicine Siriraj Hospital, Mahidol University , Bangkok , Thailand
| | - Panya Muangruen
- Siriraj Information Technology Department, Faculty of Medicine Siriraj Hospital, Mahidol University , Bangkok, Thailand
| | - Prapat Suriyaphol
- Division of Bioinformatics and Data Management for Research, Department of Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University , Bangkok , Thailand
| | - Dumrong Mairiang
- Medical Biotechnology Research Laboratory, The National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathumthani, Thailand; Division of Dengue Hemorrhagic Fever Research, Department of Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| |
Collapse
|
20
|
Flygare S, Simmon K, Miller C, Qiao Y, Kennedy B, Di Sera T, Graf EH, Tardif KD, Kapusta A, Rynearson S, Stockmann C, Queen K, Tong S, Voelkerding KV, Blaschke A, Byington CL, Jain S, Pavia A, Ampofo K, Eilbeck K, Marth G, Yandell M, Schlaberg R. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling. Genome Biol 2016; 17:111. [PMID: 27224977 PMCID: PMC4880956 DOI: 10.1186/s13059-016-0969-1] [Citation(s) in RCA: 110] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Accepted: 04/27/2016] [Indexed: 02/07/2023] Open
Abstract
Background High-throughput sequencing enables unbiased profiling of microbial communities, universal pathogen detection, and host response to infectious diseases. However, computation times and algorithmic inaccuracies have hindered adoption. Results We present Taxonomer, an ultrafast, web-tool for comprehensive metagenomics data analysis and interactive results visualization. Taxonomer is unique in providing integrated nucleotide and protein-based classification and simultaneous host messenger RNA (mRNA) transcript profiling. Using real-world case-studies, we show that Taxonomer detects previously unrecognized infections and reveals antiviral host mRNA expression profiles. To facilitate data-sharing across geographic distances in outbreak settings, Taxonomer is publicly available through a web-based user interface. Conclusions Taxonomer enables rapid, accurate, and interactive analyses of metagenomics data on personal computers and mobile devices. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0969-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Steven Flygare
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Keith Simmon
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Chase Miller
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Yi Qiao
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Brett Kennedy
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Tonya Di Sera
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Erin H Graf
- Department of Pathology, University of Utah, Salt Lake City, UT, USA
| | - Keith D Tardif
- ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT, USA
| | - Aurélie Kapusta
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Shawn Rynearson
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
| | - Chris Stockmann
- Department of Pediatrics, University of Utah, Salt Lake City, UT, USA
| | - Krista Queen
- Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Suxiang Tong
- Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Karl V Voelkerding
- Department of Pathology, University of Utah, Salt Lake City, UT, USA.,ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT, USA
| | - Anne Blaschke
- Department of Pediatrics, University of Utah, Salt Lake City, UT, USA
| | - Carrie L Byington
- Department of Pediatrics, University of Utah, Salt Lake City, UT, USA
| | - Seema Jain
- Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Andrew Pavia
- Department of Pediatrics, University of Utah, Salt Lake City, UT, USA
| | - Krow Ampofo
- Department of Pediatrics, University of Utah, Salt Lake City, UT, USA
| | - Karen Eilbeck
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA.,USTAR Center for Genetic Discovery, Salt Lake City, UT, USA
| | - Gabor Marth
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA.,USTAR Center for Genetic Discovery, Salt Lake City, UT, USA
| | - Mark Yandell
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA. .,USTAR Center for Genetic Discovery, Salt Lake City, UT, USA.
| | - Robert Schlaberg
- Department of Pathology, University of Utah, Salt Lake City, UT, USA. .,ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT, USA.
| |
Collapse
|
21
|
Zhao S, Xi L, Quan J, Xi H, Zhang Y, von Schack D, Vincent M, Zhang B. QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization. BMC Genomics 2016; 17:39. [PMID: 26747388 PMCID: PMC4706714 DOI: 10.1186/s12864-015-2356-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2015] [Accepted: 12/23/2015] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND RNA sequencing (RNA-seq), a next-generation sequencing technique for transcriptome profiling, is being increasingly used, in part driven by the decreasing cost of sequencing. Nevertheless, the analysis of the massive amounts of data generated by large-scale RNA-seq remains a challenge. Multiple algorithms pertinent to basic analyses have been developed, and there is an increasing need to automate the use of these tools so as to obtain results in an efficient and user friendly manner. Increased automation and improved visualization of the results will help make the results and findings of the analyses readily available to experimental scientists. RESULTS By combing the best open source tools developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we have implemented QuickRNASeq, a pipeline for large-scale RNA-seq data analyses and visualization. The QuickRNASeq workflow consists of three main steps. In Step #1, each individual sample is processed, including mapping RNA-seq reads to a reference genome, counting the numbers of mapped reads, quality control of the aligned reads, and SNP (single nucleotide polymorphism) calling. Step #1 is computationally intensive, and can be processed in parallel. In Step #2, the results from individual samples are merged, and an integrated and interactive project report is generated. All analyses results in the report are accessible via a single HTML entry webpage. Step #3 is the data interpretation and presentation step. The rich visualization features implemented here allow end users to interactively explore the results of RNA-seq data analyses, and to gain more insights into RNA-seq datasets. In addition, we used a real world dataset to demonstrate the simplicity and efficiency of QuickRNASeq in RNA-seq data analyses and interactive visualizations. The seamless integration of automated capabilites with interactive visualizations in QuickRNASeq is not available in other published RNA-seq pipelines. CONCLUSION The high degree of automation and interactivity in QuickRNASeq leads to a substantial reduction in the time and effort required prior to further downstream analyses and interpretation of the analyses findings. QuickRNASeq advances primary RNA-seq data analyses to the next level of automation, and is mature for public release and adoption.
Collapse
Affiliation(s)
- Shanrong Zhao
- PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - Li Xi
- PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - Jie Quan
- Computational Sciences Center of Emphasis, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - Hualin Xi
- Computational Sciences Center of Emphasis, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - Ying Zhang
- PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - David von Schack
- PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - Michael Vincent
- PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| | - Baohong Zhang
- PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, 02139, USA.
| |
Collapse
|
22
|
Zhao S, Xi L, Zhang B. Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be? PLoS One 2015; 10:e0141910. [PMID: 26559532 PMCID: PMC4641603 DOI: 10.1371/journal.pone.0141910] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Accepted: 10/14/2015] [Indexed: 11/24/2022] Open
Abstract
In recent years, RNA-seq is emerging as a powerful technology in estimation of gene and/or transcript expression, and RPKM (Reads Per Kilobase per Million reads) is widely used to represent the relative abundance of mRNAs for a gene. In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach. Transcript-based approach is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap. On the other hand, ‘union exon’-based approach method is much simpler and thus widely used in RNA-seq gene quantification. Biologically, a gene is expressed in one or more transcript isoforms. Therefore, transcript-based approach is logistically more meaningful than ‘union exon’-based approach. Despite the fact that gene quantification is a fundamental task in most RNA-seq studies, however, it remains unclear whether ‘union exon’-based approach for RNA-seq gene quantification is a good practice or not. In this paper, we carried out a side-by-side comparison of ‘union exon’-based approach and transcript-based method in RNA-seq gene quantification. It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach. The difference between the two approaches is primarily affected by the number of transcripts in a gene. We performed differential analysis at both gene and transcript levels, respectively, and found more insights, such as isoform switches, are gained from isoform differential analysis. The accuracy of isoform quantification would improve if the read coverage pattern and exon-exon spanning reads are taken into account and incorporated into EM (Expectation Maximization) algorithm. Our investigation discourages the use of ‘union exons’-based approach in gene quantification despite its simplicity.
Collapse
Affiliation(s)
- Shanrong Zhao
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, Massachusetts, 02139, United States of America
- * E-mail: ;
| | - Li Xi
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, Massachusetts, 02139, United States of America
| | - Baohong Zhang
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, Massachusetts, 02139, United States of America
| |
Collapse
|
23
|
Shuda M, Guastafierro A, Geng X, Shuda Y, Ostrowski SM, Lukianov S, Jenkins FJ, Honda K, Maricich SM, Moore PS, Chang Y. Merkel Cell Polyomavirus Small T Antigen Induces Cancer and Embryonic Merkel Cell Proliferation in a Transgenic Mouse Model. PLoS One 2015; 10:e0142329. [PMID: 26544690 PMCID: PMC4636375 DOI: 10.1371/journal.pone.0142329] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/19/2015] [Indexed: 01/30/2023] Open
Abstract
Merkel cell polyomavirus (MCV) causes the majority of human Merkel cell carcinomas (MCC) and encodes a small T (sT) antigen that transforms immortalized rodent fibroblasts in vitro. To develop a mouse model for MCV sT-induced carcinogenesis, we generated transgenic mice with a flox-stop-flox MCV sT sequence homologously recombined at the ROSA locus (ROSAsT), allowing Cre-mediated, conditional MCV sT expression. Standard tamoxifen (TMX) administration to adult UbcCreERT2; ROSAsT mice, in which Cre is ubiquitously expressed, resulted in MCV sT expression in multiple organs that was uniformly lethal within 5 days. Conversely, most adult UbcCreERT2; ROSAsT mice survived low-dose tamoxifen administration but developed ear lobe dermal hyperkeratosis and hypergranulosis. Simultaneous MCV sT expression and conditional homozygous p53 deletion generated multi-focal, poorly-differentiated, highly anaplastic tumors in the spleens and livers of mice after 60 days of TMX treatment. Mouse embryonic fibroblasts from these mice induced to express MCV sT exhibited anchorage-independent cell growth. To examine Merkel cell pathology, MCV sT expression was also induced during mid-embryogenesis in Merkel cells of Atoh1CreERT2/+; ROSAsT mice, which lead to significantly increased Merkel cell numbers in touch domes at late embryonic ages that normalized postnatally. Tamoxifen administration to adult Atoh1CreERT2/+; ROSAsT and Atoh1CreERT2/+; ROSAsT; p53flox/flox mice had no effects on Merkel cell numbers and did not induce tumor formation. Taken together, these results show that MCV sT stimulates progenitor Merkel cell proliferation in embryonic mice and is a bona fide viral oncoprotein that induces full cancer cell transformation in the p53-null setting.
Collapse
MESH Headings
- Anaplasia
- Animals
- Antigens, Viral, Tumor/genetics
- Carcinoma, Merkel Cell/pathology
- Carcinoma, Merkel Cell/virology
- Cell Count
- Cell Differentiation
- Cell Line, Tumor
- Cell Proliferation
- Cell Transformation, Viral
- Disease Models, Animal
- Embryo, Mammalian/pathology
- Female
- Humans
- Liver/pathology
- Male
- Merkel Cells/pathology
- Merkel cell polyomavirus/immunology
- Merkel cell polyomavirus/physiology
- Mice
- Mice, Inbred C57BL
- Mice, Transgenic
- Pregnancy
- Skin Neoplasms/pathology
- Skin Neoplasms/virology
- Spleen/pathology
- Tumor Suppressor Protein p53/deficiency
- Tumor Suppressor Protein p53/genetics
Collapse
Affiliation(s)
- Masahiro Shuda
- Cancer Virology Program, University of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Anna Guastafierro
- Cancer Virology Program, University of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Xuehui Geng
- Richard King Mellon Institute for Pediatric Research, Department of Pediatrics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Yoko Shuda
- Cancer Virology Program, University of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Stephen M. Ostrowski
- Department of Dermatology, Case Western Reserve University School of Medicine, Cleveland, Ohio, United States of America
| | - Stefan Lukianov
- Cancer Virology Program, University of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Frank J. Jenkins
- Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Kord Honda
- Department of Dermatology, Case Western Reserve University School of Medicine, Cleveland, Ohio, United States of America
| | - Stephen M. Maricich
- Richard King Mellon Institute for Pediatric Research, Department of Pediatrics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail: (SMM); (PSM); (YC)
| | - Patrick S. Moore
- Cancer Virology Program, University of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail: (SMM); (PSM); (YC)
| | - Yuan Chang
- Cancer Virology Program, University of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail: (SMM); (PSM); (YC)
| |
Collapse
|
24
|
Haque MM, Bose T, Dutta A, Reddy CVSK, Mande SS. CS-SCORE: Rapid identification and removal of human genome contaminants from metagenomic datasets. Genomics 2015; 106:116-21. [DOI: 10.1016/j.ygeno.2015.04.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2015] [Revised: 04/09/2015] [Accepted: 04/26/2015] [Indexed: 02/01/2023]
|
25
|
Whipple JM, Youssef OA, Aruscavage PJ, Nix DA, Hong C, Johnson WE, Bass BL. Genome-wide profiling of the C. elegans dsRNAome. RNA (NEW YORK, N.Y.) 2015; 21:786-800. [PMID: 25805852 PMCID: PMC4408787 DOI: 10.1261/rna.048801.114] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2014] [Accepted: 12/23/2014] [Indexed: 06/01/2023]
Abstract
Recent studies hint that endogenous dsRNA plays an unexpected role in cellular signaling. However, a complete understanding of endogenous dsRNA signaling is hindered by an incomplete annotation of dsRNA-producing genes. To identify dsRNAs expressed in Caenorhabditis elegans, we developed a bioinformatics pipeline that identifies dsRNA by detecting clustered RNA editing sites, which are strictly limited to long dsRNA substrates of Adenosine Deaminases that act on RNA (ADAR). We compared two alignment algorithms for mapping both unique and repetitive reads and detected as many as 664 editing-enriched regions (EERs) indicative of dsRNA loci. EERs are visually enriched on the distal arms of autosomes and are predicted to possess strong internal secondary structures as well as sequence complementarity with other EERs, indicative of both intramolecular and intermolecular duplexes. Most EERs were associated with protein-coding genes, with ∼1.7% of all C. elegans mRNAs containing an EER, located primarily in very long introns and in annotated, as well as unannotated, 3' UTRs. In addition to numerous EERs associated with coding genes, we identified a population of prospective noncoding EERs that were distant from protein-coding genes and that had little or no coding potential. Finally, subsets of EERs are differentially expressed during development as well as during starvation and infection with bacterial or fungal pathogens. By combining RNA-seq with freely available bioinformatics tools, our workflow provides an easily accessible approach for the identification of dsRNAs, and more importantly, a catalog of the C. elegans dsRNAome.
Collapse
Affiliation(s)
- Joseph M Whipple
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| | - Osama A Youssef
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| | - P Joseph Aruscavage
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| | - David A Nix
- Department of Oncological Sciences, Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, Utah 84112-5775, USA
| | - Changjin Hong
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, Massachusetts 02118, USA
| | - W Evan Johnson
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, Massachusetts 02118, USA
| | - Brenda L Bass
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| |
Collapse
|
26
|
Zhao S, Zhang B. A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics 2015; 16:97. [PMID: 25765860 PMCID: PMC4339237 DOI: 10.1186/s12864-015-1308-8] [Citation(s) in RCA: 81] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 01/30/2015] [Indexed: 01/09/2023] Open
Abstract
Background RNA-Seq has become increasingly popular in transcriptome profiling. One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. Acquiring a transcriptome expression profile requires genomic elements to be defined in the context of the genome. Multiple human genome annotation databases exist, including RefGene (RefSeq Gene), Ensembl, and the UCSC annotation database. The impact of the choice of an annotation on estimating gene expression remains insufficiently investigated. Results In this paper, we systematically characterized the impact of genome annotation choice on read mapping and transcriptome quantification by analyzing a RNA-Seq dataset generated by the Human Body Map 2.0 Project. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast, this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10–15% mapped alternatively. There are 21,958 common genes among RefGene, Ensembl, and UCSC annotations. When we compared the gene quantification results in RefGene and Ensembl annotations, 20% of genes are not expressed, and thus have a zero count in both annotations. Surprisingly, identical gene quantification results were obtained for only 16.3% (about one sixth) of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and of those, the relative expression levels for 9.3% of genes (equivalent to 2038) differed by 50% or greater. The case studies revealed that the gene definition differences in gene models frequently result in inconsistency in gene quantification. Conclusions We demonstrated that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. Our research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1308-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shanrong Zhao
- Clinical Genetics and Bioinformatics, BioTherapeutics Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Baohong Zhang
- Clinical Genetics and Bioinformatics, BioTherapeutics Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| |
Collapse
|
27
|
Abstract
Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets.
Collapse
Affiliation(s)
- J Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Danso Ako-Adjei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Yiming Bao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Olga Blinkova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
28
|
Kuhn JH, Andersen KG, Bào Y, Bavari S, Becker S, Bennett RS, Bergman NH, Blinkova O, Bradfute S, Brister JR, Bukreyev A, Chandran K, Chepurnov AA, Davey RA, Dietzgen RG, Doggett NA, Dolnik O, Dye JM, Enterlein S, Fenimore PW, Formenty P, Freiberg AN, Garry RF, Garza NL, Gire SK, Gonzalez JP, Griffiths A, Happi CT, Hensley LE, Herbert AS, Hevey MC, Hoenen T, Honko AN, Ignatyev GM, Jahrling PB, Johnson JC, Johnson KM, Kindrachuk J, Klenk HD, Kobinger G, Kochel TJ, Lackemeyer MG, Lackner DF, Leroy EM, Lever MS, Mühlberger E, Netesov SV, Olinger GG, Omilabu SA, Palacios G, Panchal RG, Park DJ, Patterson JL, Paweska JT, Peters CJ, Pettitt J, Pitt L, Radoshitzky SR, Ryabchikova EI, Saphire EO, Sabeti PC, Sealfon R, Shestopalov AM, Smither SJ, Sullivan NJ, Swanepoel R, Takada A, Towner JS, van der Groen G, Volchkov VE, Volchkova VA, Wahl-Jensen V, Warren TK, Warfield KL, Weidmann M, Nichol ST. Filovirus RefSeq entries: evaluation and selection of filovirus type variants, type sequences, and names. Viruses 2014; 6:3663-82. [PMID: 25256396 PMCID: PMC4189044 DOI: 10.3390/v6093663] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 09/23/2014] [Indexed: 12/14/2022] Open
Abstract
Sequence determination of complete or coding-complete genomes of viruses is becoming common practice for supporting the work of epidemiologists, ecologists, virologists, and taxonomists. Sequencing duration and costs are rapidly decreasing, sequencing hardware is under modification for use by non-experts, and software is constantly being improved to simplify sequence data management and analysis. Thus, analysis of virus disease outbreaks on the molecular level is now feasible, including characterization of the evolution of individual virus populations in single patients over time. The increasing accumulation of sequencing data creates a management problem for the curators of commonly used sequence databases and an entry retrieval problem for end users. Therefore, utilizing the data to their fullest potential will require setting nomenclature and annotation standards for virus isolates and associated genomic sequences. The National Center for Biotechnology Information's (NCBI's) RefSeq is a non-redundant, curated database for reference (or type) nucleotide sequence records that supplies source data to numerous other databases. Building on recently proposed templates for filovirus variant naming [ ()////-], we report consensus decisions from a majority of past and currently active filovirus experts on the eight filovirus type variants and isolates to be represented in RefSeq, their final designations, and their associated sequences.
Collapse
Affiliation(s)
- Jens H Kuhn
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Kristian G Andersen
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Yīmíng Bào
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Sina Bavari
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Stephan Becker
- Institut für Virologie, Philipps-Universität Marburg, 35043 Marburg, Germany.
| | - Richard S Bennett
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, Frederick, MD 21702, USA.
| | - Nicholas H Bergman
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, Frederick, MD 21702, USA.
| | - Olga Blinkova
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | - J Rodney Brister
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Alexander Bukreyev
- Department of Pathology and Galveston National Laboratory, University of Texas Medical Branch, Galveston, TX 77555, USA.
| | - Kartik Chandran
- Department of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| | - Alexander A Chepurnov
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Robert A Davey
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Ralf G Dietzgen
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Norman A Doggett
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Olga Dolnik
- Institut für Virologie, Philipps-Universität Marburg, 35043 Marburg, Germany.
| | - John M Dye
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Sven Enterlein
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Paul W Fenimore
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Pierre Formenty
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Alexander N Freiberg
- Department of Pathology and Galveston National Laboratory, University of Texas Medical Branch, Galveston, TX 77555, USA.
| | - Robert F Garry
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Nicole L Garza
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Stephen K Gire
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Jean-Paul Gonzalez
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA. :
| | - Anthony Griffiths
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Christian T Happi
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Lisa E Hensley
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Andrew S Herbert
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Michael C Hevey
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, Frederick, MD 21702, USA.
| | - Thomas Hoenen
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Anna N Honko
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Georgy M Ignatyev
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Peter B Jahrling
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Joshua C Johnson
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Karl M Johnson
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Jason Kindrachuk
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Hans-Dieter Klenk
- Institut für Virologie, Philipps-Universität Marburg, 35043 Marburg, Germany.
| | - Gary Kobinger
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Tadeusz J Kochel
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, Frederick, MD 21702, USA.
| | - Matthew G Lackemeyer
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Daniel F Lackner
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, Frederick, MD 21702, USA.
| | - Eric M Leroy
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Mark S Lever
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Elke Mühlberger
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Sergey V Netesov
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Gene G Olinger
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Sunday A Omilabu
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Gustavo Palacios
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Rekha G Panchal
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Daniel J Park
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Jean L Patterson
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Janusz T Paweska
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Clarence J Peters
- Department of Pathology and Galveston National Laboratory, University of Texas Medical Branch, Galveston, TX 77555, USA.
| | - James Pettitt
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA.
| | - Louise Pitt
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Sheli R Radoshitzky
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Elena I Ryabchikova
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Erica Ollmann Saphire
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Pardis C Sabeti
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Rachel Sealfon
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | - Sophie J Smither
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Nancy J Sullivan
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Robert Swanepoel
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Ayato Takada
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Jonathan S Towner
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Guido van der Groen
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Viktor E Volchkov
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Valentina A Volchkova
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Victoria Wahl-Jensen
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, Frederick, MD 21702, USA.
| | - Travis K Warren
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Kelly L Warfield
- Information Engineering Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Manfred Weidmann
- United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD 21702, USA.
| | - Stuart T Nichol
- IViral Special Pathogens Branch, Division of High-Consequence Pathogens Pathology, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.
| |
Collapse
|
29
|
Zhao S. Assessment of the impact of using a reference transcriptome in mapping short RNA-Seq reads. PLoS One 2014; 9:e101374. [PMID: 24992027 PMCID: PMC4081564 DOI: 10.1371/journal.pone.0101374] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Accepted: 06/06/2014] [Indexed: 11/28/2022] Open
Abstract
RNA-Seq has become increasingly popular in transcriptome profiling. The major challenge in RNA-Seq data analysis is the accurate mapping of junction reads to their genomic origins. To detect splicing sites in short reads, many RNA-Seq aligners use reference transcriptome to inform placement of junction reads. However, no systematic evaluation has been performed to assess or quantify the benefits of incorporating reference transcriptome in mapping RNA-Seq reads. In this paper, we have studied the impact of reference transcriptome on mapping RNA-Seq reads, especially on junction ones. The same dataset were analysed with and without RefGene transcriptome, respectively. Then a Perl script was developed to analyse and compare the mapping results. It was found that about 50–55% junction reads can be mapped to the same genomic regions regardless of the usage of RefGene model. More than one-third of reads fail to be mapped without the help of a reference transcriptome. For “Alternatively” mapped reads, i.e., those reads mapped differently with and without RefGene model, the mappings without RefGene model are usually worse than their corresponding alignments with RefGene model. For junction reads that span more than two exons, it is less likely to align them correctly without the assistance of reference transcriptome. As the sequencing technology evolves, the read length is becoming longer and longer. When reads become longer, they are more likely to span multiple exons, and thus the mapping of long junction reads is actually becoming more and more challenging without the assistance of reference transcriptome. Therefore, the advantages of using reference transcriptome in the mapping demonstrated in this study are becoming more evident for longer reads. In addition, the effect of the completeness of reference transcriptome on mapping of RNA-Seq reads is discussed.
Collapse
Affiliation(s)
- Shanrong Zhao
- Systems Pharmacology and Biomarkers, Janssen Research & Development, LLC, San Diego, California, United States of America
- * E-mail:
| |
Collapse
|