1
|
Merino Martinez R, Müller H, Negru S, Ormenisan A, Arroyo Mühr LS, Zhang X, Trier Møller F, Clements MS, Kozlakidis Z, Pimenoff VN, Wilkowski B, Boeckhout M, Öhman H, Chong S, Holzinger A, Lehtinen M, van Veen EB, Bała P, Widschwendter M, Dowling J, Törnroos J, Snyder MP, Dillner J. Human exposome assessment platform. Environ Epidemiol 2021; 5:e182. [PMID: 34909561 PMCID: PMC8663864 DOI: 10.1097/ee9.0000000000000182] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Accepted: 11/14/2021] [Indexed: 11/26/2022] Open
Abstract
The Human Exposome Assessment Platform (HEAP) is a research resource for the integrated and efficient management and analysis of human exposome data. The project will provide the complete workflow for obtaining exposome actionable knowledge from population-based cohorts. HEAP is a state-of-the-science service composed of computational resources from partner institutions, accessed through a software framework that provides the world's fastest Hadoop platform for data warehousing and applied artificial intelligence (AI). The software, will provide a decision support system for researchers and policymakers. All the data managed and processed by HEAP, together with the analysis pipelines, will be available for future research. In addition, the platform enables adding new data and analysis pipelines. HEAP's final product can be deployed in multiple instances to create a network of shareable and reusable knowledge on the impact of exposures on public health.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Frederik Trier Møller
- Infectious Disease Epidemiology and Prevention, Statens Serum Institut, Copenhagen, Denmark
| | | | - Zisis Kozlakidis
- International Agency for Research on Cancer, World Health Organization, Lyon, France
| | - Ville N. Pimenoff
- Karolinska Institutet, Stockholm, Sweden
- Faculty of Medicine, University of Oulu, Oulu, Finland
- Tampere University, Tampere, Finland
| | | | | | - Hanna Öhman
- Faculty of Medicine, University of Oulu, Oulu, Finland
- Biobank Borealis of Northern Finland, Oulu University Hospital, Oulu, Finland
| | - Steven Chong
- Danish National Biobank, Statens Serum Institut, Copenhagen, Denmark
| | | | - Matti Lehtinen
- Karolinska Institutet, Stockholm, Sweden
- Tampere University, Tampere, Finland
| | | | | | - Martin Widschwendter
- Research Institute for Biomedical Aging Research, Universität Innsbruck, Innsbruck, Austria
| | | | | | | | | |
Collapse
|
2
|
Fuentes-Trillo A, Monzó C, Manzano I, Santiso-Bellón C, Andrade JDSRD, Gozalbo-Rovira R, García-García AB, Rodríguez-Díaz J, Chaves FJ. Benchmarking different approaches for Norovirus genome assembly in metagenome samples. BMC Genomics 2021; 22:849. [PMID: 34819031 PMCID: PMC8611953 DOI: 10.1186/s12864-021-08067-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 10/10/2021] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Genome assembly of viruses with high mutation rates, such as Norovirus and other RNA viruses, or from metagenome samples, poses a challenge for the scientific community due to the coexistence of several viral quasispecies and strains. Furthermore, there is no standard method for obtaining whole-genome sequences in non-related patients. After polyA RNA isolation and sequencing in eight patients with acute gastroenteritis, we evaluated two de Bruijn graph assemblers (SPAdes and MEGAHIT), combined with four different and common pre-assembly strategies, and compared those yielding whole genome Norovirus contigs. RESULTS Reference-genome guided strategies with both host and target virus did not present any advantages compared to the assembly of non-filtered data in the case of SPAdes, and in the case of MEGAHIT, only host genome filtering presented improvements. MEGAHIT performed better than SPAdes in most samples, reaching complete genome sequences in most of them for all the strategies employed. Read binning with CD-HIT improved assembly when paired with different analysis strategies, and more notably in the case of SPAdes. CONCLUSIONS Not all metagenome assemblies are equal and the choice in the workflow depends on the species studied and the prior steps to analysis. We may need different approaches even for samples treated equally due to the presence of high intra host variability. We tested and compared different workflows for the accurate assembly of Norovirus genomes and established their assembly capacities for this purpose.
Collapse
Affiliation(s)
- Azahara Fuentes-Trillo
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
| | - Carolina Monzó
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
| | - Iris Manzano
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
| | | | | | | | - Ana-Bárbara García-García
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain.
- Spanish Biomedical Research Network in Diabetes and Associated Metabolic Disorders (CIBERDEM), Madrid, Spain.
| | - Jesús Rodríguez-Díaz
- Department of Microbiology, School of Medicine, University of Valencia, Valencia, Spain
| | - Felipe Javier Chaves
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
- Spanish Biomedical Research Network in Diabetes and Associated Metabolic Disorders (CIBERDEM), Madrid, Spain
- Sequencing Multiplex S.L., Valencia, Spain
| |
Collapse
|
3
|
Maarala AI, Arasalo O, Valenzuela D, Mäkinen V, Heljanko K. Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment. PLoS One 2021; 16:e0255260. [PMID: 34343181 PMCID: PMC8330939 DOI: 10.1371/journal.pone.0255260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 07/12/2021] [Indexed: 11/19/2022] Open
Abstract
Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.
Collapse
Affiliation(s)
| | - Ossi Arasalo
- Department of Computer Science, Aalto University, Espoo, Finland
| | - Daniel Valenzuela
- Department of Computer Science, University of Helsinki, Espoo, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, Espoo, Finland
- Helsinki Institute for Information Technology, Espoo, Finland
| | - Keijo Heljanko
- Department of Computer Science, University of Helsinki, Espoo, Finland
- Helsinki Institute for Information Technology, Espoo, Finland
| |
Collapse
|
4
|
Posada-Céspedes S, Seifert D, Topolsky I, Jablonski KP, Metzner KJ, Beerenwinkel N. V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data. Bioinformatics 2021; 37:1673-1680. [PMID: 33471068 PMCID: PMC8289377 DOI: 10.1093/bioinformatics/btab015] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 12/09/2020] [Accepted: 01/08/2021] [Indexed: 12/30/2022] Open
Abstract
Motivation High-throughput sequencing technologies are used increasingly not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. Results To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. Availabilityand implementation V-pipe is freely available at https://github.com/cbg-ethz/V-pipe. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Susana Posada-Céspedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Karin J Metzner
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of Zurich, Zurich, 8091, Switzerland.,4 Institute of Medical Virology, University of Zurich, Zurich, 8091, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| |
Collapse
|
5
|
Krissaane I, De Niz C, Gutiérrez-Sacristán A, Korodi G, Ede N, Kumar R, Lyons J, Manrai A, Patel C, Kohane I, Avillach P. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. J Am Med Inform Assoc 2020; 27:1425-1430. [PMID: 32719837 PMCID: PMC7534581 DOI: 10.1093/jamia/ocaa068] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 03/20/2020] [Accepted: 04/17/2020] [Indexed: 01/14/2023] Open
Abstract
Objective Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies. Methods We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset. Results Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics. Conclusions We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Paul Avillach
- Corresponding Author: Paul Avillach, Department of Biomedical Informatics, Harvard Medical School, Harvard University, Boston 02115, MA, USA;
| |
Collapse
|
6
|
Transcription of human papillomavirus oncogenes in head and neck squamous cell carcinomas. Vaccine 2020; 38:4066-4070. [PMID: 32362526 DOI: 10.1016/j.vaccine.2020.04.049] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 04/03/2020] [Accepted: 04/20/2020] [Indexed: 12/24/2022]
Abstract
Some head and neck cancers are caused by human papillomavirus (HPV). As HPV vaccination can prevent infection, an estimation of which HPV types have an active viral oncogene transcription in what proportion of tumors might allow estimation of the proportion of head & neck cancers preventable by HPV vaccination. We used all RNA sequencing data from primary tumors of head and neck squamous cell carcinomas from the Cancer Genome Atlas (n = 500 patients). We analysed 3.7 terabyte of sequencing data with the bioinformatics pipeline ViraPipe. Paired end reads were quality filtered using the original code and aligned to known HPV sequences. HPV transcripts were found in 113/500 specimens, with transcription of both the E6 and E7 viral oncogenes in 90 specimens. HPV16 had E6/E7 transcription in 67 cases, HPV33 in 14 cases, HPV18 in 6 cases and HPV35 in 5 cases. HPV oncogene transcription was most common in tumors from tonsils (34/40, 85%), followed by palate (4/5, 80%), base of tongue (10/20, 50%), oropharynx (4/10, 40%), and gum (4/11, 36%). Comparison to the cancer incidence statistics in the USA indicates that vaccine-preventable HPV16/18/33 oncogene transcription would be found in about 8.3% female and 20.2% male patients of head and neck cancers in the USA. Transcription of the HPV oncogenes is present in a large proportion of head and neck cancers in the TCGA database. If these cancers are caused by HPV, prevention of HPV16/18/33 infections would prevent ~49 300 annual head and neck cancer cases in the USA alone.
Collapse
|
7
|
Pérez-Losada M, Arenas M, Galán JC, Bracho MA, Hillung J, García-González N, González-Candelas F. High-throughput sequencing (HTS) for the analysis of viral populations. INFECTION GENETICS AND EVOLUTION 2020; 80:104208. [PMID: 32001386 DOI: 10.1016/j.meegid.2020.104208] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 01/21/2020] [Accepted: 01/24/2020] [Indexed: 12/12/2022]
Abstract
The development of High-Throughput Sequencing (HTS) technologies is having a major impact on the genomic analysis of viral populations. Current HTS platforms can capture nucleic acid variation across millions of genes for both selected amplicons and full viral genomes. HTS has already facilitated the discovery of new viruses, hinted new taxonomic classifications and provided a deeper and broader understanding of their diversity, population and genetic structure. Hence, HTS has already replaced standard Sanger sequencing in basic and applied research fields, but the next step is its implementation as a routine technology for the analysis of viruses in clinical settings. The most likely application of this implementation will be the analysis of viral genomics, because the huge population sizes, high mutation rates and very fast replacement of viral populations have demonstrated the limited information obtained with Sanger technology. In this review, we describe new technologies and provide guidelines for the high-throughput sequencing and genetic and evolutionary analyses of viral populations and metaviromes, including software applications. With the development of new HTS technologies, new and refurbished molecular and bioinformatic tools are also constantly being developed to process and integrate HTS data. These allow assembling viral genomes and inferring viral population diversity and dynamics. Finally, we also present several applications of these approaches to the analysis of viral clinical samples including transmission clusters and outbreak characterization.
Collapse
Affiliation(s)
- Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão 4485-661, Portugal
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain; Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain.
| | - Juan Carlos Galán
- Microbiology Service, Hospital Ramón y Cajal, Madrid, Spain; CIBER in Epidemiology and Public Health, Spain.
| | - Mª Alma Bracho
- CIBER in Epidemiology and Public Health, Spain; Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain.
| | - Julia Hillung
- Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| | - Neris García-González
- Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| | - Fernando González-Candelas
- CIBER in Epidemiology and Public Health, Spain; Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| |
Collapse
|
8
|
Maabar M, Davison AJ, Vučak M, Thorburn F, Murcia PR, Gunson R, Palmarini M, Hughes J. DisCVR: Rapid viral diagnosis from high-throughput sequencing data. Virus Evol 2019; 5:vez033. [PMID: 31528358 PMCID: PMC6735924 DOI: 10.1093/ve/vez033] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
High-throughput sequencing (HTS) enables most pathogens in a clinical sample to be detected from a single analysis, thereby providing novel opportunities for diagnosis, surveillance, and epidemiology. However, this powerful technology is difficult to apply in diagnostic laboratories because of its computational and bioinformatic demands. We have developed DisCVR, which detects known human viruses in clinical samples by matching sample k-mers (twenty-two nucleotide sequences) to k-mers from taxonomically labeled viral genomes. DisCVR was validated using published HTS data for eighty-nine clinical samples from adults with upper respiratory tract infections. These samples had been tested for viruses metagenomically and also by real-time polymerase chain reaction assay, which is the standard diagnostic method. DisCVR detected human viruses with high sensitivity (79%) and specificity (100%), and was able to detect mixed infections. Moreover, it produced results comparable to those in a published metagenomic analysis of 177 blood samples from patients in Nigeria. DisCVR has been designed as a user-friendly tool for detecting human viruses from HTS data using computers with limited RAM and processing power, and includes a graphical user interface to help users interpret and validate the output. It is written in Java and is publicly available from http://bioinformatics.cvr.ac.uk/discvr.php.
Collapse
Affiliation(s)
- Maha Maabar
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Andrew J Davison
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Matej Vučak
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Fiona Thorburn
- Microbiology Department, Glasgow Royal Infirmary, Glasgow G4 0SF, UK
| | - Pablo R Murcia
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Rory Gunson
- West of Scotland Specialist Virology Centre, Glasgow Royal Infirmary, Glasgow G4 0SF, UK
| | - Massimo Palmarini
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Joseph Hughes
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| |
Collapse
|