1
|
Harsono IW, Ariani Y, Benyamin B, Fadilah F, Pujianto DA, Hafifah CN. IDeRare: a lightweight and extensible open-source phenotype and exome analysis pipeline for germline rare disease diagnosis. JAMIA Open 2024; 7:ooae052. [PMID: 38883202 PMCID: PMC11179852 DOI: 10.1093/jamiaopen/ooae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 05/20/2024] [Accepted: 05/27/2024] [Indexed: 06/18/2024] Open
Abstract
Objectives Diagnosing rare diseases is an arduous and challenging process in clinical settings, resulting in the late discovery of novel variants and referral loops. To help clinicians, we built IDeRare pipelines to accelerate phenotype-genotype analysis for patients with suspected rare diseases. Materials and Methods IDeRare pipeline is separated into phenotype and genotype parts. The phenotype utilizes our handmade Python library, while the genotype part utilizes command line (bash) and Python script to combine bioinformatics executable and Docker image. Results We described various implementations of IDeRare phenotype and genotype parts with real-world clinical and exome data using IDeRare, accelerating the terminology conversion process and giving insight on the diagnostic pathway based on disease linkage analysis until exome analysis and HTML-based reporting for clinicians. Conclusion IDeRare is freely available under the BSD-3 license, obtainable via GitHub. The portability of IDeRare pipeline could be easily implemented for semi-technical users and extensible for advanced users.
Collapse
Affiliation(s)
- Ivan William Harsono
- Doctoral Program in Biomedical Sciences, Faculty of Medicine, Universitas Indonesia, Jakarta 10430, Indonesia
| | - Yulia Ariani
- Department of Medical Biology, Faculty of Medicine, Universitas Indonesia, Jakarta 10430, Indonesia
| | - Beben Benyamin
- Australian Centre for Precision Health, University of South Australia, Adelaide 5000, Australia
- UniSA Allied Health and Human Performance, University of South Australia, Adelaide 5000, Australia
- South Australian Health and Medical Research Institute (SAHMRI), University of South Australia, Adelaide 5000, Australia
| | - Fadilah Fadilah
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta 10430, Indonesia
- Bioinformatics Core Facilities—IMERI, Faculty of Medicine, Universitas Indonesia, Jakarta 10430, Indonesia
| | - Dwi Ari Pujianto
- Department of Medical Biology, Faculty of Medicine, Universitas Indonesia, Jakarta 10430, Indonesia
| | - Cut Nurul Hafifah
- Department of Child Health, Dr Cipto Mangunkusumo Hospital, Faculty of Medicine, University of Indonesia, Jakarta 10430, Indonesia
| |
Collapse
|
2
|
Anilkumar Sithara A, Maripuri D, Moorthy K, Amirtha Ganesh S, Philip P, Banerjee S, Sudhakar M, Raman K. iCOMIC: a graphical interface-driven bioinformatics pipeline for analyzing cancer omics data. NAR Genom Bioinform 2022; 4:lqac053. [PMID: 35899080 PMCID: PMC9310080 DOI: 10.1093/nargab/lqac053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 06/17/2022] [Accepted: 07/04/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Despite the tremendous increase in omics data generated by modern sequencing technologies, their analysis can be tricky and often requires substantial expertise in bioinformatics. To address this concern, we have developed a user-friendly pipeline to analyze (cancer) genomic data that takes in raw sequencing data (FASTQ format) as input and outputs insightful statistics. Our iCOMIC toolkit pipeline featuring many independent workflows is embedded in the popular Snakemake workflow management system. It can analyze whole-genome and transcriptome data and is characterized by a user-friendly GUI that offers several advantages, including minimal execution steps and eliminating the need for complex command-line arguments. Notably, we have integrated algorithms developed in-house to predict pathogenicity among cancer-causing mutations and differentiate between tumor suppressor genes and oncogenes from somatic mutation data. We benchmarked our tool against Genome In A Bottle benchmark dataset (NA12878) and got the highest F1 score of 0.971 and 0.988 for indels and SNPs, respectively, using the BWA MEM—GATK HC DNA-Seq pipeline. Similarly, we achieved a correlation coefficient of r = 0.85 using the HISAT2-StringTie-ballgown and STAR-StringTie-ballgown RNA-Seq pipelines on the human monocyte dataset (SRP082682). Overall, our tool enables easy analyses of omics datasets, significantly ameliorating complex data analysis pipelines.
Collapse
Affiliation(s)
- Anjana Anilkumar Sithara
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Devi Priyanka Maripuri
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Keerthika Moorthy
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Sai Sruthi Amirtha Ganesh
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Philge Philip
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Shayantan Banerjee
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Malvika Sudhakar
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| | - Karthik Raman
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras , Chennai 600036, India
- Centre for Integrative Biology and Systems mEdicine , IIT Madras, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI) , IIT Madras, India
| |
Collapse
|
3
|
Ahmed Z, Renart EG, Mishra D, Zeeshan S. JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping. FEBS Open Bio 2021; 11:2441-2452. [PMID: 34370400 PMCID: PMC8409305 DOI: 10.1002/2211-5463.13261] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/18/2021] [Accepted: 08/02/2021] [Indexed: 01/07/2023] Open
Abstract
Whole genome and exome sequencing (WGS/WES) are the most popular next‐generation sequencing (NGS) methodologies and are at present often used to detect rare and common genetic variants of clinical significance. We emphasize that automated sequence data processing, management, and visualization should be an indispensable component of modern WGS and WES data analysis for sequence assembly, variant detection (SNPs, SVs), imputation, and resolution of haplotypes. In this manuscript, we present a newly developed findable, accessible, interoperable, and reusable (FAIR) bioinformatics‐genomics pipeline Java based Whole Genome/Exome Sequence Data Processing Pipeline (JWES) for efficient variant discovery and interpretation, and big data modeling and visualization. JWES is a cross‐platform, user‐friendly, product line application, that entails three modules: (a) data processing, (b) storage, and (c) visualization. The data processing module performs a series of different tasks for variant calling, the data storage module efficiently manages high‐volume gene‐variant data, and the data visualization module supports variant data interpretation with Circos graphs. The performance of JWES was tested and validated in‐house with different experiments, using Microsoft Windows, macOS Big Sur, and UNIX operating systems. JWES is an open‐source and freely available pipeline, allowing scientists to take full advantage of all the computing resources available, without requiring much computer science knowledge. We have successfully applied JWES for processing, management, and gene‐variant discovery, annotation, prediction, and genotyping of WGS and WES data to analyze variable complex disorders. In summary, we report the performance of JWES with some reproducible case studies, using open access and in‐house generated, high‐quality datasets.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Deepshikha Mishra
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
4
|
Ahmed Z, Renart EG, Zeeshan S. Genomics pipelines to investigate susceptibility in whole genome and exome sequenced data for variant discovery, annotation, prediction and genotyping. PeerJ 2021; 9:e11724. [PMID: 34395068 PMCID: PMC8320519 DOI: 10.7717/peerj.11724] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the last few decades, genomics is leading toward audacious future, and has been changing our views about conducting biomedical research, studying diseases, and understanding diversity in our society across the human species. The whole genome and exome sequencing (WGS/WES) are two of the most popular next-generation sequencing (NGS) methodologies that are currently being used to detect genetic variations of clinical significance. Investigating WGS/WES data for the variant discovery and genotyping is based on the nexus of different data analytic applications. Although several bioinformatics applications have been developed, and many of those are freely available and published. Timely finding and interpreting genetic variants are still challenging tasks among diagnostic laboratories and clinicians. In this study, we are interested in understanding, evaluating, and reporting the current state of solutions available to process the NGS data of variable lengths and types for the identification of variants, alleles, and haplotypes. Residing within the scope, we consulted high quality peer reviewed literature published in last 10 years. We were focused on the standalone and networked bioinformatics applications proposed to efficiently process WGS and WES data, and support downstream analysis for gene-variant discovery, annotation, prediction, and interpretation. We have discussed our findings in this manuscript, which include but not are limited to the set of operations, workflow, data handling, involved tools, technologies and algorithms and limitations of the assessed applications.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
5
|
Nieroda L, Maas L, Thiebes S, Lang U, Sunyaev A, Achter V, Peifer M. iRODS metadata management for a cancer genome analysis workflow. BMC Bioinformatics 2019; 20:29. [PMID: 30646845 PMCID: PMC6334444 DOI: 10.1186/s12859-018-2576-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 12/10/2018] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND The massive amounts of data from next generation sequencing (NGS) methods pose various challenges with respect to data security, storage and metadata management. While there is a broad range of data analysis pipelines, these challenges remain largely unaddressed to date. RESULTS We describe the integration of the open-source metadata management system iRODS (Integrated Rule-Oriented Data System) with a cancer genome analysis pipeline in a high performance computing environment. The system allows for customized metadata attributes as well as fine-grained protection rules and is augmented by a user-friendly front-end for metadata input. This results in a robust, efficient end-to-end workflow under consideration of data security, central storage and unified metadata information. CONCLUSIONS Integrating iRODS with an NGS data analysis pipeline is a suitable method for addressing the challenges of data security, storage and metadata management in NGS environments.
Collapse
Affiliation(s)
- Lech Nieroda
- Regional Computing Center (RRZK), University of Cologne, Cologne, 50931 Germany
| | - Lukas Maas
- Department of Translational Genomics, Center of Integrated Oncology Cologne-Bonn, Medical Faculty, University of Cologne, Cologne, 50931 Germany
| | - Scott Thiebes
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, 76133 Germany
| | - Ulrich Lang
- Regional Computing Center (RRZK), University of Cologne, Cologne, 50931 Germany
| | - Ali Sunyaev
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, 76133 Germany
| | - Viktor Achter
- Regional Computing Center (RRZK), University of Cologne, Cologne, 50931 Germany
| | - Martin Peifer
- Department of Translational Genomics, Center of Integrated Oncology Cologne-Bonn, Medical Faculty, University of Cologne, Cologne, 50931 Germany
- Center for Molecular Medicine Cologne (CMMC), University of Cologne, Cologne, 50931 Germany
| |
Collapse
|
6
|
Cox KH, Oliveira LMB, Plummer L, Corbin B, Gardella T, Balasubramanian R, Crowley WF. Modeling mutant/wild-type interactions to ascertain pathogenicity of PROKR2 missense variants in patients with isolated GnRH deficiency. Hum Mol Genet 2019; 27:338-350. [PMID: 29161432 DOI: 10.1093/hmg/ddx404] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 11/10/2017] [Indexed: 12/30/2022] Open
Abstract
A major challenge in human genetics is the validation of pathogenicity of heterozygous missense variants. This problem is well-illustrated by PROKR2 variants associated with Isolated GnRH Deficiency (IGD). Homozygous, loss of function variants in PROKR2 was initially implicated in autosomal recessive IGD; however, most IGD-associated PROKR2 variants are heterozygous. Moreover, while IGD patient cohorts are enriched for PROKR2 missense variants similar rare variants are also found in normal individuals. To elucidate the pathogenic mechanisms distinguishing IGD-associated PROKR2 variants from rare variants in controls, we assessed 59 variants using three approaches: (i) in silico prediction, (ii) traditional in vitro functional assays across three signaling pathways with mutant-alone transfections, and (iii) modified in vitro assays with mutant and wild-type expression constructs co-transfected to model in vivo heterozygosity. We found that neither in silico analyses nor traditional in vitro assessments of mutants transfected alone could distinguish IGD variants from control variants. However, in vitro co-transfections revealed that 15/34 IGD variants caused loss-of-function (LoF), including 3 novel dominant-negatives, while only 4/25 control variants caused LoF. Surprisingly, 19 IGD-associated variants were benign or exhibited LoF that could be rescued by WT co-transfection. Overall, variants that were LoF in ≥ 2 signaling assays under co-transfection conditions were more likely to be disease-associated than benign or 'rescuable' variants. Our findings suggest that in vitro modeling of WT/Mutant interactions increases the resolution for identifying causal variants, uncovers novel dominant negative mutations, and provides new insights into the pathogenic mechanisms underlying heterozygous PROKR2 variants.
Collapse
Affiliation(s)
- Kimberly H Cox
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Luciana M B Oliveira
- Department of Bioregulation, Institute of Health Sciences, Federal University of Bahia, Salvador, Brazil
| | - Lacey Plummer
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Braden Corbin
- Endocrine Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Thomas Gardella
- Endocrine Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Ravikumar Balasubramanian
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - William F Crowley
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
7
|
Musacchia F, Ciolfi A, Mutarelli M, Bruselles A, Castello R, Pinelli M, Basu S, Banfi S, Casari G, Tartaglia M, Nigro V. VarGenius executes cohort-level DNA-seq variant calling and annotation and allows to manage the resulting data through a PostgreSQL database. BMC Bioinformatics 2018; 19:477. [PMID: 30541431 PMCID: PMC6291943 DOI: 10.1186/s12859-018-2532-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Accepted: 11/21/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Targeted resequencing has become the most used and cost-effective approach for identifying causative mutations of Mendelian diseases both for diagnostics and research purposes. Due to very rapid technological progress, NGS laboratories are expanding their capabilities to address the increasing number of analyses. Several open source tools are available to build a generic variant calling pipeline, but a tool able to simultaneously execute multiple analyses, organize, and categorize the samples is still missing. RESULTS Here we describe VarGenius, a Linux based command line software able to execute customizable pipelines for the analysis of multiple targeted resequencing data using parallel computing. VarGenius provides a database to store the output of the analysis (calling quality statistics, variant annotations, internal allelic variant frequencies) and sample information (personal data, genotypes, phenotypes). VarGenius can also perform the "joint analysis" of hundreds of samples with a single command, drastically reducing the time for the configuration and execution of the analysis. VarGenius executes the standard pipeline of the Genome Analysis Tool-Kit (GATK) best practices (GBP) for germinal variant calling, annotates the variants using Annovar, and generates a user-friendly output displaying the results through a web page. VarGenius has been tested on a parallel computing cluster with 52 machines with 120GB of RAM each. Under this configuration, a 50 M whole exome sequencing (WES) analysis for a family was executed in about 7 h (trio or quartet); a joint analysis of 30 WES in about 24 h and the parallel analysis of 34 single samples from a 1 M panel in about 2 h. CONCLUSIONS We developed VarGenius, a "master" tool that faces the increasing demand of heterogeneous NGS analyses and allows maximum flexibility for downstream analyses. It paves the way to a different kind of analysis, centered on cohorts rather than on singleton. Patient and variant information are stored into the database and any output file can be accessed programmatically. VarGenius can be used for routine analyses by biomedical researchers with basic Linux skills providing additional flexibility for computational biologists to develop their own algorithms for the comparison and analysis of data. The software is freely available at: https://github.com/frankMusacchia/VarGenius.
Collapse
Affiliation(s)
- F. Musacchia
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - A. Ciolfi
- Genetics and Rare Diseases Research Division, Bambino Gesù Children’s Hospital, Istituto di Ricovero e Cura a Carattere Scientifico, Rome, Italy
| | - M. Mutarelli
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - A. Bruselles
- Department of Oncology and Molecular Medicine, Istituto Superiore di Sanità, Rome, Italy
| | - R. Castello
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - M. Pinelli
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - S. Basu
- Department of Medical Biochemistry and Cell Biology Institue of Biomedicine, The Sahlgrenska Academy University of Gothenburg, Gothenburg, Sweden
| | - S. Banfi
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
- Università degli studi della Campania “Luigi Vanvitelli”, Caserta, Italy
| | - G. Casari
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - M. Tartaglia
- Genetics and Rare Diseases Research Division, Bambino Gesù Children’s Hospital, Istituto di Ricovero e Cura a Carattere Scientifico, Rome, Italy
| | - V. Nigro
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
- Università degli studi della Campania “Luigi Vanvitelli”, Caserta, Italy
| |
Collapse
|
8
|
Meena N, Mathur P, Medicherla K, Suravajhala P. A Bioinformatics Pipeline for Whole Exome Sequencing: Overview of the Processing and Steps from Raw Data to Downstream Analysis. Bio Protoc 2018. [DOI: 10.21769/bioprotoc.2805] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022] Open
|
9
|
MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC Bioinformatics 2017; 18:49. [PMID: 28107819 PMCID: PMC5248509 DOI: 10.1186/s12859-016-1454-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 12/24/2016] [Indexed: 12/28/2022] Open
Abstract
Background Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices. Results In this paper, we present a new package called MC-GenomeKey (Multi-Cloud GenomeKey) that efficiently executes the variant analysis workflow for detecting and annotating mutations using cloud resources from different commercial cloud providers. Our package supports Amazon, Google, and Azure clouds, as well as, any other cloud platform based on OpenStack. Our package allows different scenarios of execution with different levels of sophistication, up to the one where a workflow can be executed using a cluster whose nodes come from different clouds. MC-GenomeKey also supports scenarios to exploit the spot instance model of Amazon in combination with the use of other cloud platforms to provide significant cost reduction. To the best of our knowledge, this is the first solution that optimizes the execution of the workflow using computational resources from different cloud providers. Conclusions MC-GenomeKey provides an efficient multicloud based solution to detect and annotate mutations. The package can run in different commercial cloud platforms, which enables the user to seize the best offers. The package also provides a reliable means to make use of the low-cost spot instance model of Amazon, as it provides an efficient solution to the sudden termination of spot machines as a result of a sudden price increase. The package has a web-interface and it is available for free for academic use.
Collapse
|
10
|
Hintzsche J, Kim J, Yadav V, Amato C, Robinson SE, Seelenfreund E, Shellman Y, Wisell J, Applegate A, McCarter M, Box N, Tentler J, De S, Robinson WA, Tan AC. IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples. J Am Med Inform Assoc 2016; 23:721-30. [PMID: 27026619 DOI: 10.1093/jamia/ocw022] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 02/01/2016] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Currently, there is a disconnect between finding a patient's relevant molecular profile and predicting actionable therapeutics. Here we develop and implement the Integrating Molecular Profiles with Actionable Therapeutics (IMPACT) analysis pipeline, linking variants detected from whole-exome sequencing (WES) to actionable therapeutics. METHODS AND MATERIALS The IMPACT pipeline contains 4 analytical modules: detecting somatic variants, calling copy number alterations, predicting drugs against deleterious variants, and analyzing tumor heterogeneity. We tested the IMPACT pipeline on whole-exome sequencing data in The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples with known EGFR mutations. We also used IMPACT to analyze melanoma patient tumor samples before treatment, after BRAF-inhibitor treatment, and after BRAF- and MEK-inhibitor treatment. RESULTS IMPACT Food and Drug Administration (FDA) correctly identified known EGFR mutations in the TCGA lung adenocarcinoma samples. IMPACT linked these EGFR mutations to the appropriate FDA-approved EGFR inhibitors. For the melanoma patient samples, we identified NRAS p.Q61K as an acquired resistance mutation to BRAF-inhibitor treatment. We also identified CDKN2A deletion as a novel acquired resistance mutation to BRAFi/MEKi inhibition. The IMPACT analysis pipeline predicts these somatic variants to actionable therapeutics. We observed the clonal dynamic in the tumor samples after various treatments. We showed that IMPACT not only helped in successful prioritization of clinically relevant variants but also linked these variations to possible targeted therapies. CONCLUSION IMPACT provides a new bioinformatics strategy to delineate candidate somatic variants and actionable therapies. This approach can be applied to other patient tumor samples to discover effective drug targets for personalized medicine.IMPACT is publicly available at http://tanlab.ucdenver.edu/IMPACT.
Collapse
Affiliation(s)
- Jennifer Hintzsche
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Jihye Kim
- Division of Medical Oncology, Department of Medicine, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Vinod Yadav
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, School of Medicine
| | - Carol Amato
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Steven E Robinson
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Eric Seelenfreund
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Yiqun Shellman
- Department of Dermatology, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Joshua Wisell
- Department of Pathology, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Allison Applegate
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Martin McCarter
- Department of Surgery, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Neil Box
- Department of Dermatology, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - John Tentler
- Division of Medical Oncology, Department of Medicine, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Subhajyoti De
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, School of Medicine Department of Biostatistics and Informatics, Colorado School of Public Health University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - William A Robinson
- Division of Medical Oncology, Department of Medicine, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Aik Choon Tan
- Division of Medical Oncology, Department of Medicine, School of Medicine Department of Biostatistics and Informatics, Colorado School of Public Health University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
11
|
Pandey RV, Pabinger S, Kriegner A, Weinhäusel A. MutAid: Sanger and NGS Based Integrated Pipeline for Mutation Identification, Validation and Annotation in Human Molecular Genetics. PLoS One 2016; 11:e0147697. [PMID: 26840129 PMCID: PMC4739551 DOI: 10.1371/journal.pone.0147697] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Accepted: 01/07/2016] [Indexed: 12/20/2022] Open
Abstract
Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid.
Collapse
Affiliation(s)
- Ram Vinay Pandey
- AIT Austrian Institute of Technology, Health and Environment Department, Molecular Diagnostics, Vienna, Austria
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, A-1210, Vienna, Austria
- * E-mail:
| | - Stephan Pabinger
- AIT Austrian Institute of Technology, Health and Environment Department, Molecular Diagnostics, Vienna, Austria
| | - Albert Kriegner
- AIT Austrian Institute of Technology, Health and Environment Department, Molecular Diagnostics, Vienna, Austria
| | - Andreas Weinhäusel
- AIT Austrian Institute of Technology, Health and Environment Department, Molecular Diagnostics, Vienna, Austria
| |
Collapse
|
12
|
Pandey RV, Pabinger S, Kriegner A, Weinhäusel A. ClinQC: a tool for quality control and cleaning of Sanger and NGS data in clinical research. BMC Bioinformatics 2016; 17:56. [PMID: 26830926 PMCID: PMC4735967 DOI: 10.1186/s12859-016-0915-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2015] [Accepted: 01/28/2016] [Indexed: 01/07/2023] Open
Abstract
Background Traditional Sanger sequencing has been used as a gold standard method for genetic testing in clinic to perform single gene test, which has been a cumbersome and expensive method to test several genes in heterogeneous disease such as cancer. With the advent of Next Generation Sequencing technologies, which produce data on unprecedented speed in a cost effective manner have overcome the limitation of Sanger sequencing. Therefore, for the efficient and affordable genetic testing, Next Generation Sequencing has been used as a complementary method with Sanger sequencing for disease causing mutation identification and confirmation in clinical research. However, in order to identify the potential disease causing mutations with great sensitivity and specificity it is essential to ensure high quality sequencing data. Therefore, integrated software tools are lacking which can analyze Sanger and NGS data together and eliminate platform specific sequencing errors, low quality reads and support the analysis of several sample/patients data set in a single run. Results We have developed ClinQC, a flexible and user-friendly pipeline for format conversion, quality control, trimming and filtering of raw sequencing data generated from Sanger sequencing and three NGS sequencing platforms including Illumina, 454 and Ion Torrent. First, ClinQC convert input read files from their native formats to a common FASTQ format and remove adapters, and PCR primers. Next, it split bar-coded samples, filter duplicates, contamination and low quality sequences and generates a QC report. ClinQC output high quality reads in FASTQ format with Sanger quality encoding, which can be directly used in down-stream analysis. It can analyze hundreds of sample/patients data in a single run and generate unified output files for both Sanger and NGS sequencing data. Our tool is expected to be very useful for quality control and format conversion of Sanger and NGS data to facilitate improved downstream analysis and mutation screening. Conclusions ClinQC is a powerful and easy to handle pipeline for quality control and trimming in clinical research. ClinQC is written in Python with multiprocessing capability, run on all major operating systems and is available at https://sourceforge.net/projects/clinqc. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0915-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ram Vinay Pandey
- Health & Environment Department, Molecular Diagnostics, AIT Austrian Institute of Technology GmbH, Vienna, Austria. .,Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, A-1210, Vienna, Austria.
| | - Stephan Pabinger
- Health & Environment Department, Molecular Diagnostics, AIT Austrian Institute of Technology GmbH, Vienna, Austria.
| | - Albert Kriegner
- Health & Environment Department, Molecular Diagnostics, AIT Austrian Institute of Technology GmbH, Vienna, Austria.
| | - Andreas Weinhäusel
- Health & Environment Department, Molecular Diagnostics, AIT Austrian Institute of Technology GmbH, Vienna, Austria.
| |
Collapse
|
13
|
Souilmi Y, Lancaster AK, Jung JY, Rizzo E, Hawkins JB, Powles R, Amzazi S, Ghazal H, Tonellato PJ, Wall DP. Scalable and cost-effective NGS genotyping in the cloud. BMC Med Genomics 2015; 8:64. [PMID: 26470712 PMCID: PMC4608296 DOI: 10.1186/s12920-015-0134-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 09/11/2015] [Indexed: 12/20/2022] Open
Abstract
Background While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10’s of dollars. Results We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Conclusions Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0134-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yassine Souilmi
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA. .,Department of Biology, Mohamed Vth University, 4 Ibn Battouta Avenue, B.P: 1014RP, Rabat, Morocco.
| | - Alex K Lancaster
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA. .,Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, 02215, USA.
| | - Jae-Yoon Jung
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA.
| | - Ettore Rizzo
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, via Ferrata 1, Pavia, 27100, Italy.
| | - Jared B Hawkins
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA.
| | - Ryan Powles
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA.
| | - Saaïd Amzazi
- Department of Biology, Mohamed Vth University, 4 Ibn Battouta Avenue, B.P: 1014RP, Rabat, Morocco.
| | - Hassan Ghazal
- Department of Biology, Mohamed First University, Oujda, Nador, Morocco.
| | - Peter J Tonellato
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA. .,Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02215, USA.
| | - Dennis P Wall
- Department of Pediatrics and Psychiatry (by courtesy), Division of Systems Medicine & Program in Biomedical Informatics, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
14
|
Pranckevičiene E, Rančelis T, Pranculis A, Kučinskas V. Challenges in exome analysis by LifeScope and its alternative computational pipelines. BMC Res Notes 2015; 8:421. [PMID: 26346699 PMCID: PMC4562342 DOI: 10.1186/s13104-015-1385-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2015] [Accepted: 08/24/2015] [Indexed: 12/22/2022] Open
Abstract
Background Every next generation sequencing (NGS) platform relies on proprietary and open source computational tools to analyze sequencing data. NGS tools for Illumina platforms are well documented which is not the case with AB SOLiD systems. We applied several computational and variant calling pipelines to analyse targeted exome sequencing data obtained using AB SOLiD 5500 system. Our investigated tools comprised proprietary LifeScope’s pipeline in combination with open source color-space competent mapping programs and a variant caller. We present instrumental details of the pipelines that were used and quantitative comparative analysis of variant lists generated by LifeScope’s pipeline versus open source tools. Results Sufficient coverage of targeted regions was achieved by all investigated pipelines. High variability was observed in identities of variants across the mapping programs. We observed less than 50 % concordance of variant lists produced by approaches based on different mapping algorithms. We summarized different approaches with regards to coverage (DP) and quality (QUAL) properties of the variants provided by GATK and found that LifeScope’s computational pipeline is superior. Fusion of information on mapping profiles (pileup) at genomic positions of variants in several different alignments proved to be a useful strategy to assess questionable singleton variants. Conclusions We quantitatively supported a conclusion that Lifescope’s pipeline is superior for processing sequencing data obtained by AB SOLiD 5500 system. Nevertheless the use of alternative pipelines is encouraged because aggregation of information from other mapping and variant calling approaches helps to resolve questionable calls and increases the confidence of the call. It was noted that a coverage threshold for variant to be considered for further analysis has to be chosen in data-driven way to prevent a loss of important information.
Collapse
Affiliation(s)
- Erinija Pranckevičiene
- Department of Human and Medical Genetics, Faculty of Medicine, Vilnius University, Santariskiu str. 2, LT-08661, Vilnius, Lithuania.
| | - Tautvydas Rančelis
- Department of Human and Medical Genetics, Faculty of Medicine, Vilnius University, Santariskiu str. 2, LT-08661, Vilnius, Lithuania.
| | - Aidas Pranculis
- Department of Human and Medical Genetics, Faculty of Medicine, Vilnius University, Santariskiu str. 2, LT-08661, Vilnius, Lithuania.
| | - Vaidutis Kučinskas
- Department of Human and Medical Genetics, Faculty of Medicine, Vilnius University, Santariskiu str. 2, LT-08661, Vilnius, Lithuania.
| |
Collapse
|
15
|
Bao R, Hernandez K, Huang L, Kang W, Bartom E, Onel K, Volchenboum S, Andrade J. ExScalibur: A High-Performance Cloud-Enabled Suite for Whole Exome Germline and Somatic Mutation Identification. PLoS One 2015; 10:e0135800. [PMID: 26271043 PMCID: PMC4535852 DOI: 10.1371/journal.pone.0135800] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 07/27/2015] [Indexed: 12/30/2022] Open
Abstract
Whole exome sequencing has facilitated the discovery of causal genetic variants associated with human diseases at deep coverage and low cost. In particular, the detection of somatic mutations from tumor/normal pairs has provided insights into the cancer genome. Although there is an abundance of publicly-available software for the detection of germline and somatic variants, concordance is generally limited among variant callers and alignment algorithms. Successful integration of variants detected by multiple methods requires in-depth knowledge of the software, access to high-performance computing resources, and advanced programming techniques. We present ExScalibur, a set of fully automated, highly scalable and modulated pipelines for whole exome data analysis. The suite integrates multiple alignment and variant calling algorithms for the accurate detection of germline and somatic mutations with close to 99% sensitivity and specificity. ExScalibur implements streamlined execution of analytical modules, real-time monitoring of pipeline progress, robust handling of errors and intuitive documentation that allows for increased reproducibility and sharing of results and workflows. It runs on local computers, high-performance computing clusters and cloud environments. In addition, we provide a data analysis report utility to facilitate visualization of the results that offers interactive exploration of quality control files, read alignment and variant calls, assisting downstream customization of potential disease-causing mutations. ExScalibur is open-source and is also available as a public image on Amazon cloud.
Collapse
Affiliation(s)
- Riyue Bao
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Kyle Hernandez
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Lei Huang
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Wenjun Kang
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Elizabeth Bartom
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Kenan Onel
- Department of Pediatrics, The University of Chicago, Chicago, Illinois, United States of America
| | - Samuel Volchenboum
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Department of Pediatrics, The University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (JA); (SV)
| | - Jorge Andrade
- Center for Research Informatics, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (JA); (SV)
| |
Collapse
|
16
|
Varghese B, Patel I, Barker A. RBioCloud: A Light-Weight Framework for Bioconductor and R-based Jobs on the Cloud. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:871-878. [PMID: 26357328 DOI: 10.1109/tcbb.2014.2361327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Large-scale ad hoc analytics of genomic data is popular using the R-programming language supported by over 700 software packages provided by Bioconductor. More recently, analytical jobs are benefitting from on-demand computing and storage, their scalability and their low maintenance cost, all of which are offered by the cloud. While biologists and bioinformaticists can take an analytical job and execute it on their personal workstations, it remains challenging to seamlessly execute the job on the cloud infrastructure without extensive knowledge of the cloud dashboard. How analytical jobs can not only with minimum effort be executed on the cloud, but also how both the resources and data required by the job can be managed is explored in this paper. An open-source light-weight framework for executing R-scripts using Bioconductor packages, referred to as `RBioCloud', is designed and developed. RBioCloud offers a set of simple command-line tools for managing the cloud resources, the data and the execution of the job. Three biological test cases validate the feasibility of RBioCloud. The framework is available from http://www.rbiocloud.com.
Collapse
|
17
|
Leveraging the power of high performance computing for next generation sequencing data analysis: tricks and twists from a high throughput exome workflow. PLoS One 2015; 10:e0126321. [PMID: 25942438 PMCID: PMC4420499 DOI: 10.1371/journal.pone.0126321] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 03/31/2015] [Indexed: 12/26/2022] Open
Abstract
Next generation sequencing (NGS) has been a great success and is now a standard method of research in the life sciences. With this technology, dozens of whole genomes or hundreds of exomes can be sequenced in rather short time, producing huge amounts of data. Complex bioinformatics analyses are required to turn these data into scientific findings. In order to run these analyses fast, automated workflows implemented on high performance computers are state of the art. While providing sufficient compute power and storage to meet the NGS data challenge, high performance computing (HPC) systems require special care when utilized for high throughput processing. This is especially true if the HPC system is shared by different users. Here, stability, robustness and maintainability are as important for automated workflows as speed and throughput. To achieve all of these aims, dedicated solutions have to be developed. In this paper, we present the tricks and twists that we utilized in the implementation of our exome data processing workflow. It may serve as a guideline for other high throughput data analysis projects using a similar infrastructure. The code implementing our solutions is provided in the supporting information files.
Collapse
|
18
|
|
19
|
Gao X, Xu J, Starmer J. Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analyses. BMC Res Notes 2015; 8:72. [PMID: 25889517 PMCID: PMC4376134 DOI: 10.1186/s13104-015-1027-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 02/23/2015] [Indexed: 12/26/2022] Open
Abstract
Background Whole-exome sequencing (WES) is a popular next-generation sequencing technology used by numerous laboratories with various levels of statistical and analytical expertise. Centralized databases, such as the Sequence Read Archive and the European Nucleotide Archive, allow data to be reanalyzed by independent labs to confirm results and derive additional insights. Access to new and shared data highlights the necessity for software that both lowers the statistical and analytical expertise required to generate results and promotes reproducible methodology among laboratories. Findings We have developed fastq2vcf, a pipeline that automates the genomic variant calling process using multiple callers. Fastq2vcf offers improved flexibility, efficiency, and reproducibility by seamlessly integrating several leading sequencing analysis tools. It outputs not only the annotated variant call set for each caller, but also the consensus variant call set shared by different callers. Furthermore, it can be customized and extended easily. Conclusions Our software tool automatically generates executable command lines for a variety of tools required for analyzing WES data. It is also highly configurable and provides users with complete control of the processing procedure, making it easy to submit and track jobs in both single workstation and parallelized computing environments. By using this pipeline, WES analysis can be easily reproduced.
Collapse
Affiliation(s)
- Xiaoyi Gao
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL, 60612, USA.
| | - Jianpeng Xu
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL, 60612, USA.
| | - Joshua Starmer
- Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC, 27599, USA. .,Carolina Center for Genome Sciences, The University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA. .,Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
20
|
Azam S, Rathore A, Shah TM, Telluri M, Amindala B, Ruperao P, Katta MAVSK, Varshney RK. An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data. PLoS One 2014; 9:e101754. [PMID: 25003610 PMCID: PMC4086967 DOI: 10.1371/journal.pone.0101754] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2014] [Accepted: 06/11/2014] [Indexed: 12/30/2022] Open
Abstract
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.
Collapse
Affiliation(s)
- Sarwar Azam
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
| | - Abhishek Rathore
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
| | - Trushar M. Shah
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
| | - Mohan Telluri
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
| | - BhanuPrakash Amindala
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
| | - Pradeep Ruperao
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
- School of Agriculture and Food Sciences, University of Queensland, Brisbane, Australia
| | - Mohan A. V. S. K. Katta
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
| | - Rajeev K. Varshney
- Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India
- * E-mail:
| |
Collapse
|
21
|
Lee IH, Lee K, Hsing M, Choe Y, Park JH, Kim SH, Bohn JM, Neu MB, Hwang KB, Green RC, Kohane IS, Kong SW. Prioritizing disease-linked variants, genes, and pathways with an interactive whole-genome analysis pipeline. Hum Mutat 2014; 35:537-47. [PMID: 24478219 DOI: 10.1002/humu.22520] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2013] [Accepted: 01/23/2014] [Indexed: 01/02/2023]
Abstract
Whole-genome sequencing (WGS) studies are uncovering disease-associated variants in both rare and nonrare diseases. Utilizing the next-generation sequencing for WGS requires a series of computational methods for alignment, variant detection, and annotation, and the accuracy and reproducibility of annotation results are essential for clinical implementation. However, annotating WGS with up to date genomic information is still challenging for biomedical researchers. Here, we present one of the fastest and highly scalable annotation, filtering, and analysis pipeline-gNOME-to prioritize phenotype-associated variants while minimizing false-positive findings. Intuitive graphical user interface of gNOME facilitates the selection of phenotype-associated variants, and the result summaries are provided at variant, gene, and genome levels. Moreover, the enrichment results of specific variants, genes, and gene sets between two groups or compared with population scale WGS datasets that is already integrated in the pipeline can help the interpretation. We found a small number of discordant results between annotation software tools in part due to different reporting strategies for the variants with complex impacts. Using two published whole-exome datasets of uveal melanoma and bladder cancer, we demonstrated gNOME's accuracy of variant annotation and the enrichment of loss-of-function variants in known cancer pathways. gNOME Web server and source codes are freely available to the academic community (http://gnome.tchlab.org).
Collapse
Affiliation(s)
- In-Hee Lee
- Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, Department of Medicine, Boston Children's Hospital, Boston, Massachusetts, 02115
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Dander A, Pabinger S, Sperk M, Fischer M, Stocker G, Trajanoski Z. SeqBench: integrated solution for the management and analysis of exome sequencing data. BMC Res Notes 2014; 7:43. [PMID: 24444368 PMCID: PMC3898724 DOI: 10.1186/1756-0500-7-43] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2013] [Accepted: 01/14/2014] [Indexed: 11/21/2022] Open
Abstract
Background The rapid development of next generation sequencing technologies, including the recently introduced benchtop sequencers, made sequencing affordable for smaller research institutions. A widely applied method to identify causing mutations of diseases is exome sequencing, which proved to be cost-effective and time-saving. Findings SeqBench, a web-based application, combines management and analysis of exome sequencing data into one solution. It provides a user friendly data acquisition module to facilitate comprehensive and intuitive data handling. SeqBench provides direct access to the analysis pipeline SIMPLEX, which can be configured to run locally, on a cluster, or in the cloud. Identified genomic variants are presented along with several functional annotations and can be interpreted in a family context. Conclusions The web-based application SeqBench supports the management and analysis of exome sequencing data, is open-source and available at
http://www.icbi.at/SeqBench.
Collapse
Affiliation(s)
- Andreas Dander
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria.
| | | | | | | | | | | |
Collapse
|
23
|
Karczewski KJ, Fernald GH, Martin AR, Snyder M, Tatonetti NP, Dudley JT. STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One 2014; 9:e84860. [PMID: 24454756 PMCID: PMC3893165 DOI: 10.1371/journal.pone.0084860] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2013] [Accepted: 11/27/2013] [Indexed: 12/30/2022] Open
Abstract
The increasing public availability of personal complete genome sequencing data has ushered in an era of democratized genomics. However, read mapping and variant calling software is constantly improving and individuals with personal genomic data may prefer to customize and update their variant calls. Here, we describe STORMSeq (Scalable Tools for Open-Source Read Mapping), a graphical interface cloud computing solution that does not require a parallel computing environment or extensive technical experience. This customizable and modular system performs read mapping, read cleaning, and variant calling and annotation. At present, STORMSeq costs approximately $2 and 5–10 hours to process a full exome sequence and $30 and 3–8 days to process a whole genome sequence. We provide this open-access and open-source resource as a user-friendly interface in Amazon EC2.
Collapse
Affiliation(s)
- Konrad J. Karczewski
- Biomedical Informatics Training Program, Stanford University School of Medicine, Stanford, California, United States of America
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
- * E-mail: (KJK); (JTD)
| | - Guy Haskin Fernald
- Biomedical Informatics Training Program, Stanford University School of Medicine, Stanford, California, United States of America
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - Alicia R. Martin
- Biomedical Informatics Training Program, Stanford University School of Medicine, Stanford, California, United States of America
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - Michael Snyder
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - Nicholas P. Tatonetti
- Department of Biomedical Informatics, Columbia University, New York, New York, United States of America
| | - Joel T. Dudley
- Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York, United States of America
- * E-mail: (KJK); (JTD)
| |
Collapse
|
24
|
Dorff KC, Chambwe N, Zeno Z, Simi M, Shaknovich R, Campagne F. GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data. PLoS One 2013; 8:e69666. [PMID: 23936070 PMCID: PMC3720652 DOI: 10.1371/journal.pone.0069666] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 06/11/2013] [Indexed: 01/04/2023] Open
Abstract
We present GobyWeb, a web-based system that facilitates the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. We conducted performance evaluations of the software and found it to either outperform or have similar performance to analysis programs developed for specialized analyses of HTS data. We found that most biologists who took a one-hour GobyWeb training session were readily able to analyze RNA-Seq data with state of the art analysis tools. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use. GobyWeb plugins are distributed in source code and licensed under the open source LGPL3 license to facilitate code inspection, reuse and independent extensions http://github.com/CampagneLaboratory/gobyweb2-plugins.
Collapse
Affiliation(s)
- Kevin C. Dorff
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, The Weill Cornell Medical College, New York, New York, United States of America
| | - Nyasha Chambwe
- Department of Physiology and Biophysics, The Weill Cornell Medical College, New York, New York, United States of America
- Tri-Institutional Training Program in Computational Biology and Medicine, The Weill Cornell Medical College, New York, New York, United States of America
| | - Zachary Zeno
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, The Weill Cornell Medical College, New York, New York, United States of America
| | - Manuele Simi
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, The Weill Cornell Medical College, New York, New York, United States of America
| | - Rita Shaknovich
- Department of Pathology and Department of Medicine; The Weill Cornell Medical College, New York, New York, United States of America
| | - Fabien Campagne
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, The Weill Cornell Medical College, New York, New York, United States of America
- Department of Physiology and Biophysics, The Weill Cornell Medical College, New York, New York, United States of America
| |
Collapse
|
25
|
D'Antonio M, D'Onorio De Meo P, Paoletti D, Elmi B, Pallocca M, Sanna N, Picardi E, Pesole G, Castrignanò T. WEP: a high-performance analysis pipeline for whole-exome data. BMC Bioinformatics 2013; 14 Suppl 7:S11. [PMID: 23815231 PMCID: PMC3633005 DOI: 10.1186/1471-2105-14-s7-s11] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background The advent of massively parallel sequencing technologies (Next Generation Sequencing, NGS) profoundly modified the landscape of human genetics. In particular, Whole Exome Sequencing (WES) is the NGS branch that focuses on the exonic regions of the eukaryotic genomes; exomes are ideal to help us understanding high-penetrance allelic variation and its relationship to phenotype. A complete WES analysis involves several steps which need to be suitably designed and arranged into an efficient pipeline. Managing a NGS analysis pipeline and its huge amount of produced data requires non trivial IT skills and computational power. Results Our web resource WEP (Whole-Exome sequencing Pipeline web tool) performs a complete WES pipeline and provides easy access through interface to intermediate and final results. The WEP pipeline is composed of several steps: 1) verification of input integrity and quality checks, read trimming and filtering; 2) gapped alignment; 3) BAM conversion, sorting and indexing; 4) duplicates removal; 5) alignment optimization around insertion/deletion (indel) positions; 6) recalibration of quality scores; 7) single nucleotide and deletion/insertion polymorphism (SNP and DIP) variant calling; 8) variant annotation; 9) result storage into custom databases to allow cross-linking and intersections, statistics and much more. In order to overcome the challenge of managing large amount of data and maximize the biological information extracted from them, our tool restricts the number of final results filtering data by customizable thresholds, facilitating the identification of functionally significant variants. Default threshold values are also provided at the analysis computation completion, tuned with the most common literature work published in recent years. Conclusions Through our tool a user can perform the whole analysis without knowing the underlying hardware and software architecture, dealing with both paired and single end data. The interface provides an easy and intuitive access for data submission and a user-friendly web interface for annotated variant visualization. Non-IT mastered users can access through WEP to the most updated and tested WES algorithms, tuned to maximize the quality of called variants while minimizing artifacts and false positives. The web tool is available at the following web address: http://www.caspur.it/wep
Collapse
Affiliation(s)
- Mattia D'Antonio
- Dipartimento di Bioscienze, Biotecnologie e Scienze Farmacologiche, Università degli Studi di Bari, Bari, Italy
| | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Hong H, Zhang W, Shen J, Su Z, Ning B, Han T, Perkins R, Shi L, Tong W. Critical role of bioinformatics in translating huge amounts of next-generation sequencing data into personalized medicine. SCIENCE CHINA-LIFE SCIENCES 2013; 56:110-8. [PMID: 23393026 DOI: 10.1007/s11427-013-4439-7] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Accepted: 11/29/2012] [Indexed: 01/12/2023]
Abstract
Realizing personalized medicine requires integrating diverse data types with bioinformatics. The most vital data are genomic information for individuals that are from advanced next-generation sequencing (NGS) technologies at present. The technologies continue to advance in terms of both decreasing cost and sequencing speed with concomitant increase in the amount and complexity of the data. The prodigious data together with the requisite computational pipelines for data analysis and interpretation are stressors to IT infrastructure and the scientists conducting the work alike. Bioinformatics is increasingly becoming the rate-limiting step with numerous challenges to be overcome for translating NGS data for personalized medicine. We review some key bioinformatics tasks, issues, and challenges in contexts of IT requirements, data quality, analysis tools and pipelines, and validation of biomarkers.
Collapse
Affiliation(s)
- Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Impacts of massively parallel sequencing for genetic diagnosis of neuromuscular disorders. Acta Neuropathol 2013; 125:173-85. [PMID: 23224362 DOI: 10.1007/s00401-012-1072-7] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Revised: 11/27/2012] [Accepted: 11/28/2012] [Indexed: 12/11/2022]
Abstract
Neuromuscular disorders (NMD) such as neuropathy or myopathy are rare and often severe inherited disorders, affecting muscle and/or nerves with neonatal, childhood or adulthood onset, with considerable burden for the patients, their families and public health systems. Genetic and clinical heterogeneity, unspecific clinical features, unidentified genes and the implication of large and/or several genes requiring complementary methods are the main drawbacks in routine molecular diagnosis, leading to increased turnaround time and delay in the molecular validation of the diagnosis. The application of massively parallel sequencing, also called next generation sequencing, as a routine diagnostic strategy could lead to a rapid screening and fast identification of mutations in rare genetic disorders like NMD. This review aims to summarize and to discuss recent advances in the genetic diagnosis of neuromuscular disorders, and more generally monogenic diseases, fostered by massively parallel sequencing. We remind the challenges and benefit of obtaining an accurate genetic diagnosis, introduce the massively parallel sequencing technology and its novel applications in diagnosis of patients, prenatal diagnosis and carrier detection, and discuss the limitations and necessary improvements. Massively parallel sequencing synergizes with clinical and pathological investigations into an integrated diagnosis approach. Clinicians and pathologists are crucial in patient selection and interpretation of data, and persons trained in data management and analysis need to be integrated to the diagnosis pipeline. Massively parallel sequencing for mutation identification is expected to greatly improve diagnosis, genetic counseling and patient management.
Collapse
|
28
|
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2013; 15:256-78. [PMID: 23341494 PMCID: PMC3956068 DOI: 10.1093/bib/bbs086] [Citation(s) in RCA: 335] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.
Collapse
Affiliation(s)
- Stephan Pabinger
- Division for Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria. Tel.: +43-512-9003-71401; Fax: +43-512-9003-73100;
| | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Charoentong P, Angelova M, Efremova M, Gallasch R, Hackl H, Galon J, Trajanoski Z. Bioinformatics for cancer immunology and immunotherapy. Cancer Immunol Immunother 2012; 61:1885-903. [PMID: 22986455 PMCID: PMC3493665 DOI: 10.1007/s00262-012-1354-x] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2012] [Accepted: 09/04/2012] [Indexed: 01/24/2023]
Abstract
Recent mechanistic insights obtained from preclinical studies and the approval of the first immunotherapies has motivated increasing number of academic investigators and pharmaceutical/biotech companies to further elucidate the role of immunity in tumor pathogenesis and to reconsider the role of immunotherapy. Additionally, technological advances (e.g., next-generation sequencing) are providing unprecedented opportunities to draw a comprehensive picture of the tumor genomics landscape and ultimately enable individualized treatment. However, the increasing complexity of the generated data and the plethora of bioinformatics methods and tools pose considerable challenges to both tumor immunologists and clinical oncologists. In this review, we describe current concepts and future challenges for the management and analysis of data for cancer immunology and immunotherapy. We first highlight publicly available databases with specific focus on cancer immunology including databases for somatic mutations and epitope databases. We then give an overview of the bioinformatics methods for the analysis of next-generation sequencing data (whole-genome and exome sequencing), epitope prediction tools as well as methods for integrative data analysis and network modeling. Mathematical models are powerful tools that can predict and explain important patterns in the genetic and clinical progression of cancer. Therefore, a survey of mathematical models for tumor evolution and tumor-immune cell interaction is included. Finally, we discuss future challenges for individualized immunotherapy and suggest how a combined computational/experimental approaches can lead to new insights into the molecular mechanisms of cancer, improved diagnosis, and prognosis of the disease and pinpoint novel therapeutic targets.
Collapse
Affiliation(s)
- Pornpimol Charoentong
- Biocenter, Division of Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria
| | - Mihaela Angelova
- Biocenter, Division of Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria
| | - Mirjana Efremova
- Biocenter, Division of Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria
| | - Ralf Gallasch
- Biocenter, Division of Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria
| | - Hubert Hackl
- Biocenter, Division of Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria
| | - Jerome Galon
- INSERM U872, Integrative Cancer Immunology Laboratory, Paris, France
| | - Zlatko Trajanoski
- Biocenter, Division of Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria
| |
Collapse
|