51
|
Schönherr S, Forer L, Weißensteiner H, Kronenberg F, Specht G, Kloss-Brandstätter A. Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics 2012; 13:200. [PMID: 22888776 PMCID: PMC3532373 DOI: 10.1186/1471-2105-13-200] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2012] [Accepted: 08/01/2012] [Indexed: 11/25/2022] Open
Abstract
Background The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions. Results Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized. Conclusions Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at
http://cloudgene.uibk.ac.at.
Collapse
Affiliation(s)
- Sebastian Schönherr
- Division of Genetic Epidemiology; Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria
| | | | | | | | | | | |
Collapse
|
52
|
Wommack KE, Bhavsar J, Polson SW, Chen J, Dumas M, Srinivasiah S, Furman M, Jamindar S, Nasko DJ. VIROME: a standard operating procedure for analysis of viral metagenome sequences. Stand Genomic Sci 2012; 6:427-39. [PMID: 23407591 PMCID: PMC3558967 DOI: 10.4056/sigs.2945050] [Citation(s) in RCA: 127] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
One consistent finding among studies using shotgun metagenomics to analyze whole viral communities is that most viral sequences show no significant homology to known sequences. Thus, bioinformatic analyses based on sequence collections such as GenBank nr, which are largely comprised of sequences from known organisms, tend to ignore a majority of sequences within most shotgun viral metagenome libraries. Here we describe a bioinformatic pipeline, the Viral Informatics Resource for Metagenome Exploration (VIROME), that emphasizes the classification of viral metagenome sequences (predicted open-reading frames) based on homology search results against both known and environmental sequences. Functional and taxonomic information is derived from five annotated sequence databases which are linked to the UniRef 100 database. Environmental classifications are obtained from hits against a custom database, MetaGenomes On-Line, which contains 49 million predicted environmental peptides. Each predicted viral metagenomic ORF run through the VIROME pipeline is placed into one of seven ORF classes, thus, every sequence receives a meaningful annotation. Additionally, the pipeline includes quality control measures to remove contaminating and poor quality sequence and assesses the potential amount of cellular DNA contamination in a viral metagenome library by screening for rRNA genes. Access to the VIROME pipeline and analysis results are provided through a web-application interface that is dynamically linked to a relational back-end database. The VIROME web-application interface is designed to allow users flexibility in retrieving sequences (reads, ORFs, predicted peptides) and search results for focused secondary analyses.
Collapse
Affiliation(s)
- K Eric Wommack
- Delaware Biotechnology Institute, University of Delaware, Newark, DE 19711
| | | | | | | | | | | | | | | | | |
Collapse
|
53
|
Rodrigues MR, Magalhães WCS, Machado M, Tarazona-Santos E. A graph-based approach for designing extensible pipelines. BMC Bioinformatics 2012; 13:163. [PMID: 22788675 PMCID: PMC3496580 DOI: 10.1186/1471-2105-13-163] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2011] [Accepted: 06/22/2012] [Indexed: 11/10/2022] Open
Abstract
Background In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps. Results We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at
http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms. Conclusions Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
Collapse
Affiliation(s)
- Maíra R Rodrigues
- Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.
| | | | | | | |
Collapse
|
54
|
Goll J, Thiagarajan M, Abubucker S, Huttenhower C, Yooseph S, Methé BA. A case study for large-scale human microbiome analysis using JCVI's metagenomics reports (METAREP). PLoS One 2012; 7:e29044. [PMID: 22719821 PMCID: PMC3374610 DOI: 10.1371/journal.pone.0029044] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Accepted: 11/16/2011] [Indexed: 01/23/2023] Open
Abstract
As metagenomic studies continue to increase in their number, sequence volume and complexity, the scalability of biological analysis frameworks has become a rate-limiting factor to meaningful data interpretation. To address this issue, we have developed JCVI Metagenomics Reports (METAREP) as an open source tool to query, browse, and compare extremely large volumes of metagenomic annotations. Here we present improvements to this software including the implementation of a dynamic weighting of taxonomic and functional annotation, support for distributed searches, advanced clustering routines, and integration of additional annotation input formats. The utility of these improvements to data interpretation are demonstrated through the application of multiple comparative analysis strategies to shotgun metagenomic data produced by the National Institutes of Health Roadmap for Biomedical Research Human Microbiome Project (HMP) (http://nihroadmap.nih.gov). Specifically, the scalability of the dynamic weighting feature is evaluated and established by its application to the analysis of over 400 million weighted gene annotations derived from 14 billion short reads as predicted by the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline. Further, the capacity of METAREP to facilitate the identification and simultaneous comparison of taxonomic and functional annotations including biological pathway and individual enzyme abundances from hundreds of community samples is demonstrated by providing scenarios that describe how these data can be mined to answer biological questions related to the human microbiome. These strategies provide users with a reference of how to conduct similar large-scale metagenomic analyses using METAREP with their own sequence data, while in this study they reveal insights into the nature and extent of variation in taxonomic and functional profiles across body habitats and individuals. Over one thousand HMP WGS datasets and the latest open source code are available at http://www.jcvi.org/hmp-metarep.
Collapse
Affiliation(s)
- Johannes Goll
- The J. Craig Venter Institute, Rockville, Maryland, United States of America
| | | | - Sahar Abubucker
- The Genome Institute, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Curtis Huttenhower
- Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Shibu Yooseph
- The J. Craig Venter Institute, San Diego, California, United States of America
| | - Barbara A. Methé
- The J. Craig Venter Institute, Rockville, Maryland, United States of America
- * E-mail:
| |
Collapse
|
55
|
Lord E, Leclercq M, Boc A, Diallo AB, Makarenkov V. Armadillo 1.1: an original workflow platform for designing and conducting phylogenetic analysis and simulations. PLoS One 2012; 7:e29903. [PMID: 22253821 PMCID: PMC3256230 DOI: 10.1371/journal.pone.0029903] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2011] [Accepted: 12/08/2011] [Indexed: 11/30/2022] Open
Abstract
In this paper we introduce Armadillo v1.1, a novel workflow platform dedicated to designing and conducting phylogenetic studies, including comprehensive simulations. A number of important phylogenetic and general bioinformatics tools have been included in the first software release. As Armadillo is an open-source project, it allows scientists to develop their own modules as well as to integrate existing computer applications. Using our workflow platform, different complex phylogenetic tasks can be modeled and presented in a single workflow without any prior knowledge of programming techniques. The first version of Armadillo was successfully used by professors of bioinformatics at Université du Quebec à Montreal during graduate computational biology courses taught in 2010–11. The program and its source code are freely available at: <http://www.bioinfo.uqam.ca/armadillo>.
Collapse
Affiliation(s)
- Etienne Lord
- Département d'informatique, Université du Québec à Montréal, Montréal, Canada
| | - Mickael Leclercq
- Département d'informatique, Université du Québec à Montréal, Montréal, Canada
| | - Alix Boc
- Département de sciences biologiques, Université de Montréal, Montréal, Canada
| | | | - Vladimir Makarenkov
- Département d'informatique, Université du Québec à Montréal, Montréal, Canada
- * E-mail:
| |
Collapse
|
56
|
Abstract
Annotation of prokaryotic sequences can be separated into structural and functional annotation. Structural annotation is dependent on algorithmic interrogation of experimental evidence to discover the physical characteristics of a gene. This is done in an effort to construct accurate gene models, so understanding function or evolution of genes among organisms is not impeded. Functional annotation is dependent on sequence similarity to other known genes or proteins in an effort to assess the function of the gene. Combining structural and functional annotation across genomes in a comparative manner promotes higher levels of accurate annotation as well as an advanced understanding of genome evolution. As the availability of bacterial sequences increases and annotation methods improve, the value of comparative annotation will increase.
Collapse
Affiliation(s)
- Nicholas Beckloff
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | | | | | | |
Collapse
|
57
|
Hu B, Xie G, Lo CC, Starkenburg SR, Chain PSG. Pathogen comparative genomics in the next-generation sequencing era: genome alignments, pangenomics and metagenomics. Brief Funct Genomics 2011; 10:322-33. [DOI: 10.1093/bfgp/elr042] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
58
|
Scholz MB, Lo CC, Chain PSG. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotechnol 2011; 23:9-15. [PMID: 22154470 DOI: 10.1016/j.copbio.2011.11.013] [Citation(s) in RCA: 190] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2011] [Revised: 11/09/2011] [Accepted: 11/10/2011] [Indexed: 12/24/2022]
Abstract
The recent technological advances in next generation sequencing have brought the field closer to the goal of reconstructing all genomes within a community by presenting high throughput sequencing at much lower costs. While these next-generation sequencing technologies have allowed a massive increase in available raw sequence data, there are a number of new informatics challenges and difficulties that must be addressed to improve the current state, and fulfill the promise of, metagenomics.
Collapse
Affiliation(s)
- Matthew B Scholz
- Genome Science Group, Los Alamos National Laboratory, Los Alamos, NM 87545, United States
| | | | | |
Collapse
|
59
|
Riley DR, Angiuoli SV, Crabtree J, Dunning Hotopp JC, Tettelin H. Using Sybil for interactive comparative genomics of microbes on the web. ACTA ACUST UNITED AC 2011; 28:160-6. [PMID: 22121156 PMCID: PMC3259440 DOI: 10.1093/bioinformatics/btr652] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Motivation: Analysis of multiple genomes requires sophisticated tools that provide search, visualization, interactivity and data export. Comparative genomics datasets tend to be large and complex, making development of these tools difficult. In addition to scalability, comparative genomics tools must also provide user-friendly interfaces such that the research scientist can explore complex data with minimal technical expertise. Results: We describe a new version of the Sybil software package and its application to the important human pathogen Streptococcus pneumoniae. This new software provides a feature-rich set of comparative genomics tools for inspection of multiple genome structures, mining of orthologous gene families and identification of potential vaccine candidates. Availability: The S.pneumoniae resource is online at http://strepneumo-sybil.igs.umaryland.edu. The software, database and website are available for download as a portable virtual machine and from http://sourceforge.net/projects/sybil. Contact:driley@som.umaryland.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David R Riley
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
| | | | | | | | | |
Collapse
|
60
|
Genome sequences and characterization of the related Gordonia phages GTE5 and GRU1 and their use as potential biocontrol agents. Appl Environ Microbiol 2011; 78:42-7. [PMID: 22038604 DOI: 10.1128/aem.05584-11] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Activated sludge plants suffer frequently from the operational problem of stable foam formation on aerobic reactor surfaces, which can be difficult to prevent. Many foams are stabilized by mycolic acid-containing Actinobacteria, the mycolata. The in situ biocontrol of foaming using phages is an attractive strategy. We describe two polyvalent phages, GTE5 and GRU1, targeting Gordonia terrae and Gordonia rubrupertincta, respectively, isolated from activated sludge. Phage GRU1 also propagates on Nocardia nova. Both phages belong to the family Siphoviridae and have similar-size icosahedral heads that encapsulate double-stranded DNA genomes (∼65 kb). Their genome sequences are similar to each other but markedly different from those of other sequenced phages. Both are arranged in a modular fashion. These phages can reduce or eliminate foam formation by their host cells under laboratory conditions.
Collapse
|
61
|
Angiuoli SV, White JR, Matalka M, White O, Fricke WF. Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing. PLoS One 2011; 6:e26624. [PMID: 22028928 PMCID: PMC3197577 DOI: 10.1371/journal.pone.0026624] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 09/29/2011] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. RESULTS We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. CONCLUSIONS Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.
Collapse
Affiliation(s)
- Samuel V. Angiuoli
- Institute for Genome Sciences (IGS), University of Maryland Baltimore, Baltimore, Maryland, United States of America
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - James R. White
- Institute for Genome Sciences (IGS), University of Maryland Baltimore, Baltimore, Maryland, United States of America
| | - Malcolm Matalka
- Institute for Genome Sciences (IGS), University of Maryland Baltimore, Baltimore, Maryland, United States of America
| | - Owen White
- Institute for Genome Sciences (IGS), University of Maryland Baltimore, Baltimore, Maryland, United States of America
| | - W. Florian Fricke
- Institute for Genome Sciences (IGS), University of Maryland Baltimore, Baltimore, Maryland, United States of America
| |
Collapse
|
62
|
Small but sufficient: the Rhodococcus phage RRH1 has the smallest known Siphoviridae genome at 14.2 kilobases. J Virol 2011; 86:358-63. [PMID: 22013058 DOI: 10.1128/jvi.05460-11] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Bacteriophages are considered to be the most abundant biological entities on the planet. The Siphoviridae are the most commonly encountered tailed phages and contain double-stranded DNA with an average genome size of ∼50 kb. This paper describes the isolation from four different activated sludge plants of the phage RRH1, which is polyvalent, lysing five Rhodococcus species. It has a capsid diameter of only ∼43 nm. Whole-genome sequencing of RRH1 revealed a novel circularly permuted DNA sequence (14,270 bp) carrying 20 putative open reading frames. The genome has a modular arrangement, as reported for those of most Siphoviridae phages, but appears to encode only structural proteins and carry a single lysis gene. All genes are transcribed in the same direction. RRH1 has the smallest genome yet of any described functional Siphoviridae phage. We demonstrate that lytic phage can be recovered from transforming naked DNA into its host bacterium, thus making it a potentially useful model for studying gene function in phages.
Collapse
|
63
|
Ficklin SP, Sanderson LA, Cheng CH, Staton ME, Lee T, Cho IH, Jung S, Bett KE, Main D. Tripal: a construction toolkit for online genome databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar044. [PMID: 21959868 PMCID: PMC3263599 DOI: 10.1093/database/bar044] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
As the availability, affordability and magnitude of genomics and genetics research increases so does the need to provide online access to resulting data and analyses. Availability of a tailored online database is the desire for many investigators or research communities; however, managing the Information Technology infrastructure needed to create such a database can be an undesired distraction from primary research or potentially cost prohibitive. Tripal provides simplified site development by merging the power of Drupal, a popular web Content Management System with that of Chado, a community-derived database schema for storage of genomic, genetic and other related biological data. Tripal provides an interface that extends the content management features of Drupal to the data housed in Chado. Furthermore, Tripal provides a web-based Chado installer, genomic data loaders, web-based editing of data for organisms, genomic features, biological libraries, controlled vocabularies and stock collections. Also available are Tripal extensions that support loading and visualizations of NCBI BLAST, InterPro, Kyoto Encyclopedia of Genes and Genomes and Gene Ontology analyses, as well as an extension that provides integration of Tripal with GBrowse, a popular GMOD tool. An Application Programming Interface is available to allow creation of custom extensions by site developers, and the look-and-feel of the site is completely customizable through Drupal-based PHP template files. Addition of non-biological content and user-management is afforded through Drupal. Tripal is an open source and freely available software package found at http://tripal.sourceforge.net
Collapse
Affiliation(s)
- Stephen P Ficklin
- Department of Horticulture and Landscape Architecture, Washington State University, Pullman, WA 99164, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
64
|
Mishima H, Sasaki K, Tanaka M, Tatebe O, Yoshiura KI. Agile parallel bioinformatics workflow management using Pwrake. BMC Res Notes 2011; 4:331. [PMID: 21899774 PMCID: PMC3180464 DOI: 10.1186/1756-0500-4-331] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Accepted: 09/08/2011] [Indexed: 12/20/2022] Open
Abstract
Background In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error. Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. Findings We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Conclusions Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows.
Collapse
Affiliation(s)
- Hiroyuki Mishima
- Department of Human Genetics, Nagasaki University Graduate School of Biomedical Sciences, 1-12-4 Sakamoto, Nagasaki, Nagasaki, Japan.
| | | | | | | | | |
Collapse
|
65
|
Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C, White JR, White O, Fricke WF. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 2011; 12:356. [PMID: 21878105 PMCID: PMC3228541 DOI: 10.1186/1471-2105-12-356] [Citation(s) in RCA: 227] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2011] [Accepted: 08/30/2011] [Indexed: 11/23/2022] Open
Abstract
Background Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. Results We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. Conclusion The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
Collapse
Affiliation(s)
- Samuel V Angiuoli
- Institute for Genome Sciences (IGS), University of Maryland School of Medicine, Baltimore, Maryland, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
66
|
DeMaere MZ, Lauro FM, Thomas T, Yau S, Cavicchioli R. Simple high-throughput annotation pipeline (SHAP). Bioinformatics 2011; 27:2431-2. [PMID: 21775307 DOI: 10.1093/bioinformatics/btr411] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
SUMMARY SHAP (simple high-throughput annotation pipeline) is a lightweight and scalable sequence annotation pipeline capable of supporting research efforts that generate or utilize large volumes of DNA sequence data. The software provides Grid capable analysis, relational storage and Web-based full-text searching of annotation results. Implemented in Java, SHAP recognizes the limited resources of many smaller research groups. AVAILABILITY Source code is freely available under GPLv3 at https://sourceforge.net/projects/shap. CONTACT matt.demaere@unsw.edu.au; r.cavicchioli@unsw.edu.au.
Collapse
Affiliation(s)
- Matthew Z DeMaere
- School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, NSW 2052, Australia.
| | | | | | | | | |
Collapse
|
67
|
Galens K, Orvis J, Daugherty S, Creasy HH, Angiuoli S, White O, Wortman J, Mahurkar A, Giglio MG. The IGS Standard Operating Procedure for Automated Prokaryotic Annotation. Stand Genomic Sci 2011; 4:244-51. [PMID: 21677861 PMCID: PMC3111993 DOI: 10.4056/sigs.1223234] [Citation(s) in RCA: 110] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The Institute for Genome Sciences (IGS) has developed a prokaryotic annotation pipeline that is used for coding gene/RNA prediction and functional annotation of Bacteria and Archaea. The fully automated pipeline accepts one or many genomic sequences as input and produces output in a variety of standard formats. Functional annotation is primarily based on similarity searches and motif finding combined with a hierarchical rule based annotation system. The output annotations can also be loaded into a relational database and accessed through visualization tools.
Collapse
Affiliation(s)
- Kevin Galens
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
68
|
Cieślik M, Mura C. A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines. BMC Bioinformatics 2011; 12:61. [PMID: 21352538 PMCID: PMC3051902 DOI: 10.1186/1471-2105-12-61] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Accepted: 02/25/2011] [Indexed: 11/13/2022] Open
Abstract
Background Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. Results To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats). Conclusions PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.
Collapse
Affiliation(s)
- Marcin Cieślik
- Department of Chemistry, University of Virginia, Charlottesville, VA 22904-4319, USA
| | | |
Collapse
|
69
|
Chain PS, Xie G, Starkenburg SR, Scholz MB, Beckloff N, Lo CC, Davenport KW, Reitenga KG, Daligault HE, Detter JC, Freitas TA, Gleasner CD, Green LD, Han CS, McMurry KK, Meincke LJ, Shen X, Zeytun A. Genomics for Key Players in the N Cycle. Methods Enzymol 2011; 496:289-318. [DOI: 10.1016/b978-0-12-386489-5.00012-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
70
|
Bacterial population genomics and infectious disease diagnostics. Trends Biotechnol 2010; 28:611-8. [PMID: 20961641 DOI: 10.1016/j.tibtech.2010.09.001] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2010] [Revised: 09/03/2010] [Accepted: 09/07/2010] [Indexed: 01/14/2023]
Abstract
New sequencing technologies have made the production of bacterial genome sequences increasingly easy, and it can be confidently forecasted that vast genomic databases will be generated in the next few years. Here, we detail how collections of bacterial genomes from a particular species (population genomics libraries) have already been used to improve the design of several diagnostic assays for bacterial pathogens. Genome sequencing itself is also becoming more commonly used for epidemiological, forensic and clinical investigations. There is an opportunity for the further development of bioinformatic tools to bring even further value to bacterial diagnostic genomics.
Collapse
|
71
|
Abstract
In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own high-level abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program.
Collapse
Affiliation(s)
- Jonas S Almeida
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA.
| |
Collapse
|
72
|
Ovaska K, Laakso M, Haapa-Paananen S, Louhimo R, Chen P, Aittomäki V, Valo E, Núñez-Fontarnau J, Rantanen V, Karinen S, Nousiainen K, Lahesmaa-Korpinen AM, Miettinen M, Saarinen L, Kohonen P, Wu J, Westermarck J, Hautaniemi S. Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Med 2010; 2:65. [PMID: 20822536 PMCID: PMC3092116 DOI: 10.1186/gm186] [Citation(s) in RCA: 125] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2010] [Revised: 07/16/2010] [Accepted: 09/07/2010] [Indexed: 01/17/2023] Open
Abstract
Background Coordinated efforts to collect large-scale data sets provide a basis for systems level understanding of complex diseases. In order to translate these fragmented and heterogeneous data sets into knowledge and medical benefits, advanced computational methods for data analysis, integration and visualization are needed. Methods We introduce a novel data integration framework, Anduril, for translating fragmented large-scale data into testable predictions. The Anduril framework allows rapid integration of heterogeneous data with state-of-the-art computational methods and existing knowledge in bio-databases. Anduril automatically generates thorough summary reports and a website that shows the most relevant features of each gene at a glance, allows sorting of data based on different parameters, and provides direct links to more detailed data on genes, transcripts or genomic regions. Anduril is open-source; all methods and documentation are freely available. Results We have integrated multidimensional molecular and clinical data from 338 subjects having glioblastoma multiforme, one of the deadliest and most poorly understood cancers, using Anduril. The central objective of our approach is to identify genetic loci and genes that have significant survival effect. Our results suggest several novel genetic alterations linked to glioblastoma multiforme progression and, more specifically, reveal Moesin as a novel glioblastoma multiforme-associated gene that has a strong survival effect and whose depletion in vitro significantly inhibited cell proliferation. All analysis results are available as a comprehensive website. Conclusions Our results demonstrate that integrated analysis and visualization of multidimensional and heterogeneous data by Anduril enables drawing conclusions on functional consequences of large-scale molecular data. Many of the identified genetic loci and genes having significant survival effect have not been reported earlier in the context of glioblastoma multiforme. Thus, in addition to generally applicable novel methodology, our results provide several glioblastoma multiforme candidate genes for further studies. Anduril is available at http://csbi.ltdk.helsinki.fi/anduril/ The glioblastoma multiforme analysis results are available at http://csbi.ltdk.helsinki.fi/anduril/tcga-gbm/
Collapse
Affiliation(s)
- Kristian Ovaska
- Computational Systems Biology Laboratory, Institute of Biomedicine and Genome-Scale Biology Research Program, University of Helsinki, Haartmaninkatu 8, Helsinki, FIN-00014, Finland.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|