51
|
|
52
|
Marchese Robinson RL, Cronin MTD, Richarz AN, Rallo R. An ISA-TAB-Nano based data collection framework to support data-driven modelling of nanotoxicology. BEILSTEIN JOURNAL OF NANOTECHNOLOGY 2015; 6:1978-99. [PMID: 26665069 PMCID: PMC4660926 DOI: 10.3762/bjnano.6.202] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Accepted: 08/27/2015] [Indexed: 05/20/2023]
Abstract
Analysis of trends in nanotoxicology data and the development of data driven models for nanotoxicity is facilitated by the reporting of data using a standardised electronic format. ISA-TAB-Nano has been proposed as such a format. However, in order to build useful datasets according to this format, a variety of issues has to be addressed. These issues include questions regarding exactly which (meta)data to report and how to report them. The current article discusses some of the challenges associated with the use of ISA-TAB-Nano and presents a set of resources designed to facilitate the manual creation of ISA-TAB-Nano datasets from the nanotoxicology literature. These resources were developed within the context of the NanoPUZZLES EU project and include data collection templates, corresponding business rules that extend the generic ISA-TAB-Nano specification as well as Python code to facilitate parsing and integration of these datasets within other nanoinformatics resources. The use of these resources is illustrated by a "Toy Dataset" presented in the Supporting Information. The strengths and weaknesses of the resources are discussed along with possible future developments.
Collapse
Affiliation(s)
- Richard L Marchese Robinson
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool, L3 3AF, United Kingdom
| | - Mark T D Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool, L3 3AF, United Kingdom
| | - Andrea-Nicole Richarz
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool, L3 3AF, United Kingdom
| | - Robert Rallo
- Departament d'Enginyeria Informatica i Matematiques, Universitat Rovira i Virgili, Av. Paisos Catalans 26, 43007 Tarragona, Catalunya, Spain
| |
Collapse
|
53
|
Golosova O, Henderson R, Vaskin Y, Gabrielian A, Grekhov G, Nagarajan V, Oler AJ, Quiñones M, Hurt D, Fursov M, Huyen Y. Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses. PeerJ 2014; 2:e644. [PMID: 25392756 PMCID: PMC4226638 DOI: 10.7717/peerj.644] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 10/09/2014] [Indexed: 02/03/2023] Open
Abstract
The advent of Next Generation Sequencing (NGS) technologies has opened new possibilities for researchers. However, the more biology becomes a data-intensive field, the more biologists have to learn how to process and analyze NGS data with complex computational tools. Even with the availability of common pipeline specifications, it is often a time-consuming and cumbersome task for a bench scientist to install and configure the pipeline tools. We believe that a unified, desktop and biologist-friendly front end to NGS data analysis tools will substantially improve productivity in this field. Here we present NGS pipelines "Variant Calling with SAMtools", "Tuxedo Pipeline for RNA-seq Data Analysis" and "Cistrome Pipeline for ChIP-seq Data Analysis" integrated into the Unipro UGENE desktop toolkit. We describe the available UGENE infrastructure that helps researchers run these pipelines on different datasets, store and investigate the results and re-run the pipelines with the same parameters. These pipeline tools are included in the UGENE NGS package. Individual blocks of these pipelines are also available for expert users to create their own advanced workflows.
Collapse
Affiliation(s)
- Olga Golosova
- Unipro Center for Information Technologies , Novosibirsk , Russia
| | - Ross Henderson
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| | - Yuriy Vaskin
- Unipro Center for Information Technologies , Novosibirsk , Russia
| | - Andrei Gabrielian
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| | - German Grekhov
- Unipro Center for Information Technologies , Novosibirsk , Russia
| | - Vijayaraj Nagarajan
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| | - Andrew J Oler
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| | - Mariam Quiñones
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| | - Darrell Hurt
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| | - Mikhail Fursov
- Unipro Center for Information Technologies , Novosibirsk , Russia
| | - Yentram Huyen
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, NIH , Bethesda, MD , USA
| |
Collapse
|
54
|
Tsiliki G, Karacapilidis N, Christodoulou S, Tzagarakis M. Collaborative mining and interpretation of large-scale data for biomedical research insights. PLoS One 2014; 9:e108600. [PMID: 25268270 PMCID: PMC4182494 DOI: 10.1371/journal.pone.0108600] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2014] [Accepted: 08/31/2014] [Indexed: 01/21/2023] Open
Abstract
Biomedical research becomes increasingly interdisciplinary and collaborative in nature. Researchers need to efficiently and effectively collaborate and make decisions by meaningfully assembling, mining and analyzing available large-scale volumes of complex multi-faceted data residing in different sources. In line with related research directives revealing that, in spite of the recent advances in data mining and computational analysis, humans can easily detect patterns which computer algorithms may have difficulty in finding, this paper reports on the practical use of an innovative web-based collaboration support platform in a biomedical research context. Arguing that dealing with data-intensive and cognitively complex settings is not a technical problem alone, the proposed platform adopts a hybrid approach that builds on the synergy between machine and human intelligence to facilitate the underlying sense-making and decision making processes. User experience shows that the platform enables more informed and quicker decisions, by displaying the aggregated information according to their needs, while also exploiting the associated human intelligence.
Collapse
Affiliation(s)
- Georgia Tsiliki
- School of Chemical Engineering, National Technical University of Athens, Athens, Greece
| | - Nikos Karacapilidis
- University of Patras and Computer Technology Institute & Press ‘Diophantus’, Patras, Greece
- * E-mail:
| | - Spyros Christodoulou
- University of Patras and Computer Technology Institute & Press ‘Diophantus’, Patras, Greece
| | - Manolis Tzagarakis
- University of Patras and Computer Technology Institute & Press ‘Diophantus’, Patras, Greece
| |
Collapse
|
55
|
Hettne KM, Dharuri H, Zhao J, Wolstencroft K, Belhajjame K, Soiland-Reyes S, Mina E, Thompson M, Cruickshank D, Verdes-Montenegro L, Garrido J, de Roure D, Corcho O, Klyne G, van Schouwen R, ‘t Hoen PAC, Bechhofer S, Goble C, Roos M. Structuring research methods and data with the research object model: genomics workflows as a case study. J Biomed Semantics 2014; 5:41. [PMID: 25276335 PMCID: PMC4177597 DOI: 10.1186/2041-1480-5-41] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Accepted: 07/29/2014] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. RESULTS We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as "which particular data was input to a particular workflow to test a particular hypothesis?", and "which particular conclusions were drawn from a particular workflow?". CONCLUSIONS Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well. AVAILABILITY The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro.
Collapse
Affiliation(s)
- Kristina M Hettne
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Harish Dharuri
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Jun Zhao
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Katherine Wolstencroft
- />School of Computer Science, University of Manchester, Manchester, UK
- />Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
| | - Khalid Belhajjame
- />School of Computer Science, University of Manchester, Manchester, UK
| | | | - Eleni Mina
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Mark Thompson
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | | | | | - David de Roure
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Oscar Corcho
- />Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain
| | - Graham Klyne
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Reinout van Schouwen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter A C ‘t Hoen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Sean Bechhofer
- />School of Computer Science, University of Manchester, Manchester, UK
| | - Carole Goble
- />School of Computer Science, University of Manchester, Manchester, UK
| | - Marco Roos
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
56
|
Beisken S, Earll M, Baxter C, Portwood D, Ament Z, Kende A, Hodgman C, Seymour G, Smith R, Fraser P, Seymour M, Salek RM, Steinbeck C. Metabolic differences in ripening of Solanum lycopersicum 'Ailsa Craig' and three monogenic mutants. Sci Data 2014; 1:140029. [PMID: 25977786 PMCID: PMC4322568 DOI: 10.1038/sdata.2014.29] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2014] [Accepted: 08/06/2014] [Indexed: 12/02/2022] Open
Abstract
Application of mass spectrometry enables the detection of metabolic differences between groups of related organisms. Differences in the metabolic fingerprints of wild-type Solanum lycopersicum and three monogenic mutants, ripening inhibitor (rin), non-ripening (nor) and Colourless non-ripening (Cnr), of tomato are captured with regard to ripening behaviour. A high-resolution tandem mass spectrometry system coupled to liquid chromatography produced a time series of the ripening behaviour at discrete intervals with a focus on changes post-anthesis. Internal standards and quality controls were used to ensure system stability. The raw data of the samples and reference compounds including study protocols have been deposited in the open metabolomics database MetaboLights via the metadata annotation tool Isatab to enable efficient re-use of the datasets, such as in metabolomics cross-study comparisons or data fusion exercises.
Collapse
Affiliation(s)
- Stephan Beisken
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 2HA, UK
| | - Mark Earll
- Syngenta Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, UK
| | - Charles Baxter
- Syngenta Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, UK
| | - David Portwood
- Syngenta Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, UK
| | - Zsuzsanna Ament
- Syngenta Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, UK
| | - Aniko Kende
- Syngenta Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, UK
| | - Charlie Hodgman
- Centre for Plant Integrative Biology, University of Nottingham , Loughborough, Leicestershire LE12 5RD, UK
| | - Graham Seymour
- Centre for Plant Integrative Biology, University of Nottingham , Loughborough, Leicestershire LE12 5RD, UK
| | - Rebecca Smith
- Centre for Plant Integrative Biology, University of Nottingham , Loughborough, Leicestershire LE12 5RD, UK
| | - Paul Fraser
- School of Biological Sciences, Royal Holloway, University of London, Egham Hill , Egham, Surrey TW20 0EX, UK
| | - Mark Seymour
- Syngenta Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, UK
| | - Reza M Salek
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 2HA, UK
| | - Christoph Steinbeck
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 2HA, UK
| |
Collapse
|
57
|
Tsiliki G, Kossida S, Friesen N, Rüping S, Tzagarakis M, Karacapilidis N. A Data Mining Based Approach for Collaborative Analysis of Biomedical Data. INT J ARTIF INTELL T 2014. [DOI: 10.1142/s0218213014600100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Biomedical research becomes increasingly multidisciplinary and collaborative in nature. At the same time, it has recently seen a vast growth in publicly and instantly available information. As the available resources become more specialized, there is a growing need for multidisciplinary collaborations between biomedical researchers to address complex research questions. We present an application of a data mining algorithm to genomic data in a collaborative decision-making support environment, as a typical example of how multidisciplinary researchers can collaborate in analyzing and interpreting biomedical data. Through the proposed approach, researchers can easily decide about which data repositories should be considered, analyze the algorithmic results, discuss the weaknesses of the patterns identified, and set up new iterations of the data mining algorithm by defining other descriptive attributes or integrating other relevant data. Evaluation results show that the proposed approach facilitates users to set their research objectives and better understand the data and methodologies used in their research.
Collapse
Affiliation(s)
- Georgia Tsiliki
- Bioinformatics and Medical Informatics Team, Biomedical Research Foundation, Academy of Athens, 4 Soranou Ephessiou 115 27, Greece
| | - Sophia Kossida
- Bioinformatics and Medical Informatics Team, Biomedical Research Foundation, Academy of Athens, 4 Soranou Ephessiou 115 27, Greece
| | - Natalja Friesen
- Knowledge Discovery Group, Fraunhofer Institute IAIS, Sankt Augustin, Germany
| | - Stefan Rüping
- Knowledge Discovery Group, Fraunhofer Institute IAIS, Sankt Augustin, Germany
| | - Manolis Tzagarakis
- University of Patras and Computer Technology Institute & Press “Diophantus”, Rio Patras, Greece
| | - Nikos Karacapilidis
- University of Patras and Computer Technology Institute & Press “Diophantus”, Rio Patras, Greece
| |
Collapse
|
58
|
Costa RS, Veríssimo A, Vinga S. KiMoSys: a web-based repository of experimental data for KInetic MOdels of biological SYStems. BMC SYSTEMS BIOLOGY 2014; 8:85. [PMID: 25115331 PMCID: PMC4236735 DOI: 10.1186/s12918-014-0085-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Accepted: 07/11/2014] [Indexed: 01/03/2023]
Abstract
BACKGROUND The kinetic modeling of biological systems is mainly composed of three steps that proceed iteratively: model building, simulation and analysis. In the first step, it is usually required to set initial metabolite concentrations, and to assign kinetic rate laws, along with estimating parameter values using kinetic data through optimization when these are not known. Although the rapid development of high-throughput methods has generated much omics data, experimentalists present only a summary of obtained results for publication, the experimental data files are not usually submitted to any public repository, or simply not available at all. In order to automatize as much as possible the steps of building kinetic models, there is a growing requirement in the systems biology community for easily exchanging data in combination with models, which represents the main motivation of KiMoSys development. DESCRIPTION KiMoSys is a user-friendly platform that includes a public data repository of published experimental data, containing concentration data of metabolites and enzymes and flux data. It was designed to ensure data management, storage and sharing for a wider systems biology community. This community repository offers a web-based interface and upload facility to turn available data into publicly accessible, centralized and structured-format data files. Moreover, it compiles and integrates available kinetic models associated with the data.KiMoSys also integrates some tools to facilitate the kinetic model construction process of large-scale metabolic networks, especially when the systems biologists perform computational research. CONCLUSIONS KiMoSys is a web-based system that integrates a public data and associated model(s) repository with computational tools, providing the systems biology community with a novel application facilitating data storage and sharing, thus supporting construction of ODE-based kinetic models and collaborative research projects.The web application implemented using Ruby on Rails framework is freely available for web access at http://kimosys.org, along with its full documentation.
Collapse
Affiliation(s)
- Rafael S Costa
- Instituto de Engenharia de Sistemas e Computadores, Investigacão e Desenvolvimento (INESC-ID), R Alves Redol 9, Lisboa, 1000-029, Portugal
- Center for Intelligent Systems, LAETA,IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, Lisboa, 1049-001, Portugal
| | - André Veríssimo
- Center for Intelligent Systems, LAETA,IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, Lisboa, 1049-001, Portugal
| | - Susana Vinga
- Center for Intelligent Systems, LAETA,IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, Lisboa, 1049-001, Portugal
| |
Collapse
|
59
|
WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare? Pathogens 2014; 3:437-58. [PMID: 25437808 PMCID: PMC4243455 DOI: 10.3390/pathogens3020437] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2014] [Revised: 05/30/2014] [Accepted: 06/03/2014] [Indexed: 11/16/2022] Open
Abstract
Recent advances in DNA sequencing technologies have the potential to transform the field of clinical and public health microbiology, and in the last few years numerous case studies have demonstrated successful applications in this context. Among other considerations, a lack of user-friendly data analysis and interpretation tools has been frequently cited as a major barrier to routine use of these techniques. Here we consider the requirements of microbiology laboratories for the analysis, clinical interpretation and management of bacterial whole-genome sequence (WGS) data. Then we discuss relevant, existing WGS analysis tools. We highlight many essential and useful features that are represented among existing tools, but find that no single tool fulfils all of the necessary requirements. We conclude that to fully realise the potential of WGS analyses for clinical and public health microbiology laboratories of all scales, we will need to develop tools specifically with the needs of these laboratories in mind.
Collapse
|
60
|
The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration. J Cheminform 2014. [PMCID: PMC4036106 DOI: 10.1186/1758-2946-6-24] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
Collapse
|
61
|
McDonagh JL, Nath N, De Ferrari L, van Mourik T, Mitchell JBO. Uniting cheminformatics and chemical theory to predict the intrinsic aqueous solubility of crystalline druglike molecules. J Chem Inf Model 2014; 54:844-56. [PMID: 24564264 PMCID: PMC3965570 DOI: 10.1021/ci4005805] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
We
present four models of solution free-energy prediction for druglike
molecules utilizing cheminformatics descriptors and theoretically
calculated thermodynamic values. We make predictions of solution free
energy using physics-based theory alone and using machine learning/quantitative
structure–property relationship (QSPR) models. We also develop
machine learning models where the theoretical energies and cheminformatics
descriptors are used as combined input. These models are used to predict
solvation free energy. While direct theoretical calculation does not
give accurate results in this approach, machine learning is able to
give predictions with a root mean squared error (RMSE) of ∼1.1
log S units in a 10-fold cross-validation for our
Drug-Like-Solubility-100 (DLS-100) dataset of 100 druglike molecules.
We find that a model built using energy terms from our theoretical
methodology as descriptors is marginally less predictive than one
built on Chemistry Development Kit (CDK) descriptors. Combining both
sets of descriptors allows a further but very modest improvement in
the predictions. However, in some cases, this is a statistically significant
enhancement. These results suggest that there is little complementarity
between the chemical information provided by these two sets of descriptors,
despite their different sources and methods of calculation. Our machine
learning models are also able to predict the well-known Solubility
Challenge dataset with an RMSE value of 0.9–1.0 log S units.
Collapse
Affiliation(s)
- James L McDonagh
- Biomedical Sciences Research Complex and ‡EaStCHEM, School of Chemistry, Purdie Building, University of St. Andrews , North Haugh, St. Andrews, Scotland , KY16 9ST, United Kingdom
| | | | | | | | | |
Collapse
|
62
|
SemEnAl: Using Semantics for Accelerating Environmental Analytical Model Discovery. BIG DATA ANALYTICS 2014. [DOI: 10.1007/978-3-319-13820-6_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
|
63
|
Dharuri H, Henneman P, Demirkan A, van Klinken JB, Mook-Kanamori DO, Wang-Sattler R, Gieger C, Adamski J, Hettne K, Roos M, Suhre K, Van Duijn CM, van Dijk KW, 't Hoen PAC. Automated workflow-based exploitation of pathway databases provides new insights into genetic associations of metabolite profiles. BMC Genomics 2013; 14:865. [PMID: 24320595 PMCID: PMC3879060 DOI: 10.1186/1471-2164-14-865] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2013] [Accepted: 12/02/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) have identified many common single nucleotide polymorphisms (SNPs) that associate with clinical phenotypes, but these SNPs usually explain just a small part of the heritability and have relatively modest effect sizes. In contrast, SNPs that associate with metabolite levels generally explain a higher percentage of the genetic variation and demonstrate larger effect sizes. Still, the discovery of SNPs associated with metabolite levels is challenging since testing all metabolites measured in typical metabolomics studies with all SNPs comes with a severe multiple testing penalty. We have developed an automated workflow approach that utilizes prior knowledge of biochemical pathways present in databases like KEGG and BioCyc to generate a smaller SNP set relevant to the metabolite. This paper explores the opportunities and challenges in the analysis of GWAS of metabolomic phenotypes and provides novel insights into the genetic basis of metabolic variation through the re-analysis of published GWAS datasets. RESULTS Re-analysis of the published GWAS dataset from Illig et al. (Nature Genetics, 2010) using a pathway-based workflow (http://www.myexperiment.org/packs/319.html), confirmed previously identified hits and identified a new locus of human metabolic individuality, associating Aldehyde dehydrogenase family1 L1 (ALDH1L1) with serine/glycine ratios in blood. Replication in an independent GWAS dataset of phospholipids (Demirkan et al., PLoS Genetics, 2012) identified two novel loci supported by additional literature evidence: GPAM (Glycerol-3 phosphate acyltransferase) and CBS (Cystathionine beta-synthase). In addition, the workflow approach provided novel insight into the affected pathways and relevance of some of these gene-metabolite pairs in disease development and progression. CONCLUSIONS We demonstrate the utility of automated exploitation of background knowledge present in pathway databases for the analysis of GWAS datasets of metabolomic phenotypes. We report novel loci and potential biochemical mechanisms that contribute to our understanding of the genetic basis of metabolic variation and its relationship to disease development and progression.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Peter A C 't Hoen
- Center for Human and Clinical Genetics, Leiden University Medical Center, S4-P, PO Box 9600, 2300, RC Leiden, Netherlands.
| |
Collapse
|
64
|
Affiliation(s)
- Geir Kjetil Sandve
- Department of Informatics, University of Oslo, Blindern, Oslo, Norway
- Centre for Cancer Biomedicine, University of Oslo, Blindern, Oslo, Norway
- * E-mail:
| | - Anton Nekrutenko
- Department of Biochemistry and Molecular Biology and The Huck Institutes for the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - James Taylor
- Department of Biology and Department of Mathematics and Computer Science, Emory University, Atlanta, Georgia, United States of America
| | - Eivind Hovig
- Department of Informatics, University of Oslo, Blindern, Oslo, Norway
- Department of Tumor Biology, Institute for Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, Oslo, Norway
- Institute for Medical Informatics, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, Oslo, Norway
| |
Collapse
|
65
|
Beisken S, Meinl T, Wiswedel B, de Figueiredo LF, Berthold M, Steinbeck C. KNIME-CDK: Workflow-driven cheminformatics. BMC Bioinformatics 2013; 14:257. [PMID: 24103053 PMCID: PMC3765822 DOI: 10.1186/1471-2105-14-257] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2013] [Accepted: 08/21/2013] [Indexed: 12/17/2022] Open
Abstract
Background Cheminformaticians have to routinely process and analyse libraries of small molecules. Among other things, that includes the standardization of molecules, calculation of various descriptors, visualisation of molecular structures, and downstream analysis. For this purpose, scientific workflow platforms such as the Konstanz Information Miner can be used if provided with the right plug-in. A workflow-based cheminformatics tool provides the advantage of ease-of-use and interoperability between complementary cheminformatics packages within the same framework, hence facilitating the analysis process. Results KNIME-CDK comprises functions for molecule conversion to/from common formats, generation of signatures, fingerprints, and molecular properties. It is based on the Chemistry Development Toolkit and uses the Chemical Markup Language for persistence. A comparison with the cheminformatics plug-in RDKit shows that KNIME-CDK supports a similar range of chemical classes and adds new functionality to the framework. We describe the design and integration of the plug-in, and demonstrate the usage of the nodes on ChEBI, a library of small molecules of biological interest. Conclusions KNIME-CDK is an open-source plug-in for the Konstanz Information Miner, a free workflow platform. KNIME-CDK is build on top of the open-source Chemistry Development Toolkit and allows for efficient cross-vendor structural cheminformatics. Its ease-of-use and modularity enables researchers to automate routine tasks and data analysis, bringing complimentary cheminformatics functionality to the workflow environment.
Collapse
Affiliation(s)
- Stephan Beisken
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | | | | | | | | | | |
Collapse
|
66
|
Kouskoumvekaki I, Shublaq N, Brunak S. Facilitating the use of large-scale biological data and tools in the era of translational bioinformatics. Brief Bioinform 2013; 15:942-52. [PMID: 23908249 DOI: 10.1093/bib/bbt055] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
As both the amount of generated biological data and the processing compute power increase, computational experimentation is no longer the exclusivity of bioinformaticians, but it is moving across all biomedical domains. For bioinformatics to realize its translational potential, domain experts need access to user-friendly solutions to navigate, integrate and extract information out of biological databases, as well as to combine tools and data resources in bioinformatics workflows. In this review, we present services that assist biomedical scientists in incorporating bioinformatics tools into their research. We review recent applications of Cytoscape, BioGPS and DAVID for data visualization, integration and functional enrichment. Moreover, we illustrate the use of Taverna, Kepler, GenePattern, and Galaxy as open-access workbenches for bioinformatics workflows. Finally, we mention services that facilitate the integration of biomedical ontologies and bioinformatics tools in computational workflows.
Collapse
|
67
|
Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 2013; 29:1325-32. [PMID: 23479348 PMCID: PMC3654706 DOI: 10.1093/bioinformatics/btt113] [Citation(s) in RCA: 126] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Revised: 02/28/2013] [Accepted: 03/01/2013] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required. RESULTS EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations. AVAILABILITY The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl. CONTACT jison@ebi.ac.uk.
Collapse
Affiliation(s)
- Jon Ison
- EMBL European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
68
|
McWilliam H, Li W, Uludag M, Squizzato S, Park YM, Buso N, Cowley AP, Lopez R. Analysis Tool Web Services from the EMBL-EBI. Nucleic Acids Res 2013; 41:W597-600. [PMID: 23671338 PMCID: PMC3692137 DOI: 10.1093/nar/gkt376] [Citation(s) in RCA: 1203] [Impact Index Per Article: 109.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Since 2004 the European Bioinformatics Institute (EMBL-EBI) has provided access to a wide range of databases and analysis tools via Web Services interfaces. This comprises services to search across the databases available from the EMBL-EBI and to explore the network of cross-references present in the data (e.g. EB-eye), services to retrieve entry data in various data formats and to access the data in specific fields (e.g. dbfetch), and analysis tool services, for example, sequence similarity search (e.g. FASTA and NCBI BLAST), multiple sequence alignment (e.g. Clustal Omega and MUSCLE), pairwise sequence alignment and protein functional analysis (e.g. InterProScan and Phobius). The REST/SOAP Web Services (http://www.ebi.ac.uk/Tools/webservices/) interfaces to these databases and tools allow their integration into other tools, applications, web sites, pipeline processes and analytical workflows. To get users started using the Web Services, sample clients are provided covering a range of programming languages and popular Web Service tool kits, and a brief guide to Web Services technologies, including a set of tutorials, is available for those wishing to learn more and develop their own clients. Users of the Web Services are informed of improvements and updates via a range of methods.
Collapse
Affiliation(s)
- Hamish McWilliam
- EMBL Outstation-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD Cambridge, UK
| | | | | | | | | | | | | | | |
Collapse
|
69
|
Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P, Bhagat J, Belhajjame K, Bacall F, Hardisty A, Nieva de la Hidalga A, Balcazar Vargas MP, Sufi S, Goble C. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 2013; 41:W557-61. [PMID: 23640334 PMCID: PMC3692062 DOI: 10.1093/nar/gkt328] [Citation(s) in RCA: 482] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The Taverna workflow tool suite (http://www.taverna.org.uk) is designed to combine distributed Web Services and/or local tools into complex analysis pipelines. These pipelines can be executed on local desktop machines or through larger infrastructure (such as supercomputers, Grids or cloud environments), using the Taverna Server. In bioinformatics, Taverna workflows are typically used in the areas of high-throughput omics analyses (for example, proteomics or transcriptomics), or for evidence gathering methods involving text mining or data mining. Through Taverna, scientists have access to several thousand different tools and resources that are freely available from a large range of life science institutions. Once constructed, the workflows are reusable, executable bioinformatics protocols that can be shared, reused and repurposed. A repository of public workflows is available at http://www.myexperiment.org. This article provides an update to the Taverna tool suite, highlighting new features and developments in the workbench and the Taverna Server.
Collapse
|
70
|
Pérez M, Berlanga R, Sanz I, Aramburu MJ. BioUSeR: a semantic-based tool for retrieving Life Science web resources driven by text-rich user requirements. J Biomed Semantics 2013; 4:12. [PMID: 23635042 PMCID: PMC3698192 DOI: 10.1186/2041-1480-4-12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 04/18/2013] [Indexed: 12/05/2022] Open
Abstract
Background Open metadata registries are a fundamental tool for researchers in the Life Sciences trying to locate resources. While most current registries assume that resources are annotated with well-structured metadata, evidence shows that most of the resource annotations simply consists of informal free text. This reality must be taken into account in order to develop effective techniques for resource discovery in Life Sciences. Results BioUSeR is a semantic-based tool aimed at retrieving Life Sciences resources described in free text. The retrieval process is driven by the user requirements, which consist of a target task and a set of facets of interest, both expressed in free text. BioUSeR is able to effectively exploit the available textual descriptions to find relevant resources by using semantic-aware techniques. Conclusions BioUSeR overcomes the limitations of the current registries thanks to: (i) rich specification of user information needs, (ii) use of semantics to manage textual descriptions, (iii) retrieval and ranking of resources based on user requirements.
Collapse
Affiliation(s)
- María Pérez
- Department of Computer Science and Engineering, Universitat Jaume I, Castellón, Spain.
| | | | | | | |
Collapse
|
71
|
Wollbrett J, Larmande P, de Lamotte F, Ruiz M. Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases. BMC Bioinformatics 2013; 14:126. [PMID: 23586394 PMCID: PMC3680174 DOI: 10.1186/1471-2105-14-126] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2012] [Accepted: 03/25/2013] [Indexed: 11/10/2022] Open
Abstract
Background In recent years, a large amount of “-omics” data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. Results We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. Conclusions BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic.
Collapse
|
72
|
Vaughan LK, Srinivasasainagendra V. Where in the genome are we? A cautionary tale of database use in genomics research. Front Genet 2013; 4:38. [PMID: 23519237 PMCID: PMC3604632 DOI: 10.3389/fgene.2013.00038] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2012] [Accepted: 03/04/2013] [Indexed: 11/20/2022] Open
Abstract
With the advent of high throughput data genomic technologies the volume of available data is now staggering. In addition databases that provide resources to annotate, translate, and connect biological data have grown exponentially in content and use. The availability of such data emphasizes the importance of bioinformatics and computational biology in genomics research and has led to the development of thousands of tools to integrate and utilize these resources. When utilizing such resources, the principles of reproducible research are often overlooked. In this manuscript we provide selected case studies illustrating issues that may arise while working with genes and genetic polymorphisms. These case studies illustrate potential sources of error which can be introduced if the practices of reproducible research are not employed and non-concurrent databases are used. We also show examples of a lack of transparency when these databases are concerned when using popular bioinformatics tools. These examples highlight that resources are constantly evolving, and in order to provide reproducible results, research should be aware of and connected to the correct release of the data, particularly when implementing computational tools.
Collapse
Affiliation(s)
- Laura K Vaughan
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA
| | | |
Collapse
|
73
|
|
74
|
Aranguren ME, Fernández-Breis JT, Mungall C, Antezana E, González AR, Wilkinson MD. OPPL-Galaxy, a Galaxy tool for enhancing ontology exploitation as part of bioinformatics workflows. J Biomed Semantics 2013; 4:2. [PMID: 23286517 PMCID: PMC3643862 DOI: 10.1186/2041-1480-4-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 12/27/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical ontologies are key elements for building up the Life Sciences Semantic Web. Reusing and building biomedical ontologies requires flexible and versatile tools to manipulate them efficiently, in particular for enriching their axiomatic content. The Ontology Pre Processor Language (OPPL) is an OWL-based language for automating the changes to be performed in an ontology. OPPL augments the ontologists' toolbox by providing a more efficient, and less error-prone, mechanism for enriching a biomedical ontology than that obtained by a manual treatment. RESULTS We present OPPL-Galaxy, a wrapper for using OPPL within Galaxy. The functionality delivered by OPPL (i.e. automated ontology manipulation) can be combined with the tools and workflows devised within the Galaxy framework, resulting in an enhancement of OPPL. Use cases are provided in order to demonstrate OPPL-Galaxy's capability for enriching, modifying and querying biomedical ontologies. CONCLUSIONS Coupling OPPL-Galaxy with other bioinformatics tools of the Galaxy framework results in a system that is more than the sum of its parts. OPPL-Galaxy opens a new dimension of analyses and exploitation of biomedical ontologies, including automated reasoning, paving the way towards advanced biological data analyses.
Collapse
Affiliation(s)
- Mikel Egaña Aranguren
- Ontology Engineering Group, School of Computer Science, Technical University of Madrid (UPM), Boadilla del Monte, 28660, Spain
- Biological Informatics Group, Centre for Plant Biotechnology and Genomics (CBGP), Technical University of Madrid (UPM), Pozuelo de Alarcón, 28223, Spain
| | | | - Chris Mungall
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, US
| | - Erick Antezana
- Department of Biology, Norwegian University of Science and Technology (NTNU), Høgskoleringen 5, Trondheim, N-7491, Norway
| | - Alejandro Rodríguez González
- Biological Informatics Group, Centre for Plant Biotechnology and Genomics (CBGP), Technical University of Madrid (UPM), Pozuelo de Alarcón, 28223, Spain
| | - Mark D Wilkinson
- Biological Informatics Group, Centre for Plant Biotechnology and Genomics (CBGP), Technical University of Madrid (UPM), Pozuelo de Alarcón, 28223, Spain
| |
Collapse
|
75
|
Jimenez RC, Corpas M. Bioinformatics workflows and web services in systems biology made easy for experimentalists. Methods Mol Biol 2013; 1021:299-310. [PMID: 23715992 DOI: 10.1007/978-1-62703-450-0_16] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Workflows are useful to perform data analysis and integration in systems biology. Workflow management systems can help users create workflows without any previous knowledge in programming and web services. However the computational skills required to build such workflows are usually above the level most biological experimentalists are comfortable with. In this chapter we introduce workflow management systems that reuse existing workflows instead of creating them, making it easier for experimentalists to perform computational tasks.
Collapse
Affiliation(s)
- Rafael C Jimenez
- EMBL Outstation-European Bioinformatics Institute, Cambridge, UK
| | | |
Collapse
|
76
|
Bird CL, Willoughby C, Frey JG. Laboratory notebooks in the digital era: the role of ELNs in record keeping for chemistry and other sciences. Chem Soc Rev 2013; 42:8157-75. [DOI: 10.1039/c3cs60122f] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
77
|
Jiménez RC, Vizcaíno JA. Proteomics data exchange and storage: the need for common standards and public repositories. Methods Mol Biol 2013; 1007:317-333. [PMID: 23666733 DOI: 10.1007/978-1-62703-392-3_14] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Both the existence of data standards and public databases or repositories have been key factors behind the development of the existing "omics" approaches. In this book chapter we first review the main existing mass spectrometry (MS)-based proteomics resources: PRIDE, PeptideAtlas, GPMDB, and Tranche. Second, we report on the current status of the different proteomics data standards developed by the Proteomics Standards Initiative (PSI): the formats mzML, mzIdentML, mzQuantML, TraML, and PSI-MI XML are then reviewed. Finally, we present an easy way to query and access MS proteomics data in the PRIDE database, as a representative of the existing repositories, using the workflow management system (WMS) tool Taverna. Two different publicly available workflows are explained and described.
Collapse
|
78
|
Beck T, Free RC, Thorisson GA, Brookes AJ. Semantically enabling a genome-wide association study database. J Biomed Semantics 2012; 3:9. [PMID: 23244533 PMCID: PMC3579732 DOI: 10.1186/2041-1480-3-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2012] [Accepted: 08/22/2012] [Indexed: 01/03/2023] Open
Abstract
Background The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central – a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data. Results A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications. Conclusions We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.
Collapse
Affiliation(s)
- Tim Beck
- Department of Genetics, University of Leicester, University Road, Leicester, UK.
| | | | | | | |
Collapse
|
79
|
Application of an integrative computational framework in trancriptomic data of atherosclerotic mice suggests numerous molecular players. Adv Bioinformatics 2012; 2012:453513. [PMID: 23193398 PMCID: PMC3502768 DOI: 10.1155/2012/453513] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2012] [Accepted: 09/21/2012] [Indexed: 01/09/2023] Open
Abstract
Atherosclerosis is a multifactorial disease involving a lot of genes and proteins recruited throughout its manifestation. The present study aims to exploit bioinformatic tools in order to analyze microarray data of atherosclerotic aortic lesions of ApoE knockout mice, a model widely used in atherosclerosis research. In particular, a dynamic analysis was performed among young and aged animals, resulting in a list of 852 significantly altered genes. Pathway analysis indicated alterations in critical cellular processes related to cell communication and signal transduction, immune response, lipid transport, and metabolism. Cluster analysis partitioned the significantly differentiated genes in three major clusters of similar expression profile. Promoter analysis applied to functional related groups of the same cluster revealed shared putative cis-elements potentially contributing to a common regulatory mechanism. Finally, by reverse engineering the functional relevance of differentially expressed genes with specific cellular pathways, putative genes acting as hubs, were identified, linking functionally disparate cellular processes in the context of traditional molecular description.
Collapse
|
80
|
Rodrigues MR, Magalhães WCS, Machado M, Tarazona-Santos E. A graph-based approach for designing extensible pipelines. BMC Bioinformatics 2012; 13:163. [PMID: 22788675 PMCID: PMC3496580 DOI: 10.1186/1471-2105-13-163] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2011] [Accepted: 06/22/2012] [Indexed: 11/10/2022] Open
Abstract
Background In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps. Results We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at
http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms. Conclusions Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
Collapse
Affiliation(s)
- Maíra R Rodrigues
- Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.
| | | | | | | |
Collapse
|
81
|
Abstract
INTRODUCTION The development and use of web tools in chemistry has accumulated more than 15 years of history already. Powered by the advances in the Internet technologies, the current generation of web systems are starting to expand into areas, traditional for desktop applications. The web platforms integrate data storage, cheminformatics and data analysis tools. The ease of use and the collaborative potential of the web is compelling, despite the challenges. AREAS COVERED The topic of this review is a set of recently published web tools that facilitate predictive toxicology model building. The focus is on software platforms, offering web access to chemical structure-based methods, although some of the frameworks could also provide bioinformatics or hybrid data analysis functionalities. A number of historical and current developments are cited. In order to provide comparable assessment, the following characteristics are considered: support for workflows, descriptor calculations, visualization, modeling algorithms, data management and data sharing capabilities, availability of GUI or programmatic access and implementation details. EXPERT OPINION The success of the Web is largely due to its highly decentralized, yet sufficiently interoperable model for information access. The expected future convergence between cheminformatics and bioinformatics databases provides new challenges toward management and analysis of large data sets. The web tools in predictive toxicology will likely continue to evolve toward the right mix of flexibility, performance, scalability, interoperability, sets of unique features offered, friendly user interfaces, programmatic access for advanced users, platform independence, results reproducibility, curation and crowdsourcing utilities, collaborative sharing and secure access.
Collapse
|
82
|
Abouelhoda M, Issa SA, Ghanem M. Tavaxy: integrating Taverna and Galaxy workflows with cloud computing support. BMC Bioinformatics 2012; 13:77. [PMID: 22559942 PMCID: PMC3583125 DOI: 10.1186/1471-2105-13-77] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2011] [Accepted: 05/04/2012] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Over the past decade the workflow system paradigm has evolved as an efficient and user-friendly approach for developing complex bioinformatics applications. Two popular workflow systems that have gained acceptance by the bioinformatics community are Taverna and Galaxy. Each system has a large user-base and supports an ever-growing repository of application workflows. However, workflows developed for one system cannot be imported and executed easily on the other. The lack of interoperability is due to differences in the models of computation, workflow languages, and architectures of both systems. This lack of interoperability limits sharing of workflows between the user communities and leads to duplication of development efforts. RESULTS In this paper, we present Tavaxy, a stand-alone system for creating and executing workflows based on using an extensible set of re-usable workflow patterns. Tavaxy offers a set of new features that simplify and enhance the development of sequence analysis applications: It allows the integration of existing Taverna and Galaxy workflows in a single environment, and supports the use of cloud computing capabilities. The integration of existing Taverna and Galaxy workflows is supported seamlessly at both run-time and design-time levels, based on the concepts of hierarchical workflows and workflow patterns. The use of cloud computing in Tavaxy is flexible, where the users can either instantiate the whole system on the cloud, or delegate the execution of certain sub-workflows to the cloud infrastructure. CONCLUSIONS Tavaxy reduces the workflow development cycle by introducing the use of workflow patterns to simplify workflow creation. It enables the re-use and integration of existing (sub-) workflows from Taverna and Galaxy, and allows the creation of hybrid workflows. Its additional features exploit recent advances in high performance cloud computing to cope with the increasing data size and complexity of analysis.The system can be accessed either through a cloud-enabled web-interface or downloaded and installed to run within the user's local environment. All resources related to Tavaxy are available at http://www.tavaxy.org.
Collapse
|
83
|
Rybiński M, Lula M, Banasik P, Lasota S, Gambin A. Tav4SB: integrating tools for analysis of kinetic models of biological systems. BMC SYSTEMS BIOLOGY 2012; 6:25. [PMID: 22480273 PMCID: PMC3495710 DOI: 10.1186/1752-0509-6-25] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 04/05/2012] [Indexed: 11/25/2022]
Abstract
Background Progress in the modeling of biological systems strongly relies on the availability of specialized computer-aided tools. To that end, the Taverna Workbench eases integration of software tools for life science research and provides a common workflow-based framework for computational experiments in Biology. Results The Taverna services for Systems Biology (Tav4SB) project provides a set of new Web service operations, which extend the functionality of the Taverna Workbench in a domain of systems biology. Tav4SB operations allow you to perform numerical simulations or model checking of, respectively, deterministic or stochastic semantics of biological models. On top of this functionality, Tav4SB enables the construction of high-level experiments. As an illustration of possibilities offered by our project we apply the multi-parameter sensitivity analysis. To visualize the results of model analysis a flexible plotting operation is provided as well. Tav4SB operations are executed in a simple grid environment, integrating heterogeneous software such as Mathematica, PRISM and SBML ODE Solver. The user guide, contact information, full documentation of available Web service operations, workflows and other additional resources can be found at the Tav4SB project’s Web page: http://bioputer.mimuw.edu.pl/tav4sb/. Conclusions The Tav4SB Web service provides a set of integrated tools in the domain for which Web-based applications are still not as widely available as for other areas of computational biology. Moreover, we extend the dedicated hardware base for computationally expensive task of simulating cellular models. Finally, we promote the standardization of models and experiments as well as accessibility and usability of remote services.
Collapse
Affiliation(s)
- Mikołaj Rybiński
- Institute of Informatics, University of Warsaw, ul, Banacha 2, 02-097, Warsaw, Poland.
| | | | | | | | | |
Collapse
|
84
|
Parr CS, Guralnick R, Cellinese N, Page RD. Evolutionary informatics: unifying knowledge about the diversity of life. Trends Ecol Evol 2012; 27:94-103. [PMID: 22154516 DOI: 10.1016/j.tree.2011.11.001] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2011] [Revised: 10/31/2011] [Accepted: 11/01/2011] [Indexed: 01/23/2023]
|
85
|
Michener WK, Jones MB. Ecoinformatics: supporting ecology as a data-intensive science. Trends Ecol Evol 2012; 27:85-93. [PMID: 22240191 DOI: 10.1016/j.tree.2011.11.016] [Citation(s) in RCA: 146] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2011] [Revised: 11/29/2011] [Accepted: 11/29/2011] [Indexed: 11/30/2022]
Abstract
Ecology is evolving rapidly and increasingly changing into a more open, accountable, interdisciplinary, collaborative and data-intensive science. Discovering, integrating and analyzing massive amounts of heterogeneous data are central to ecology as researchers address complex questions at scales from the gene to the biosphere. Ecoinformatics offers tools and approaches for managing ecological data and transforming the data into information and knowledge. Here, we review the state-of-the-art and recent advances in ecoinformatics that can benefit ecologists and environmental scientists as they tackle increasingly challenging questions that require voluminous amounts of data across disciplines and scales of space and time. We also highlight the challenges and opportunities that remain.
Collapse
Affiliation(s)
- William K Michener
- University Libraries, University of New Mexico, Albuquerque, NM 87131, USA.
| | | |
Collapse
|
86
|
Hallinan J. Data mining for microbiologists. J Microbiol Methods 2012. [DOI: 10.1016/b978-0-08-099387-4.00002-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
|
87
|
Willighagen EL, Jeliazkova N, Hardy B, Grafström RC, Spjuth O. Computational toxicology using the OpenTox application programming interface and Bioclipse. BMC Res Notes 2011; 4:487. [PMID: 22075173 PMCID: PMC3264531 DOI: 10.1186/1756-0500-4-487] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2011] [Accepted: 11/10/2011] [Indexed: 11/10/2022] Open
Abstract
Background Toxicity is a complex phenomenon involving the potential adverse effect on a range of biological functions. Predicting toxicity involves using a combination of experimental data (endpoints) and computational methods to generate a set of predictive models. Such models rely strongly on being able to integrate information from many sources. The required integration of biological and chemical information sources requires, however, a common language to express our knowledge ontologically, and interoperating services to build reliable predictive toxicology applications. Findings This article describes progress in extending the integrative bio- and cheminformatics platform Bioclipse to interoperate with OpenTox, a semantic web framework which supports open data exchange and toxicology model building. The Bioclipse workbench environment enables functionality from OpenTox web services and easy access to OpenTox resources for evaluating toxicity properties of query molecules. Relevant cases and interfaces based on ten neurotoxins are described to demonstrate the capabilities provided to the user. The integration takes advantage of semantic web technologies, thereby providing an open and simplifying communication standard. Additionally, the use of ontologies ensures proper interoperation and reliable integration of toxicity information from both experimental and computational sources. Conclusions A novel computational toxicity assessment platform was generated from integration of two open science platforms related to toxicology: Bioclipse, that combines a rich scriptable and graphical workbench environment for integration of diverse sets of information sources, and OpenTox, a platform for interoperable toxicology data and computational services. The combination provides improved reliability and operability for handling large data sets by the use of the Open Standards from the OpenTox Application Programming Interface. This enables simultaneous access to a variety of distributed predictive toxicology databases, and algorithm and model resources, taking advantage of the Bioclipse workbench handling the technical layers.
Collapse
Affiliation(s)
- Egon L Willighagen
- Department of Pharmaceutical Bioinformatics, Uppsala University, Uppsala, Sweden.
| | | | | | | | | |
Collapse
|
88
|
Romano P, Giugno R, Pulvirenti A. Tools and collaborative environments for bioinformatics research. Brief Bioinform 2011; 12:549-61. [PMID: 21984743 PMCID: PMC3220874 DOI: 10.1093/bib/bbr055] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Advanced research requires intensive interaction among a multitude of actors, often possessing different expertise and usually working at a distance from each other. The field of collaborative research aims to establish suitable models and technologies to properly support these interactions. In this article, we first present the reasons for an interest of Bioinformatics in this context by also suggesting some research domains that could benefit from collaborative research. We then review the principles and some of the most relevant applications of social networking, with a special attention to networks supporting scientific collaboration, by also highlighting some critical issues, such as identification of users and standardization of formats. We then introduce some systems for collaborative document creation, including wiki systems and tools for ontology development, and review some of the most interesting biological wikis. We also review the principles of Collaborative Development Environments for software and show some examples in Bioinformatics. Finally, we present the principles and some examples of Learning Management Systems. In conclusion, we try to devise some of the goals to be achieved in the short term for the exploitation of these technologies.
Collapse
Affiliation(s)
- Paolo Romano
- Bioinformatics, National Cancer Research Institute (IST), Genoa, Italy.
| | | | | |
Collapse
|
89
|
Splendiani A, Gündel M, Austyn JM, Cavalieri D, Scognamiglio C, Brandizi M. Knowledge sharing and collaboration in translational research, and the DC-THERA Directory. Brief Bioinform 2011; 12:562-75. [PMID: 21969471 PMCID: PMC3220873 DOI: 10.1093/bib/bbr051] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Biomedical research relies increasingly on large collections of data sets and knowledge whose generation, representation and analysis often require large collaborative and interdisciplinary efforts. This dimension of ‘big data’ research calls for the development of computational tools to manage such a vast amount of data, as well as tools that can improve communication and access to information from collaborating researchers and from the wider community. Whenever research projects have a defined temporal scope, an additional issue of data management arises, namely how the knowledge generated within the project can be made available beyond its boundaries and life-time. DC-THERA is a European ‘Network of Excellence’ (NoE) that spawned a very large collaborative and interdisciplinary research community, focusing on the development of novel immunotherapies derived from fundamental research in dendritic cell immunobiology. In this article we introduce the DC-THERA Directory, which is an information system designed to support knowledge management for this research community and beyond. We present how the use of metadata and Semantic Web technologies can effectively help to organize the knowledge generated by modern collaborative research, how these technologies can enable effective data management solutions during and beyond the project lifecycle, and how resources such as the DC-THERA Directory fit into the larger context of e-science.
Collapse
|
90
|
Mishima H, Sasaki K, Tanaka M, Tatebe O, Yoshiura KI. Agile parallel bioinformatics workflow management using Pwrake. BMC Res Notes 2011; 4:331. [PMID: 21899774 PMCID: PMC3180464 DOI: 10.1186/1756-0500-4-331] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Accepted: 09/08/2011] [Indexed: 12/20/2022] Open
Abstract
Background In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error. Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. Findings We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Conclusions Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows.
Collapse
Affiliation(s)
- Hiroyuki Mishima
- Department of Human Genetics, Nagasaki University Graduate School of Biomedical Sciences, 1-12-4 Sakamoto, Nagasaki, Nagasaki, Japan.
| | | | | | | | | |
Collapse
|
91
|
Jagla B, Wiswedel B, Coppée JY. Extending KNIME for next-generation sequencing data analysis. ACTA ACUST UNITED AC 2011; 27:2907-9. [PMID: 21873641 DOI: 10.1093/bioinformatics/btr478] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
SUMMARY KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis and exploration platform. We present here new functionality and workflows that open the door to performing next-generation sequencing analysis using the KNIME framework. AVAILABILITY All sources and compiled code are available via the KNIME update mechanism. Example workflows and descriptions are available through http://tech.knime.org/community/next-generation-sequencing. CONTACT bernd.jagla@pasteur.fr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bernd Jagla
- Departement Génomes et Génétique, Institut Pasteur, Plate-forme Transcriptome et Epigénome, 25 Rue du Docteur Roux, F-75015 Paris, France.
| | | | | |
Collapse
|
92
|
Affiliation(s)
- Jason R Swedlow
- Wellcome Trust Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dundee, Scotland, UK.
| | | | | |
Collapse
|
93
|
|
94
|
Lushbough CM, Jennewein DM, Brendel VP. The BioExtract Server: a web-based bioinformatic workflow platform. Nucleic Acids Res 2011; 39:W528-32. [PMID: 21546552 PMCID: PMC3125737 DOI: 10.1093/nar/gkr286] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The BioExtract Server (bioextract.org) is an open, web-based system designed to aid researchers in the analysis of genomic data by providing a platform for the creation of bioinformatic workflows. Scientific workflows are created within the system by recording tasks performed by the user. These tasks may include querying multiple, distributed data sources, saving query results as searchable data extracts, and executing local and web-accessible analytic tools. The series of recorded tasks can then be saved as a reproducible, sharable workflow available for subsequent execution with the original or modified inputs and parameter settings. Integrated data resources include interfaces to the National Center for Biotechnology Information (NCBI) nucleotide and protein databases, the European Molecular Biology Laboratory (EMBL-Bank) non-redundant nucleotide database, the Universal Protein Resource (UniProt), and the UniProt Reference Clusters (UniRef) database. The system offers access to numerous preinstalled, curated analytic tools and also provides researchers with the option of selecting computational tools from a large list of web services including the European Molecular Biology Open Software Suite (EMBOSS), BioMoby, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). The system further allows users to integrate local command line tools residing on their own computers through a client-side Java applet.
Collapse
Affiliation(s)
- Carol M Lushbough
- Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA.
| | | | | |
Collapse
|
95
|
Webb AJ, Thorisson GA, Brookes AJ. An informatics project and online "Knowledge Centre" supporting modern genotype-to-phenotype research. Hum Mutat 2011; 32:543-50. [PMID: 21438073 DOI: 10.1002/humu.21469] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2011] [Accepted: 01/28/2011] [Indexed: 11/06/2022]
Abstract
Explosive growth in the generation of genotype-to-phenotype (G2P) data necessitates a concerted effort to tackle the logistical and informatics challenges this presents. The GEN2PHEN Project represents one such effort, with a broad strategy of uniting disparate G2P resources into a hybrid centralized-federated network. This is achieved through a holistic strategy focussed on three overlapping areas: data input standards and pipelines through which to submit and collect data (data in); federated, independent, extendable, yet interoperable database platforms on which to store and curate widely diverse datasets (data storage); and data formats and mechanisms with which to exchange, combine, and extract data (data exchange and output). To fully leverage this data network, we have constructed the "G2P Knowledge Centre" (http://www.gen2phen.org). This central platform provides holistic searching of the G2P data domain allied with facilities for data annotation and user feedback, access to extensive G2P and informatics resources, and tools for constructing online working communities centered on the G2P domain. Through the efforts of GEN2PHEN, and through combining data with broader community-derived knowledge, the Knowledge Centre opens up exciting possibilities for organizing, integrating, sharing, and interpreting new waves of G2P data in a collaborative fashion.
Collapse
Affiliation(s)
- Adam J Webb
- Department of Genetics, University of Leicester, University Road, Leicester, United Kingdom.
| | | | | | | |
Collapse
|
96
|
Strijkers R, Cushing R, Vasyunin D, de Laat C, Belloum AS, Meijer R. Toward Executable Scientific Publications. ACTA ACUST UNITED AC 2011. [DOI: 10.1016/j.procs.2011.04.074] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
97
|
Vroling B, Sanders M, Baakman C, Borrmann A, Verhoeven S, Klomp J, Oliveira L, de Vlieg J, Vriend G. GPCRDB: information system for G protein-coupled receptors. Nucleic Acids Res 2011; 39:D309-19. [PMID: 21045054 PMCID: PMC3013641 DOI: 10.1093/nar/gkq1009] [Citation(s) in RCA: 115] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2010] [Accepted: 10/07/2010] [Indexed: 11/14/2022] Open
Abstract
The GPCRDB is a Molecular Class-Specific Information System (MCSIS) that collects, combines, validates and disseminates large amounts of heterogeneous data on G protein-coupled receptors (GPCRs). The GPCRDB contains experimental data on sequences, ligand-binding constants, mutations and oligomers, as well as many different types of computationally derived data such as multiple sequence alignments and homology models. The GPCRDB provides access to the data via a number of different access methods. It offers visualization and analysis tools, and a number of query systems. The data is updated automatically on a monthly basis. The GPCRDB can be found online at http://www.gpcr.org/7tm/.
Collapse
Affiliation(s)
- Bas Vroling
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Marijn Sanders
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Coos Baakman
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Annika Borrmann
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Stefan Verhoeven
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Jan Klomp
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Laerte Oliveira
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Jacob de Vlieg
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| | - Gert Vriend
- CMBI, NCMLS, Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Department of Molecular Design and Informatics, MSD, Molenstraat 110, 5340 BH, Oss, The Netherlands and Department of Biophysics, Escola Paulista de Medicina, Federal University of São Paulo, São Paulo 04023-062, Brazil
| |
Collapse
|
98
|
|
99
|
Möller S, Krabbenhöft HN, Tille A, Paleino D, Williams A, Wolstencroft K, Goble C, Holland R, Belhachemi D, Plessy C. Community-driven computational biology with Debian Linux. BMC Bioinformatics 2010; 11 Suppl 12:S5. [PMID: 21210984 PMCID: PMC3040531 DOI: 10.1186/1471-2105-11-s12-s5] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background The Open Source movement and its technologies are popular in the bioinformatics community because they provide freely available tools and resources for research. In order to feed the steady demand for updates on software and associated data, a service infrastructure is required for sharing and providing these tools to heterogeneous computing environments. Results The Debian Med initiative provides ready and coherent software packages for medical informatics and bioinformatics. These packages can be used together in Taverna workflows via the UseCase plugin to manage execution on local or remote machines. If such packages are available in cloud computing environments, the underlying hardware and the analysis pipelines can be shared along with the software. Conclusions Debian Med closes the gap between developers and users. It provides a simple method for offering new releases of software and data resources, thus provisioning a local infrastructure for computational biology. For geographically distributed teams it can ensure they are working on the same versions of tools, in the same conditions. This contributes to the world-wide networking of researchers.
Collapse
Affiliation(s)
- Steffen Möller
- University Clinics of Schleswig-Holstein, Department of Dermatology, formerly University of Lübeck, Institute for Neuro- andBioinformatics, Ratzeburger Allee 160, 23530 Lübeck, Germany.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
100
|
Wilkinson MD, McCarthy L, Vandervalk B, Withers D, Kawas E, Samadian S. SADI, SHARE, and the in silico scientific method. BMC Bioinformatics 2010; 11 Suppl 12:S7. [PMID: 21210986 PMCID: PMC3040533 DOI: 10.1186/1471-2105-11-s12-s7] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background The emergence and uptake of Semantic Web technologies by the Life Sciences provides exciting opportunities for exploring novel ways to conduct in silico science. Web Service Workflows are already becoming first-class objects in “the new way”, and serve as explicit, shareable, referenceable representations of how an experiment was done. In turn, Semantic Web Service projects aim to facilitate workflow construction by biological domain-experts such that workflows can be edited, re-purposed, and re-published by non-informaticians. However the aspects of the scientific method relating to explicit discourse, disagreement, and hypothesis generation have remained relatively impervious to new technologies. Results Here we present SADI and SHARE - a novel Semantic Web Service framework, and a reference implementation of its client libraries. Together, SADI and SHARE allow the semi- or fully-automatic discovery and pipelining of Semantic Web Services in response to ad hoc user queries. Conclusions The semantic behaviours exhibited by SADI and SHARE extend the functionalities provided by Description Logic Reasoners such that novel assertions can be automatically added to a data-set without logical reasoning, but rather by analytical or annotative services. This behaviour might be applied to achieve the “semantification” of those aspects of the in silico scientific method that are not yet supported by Semantic Web technologies. We support this suggestion using an example in the clinical research space.
Collapse
Affiliation(s)
- Mark D Wilkinson
- Heart + Lung Institute at St. Paul's Hospital, University of British Columbia, Vancouver, BC, Canada.
| | | | | | | | | | | |
Collapse
|