1
|
Walsh LH, Coakley M, Walsh AM, O'Toole PW, Cotter PD. Bioinformatic approaches for studying the microbiome of fermented food. Crit Rev Microbiol 2023; 49:693-725. [PMID: 36287644 DOI: 10.1080/1040841x.2022.2132850] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 08/11/2022] [Accepted: 09/28/2022] [Indexed: 11/03/2022]
Abstract
High-throughput DNA sequencing-based approaches continue to revolutionise our understanding of microbial ecosystems, including those associated with fermented foods. Metagenomic and metatranscriptomic approaches are state-of-the-art biological profiling methods and are employed to investigate a wide variety of characteristics of microbial communities, such as taxonomic membership, gene content and the range and level at which these genes are expressed. Individual groups and consortia of researchers are utilising these approaches to produce increasingly large and complex datasets, representing vast populations of microorganisms. There is a corresponding requirement for the development and application of appropriate bioinformatic tools and pipelines to interpret this data. This review critically analyses the tools and pipelines that have been used or that could be applied to the analysis of metagenomic and metatranscriptomic data from fermented foods. In addition, we critically analyse a number of studies of fermented foods in which these tools have previously been applied, to highlight the insights that these approaches can provide.
Collapse
Affiliation(s)
- Liam H Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- School of Microbiology, University College Cork, Ireland
| | - Mairéad Coakley
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Aaron M Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Paul W O'Toole
- School of Microbiology, University College Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
| | - Paul D Cotter
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
- VistaMilk SFI Research Centre, Teagasc, Moorepark, Fermoy, Cork, Ireland
| |
Collapse
|
2
|
Martens M, Stierum R, Schymanski EL, Evelo CT, Aalizadeh R, Aladjov H, Arturi K, Audouze K, Babica P, Berka K, Bessems J, Blaha L, Bolton EE, Cases M, Damalas DΕ, Dave K, Dilger M, Exner T, Geerke DP, Grafström R, Gray A, Hancock JM, Hollert H, Jeliazkova N, Jennen D, Jourdan F, Kahlem P, Klanova J, Kleinjans J, Kondic T, Kone B, Lynch I, Maran U, Martinez Cuesta S, Ménager H, Neumann S, Nymark P, Oberacher H, Ramirez N, Remy S, Rocca-Serra P, Salek RM, Sallach B, Sansone SA, Sanz F, Sarimveis H, Sarntivijai S, Schulze T, Slobodnik J, Spjuth O, Tedds J, Thomaidis N, Weber RJ, van Westen GJ, Wheelock CE, Williams AJ, Witters H, Zdrazil B, Županič A, Willighagen EL. ELIXIR and Toxicology: a community in development. F1000Res 2023; 10:ELIXIR-1129. [PMID: 37842337 PMCID: PMC10568213 DOI: 10.12688/f1000research.74502.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/28/2023] [Indexed: 10/17/2023] Open
Abstract
Toxicology has been an active research field for many decades, with academic, industrial and government involvement. Modern omics and computational approaches are changing the field, from merely disease-specific observational models into target-specific predictive models. Traditionally, toxicology has strong links with other fields such as biology, chemistry, pharmacology and medicine. With the rise of synthetic and new engineered materials, alongside ongoing prioritisation needs in chemical risk assessment for existing chemicals, early predictive evaluations are becoming of utmost importance to both scientific and regulatory purposes. ELIXIR is an intergovernmental organisation that brings together life science resources from across Europe. To coordinate the linkage of various life science efforts around modern predictive toxicology, the establishment of a new ELIXIR Community is seen as instrumental. In the past few years, joint efforts, building on incidental overlap, have been piloted in the context of ELIXIR. For example, the EU-ToxRisk, diXa, HeCaToS, transQST, and the nanotoxicology community have worked with the ELIXIR TeSS, Bioschemas, and Compute Platforms and activities. In 2018, a core group of interested parties wrote a proposal, outlining a sketch of what this new ELIXIR Toxicology Community would look like. A recent workshop (held September 30th to October 1st, 2020) extended this into an ELIXIR Toxicology roadmap and a shortlist of limited investment-high gain collaborations to give body to this new community. This Whitepaper outlines the results of these efforts and defines our vision of the ELIXIR Toxicology Community and how it complements other ELIXIR activities.
Collapse
Affiliation(s)
- Marvin Martens
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, 6229 ER, The Netherlands
| | - Rob Stierum
- Risk Analysis for Products In Development (RAPID), Netherlands Organisation for applied scientific research TNO, Utrecht, 3584 CB, The Netherlands
| | - Emma L. Schymanski
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, 4367, Luxembourg
| | - Chris T. Evelo
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, 6229 ER, The Netherlands
- Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, 6229 EN, The Netherlands
| | - Reza Aalizadeh
- Laboratory of Analytical Chemistry, Department of Chemistry, National and Kapodistrian University of Athens, Athens, 15771, Greece
| | - Hristo Aladjov
- Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, Sofia, 1113, Bulgaria
| | - Kasia Arturi
- Department Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology, Dübendorf, 8600, Switzerland
| | | | - Pavel Babica
- RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Karel Berka
- Department of Physical Chemistry, Palacky University Olomouc, Olomouc, 77146, Czech Republic
| | | | - Ludek Blaha
- RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Evan E. Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | - Dimitrios Ε. Damalas
- Laboratory of Analytical Chemistry, Department of Chemistry, National and Kapodistrian University of Athens, Athens, 15771, Greece
| | - Kirtan Dave
- School of Science, GSFC University, Gujarat, 391750, India
| | - Marco Dilger
- Forschungs- und Beratungsinstitut Gefahrstoffe (FoBiG) GmbH, Freiburg im Breisgau, 79106, Germany
| | | | - Daan P. Geerke
- AIMMS Division of Molecular Toxicology, Vrije Universiteit, Amsterdam, 1081 HZ, The Netherlands
| | - Roland Grafström
- Department of Toxicology, Misvik Biology, Turku, 20520, Finland
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, 17177, Sweden
| | - Alasdair Gray
- Department of Computer Science, Heriot-Watt University, Edinburgh, UK
| | | | - Henner Hollert
- Department Evolutionary Ecology & Environmental Toxicology (E3T), Goethe-University, Frankfurt, D-60438, Germany
| | | | - Danyel Jennen
- Department of Toxicogenomics, Maastricht University, Maastricht, 6200 MD, The Netherlands
| | - Fabien Jourdan
- MetaboHUB, French metabolomics infrastructure in Metabolomics and Fluxomics, Toulouse, France
- Toxalim (Research Centre in Food Toxicology), Université de Toulouse, Toulouse, France
| | - Pascal Kahlem
- Scientific Network Management SL, Barcelona, 08015, Spain
| | - Jana Klanova
- RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Jos Kleinjans
- Department of Toxicogenomics, Maastricht University, Maastricht, 6200 MD, The Netherlands
| | - Todor Kondic
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, 4367, Luxembourg
| | - Boï Kone
- Faculty of Pharmacy, Malaria Research and Training Center, Bamako, BP:1805, Mali
| | - Iseult Lynch
- School of Geography, Earth and Environmental Sciences, University of Birmingham, UK, Birmingham, B15 2TT, UK
| | - Uko Maran
- Institute of Chemistry, University of Tartu, Tartu, 50411, Estonia
| | | | - Hervé Ménager
- Institut Français de Bioinformatique, Evry, F-91000, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Paris, F-75015, France
| | - Steffen Neumann
- Research group Bioinformatics and Scientific Data, Leibniz Institute of Plant Biochemistry, Halle, 06120, Germany
| | - Penny Nymark
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, 17177, Sweden
| | - Herbert Oberacher
- Institute of Legal Medicine and Core Facility Metabolomics, Medical University of Innsbruck, Innsbruck, A-6020, Austria
| | - Noelia Ramirez
- Institut d'Investigacio Sanitaria Pere Virgili-Universitat Rovira i Virgili, Tarragona, 43007, Spain
| | | | - Philippe Rocca-Serra
- Data Readiness Group, Department of Engineering Science, University of Oxford, Oxford, UK
| | - Reza M. Salek
- International Agency for Research on Cancer, World Health Organisation, Lyon, 69372, France
| | - Brett Sallach
- Department of Environment and Geography, University of York, UK, York, YO10 5NG, UK
| | | | - Ferran Sanz
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Pompeu Fabra University, Barcelona, 08003, Spain
| | | | | | - Tobias Schulze
- Helmholtz Centre for Environmental Research - UFZ, Leipzig, 04318, Germany
| | | | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, SE-75124, Sweden
| | - Jonathan Tedds
- ELIXIR Hub, Wellcome Genome Campus, Cambridge, CB10 1SD, UK
| | - Nikolaos Thomaidis
- Laboratory of Analytical Chemistry, Department of Chemistry, National and Kapodistrian University of Athens, Athens, 15771, Greece
| | - Ralf J.M. Weber
- School of Biosciences, University of Birmingham, UK, Birmingham, B15 2TT, UK
| | - Gerard J.P. van Westen
- Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research, Leiden, 2333 CC, The Netherlands
| | - Craig E. Wheelock
- Department of Respiratory Medicine and Allergy, Karolinska University Hospital, Stockholm SE-141-86, Sweden
- Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm, 17177, Sweden
| | - Antony J. Williams
- Center for Computational Toxicology and Exposure, United States Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | | | - Barbara Zdrazil
- Department of Pharmaceutical Sciences, University of Vienna, Vienna, 1090, Austria
| | - Anže Županič
- Department Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, 1000, Slovenia
| | - Egon L. Willighagen
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, 6229 ER, The Netherlands
| |
Collapse
|
3
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
4
|
Yuen D, Cabansay L, Duncan A, Luu G, Hogue G, Overbeck C, Perez N, Shands W, Steinberg D, Reid C, Olunwa N, Hansen R, Sheets E, O’Farrell A, Cullion K, O’Connor B, Paten B, Stein L. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res 2021; 49:W624-W632. [PMID: 33978761 PMCID: PMC8218198 DOI: 10.1093/nar/gkab346] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 04/01/2021] [Accepted: 04/26/2021] [Indexed: 11/24/2022] Open
Abstract
Dockstore (https://dockstore.org/) is an open source platform for publishing, sharing, and finding bioinformatics tools and workflows. The platform has facilitated large-scale biomedical research collaborations by using cloud technologies to increase the Findability, Accessibility, Interoperability and Reusability (FAIR) of computational resources, thereby promoting the reproducibility of complex bioinformatics analyses. Dockstore supports a variety of source repositories, analysis frameworks, and language technologies to provide a seamless publishing platform for authors to create a centralized catalogue of scientific software. The ready-to-use packaging of hundreds of tools and workflows, combined with the implementation of interoperability standards, enables users to launch analyses across multiple environments. Dockstore is widely used, more than twenty-five high-profile organizations share analysis collections through the platform in a variety of workflow languages, including the Broad Institute's GATK best practice and COVID-19 workflows (WDL), nf-core workflows (Nextflow), the Intergalactic Workflow Commission tools (Galaxy), and workflows from Seven Bridges (CWL) to highlight just a few. Here we describe the improvements made over the last four years, including the expansion of system integrations supporting authors, the addition of collaboration features and analysis platform integrations supporting users, and other enhancements that improve the overall scientific reproducibility of Dockstore content.
Collapse
Affiliation(s)
- Denis Yuen
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Louise Cabansay
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Andrew Duncan
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Gary Luu
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Gregory Hogue
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Charles Overbeck
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Natalie Perez
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Walt Shands
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - David Steinberg
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Chaz Reid
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Nneka Olunwa
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Richard Hansen
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Elizabeth Sheets
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Ash O’Farrell
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Kim Cullion
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | | | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Lincoln Stein
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| |
Collapse
|
5
|
Bai J, Bandla C, Guo J, Alvarez RV, Bai M, Vizcaíno JA, Moreno P, Grüning B, Sallou O, Perez-Riverol Y. BioContainers Registry: Searching Bioinformatics and Proteomics Tools, Packages, and Containers. J Proteome Res 2021; 20:2056-2061. [PMID: 33625229 PMCID: PMC7611561 DOI: 10.1021/acs.jproteome.0c00904] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize software containers including the metadata, versions, licenses, and software dependencies. BioContainers supports multiple packaging and container technologies such as Conda, Docker, and Singularity. The BioContainers provide over 9000 bioinformatics tools, including more than 200 proteomics and mass spectrometry tools. Here we introduce the BioContainers Registry and Restful API to make containerized bioinformatics tools more findable, accessible, interoperable, and reusable (FAIR). The BioContainers Registry provides a fast and convenient way to find and retrieve bioinformatics tool packages and containers. By doing so, it will increase the use of bioinformatics packages and containers while promoting replicability and reproducibility in research.
Collapse
Affiliation(s)
- Jingwen Bai
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Chakradhar Bandla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Jiaxin Guo
- College of Bioinformation, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Roberto Vera Alvarez
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Mingze Bai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing, 400065, China
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Pablo Moreno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Björn Grüning
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg,79110, Germany
| | - Olivier Sallou
- Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA/INRIA) -GenOuest Platform, Université de Rennes, Rennes, France
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
6
|
Perez-Riverol Y, Moreno P. Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines. Proteomics 2020; 20:e1900147. [PMID: 31657527 PMCID: PMC7613303 DOI: 10.1002/pmic.201900147] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 09/30/2019] [Indexed: 12/29/2022]
Abstract
The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, bioinformatics analysis is becoming increasingly complex and convoluted, involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are designed as single-tiered software application where the analytics tasks cannot be distributed, limiting the scalability and reproducibility of the data analysis. In this paper the key steps of metabolomics and proteomics data processing, including the main tools and software used to perform the data analysis, are summarized. The combination of software containers with workflows environments for large-scale metabolomics and proteomics analysis is discussed. Finally, a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflow environments, Galaxy and Nextflow, is introduced to the proteomics and metabolomics communities.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Pablo Moreno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
7
|
Föll MC, Moritz L, Wollmann T, Stillger MN, Vockert N, Werner M, Bronsert P, Rohr K, Grüning BA, Schilling O. Accessible and reproducible mass spectrometry imaging data analysis in Galaxy. Gigascience 2019; 8:giz143. [PMID: 31816088 PMCID: PMC6901077 DOI: 10.1093/gigascience/giz143] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Revised: 09/10/2019] [Accepted: 11/10/2019] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Mass spectrometry imaging is increasingly used in biological and translational research because it has the ability to determine the spatial distribution of hundreds of analytes in a sample. Being at the interface of proteomics/metabolomics and imaging, the acquired datasets are large and complex and often analyzed with proprietary software or in-house scripts, which hinders reproducibility. Open source software solutions that enable reproducible data analysis often require programming skills and are therefore not accessible to many mass spectrometry imaging (MSI) researchers. FINDINGS We have integrated 18 dedicated mass spectrometry imaging tools into the Galaxy framework to allow accessible, reproducible, and transparent data analysis. Our tools are based on Cardinal, MALDIquant, and scikit-image and enable all major MSI analysis steps such as quality control, visualization, preprocessing, statistical analysis, and image co-registration. Furthermore, we created hands-on training material for use cases in proteomics and metabolomics. To demonstrate the utility of our tools, we re-analyzed a publicly available N-linked glycan imaging dataset. By providing the entire analysis history online, we highlight how the Galaxy framework fosters transparent and reproducible research. CONCLUSION The Galaxy framework has emerged as a powerful analysis platform for the analysis of MSI data with ease of use and access, together with high levels of reproducibility and transparency.
Collapse
Affiliation(s)
- Melanie Christine Föll
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Biology, University of Freiburg, Schänzlestraße 1, 79104 Freiburg, Germany
| | - Lennart Moritz
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
| | - Thomas Wollmann
- Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Maren Nicole Stillger
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Biology, University of Freiburg, Schänzlestraße 1, 79104 Freiburg, Germany
- Institute of Molecular Medicine and Cell Research, Faculty of Medicine, University of Freiburg, Stefan-Meier-Straße 17, 79104 Freiburg, Germany
| | - Niklas Vockert
- Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Martin Werner
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Medicine - University of Freiburg, Breisacher Straße 153, 79110 Freiburg, Germany
- Tumorbank Comprehensive Cancer Center Freiburg, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Peter Bronsert
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Medicine - University of Freiburg, Breisacher Straße 153, 79110 Freiburg, Germany
- Tumorbank Comprehensive Cancer Center Freiburg, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Karl Rohr
- Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Björn Andreas Grüning
- Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
| | - Oliver Schilling
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Medicine - University of Freiburg, Breisacher Straße 153, 79110 Freiburg, Germany
- German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Hugstetter Straße 55, 79106 Freiburg, Germany
| |
Collapse
|
8
|
Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience 2019; 8:giz095. [PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/23/2019] [Accepted: 07/17/2019] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. RESULTS Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. CONCLUSIONS The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.
Collapse
Affiliation(s)
- Farah Zaib Khan
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
- Common Workflow Language Project
| | | | - Richard O Sinnott
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | - Andrew Lonie
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | | | | |
Collapse
|
9
|
Grüning BA, Lampa S, Vaudel M, Blankenberg D. Software engineering for scientific big data analysis. Gigascience 2019; 8:giz054. [PMID: 31121028 PMCID: PMC6532757 DOI: 10.1093/gigascience/giz054] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Revised: 01/20/2019] [Accepted: 04/18/2019] [Indexed: 11/14/2022] Open
Abstract
The increasing complexity of data and analysis methods has created an environment where scientists, who may not have formal training, are finding themselves playing the impromptu role of software engineer. While several resources are available for introducing scientists to the basics of programming, researchers have been left with little guidance on approaches needed to advance to the next level for the development of robust, large-scale data analysis tools that are amenable to integration into workflow management systems, tools, and frameworks. The integration into such workflow systems necessitates additional requirements on computational tools, such as adherence to standard conventions for robustness, data input, output, logging, and flow control. Here we provide a set of 10 guidelines to steer the creation of command-line computational tools that are usable, reliable, extensible, and in line with standards of modern coding practices.
Collapse
Affiliation(s)
- Björn A Grüning
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
- Center for Biological Systems Analysis (ZBSA), University of Freiburg, Habsburgerstr. 49, D-79104 Freiburg, Germany
| | - Samuel Lampa
- Pharmaceutical Bioinformatics group, Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
- Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Svante Arrhenius vag 16C, 106 91, Solna, Sweden
| | - Marc Vaudel
- K.G. Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Postboks 7804, 5020, Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Postboks 7804, 5020, Bergen, Norway
| | - Daniel Blankenberg
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Avenue / NE50, Cleveland, OH, USA
| |
Collapse
|
10
|
Sélem-Mojica N, Aguilar C, Gutiérrez-García K, Martínez-Guerrero CE, Barona-Gómez F. EvoMining reveals the origin and fate of natural product biosynthetic enzymes. Microb Genom 2019; 5. [PMID: 30946645 PMCID: PMC6939163 DOI: 10.1099/mgen.0.000260] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Natural products (NPs), or specialized metabolites, are important for medicine and agriculture alike, and for the fitness of the organisms that produce them. NP genome-mining aims at extracting biosynthetic information from the genomes of microbes presumed to produce these compounds. Typically, canonical enzyme sequences from known biosynthetic systems are identified after sequence similarity searches. Despite this being an efficient process, the likelihood of identifying truly novel systems by this approach is low. To overcome this limitation, we previously introduced EvoMining, a genome-mining approach that incorporates evolutionary principles. Here, we release and use our latest EvoMining version, which includes novel visualization features and customizable databases, to analyse 42 central metabolic enzyme families (EFs) conserved throughout Actinobacteria, Cyanobacteria, Pseudomonas and Archaea. We found that expansion-and-recruitment profiles of these 42 families are lineage specific, opening the metabolic space related to ‘shell’ enzymes. These enzymes, which have been overlooked, are EFs with orthologues present in most of the genomes of a taxonomic group, but not in all. As a case study of canonical shell enzymes, we characterized the expansion and recruitment of glutamate dehydrogenase and acetolactate synthase into scytonemin biosynthesis, and into other central metabolic pathways driving Archaea and Bacteria adaptive evolution. By defining the origin and fate of enzymes, EvoMining complements traditional genome-mining approaches as an unbiased strategy and opens the door to gaining insights into the evolution of NP biosynthesis. We anticipate that EvoMining will be broadly used for evolutionary studies, and for generating predictions of unprecedented chemical scaffolds and new antibiotics. This article contains data hosted by Microreact.
Collapse
Affiliation(s)
- Nelly Sélem-Mojica
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| | - César Aguilar
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| | | | - Christian E Martínez-Guerrero
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México.,Present address: Nuclear-Mitochondrial Interaction and Paleogenomics Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| | - Fancisco Barona-Gómez
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| |
Collapse
|
11
|
Korhonen PK, Hall RS, Young ND, Gasser RB. Common workflow language (CWL)-based software pipeline for de novo genome assembly from long- and short-read data. Gigascience 2019; 8:giz014. [PMID: 30821816 PMCID: PMC6451199 DOI: 10.1093/gigascience/giz014] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 11/03/2018] [Accepted: 01/25/2019] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Here, we created an automated pipeline for the de novoassembly of genomes from Pacific Biosciences long-read and Illumina short-read data using common workflow language (CWL). To evaluate the performance of this pipeline, we assembled the nuclear genomes of the eukaryotes Caenorhabditis elegans (∼100 Mb), Drosophila melanogaster (∼138 Mb), and Plasmodium falciparum (∼23 Mb) directly from publicly accessible nucleotide sequence datasets and assessed the quality of the assemblies against curated reference genomes. FINDINGS We showed a dependency of the accuracy of assembly on sequencing technology and GC content and repeatedly achieved assemblies that meet the high standards set by the National Human Genome Research Institute, being applicable to gene prediction and subsequent genomic analyses. CONCLUSIONS This CWL pipeline overcomes current challenges of achieving repeatability and reproducibility of assembly results and offers a platform for the re-use of the workflow and the integration of diverse datasets. This workflow is publicly available via GitHub (https://github.com/vetscience/Assemblosis) and is currently applicable to the assembly of haploid and diploid genomes of eukaryotes.
Collapse
Affiliation(s)
- Pasi K Korhonen
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Ross S Hall
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|
12
|
Rowe WPM, Carrieri AP, Alcon-Giner C, Caim S, Shaw A, Sim K, Kroll JS, Hall LJ, Pyzer-Knapp EO, Winn MD. Streaming histogram sketching for rapid microbiome analytics. MICROBIOME 2019; 7:40. [PMID: 30878035 PMCID: PMC6420756 DOI: 10.1186/s40168-019-0653-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 03/01/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. RESULTS We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed 'histosketch' that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a 'real life' example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. CONCLUSIONS Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. ( https://github.com/will-rowe/hulk ).
Collapse
Affiliation(s)
- Will PM Rowe
- Scientific Computing Department, STFC Daresbury Laboratory, Warrington, UK
| | | | | | - Shabhonam Caim
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Alex Shaw
- Department of Medicine, Section of Paediatrics, Imperial College London, London, UK
| | - Kathleen Sim
- Department of Medicine, Section of Paediatrics, Imperial College London, London, UK
| | - J. Simon Kroll
- Department of Medicine, Section of Paediatrics, Imperial College London, London, UK
| | - Lindsay J. Hall
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | | | - Martyn D. Winn
- Scientific Computing Department, STFC Daresbury Laboratory, Warrington, UK
| |
Collapse
|