1
|
Niehues A, de Visser C, Hagenbeek FA, Kulkarni P, Pool R, Karu N, Kindt ASD, Singh G, Vermeiren RRJM, Boomsma DI, van Dongen J, ’t Hoen PAC, van Gool AJ. A multi-omics data analysis workflow packaged as a FAIR Digital Object. Gigascience 2024; 13:giad115. [PMID: 38217405 PMCID: PMC10787363 DOI: 10.1093/gigascience/giad115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 11/14/2023] [Accepted: 12/10/2023] [Indexed: 01/15/2024] Open
Abstract
BACKGROUND Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. FINDINGS We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. CONCLUSIONS Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
Collapse
Affiliation(s)
- Anna Niehues
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| | - Casper de Visser
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Fiona A Hagenbeek
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Purva Kulkarni
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
- Department of Human Genetics, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - René Pool
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Naama Karu
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Alida S D Kindt
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Gurnoor Singh
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Robert R J M Vermeiren
- Department of Child and Adolescent Psychiatry, LUMC-Curium, Leiden University Medical Center, 2342 AK Oegstgeest, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Jenny van Dongen
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Peter A C ’t Hoen
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Alain J van Gool
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| |
Collapse
|
2
|
Rodeiro J, Vidaña-Vila E, Navarro J, Mallol R. CloMet: A Novel Open-Source and Modular Software Platform That Connects Established Metabolomics Repositories and Data Analysis Resources. J Proteome Res 2023; 22:2540-2547. [PMID: 37428859 PMCID: PMC10857572 DOI: 10.1021/acs.jproteome.2c00602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Indexed: 07/12/2023]
Abstract
The field of metabolomics has witnessed the development of hundreds of computational tools, but only a few have become cornerstones of this field. While MetaboLights and Metabolomics Workbench are two well-established data repositories for metabolomics data sets, Workflows4Metabolomics and MetaboAnalyst are two well-established web-based data analysis platforms for metabolomics. Yet, the raw data stored in the aforementioned repositories lack standardization in terms of the file system format used to store the associated acquisition files. Consequently, it is not straightforward to reuse available data sets as input data in the above-mentioned data analysis resources, especially for non-expert users. This paper presents CloMet, a novel open-source modular software platform that contributes to standardization, reusability, and reproducibility in the metabolomics field. CloMet, which is available through a Docker file, converts raw and NMR-based metabolomics data from MetaboLights and Metabolomics Workbench to a file format that can be used directly either in MetaboAnalyst or in Workflows4Metabolomics. We validated both CloMet and the output data using data sets from these repositories. Overall, CloMet fills the gap between well-established data repositories and web-based statistical platforms and contributes to the consolidation of a data-driven perspective of the metabolomics field by leveraging and connecting existing data and resources.
Collapse
Affiliation(s)
- Jordi Rodeiro
- Human
Environment Research, La Salle - Universitat
Ramon Llull, 08022 Barcelona, Spain
| | - Ester Vidaña-Vila
- Human
Environment Research, La Salle - Universitat
Ramon Llull, 08022 Barcelona, Spain
| | - Joan Navarro
- Research
Group on Smart Society, La Salle - Universitat
Ramon Llull, 08022 Barcelona, Spain
| | - Roger Mallol
- Human
Environment Research, La Salle - Universitat
Ramon Llull, 08022 Barcelona, Spain
| |
Collapse
|
3
|
Chen J, Basting PJ, Han S, Garfinkel DJ, Bergman CM. Reproducible evaluation of transposable element detectors with McClintock 2 guides accurate inference of Ty insertion patterns in yeast. Mob DNA 2023; 14:8. [PMID: 37452430 PMCID: PMC10347736 DOI: 10.1186/s13100-023-00296-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 06/09/2023] [Indexed: 07/18/2023] Open
Abstract
BACKGROUND Many computational methods have been developed to detect non-reference transposable element (TE) insertions using short-read whole genome sequencing data. The diversity and complexity of such methods often present challenges to new users seeking to reproducibly install, execute, or evaluate multiple TE insertion detectors. RESULTS We previously developed the McClintock meta-pipeline to facilitate the installation, execution, and evaluation of six first-generation short-read TE detectors. Here, we report a completely re-implemented version of McClintock written in Python using Snakemake and Conda that improves its installation, error handling, speed, stability, and extensibility. McClintock 2 now includes 12 short-read TE detectors, auxiliary pre-processing and analysis modules, interactive HTML reports, and a simulation framework to reproducibly evaluate the accuracy of component TE detectors. When applied to the model microbial eukaryote Saccharomyces cerevisiae, we find substantial variation in the ability of McClintock 2 components to identify the precise locations of non-reference TE insertions, with RelocaTE2 showing the highest recall and precision in simulated data. We find that RelocaTE2, TEMP, TEMP2 and TEBreak provide consistent estimates of [Formula: see text]50 non-reference TE insertions per strain and that Ty2 has the highest number of non-reference TE insertions in a species-wide panel of [Formula: see text]1000 yeast genomes. Finally, we show that best-in-class predictors for yeast applied to resequencing data have sufficient resolution to reveal a dyad pattern of integration in nucleosome-bound regions upstream of yeast tRNA genes for Ty1, Ty2, and Ty4, allowing us to extend knowledge about fine-scale target preferences revealed previously for experimentally-induced Ty1 insertions to spontaneous insertions for other copia-superfamily retrotransposons in yeast. CONCLUSION McClintock ( https://github.com/bergmanlab/mcclintock/ ) provides a user-friendly pipeline for the identification of TEs in short-read WGS data using multiple TE detectors, which should benefit researchers studying TE insertion variation in a wide range of different organisms. Application of the improved McClintock system to simulated and empirical yeast genome data reveals best-in-class methods and novel biological insights for one of the most widely-studied model eukaryotes and provides a paradigm for evaluating and selecting non-reference TE detectors in other species.
Collapse
Affiliation(s)
- Jingxuan Chen
- Institute of Bioinformatics, University of Georgia, Athens, GA USA
| | | | - Shunhua Han
- Institute of Bioinformatics, University of Georgia, Athens, GA USA
| | - David J. Garfinkel
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA USA
| | - Casey M. Bergman
- Institute of Bioinformatics, University of Georgia, Athens, GA USA
- Department of Genetics, University of Georgia, Athens, GA USA
| |
Collapse
|
4
|
Sonrel A, Luetge A, Soneson C, Mallona I, Germain PL, Knyazev S, Gilis J, Gerber R, Seurinck R, Paul D, Sonder E, Crowell HL, Fanaswala I, Al-Ajami A, Heidari E, Schmeing S, Milosavljevic S, Saeys Y, Mangul S, Robinson MD. Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability. Genome Biol 2023; 24:119. [PMID: 37198712 DOI: 10.1186/s13059-023-02962-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 05/06/2023] [Indexed: 05/19/2023] Open
Abstract
Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for example, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Collapse
Affiliation(s)
- Anthony Sonrel
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Almut Luetge
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Charlotte Soneson
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Izaskun Mallona
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
| | - Pierre-Luc Germain
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
| | - Sergey Knyazev
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
| | - Jeroen Gilis
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Reto Gerber
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Ruth Seurinck
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - Dominique Paul
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| | - Emanuel Sonder
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
| | - Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Imran Fanaswala
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Ahmad Al-Ajami
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Elyas Heidari
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Stephan Schmeing
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Stefan Milosavljevic
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | - Yvan Saeys
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland.
| |
Collapse
|
5
|
Player RA, Aguinaldo AM, Merritt BB, Maszkiewicz LN, Adeyemo OE, Forsyth ER, Verratti KJ, Chee BW, Grady SL, Bradburne CE. The META tool optimizes metagenomic analyses across sequencing platforms and classifiers. FRONTIERS IN BIOINFORMATICS 2023; 2:969247. [PMID: 36685333 PMCID: PMC9852826 DOI: 10.3389/fbinf.2022.969247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 12/14/2022] [Indexed: 01/09/2023] Open
Abstract
A major challenge in the field of metagenomics is the selection of the correct combination of sequencing platform and downstream metagenomic analysis algorithm, or "classifier". Here, we present the Metagenomic Evaluation Tool Analyzer (META), which produces simulated data and facilitates platform and algorithm selection for any given metagenomic use case. META-generated in silico read data are modular, scalable, and reflect user-defined community profiles, while the downstream analysis is done using a variety of metagenomic classifiers. Reported results include information on resource utilization, time-to-answer, and performance. Real-world data can also be analyzed using selected classifiers and results benchmarked against simulations. To test the utility of the META software, simulated data was compared to real-world viral and bacterial metagenomic samples run on four different sequencers and analyzed using 12 metagenomic classifiers. Lastly, we introduce "META Score": a unified, quantitative value which rates an analytic classifier's ability to both identify and count taxa in a representative sample.
Collapse
Affiliation(s)
- Robert A. Player
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | | | - Brian B. Merritt
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | - Lisa N. Maszkiewicz
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | | | - Ellen R. Forsyth
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | | | - Brant W. Chee
- Division of General Internal Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States,Armstrong Institute for Patient Safety and Quality, Johns Hopkins School of Medicine, Baltimore, MD, United States
| | - Sarah L. Grady
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | - Christopher E. Bradburne
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States,McKusick-Nathans Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States,*Correspondence: Christopher E. Bradburne,
| |
Collapse
|
6
|
Mendes CI, Vila-Cerqueira P, Motro Y, Moran-Gilad J, Carriço JA, Ramirez M. LMAS: evaluating metagenomic short de novo assembly methods through defined communities. Gigascience 2022; 12:6963325. [PMID: 36576131 PMCID: PMC9795473 DOI: 10.1093/gigascience/giac122] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 09/26/2022] [Accepted: 11/16/2022] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The de novo assembly of raw sequence data is key in metagenomic analysis. It allows recovering draft genomes from a pool of mixed raw reads, yielding longer sequences that offer contextual information and provide a more complete picture of the microbial community. FINDINGS To better compare de novo assemblers for metagenomic analysis, LMAS (Last Metagenomic Assembler Standing) was developed as a flexible platform allowing users to evaluate assembler performance given known standard communities. Overall, in our test datasets, k-mer De Bruijn graph assemblers outperformed the alternative approaches but came with a greater computational cost. Furthermore, assemblers branded as metagenomic specific did not consistently outperform other genomic assemblers in metagenomic samples. Some assemblers still in use, such as ABySS, MetaHipmer2, minia, and VelvetOptimiser, perform relatively poorly and should be used with caution when assembling complex samples. Meaningful strain resolution at the single-nucleotide polymorphism level was not achieved, even by the best assemblers tested. CONCLUSIONS The choice of a de novo assembler depends on the computational resources available, the replicon of interest, and the major goals of the analysis. No single assembler appeared an ideal choice for short-read metagenomic prokaryote replicon assembly, each showing specific strengths. The choice of metagenomic assembler should be guided by user requirements and characteristics of the sample of interest, and LMAS provides an interactive evaluation platform for this purpose. LMAS is open source, and the workflow and its documentation are available at https://github.com/B-UMMI/LMAS and https://lmas.readthedocs.io/, respectively.
Collapse
Affiliation(s)
- Catarina Inês Mendes
- Correspondence address. Catarina I. Mendes, Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portuga. E-mail:
| | - Pedro Vila-Cerqueira
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| | - Yair Motro
- Faculty of Health Sciences, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel
| | - Jacob Moran-Gilad
- Faculty of Health Sciences, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel
| | - João André Carriço
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| | - Mário Ramirez
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| |
Collapse
|
7
|
König P, Beier S, Mascher M, Stein N, Lange M, Scholz U. DivBrowse-interactive visualization and exploratory data analysis of variant call matrices. Gigascience 2022; 12:giad025. [PMID: 37083938 PMCID: PMC10120423 DOI: 10.1093/gigascience/giad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 01/23/2023] [Accepted: 03/23/2023] [Indexed: 04/22/2023] Open
Abstract
BACKGROUND The sequencing of whole genomes is becoming increasingly affordable. In this context, large-scale sequencing projects are generating ever larger datasets of species-specific genomic diversity. As a consequence, more and more genomic data need to be made easily accessible and analyzable to the scientific community. FINDINGS We present DivBrowse, a web application for interactive visualization and exploratory analysis of genomic diversity data stored in Variant Call Format (VCF) files of any size. By seamlessly combining BLAST as an entry point together with interactive data analysis features such as principal component analysis in one graphical user interface, DivBrowse provides a novel and unique set of exploratory data analysis capabilities for genomic biodiversity datasets. The capability to integrate DivBrowse into existing web applications supports interoperability between different web applications. Built-in interactive computation of principal component analysis allows users to perform ad hoc analysis of the population structure based on specific genetic elements such as genes and exons. Data interoperability is supported by the ability to export genomic diversity data in VCF and General Feature Format 3 files. CONCLUSION DivBrowse offers a novel approach for interactive visualization and analysis of genomic diversity data and optionally also gene annotation data by including features like interactive calculation of variant frequencies and principal component analysis. The use of established standard file formats for data input supports interoperability and seamless deployment of application instances based on the data output of established bioinformatics pipelines.
Collapse
Affiliation(s)
- Patrick König
- Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany
| | - Sebastian Beier
- Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany
- Institute of Bio- and Geosciences, IBG-4, Forschungszentrum Jülich GmbH, 52425 Jülich, Germany
| | - Martin Mascher
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany
| | - Nils Stein
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany
- Center for Integrated Breeding Research, Georg-August University, 37075 Göttingen, Germany
| | - Matthias Lange
- Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany
| | - Uwe Scholz
- Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany
| |
Collapse
|
8
|
Hou Q, Waury K, Gogishvili D, Feenstra KA. Ten quick tips for sequence-based prediction of protein properties using machine learning. PLoS Comput Biol 2022; 18:e1010669. [PMID: 36454728 PMCID: PMC9714715 DOI: 10.1371/journal.pcbi.1010669] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
Collapse
Affiliation(s)
- Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong, P. R. China
- National Institute of Health Data Science of China, Shandong University, Shandong, P. R. China
| | - Katharina Waury
- Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Dea Gogishvili
- Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - K. Anton Feenstra
- Department of Computer Science, Bioinformatics Group, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| |
Collapse
|
9
|
Kadri S, Sboner A, Sigaras A, Roy S. Containers in Bioinformatics: Applications, Practical Considerations, and Best Practices in Molecular Pathology. J Mol Diagn 2022; 24:442-454. [PMID: 35189355 DOI: 10.1016/j.jmoldx.2022.01.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 11/15/2021] [Accepted: 01/21/2022] [Indexed: 12/19/2022] Open
Abstract
Systematic implementation of bioinformatics resources for next generation sequencing (NGS)-based clinical testing is an arduous undertaking. One of the key challenges involves developing an ecosystem of information technology infrastructure for enabling scalable and reproducible bioinformatics services that is resilient and secure for handling genetic and protected health information, often embedded in an existing non-bioinformatics-oriented infrastructure. Container technology provides an ideal and infrastructure-agnostic solution for molecular laboratories developing and using bioinformatics pipelines, whether on-premise or using the cloud. A container is a technology that provides a consistent computational environment and enables reproducibility, scalability, and security when developing NGS bioinformatics analysis pipelines. Containers can increase the bioinformatics team's productivity by automating and simplifying the maintenance of complex bioinformatics resources, as well as facilitate validation, version control, and documentation necessary for clinical laboratory regulatory compliance. Although there is increasing popularity in adopting containers for developing NGS bioinformatics pipelines, there is wide variability and inconsistency in the usage of containers that may result in suboptimal performance and potentially compromise the security and privacy of protected health information. In this article, the authors highlight the current state and provide best or recommended practices for building, using containers in NGS bioinformatics solutions in a clinical setting with focus on scalability, optimization, maintainability, and data security.
Collapse
Affiliation(s)
- Sabah Kadri
- Department of Bioinformatics, Ann & Robert H Lurie Children's Hospital, Chicago, Illinois
| | - Andrea Sboner
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York; Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, New York; Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York
| | - Alexandros Sigaras
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, New York; Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York; Department of Physiology and Biophysics, Weill Cornell Medicine, New York, New York
| | - Somak Roy
- Department of Molecular Pathology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio.
| |
Collapse
|
10
|
van der Putten BCL, Mendes CI, Talbot BM, de Korne-Elenbaas J, Mamede R, Vila-Cerqueira P, Coelho LP, Gulvik CA, Katz LS, The Asm Ngs Hackathon Participants. Software testing in microbial bioinformatics: a call to action. Microb Genom 2022; 8. [PMID: 35259087 PMCID: PMC9176277 DOI: 10.1099/mgen.0.000790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Computational algorithms have become an essential component of research, with great efforts by the scientific community to raise standards on development and distribution of code. Despite these efforts, sustainability and reproducibility are major issues since continued validation through software testing is still not a widely adopted practice. Here, we report seven recommendations that help researchers implement software testing in microbial bioinformatics. We have developed these recommendations based on our experience from a collaborative hackathon organised prior to the American Society for Microbiology Next Generation Sequencing (ASM NGS) 2020 conference. We also present a repository hosting examples and guidelines for testing, available from https://github.com/microbinfie-hackathon2020/CSIS.
Collapse
Affiliation(s)
- Boas C L van der Putten
- Department of Medical Microbiology, Amsterdam UMC, University of Amsterdam, the Netherlands.,Department of Global Health, Amsterdam Institute for Global Health and Development, Amsterdam UMC, University of Amsterdam, the Netherlands
| | - C I Mendes
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Brooke M Talbot
- Department of Biological and Biomedical Sciences, Emory University, Atlanta, GA, USA
| | - Jolinda de Korne-Elenbaas
- Department of Medical Microbiology, Amsterdam UMC, University of Amsterdam, the Netherlands.,Department of Infectious Diseases, Public Health Laboratory, Public Health Service of Amsterdam, the Netherlands
| | - Rafael Mamede
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Pedro Vila-Cerqueira
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, PR China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, PR China
| | - Christopher A Gulvik
- Bacterial Special Pathogens Branch, Division of High-Consequence Pathogens and Pathology, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Lee S Katz
- Center for Food Safety, University of Georgia, Griffin, GA, USA.,Enteric Diseases Laboratory Branch, Division of Foodborne, Waterborne, and Environmental Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | | |
Collapse
|
11
|
Allain F, Roméjon J, La Rosa P, Jarlier F, Servant N, Hupé P. Geniac: Automatic Configuration GENerator and Installer for nextflow pipelines. OPEN RESEARCH EUROPE 2022; 1:76. [PMID: 37645091 PMCID: PMC10445886 DOI: 10.12688/openreseurope.13861.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/11/2022] [Indexed: 08/31/2023]
Abstract
With the advent of high-throughput biotechnological platforms and their ever-growing capacity, life science has turned into a digitized, computational and data-intensive discipline. As a consequence, standard analysis with a bioinformatics pipeline in the context of routine production has become a challenge such that the data can be processed in real-time and delivered to the end-users as fast as possible. The usage of workflow management systems along with packaging systems and containerization technologies offer an opportunity to tackle this challenge. While very powerful, they can be used and combined in many multiple ways which may differ from one developer to another. Therefore, promoting the homogeneity of the workflow implementation requires guidelines and protocols which detail how the source code of the bioinformatics pipeline should be written and organized to ensure its usability, maintainability, interoperability, sustainability, portability, reproducibility, scalability and efficiency. Capitalizing on Nextflow, Conda, Docker, Singularity and the nf-core initiative, we propose a set of best practices along the development life cycle of the bioinformatics pipeline and deployment for production operations which target different expert communities including i) the bioinformaticians and statisticians ii) the software engineers and iii) the data managers and core facility engineers. We implemented Geniac (Automatic Configuration GENerator and Installer for nextflow pipelines) which consists of a toolbox with three components: i) a technical documentation available at https://geniac.readthedocs.io to detail coding guidelines for the bioinformatics pipeline with Nextflow, ii) a command line interface with a linter to check that the code respects the guidelines, and iii) an add-on to generate configuration files, build the containers and deploy the pipeline. The Geniac toolbox aims at the harmonization of development practices across developers and automation of the generation of configuration files and containers by parsing the source code of the Nextflow pipeline.
Collapse
Affiliation(s)
- Fabrice Allain
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Julien Roméjon
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Philippe La Rosa
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Frédéric Jarlier
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Nicolas Servant
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Philippe Hupé
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
- UMR144, CNRS, Paris, F-75005, France
| |
Collapse
|
12
|
Piccolo SR, Ence ZE, Anderson EC, Chang JT, Bild AH. Simplifying the development of portable, scalable, and reproducible workflows. eLife 2021; 10:71069. [PMID: 34643507 PMCID: PMC8514239 DOI: 10.7554/elife.71069] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 09/27/2021] [Indexed: 12/30/2022] Open
Abstract
Command-line software plays a critical role in biology research. However, processes for installing and executing software differ widely. The Common Workflow Language (CWL) is a community standard that addresses this problem. Using CWL, tool developers can formally describe a tool’s inputs, outputs, and other execution details. CWL documents can include instructions for executing tools inside software containers. Accordingly, CWL tools are portable—they can be executed on diverse computers—including personal workstations, high-performance clusters, or the cloud. CWL also supports workflows, which describe dependencies among tools and using outputs from one tool as inputs to others. To date, CWL has been used primarily for batch processing of large datasets, especially in genomics. But it can also be used for analytical steps of a study. This article explains key concepts about CWL and software containers and provides examples for using CWL in biology research. CWL documents are text-based, so they can be created manually, without computer programming. However, ensuring that these documents conform to the CWL specification may prevent some users from adopting it. To address this gap, we created ToolJig, a Web application that enables researchers to create CWL documents interactively. ToolJig validates information provided by the user to ensure it is complete and valid. After creating a CWL tool or workflow, the user can create ‘input-object’ files, which store values for a particular invocation of a tool or workflow. In addition, ToolJig provides examples of how to execute the tool or workflow via a workflow engine. ToolJig and our examples are available at https://github.com/srp33/ToolJig.
Collapse
Affiliation(s)
- Stephen R Piccolo
- Department of Biology, Brigham Young University, Provo, United States
| | - Zachary E Ence
- Department of Biology, Brigham Young University, Provo, United States
| | | | - Jeffrey T Chang
- Department of Integrative Biology and Pharmacology, University of Texas Health Science Center at Houston, Houston, United States
| | - Andrea H Bild
- Department of Medical Oncology and Therapeutics, City of Hope Comprehensive Cancer Institute, Monrovia, United States
| |
Collapse
|
13
|
Combining Multiple RNA-Seq Data Analysis Algorithms Using Machine Learning Improves Differential Isoform Expression Analysis. Methods Protoc 2021; 4:mps4040068. [PMID: 34698224 PMCID: PMC8544431 DOI: 10.3390/mps4040068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 08/22/2021] [Accepted: 09/24/2021] [Indexed: 12/13/2022] Open
Abstract
RNA sequencing has become the standard technique for high resolution genome-wide monitoring of gene expression. As such, it often comprises the first step towards understanding complex molecular mechanisms driving various phenotypes, spanning organ development to disease genesis, monitoring and progression. An advantage of RNA sequencing is its ability to capture complex transcriptomic events such as alternative splicing which results in alternate isoform abundance. At the same time, this advantage remains algorithmically and computationally challenging, especially with the emergence of even higher resolution technologies such as single-cell RNA sequencing. Although several algorithms have been proposed for the effective detection of differential isoform expression from RNA-Seq data, no widely accepted golden standards have been established. This fact is further compounded by the significant differences in the output of different algorithms when applied on the same data. In addition, many of the proposed algorithms remain scarce and poorly maintained. Driven by these challenges, we developed a novel integrative approach that effectively combines the most widely used algorithms for differential transcript and isoform analysis using state-of-the-art machine learning techniques. We demonstrate its usability by applying it on simulated data based on several organisms, and using several performance metrics; we conclude that our strategy outperforms the application of the individual algorithms. Finally, our approach is implemented as an R Shiny application, with the underlying data analysis pipelines also available as docker containers.
Collapse
|
14
|
Paul-Gilloteaux P, Tosi S, Hériché JK, Gaignard A, Ménager H, Marée R, Baecker V, Klemm A, Kalaš M, Zhang C, Miura K, Colombelli J. Bioimage analysis workflows: community resources to navigate through a complex ecosystem. F1000Res 2021; 10:320. [PMID: 34136134 PMCID: PMC8182692 DOI: 10.12688/f1000research.52569.1] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/14/2021] [Indexed: 11/20/2022] Open
Abstract
Workflows are the keystone of bioimage analysis, and the NEUBIAS (Network of European BioImage AnalystS) community is trying to gather the actors of this field and organize the information around them. One of its most recent outputs is the opening of the F1000Research NEUBIAS gateway, whose main objective is to offer a channel of publication for bioimage analysis workflows and associated resources. In this paper we want to express some personal opinions and recommendations related to finding, handling and developing bioimage analysis workflows. The emergence of "big data" in bioimaging and resource-intensive analysis algorithms make local data storage and computing solutions a limiting factor. At the same time, the need for data sharing with collaborators and a general shift towards remote work, have created new challenges and avenues for the execution and sharing of bioimage analysis workflows. These challenges are to reproducibly run workflows in remote environments, in particular when their components come from different software packages, but also to document them and link their parameters and results by following the FAIR principles (Findable, Accessible, Interoperable, Reusable) to foster open and reproducible science. In this opinion paper, we focus on giving some directions to the reader to tackle these challenges and navigate through this complex ecosystem, in order to find and use workflows, and to compare workflows addressing the same problem. We also discuss tools to run workflows in the cloud and on High Performance Computing resources, and suggest ways to make these workflows FAIR.
Collapse
Affiliation(s)
- Perrine Paul-Gilloteaux
- Université de Nantes, CNRS, INSERM, l’institut du thorax, Nantes, F-44000, France
- Université de Nantes, CHU Nantes, Inserm, CNRS, SFR Santé, Inserm UMS 016, CNRS UMS 3556, Nantes, F-44000, France
| | - Sébastien Tosi
- Institute for Research in Biomedicine, IRB Barcelona, Barcelona Institute of Science and Technology, BIST, Barcelona, Spain
| | - Jean-Karim Hériché
- Cell Biology and Biophysics Unit, European Molecular Biology Laboratory, Heidelberg, 69117, Germany
| | - Alban Gaignard
- Université de Nantes, CNRS, INSERM, l’institut du thorax, Nantes, F-44000, France
| | - Hervé Ménager
- Hub de Bioinformatique et Biostatistique, Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, Paris, 75015, France
- CNRS, UMS 3601, Institut Français de Bioinformatique, IFB-core, Evry, 91000, France
| | - Raphaël Marée
- Montefiore Institute, University of Liège, Liège, Belgium
| | - Volker Baecker
- Montpellier Ressources Imagerie, BioCampus Montpellier, CNRS, INSERM, University of Montpellier, Montpellier, F-34000, France
| | - Anna Klemm
- BioImage Informatics Facility, SciLifeLab, Stockholm, Sweden
| | - Matúš Kalaš
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Chong Zhang
- Department of Information and Communication Technologies, University Pompeu Fabra, Barcelona, Spain
| | - Kota Miura
- Nikon Imaging Center, University of Heidelberg, Heidelberg, Germany
| | - Julien Colombelli
- Institute for Research in Biomedicine, IRB Barcelona, Barcelona Institute of Science and Technology, BIST, Barcelona, Spain
| |
Collapse
|
15
|
Nüst D, Sochat V, Marwick B, Eglen SJ, Head T, Hirst T, Evans BD. Ten simple rules for writing Dockerfiles for reproducible data science. PLoS Comput Biol 2020; 16:e1008316. [PMID: 33170857 PMCID: PMC7654784 DOI: 10.1371/journal.pcbi.1008316] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Computational science has been greatly improved by the use of containers for packaging software and data dependencies. In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow's reproducibility can be greatly affected by the choices that are made with respect to building containers. In many cases, the build process for the container's image is created from instructions provided in a Dockerfile format. In support of this approach, we present a set of rules to help researchers write understandable Dockerfiles for typical data science workflows. By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows.
Collapse
Affiliation(s)
- Daniel Nüst
- Institute for Geoinformatics, University of Münster, Münster, Germany
| | - Vanessa Sochat
- Stanford Research Computing Center, Stanford University, Stanford, California, United States of America
| | - Ben Marwick
- Department of Anthropology, University of Washington, Seattle, Washington, United States of America
| | - Stephen J. Eglen
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, Cambridgeshire, Great Britain
| | - Tim Head
- Wild Tree Tech, Zurich, Switzerland
| | - Tony Hirst
- Department of Computing and Communications, The Open University, Great Britain
| | - Benjamin D. Evans
- School of Psychological Science, University of Bristol, Bristol, Great Britain
| |
Collapse
|
16
|
Föll MC, Moritz L, Wollmann T, Stillger MN, Vockert N, Werner M, Bronsert P, Rohr K, Grüning BA, Schilling O. Accessible and reproducible mass spectrometry imaging data analysis in Galaxy. Gigascience 2019; 8:giz143. [PMID: 31816088 PMCID: PMC6901077 DOI: 10.1093/gigascience/giz143] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Revised: 09/10/2019] [Accepted: 11/10/2019] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Mass spectrometry imaging is increasingly used in biological and translational research because it has the ability to determine the spatial distribution of hundreds of analytes in a sample. Being at the interface of proteomics/metabolomics and imaging, the acquired datasets are large and complex and often analyzed with proprietary software or in-house scripts, which hinders reproducibility. Open source software solutions that enable reproducible data analysis often require programming skills and are therefore not accessible to many mass spectrometry imaging (MSI) researchers. FINDINGS We have integrated 18 dedicated mass spectrometry imaging tools into the Galaxy framework to allow accessible, reproducible, and transparent data analysis. Our tools are based on Cardinal, MALDIquant, and scikit-image and enable all major MSI analysis steps such as quality control, visualization, preprocessing, statistical analysis, and image co-registration. Furthermore, we created hands-on training material for use cases in proteomics and metabolomics. To demonstrate the utility of our tools, we re-analyzed a publicly available N-linked glycan imaging dataset. By providing the entire analysis history online, we highlight how the Galaxy framework fosters transparent and reproducible research. CONCLUSION The Galaxy framework has emerged as a powerful analysis platform for the analysis of MSI data with ease of use and access, together with high levels of reproducibility and transparency.
Collapse
Affiliation(s)
- Melanie Christine Föll
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Biology, University of Freiburg, Schänzlestraße 1, 79104 Freiburg, Germany
| | - Lennart Moritz
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
| | - Thomas Wollmann
- Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Maren Nicole Stillger
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Biology, University of Freiburg, Schänzlestraße 1, 79104 Freiburg, Germany
- Institute of Molecular Medicine and Cell Research, Faculty of Medicine, University of Freiburg, Stefan-Meier-Straße 17, 79104 Freiburg, Germany
| | - Niklas Vockert
- Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Martin Werner
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Medicine - University of Freiburg, Breisacher Straße 153, 79110 Freiburg, Germany
- Tumorbank Comprehensive Cancer Center Freiburg, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Peter Bronsert
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Medicine - University of Freiburg, Breisacher Straße 153, 79110 Freiburg, Germany
- Tumorbank Comprehensive Cancer Center Freiburg, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Karl Rohr
- Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Björn Andreas Grüning
- Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
| | - Oliver Schilling
- Institute of Surgical Pathology, Medical Center – University of Freiburg, Breisacher Straße 115a, 79106 Freiburg, Germany
- Faculty of Medicine - University of Freiburg, Breisacher Straße 153, 79110 Freiburg, Germany
- German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Hugstetter Straße 55, 79106 Freiburg, Germany
| |
Collapse
|
17
|
Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience 2019; 8:giz095. [PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/23/2019] [Accepted: 07/17/2019] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. RESULTS Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. CONCLUSIONS The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.
Collapse
Affiliation(s)
- Farah Zaib Khan
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
- Common Workflow Language Project
| | | | - Richard O Sinnott
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | - Andrew Lonie
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | | | | |
Collapse
|
18
|
Georgeson P, Syme A, Sloggett C, Chung J, Dashnow H, Milton M, Lonsdale A, Powell D, Seemann T, Pope B. Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. Gigascience 2019; 8:giz109. [PMID: 31544213 PMCID: PMC6755254 DOI: 10.1093/gigascience/giz109] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 07/16/2019] [Accepted: 08/13/2019] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Bioinformatics software tools are often created ad hoc, frequently by people without extensive training in software development. In particular, for beginners, the barrier to entry in bioinformatics software development is high, especially if they want to adopt good programming practices. Even experienced developers do not always follow best practices. This results in the proliferation of poorer-quality bioinformatics software, leading to limited scalability and inefficient use of resources; lack of reproducibility, usability, adaptability, and interoperability; and erroneous or inaccurate results. FINDINGS We have developed Bionitio, a tool that automates the process of starting new bioinformatics software projects following recommended best practices. With a single command, the user can create a new well-structured project in 1 of 12 programming languages. The resulting software is functional, carrying out a prototypical bioinformatics task, and thus serves as both a working example and a template for building new tools. Key features include command-line argument parsing, error handling, progress logging, defined exit status values, a test suite, a version number, standardized building and packaging, user documentation, code documentation, a standard open source software license, software revision control, and containerization. CONCLUSIONS Bionitio serves as a learning aid for beginner-to-intermediate bioinformatics programmers and provides an excellent starting point for new projects. This helps developers adopt good programming practices from the beginning of a project and encourages high-quality tools to be developed more rapidly. This also benefits users because tools are more easily installed and consistent in their usage. Bionitio is released as open source software under the MIT License and is available at https://github.com/bionitio-team/bionitio.
Collapse
Affiliation(s)
- Peter Georgeson
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Victorian Comprehensive Cancer Centre, 305 Grattan Street, Melbourne, Victoria, Australia 3000
| | - Anna Syme
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Royal Botanic Gardens Victoria, Birdwood Avenue, Melbourne, Victoria, Australia 3004
| | - Clare Sloggett
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
| | - Jessica Chung
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
| | - Harriet Dashnow
- Bioinformatics, Murdoch Children's Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Victoria, Australia 3052
- School of BioSciences, The University of Melbourne, Royal Parade, Parkville, Victoria, Australia 3052
| | - Michael Milton
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Melbourne Genomics Health Alliance, Walter and Eliza Hall Institute, 1G Royal Parade, Parkville, Victoria, Australia 3052
| | - Andrew Lonsdale
- Bioinformatics, Murdoch Children's Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Victoria, Australia 3052
- ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of Melbourne, Royal Parade, Parkville, Victoria, Australia 3052
| | - David Powell
- Monash Bioinformatics Platform, Biomedicine Discovery Institute, Faculty of Medicine, Nursing and Health Sciences, 15 Innovation Walk, Monash University, Clayton, Victoria, Australia 3800
| | - Torsten Seemann
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Department of Microbiology and Immunology, Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street Melbourne, Victoria, Australia 3000
| | - Bernard Pope
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Victorian Comprehensive Cancer Centre, 305 Grattan Street, Melbourne, Victoria, Australia 3000
- Department of Medicine, Central Clinical School, Monash University, Clayton, Victoria, Australia 3800
| |
Collapse
|
19
|
Sélem-Mojica N, Aguilar C, Gutiérrez-García K, Martínez-Guerrero CE, Barona-Gómez F. EvoMining reveals the origin and fate of natural product biosynthetic enzymes. Microb Genom 2019; 5. [PMID: 30946645 PMCID: PMC6939163 DOI: 10.1099/mgen.0.000260] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Natural products (NPs), or specialized metabolites, are important for medicine and agriculture alike, and for the fitness of the organisms that produce them. NP genome-mining aims at extracting biosynthetic information from the genomes of microbes presumed to produce these compounds. Typically, canonical enzyme sequences from known biosynthetic systems are identified after sequence similarity searches. Despite this being an efficient process, the likelihood of identifying truly novel systems by this approach is low. To overcome this limitation, we previously introduced EvoMining, a genome-mining approach that incorporates evolutionary principles. Here, we release and use our latest EvoMining version, which includes novel visualization features and customizable databases, to analyse 42 central metabolic enzyme families (EFs) conserved throughout Actinobacteria, Cyanobacteria, Pseudomonas and Archaea. We found that expansion-and-recruitment profiles of these 42 families are lineage specific, opening the metabolic space related to ‘shell’ enzymes. These enzymes, which have been overlooked, are EFs with orthologues present in most of the genomes of a taxonomic group, but not in all. As a case study of canonical shell enzymes, we characterized the expansion and recruitment of glutamate dehydrogenase and acetolactate synthase into scytonemin biosynthesis, and into other central metabolic pathways driving Archaea and Bacteria adaptive evolution. By defining the origin and fate of enzymes, EvoMining complements traditional genome-mining approaches as an unbiased strategy and opens the door to gaining insights into the evolution of NP biosynthesis. We anticipate that EvoMining will be broadly used for evolutionary studies, and for generating predictions of unprecedented chemical scaffolds and new antibiotics. This article contains data hosted by Microreact.
Collapse
Affiliation(s)
- Nelly Sélem-Mojica
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| | - César Aguilar
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| | | | - Christian E Martínez-Guerrero
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México.,Present address: Nuclear-Mitochondrial Interaction and Paleogenomics Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| | - Fancisco Barona-Gómez
- Evolution of Metabolic Diversity Laboratory, Langebio, Cinvestav-IPN, Irapuato, México
| |
Collapse
|