1
|
Sola F, Ayala D, Pulido M, Ayala R, López-Cerero L, Hernández I, Ruiz D. ginmappeR: an unified approach for integrating gene and protein identifiers across biological sequence databases. BIOINFORMATICS ADVANCES 2024; 4:vbae129. [PMID: 39262905 PMCID: PMC11387618 DOI: 10.1093/bioadv/vbae129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 08/09/2024] [Accepted: 08/27/2024] [Indexed: 09/13/2024]
Abstract
Summary The proliferation of biological sequence data, due to developments in molecular biology techniques, has led to the creation of numerous open access databases on gene and protein sequencing. However, the lack of direct equivalence between identifiers across these databases difficults data integration. To address this challenge, we introduce ginmappeR, an integrated R package facilitating the translation of gene and protein identifiers between databases. By providing a unified interface, ginmappeR streamlines the integration of diverse data sources into biological workflows, so it enhances efficiency and user experience. Availability and implementation from Bioconductor: https://bioconductor.org/packages/ginmappeR.
Collapse
Affiliation(s)
- Fernando Sola
- SCORE Lab, DEAL, University of Seville, ETSII, 41012 Seville, Spain
| | - Daniel Ayala
- SCORE Lab, DEAL, University of Seville, ETSII, 41012 Seville, Spain
| | - Marina Pulido
- Department of Microbiology, University of Seville, 41009 Seville, Spain
- Institute of Biomedicine of Seville, Virgen Macarena University Hospital, CSIC, University of Seville, 41013 Seville, Spain
- Centro de Investigación Biomédica en Red en Enfermedades Infecciosas (CIBERINFEC), 28029 Madrid, Spain
| | - Rafael Ayala
- Molecular Cryo-Electron Microscopy Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa 904-0411, Japan
| | - Lorena López-Cerero
- Department of Microbiology, University of Seville, 41009 Seville, Spain
- Institute of Biomedicine of Seville, Virgen Macarena University Hospital, CSIC, University of Seville, 41013 Seville, Spain
- Centro de Investigación Biomédica en Red en Enfermedades Infecciosas (CIBERINFEC), 28029 Madrid, Spain
| | - Inma Hernández
- SCORE Lab, DEAL, University of Seville, ETSII, 41012 Seville, Spain
| | - David Ruiz
- SCORE Lab, DEAL, University of Seville, ETSII, 41012 Seville, Spain
| |
Collapse
|
2
|
Thiriveedhi VK, Krishnaswamy D, Clunie D, Pieper S, Kikinis R, Fedorov A. Cloud-based large-scale curation of medical imaging data using AI segmentation. RESEARCH SQUARE 2024:rs.3.rs-4351526. [PMID: 38746269 PMCID: PMC11092813 DOI: 10.21203/rs.3.rs-4351526/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Rapid advances in medical imaging Artificial Intelligence (AI) offer unprecedented opportunities for automatic analysis and extraction of data from large imaging collections. Computational demands of such modern AI tools may be difficult to satisfy with the capabilities available on premises. Cloud computing offers the promise of economical access and extreme scalability. Few studies examine the price/performance tradeoffs of using the cloud, in particular for medical image analysis tasks. We investigate the use of cloud-provisioned compute resources for AI-based curation of the National Lung Screening Trial (NLST) Computed Tomography (CT) images available from the National Cancer Institute (NCI) Imaging Data Commons (IDC). We evaluated NCI Cancer Research Data Commons (CRDC) Cloud Resources - Terra (FireCloud) and Seven Bridges-Cancer Genomics Cloud (SB-CGC) platforms - to perform automatic image segmentation with TotalSegmentator and pyradiomics feature extraction for a large cohort containing >126,000 CT volumes from >26,000 patients. Utilizing >21,000 Virtual Machines (VMs) over the course of the computation we completed analysis in under 9 hours, as compared to the estimated 522 days that would be needed on a single workstation. The total cost of utilizing the cloud for this analysis was $1,011.05. Our contributions include: 1) an evaluation of the numerous tradeoffs towards optimizing the use of cloud resources for large-scale image analysis; 2) CloudSegmentator, an open source reproducible implementation of the developed workflows, which can be reused and extended; 3) practical recommendations for utilizing the cloud for large-scale medical image computing tasks. We also share the results of the analysis: the total of 9,565,554 segmentations of the anatomic structures and the accompanying radiomics features in IDC as of release v18.
Collapse
|
3
|
Spake L, Hassan A, Schaffnit SB, Alam N, Amoah AS, Badjie J, Cerami C, Crampin A, Dube A, Kaye MP, Kotch R, Liew F, McLean E, Munthali-Mkandawire S, Mwalwanda L, Petersen AC, Prentice AM, Zohora FT, Watts J, Sear R, Shenk MK, Sosis R, Shaver JH. A practical guide to cross-cultural and multi-sited data collection in the biological and behavioural sciences. Proc Biol Sci 2024; 291:20231422. [PMID: 38654647 PMCID: PMC11040250 DOI: 10.1098/rspb.2023.1422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 03/21/2024] [Indexed: 04/26/2024] Open
Abstract
Researchers in the biological and behavioural sciences are increasingly conducting collaborative, multi-sited projects to address how phenomena vary across ecologies. These types of projects, however, pose additional workflow challenges beyond those typically encountered in single-sited projects. Through specific attention to cross-cultural research projects, we highlight four key aspects of multi-sited projects that must be considered during the design phase to ensure success: (1) project and team management; (2) protocol and instrument development; (3) data management and documentation; and (4) equitable and collaborative practices. Our recommendations are supported by examples from our experiences collaborating on the Evolutionary Demography of Religion project, a mixed-methods project collecting data across five countries in collaboration with research partners in each host country. To existing discourse, we contribute new recommendations around team and project management, introduce practical recommendations for exploring the validity of instruments through qualitative techniques during piloting, highlight the importance of good documentation at all steps of the project, and demonstrate how data management workflows can be strengthened through open science practices. While this project was rooted in cross-cultural human behavioural ecology and evolutionary anthropology, lessons learned from this project are applicable to multi-sited research across the biological and behavioural sciences.
Collapse
Affiliation(s)
- Laure Spake
- Binghamton University (SUNY), Binghamton, NY, USA
| | - Anushé Hassan
- London School of Hygiene and Tropical Medicine, London, UK
| | | | - Nurul Alam
- International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b), Dhaka, Bangladesh
| | - Abena S. Amoah
- London School of Hygiene and Tropical Medicine, London, UK
- Malawi Epidemiology and Intervention Research Unit, Lilongwe, Malawi
- Leiden University Medical Center, Leiden, The Netherlands
| | - Jainaba Badjie
- Medical Research Council Unit The Gambia at the London School of Hygiene and Tropical Medicine (MRCG@LSHTM), Fajara, The Gambia
| | - Carla Cerami
- London School of Hygiene and Tropical Medicine, London, UK
- Medical Research Council Unit The Gambia at the London School of Hygiene and Tropical Medicine (MRCG@LSHTM), Fajara, The Gambia
| | - Amelia Crampin
- International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b), Dhaka, Bangladesh
- University of Glasgow, Glasgow, UK
| | - Albert Dube
- Malawi Epidemiology and Intervention Research Unit, Lilongwe, Malawi
| | - Miranda P. Kaye
- Pennsylvania State University, University Park, PA, USA
- University of Chicago, Chicago, IL, USA
| | - Renee Kotch
- Pennsylvania State University, University Park, PA, USA
| | - Frankie Liew
- London School of Hygiene and Tropical Medicine, London, UK
| | - Estelle McLean
- London School of Hygiene and Tropical Medicine, London, UK
- Malawi Epidemiology and Intervention Research Unit, Lilongwe, Malawi
| | | | - Lusako Mwalwanda
- Malawi Epidemiology and Intervention Research Unit, Lilongwe, Malawi
| | | | - Andrew M. Prentice
- London School of Hygiene and Tropical Medicine, London, UK
- Medical Research Council Unit The Gambia at the London School of Hygiene and Tropical Medicine (MRCG@LSHTM), Fajara, The Gambia
| | - Fatema tuz Zohora
- International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b), Dhaka, Bangladesh
| | - Joseph Watts
- University of Chicago, Chicago, IL, USA
- University of Canterbury, Christchurch, New Zealand
| | - Rebecca Sear
- London School of Hygiene and Tropical Medicine, London, UK
| | - Mary K. Shenk
- Pennsylvania State University, University Park, PA, USA
| | | | - John H. Shaver
- University of Otago, Dunedin, New Zealand
- Baylor University, Waco, TX, USA
| |
Collapse
|
4
|
Niehues A, de Visser C, Hagenbeek FA, Kulkarni P, Pool R, Karu N, Kindt ASD, Singh G, Vermeiren RRJM, Boomsma DI, van Dongen J, ’t Hoen PAC, van Gool AJ. A multi-omics data analysis workflow packaged as a FAIR Digital Object. Gigascience 2024; 13:giad115. [PMID: 38217405 PMCID: PMC10787363 DOI: 10.1093/gigascience/giad115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 11/14/2023] [Accepted: 12/10/2023] [Indexed: 01/15/2024] Open
Abstract
BACKGROUND Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. FINDINGS We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. CONCLUSIONS Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
Collapse
Affiliation(s)
- Anna Niehues
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| | - Casper de Visser
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Fiona A Hagenbeek
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Purva Kulkarni
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
- Department of Human Genetics, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - René Pool
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Naama Karu
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Alida S D Kindt
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Gurnoor Singh
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Robert R J M Vermeiren
- Department of Child and Adolescent Psychiatry, LUMC-Curium, Leiden University Medical Center, 2342 AK Oegstgeest, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Jenny van Dongen
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Peter A C ’t Hoen
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Alain J van Gool
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| |
Collapse
|
5
|
Ziemann M, Poulain P, Bora A. The five pillars of computational reproducibility: bioinformatics and beyond. Brief Bioinform 2023; 24:bbad375. [PMID: 37870287 PMCID: PMC10591307 DOI: 10.1093/bib/bbad375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 09/26/2023] [Accepted: 09/30/2023] [Indexed: 10/24/2023] Open
Abstract
Computational reproducibility is a simple premise in theory, but is difficult to achieve in practice. Building upon past efforts and proposals to maximize reproducibility and rigor in bioinformatics, we present a framework called the five pillars of reproducible computational research. These include (1) literate programming, (2) code version control and sharing, (3) compute environment control, (4) persistent data sharing and (5) documentation. These practices will ensure that computational research work can be reproduced quickly and easily, long into the future. This guide is designed for bioinformatics data analysts and bioinformaticians in training, but should be relevant to other domains of study.
Collapse
Affiliation(s)
- Mark Ziemann
- Deakin University, School of Life and Environmental Sciences, Geelong, Australia
- Burnet Institute, Melbourne, Australia
| | - Pierre Poulain
- Université Paris Cité, CNRS, Institut Jacques Monod, Paris, France
| | - Anusuiya Bora
- Deakin University, School of Life and Environmental Sciences, Geelong, Australia
| |
Collapse
|
6
|
Carlucci M, Bareikis T, Koncevičius K, Gibas P, Kriščiūnas A, Petronis A, Oh G. Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis. PLoS One 2023; 18:e0289171. [PMID: 37498822 PMCID: PMC10374128 DOI: 10.1371/journal.pone.0289171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Accepted: 07/13/2023] [Indexed: 07/29/2023] Open
Abstract
Reproducibility is crucial for scientific progress, yet a clear research data analysis workflow is challenging to implement and maintain. As a result, a record of computational steps performed on the data to arrive at the key research findings is often missing. We developed Scikick, a tool that eases the configuration, execution, and presentation of scientific computational analyses. Scikick allows for workflow configurations with notebooks as the units of execution, defines a standard structure for the project, automatically tracks the defined interdependencies between the data analysis steps, and implements methods to compile all research results into a cohesive final report. Utilities provided by Scikick help turn the complicated management of transparent data analysis workflows into a standardized and feasible practice. Scikick version 0.2.1 code and documentation is available as supplementary material. The Scikick software is available on GitHub (https://github.com/matthewcarlucci/scikick) and is distributed with PyPi (https://pypi.org/project/scikick/) under a GPL-3 license.
Collapse
Affiliation(s)
- Matthew Carlucci
- The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Tadas Bareikis
- The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Karolis Koncevičius
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Povilas Gibas
- The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Algimantas Kriščiūnas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Art Petronis
- The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Gabriel Oh
- The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
- Stanford University School of Medicine, Stanford, California, United States of America
| |
Collapse
|
7
|
Shao R, Sim A, Wu K, Kim J. Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows. SENSORS (BASEL, SWITZERLAND) 2023; 23:5485. [PMID: 37420657 DOI: 10.3390/s23125485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Revised: 05/29/2023] [Accepted: 06/06/2023] [Indexed: 07/09/2023]
Abstract
Scientific computing heavily relies on data shared by the community, especially in distributed data-intensive applications. This research focuses on predicting slow connections that create bottlenecks in distributed workflows. In this study, we analyze network traffic logs collected between January 2021 and August 2022 at the National Energy Research Scientific Computing Center (NERSC). Based on the observed patterns, we define a set of features primarily based on history for identifying low-performing data transfers. Typically, there are far fewer slow connections on well-maintained networks, which creates difficulty in learning to identify these abnormally slow connections from the normal ones. We devise several stratified sampling techniques to address the class-imbalance challenge and study how they affect the machine learning approaches. Our tests show that a relatively simple technique that undersamples the normal cases to balance the number of samples in two classes (normal and slow) is very effective for model training. This model predicts slow connections with an F1 score of 0.926.
Collapse
Affiliation(s)
- Robin Shao
- EECS, University of California at Berkeley, Berkeley, CA 94720, USA
| | - Alex Sim
- Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Kesheng Wu
- Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jinoh Kim
- Computer Science Department, Texas A&M University, Commerce, TX 75428, USA
| |
Collapse
|
8
|
Licata L, Via A, Turina P, Babbi G, Benevenuta S, Carta C, Casadio R, Cicconardi A, Facchiano A, Fariselli P, Giordano D, Isidori F, Marabotti A, Martelli PL, Pascarella S, Pinelli M, Pippucci T, Russo R, Savojardo C, Scafuri B, Valeriani L, Capriotti E. Resources and tools for rare disease variant interpretation. Front Mol Biosci 2023; 10:1169109. [PMID: 37234922 PMCID: PMC10206239 DOI: 10.3389/fmolb.2023.1169109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Accepted: 04/25/2023] [Indexed: 05/28/2023] Open
Abstract
Collectively, rare genetic disorders affect a substantial portion of the world's population. In most cases, those affected face difficulties in receiving a clinical diagnosis and genetic characterization. The understanding of the molecular mechanisms of these diseases and the development of therapeutic treatments for patients are also challenging. However, the application of recent advancements in genome sequencing/analysis technologies and computer-aided tools for predicting phenotype-genotype associations can bring significant benefits to this field. In this review, we highlight the most relevant online resources and computational tools for genome interpretation that can enhance the diagnosis, clinical management, and development of treatments for rare disorders. Our focus is on resources for interpreting single nucleotide variants. Additionally, we present use cases for interpreting genetic variants in clinical settings and review the limitations of these results and prediction tools. Finally, we have compiled a curated set of core resources and tools for analyzing rare disease genomes. Such resources and tools can be utilized to develop standardized protocols that will enhance the accuracy and effectiveness of rare disease diagnosis.
Collapse
Affiliation(s)
- Luana Licata
- Department of Biology, University of Rome Tor Vergata, Roma, Italy
| | - Allegra Via
- Department of Biochemical Sciences “A. Rossi Fanelli”, University of Rome “La Sapienza”, Roma, Italy
| | - Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | | | - Claudio Carta
- National Centre for Rare Diseases, Istituto Superiore di Sanità, Roma, Italy
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Andrea Cicconardi
- Department of Physics, University of Genova, Genova, Italy
- Italiano di Tecnologia—IIT, Genova, Italy
| | - Angelo Facchiano
- National Research Council, Institute of Food Science, Avellino, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Torino, Italy
| | - Deborah Giordano
- National Research Council, Institute of Food Science, Avellino, Italy
| | - Federica Isidori
- Medical Genetics Unit, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, Italy
| | - Anna Marabotti
- Department of Chemistry and Biology “A. Zambelli”, University of Salerno, Fisciano, SA, Italy
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Stefano Pascarella
- Department of Biochemical Sciences “A. Rossi Fanelli”, University of Rome “La Sapienza”, Roma, Italy
| | - Michele Pinelli
- Department of Molecular Medicine and Medical Biotechnology, University of Naples Federico II, Napoli, Italy
| | - Tommaso Pippucci
- Medical Genetics Unit, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, Italy
| | - Roberta Russo
- Department of Molecular Medicine and Medical Biotechnology, University of Naples Federico II, Napoli, Italy
- CEINGE Biotecnologie Avanzate Franco Salvatore, Napoli, Italy
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Bernardina Scafuri
- Department of Chemistry and Biology “A. Zambelli”, University of Salerno, Fisciano, SA, Italy
| | | | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| |
Collapse
|
9
|
Thompson LR, Thielen P. Decoding dissolved information: environmental DNA sequencing at global scale to monitor a changing ocean. Curr Opin Biotechnol 2023; 81:102936. [PMID: 37060640 DOI: 10.1016/j.copbio.2023.102936] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/03/2023] [Accepted: 03/13/2023] [Indexed: 04/17/2023]
Abstract
The use of environmental DNA (eDNA) technology for environmental monitoring is rapidly expanding, with applications for fisheries, coral reefs, harmful algal blooms, invasive and endangered species, and biodiversity monitoring. By enabling detection of species over space and time, eDNA fulfills a fundamental need of environmental surveys. Traditional surveys are expensive, require significant capital expenditure, and can be destructive; eDNA offers promise for cheaper, less invasive, and higher-resolution (i.e. genetic) assessments of environments and stocks. However, challenges in quantification, detection limits, biobanking capacity, reference databases, and data management and integration remain significant hurdles to efficient eDNA monitoring at global and decadal scale. Here, we consider the current state of eDNA technology and its suitability for the problems for which it is being used. We explore the current best practices, the logistical and social challenges that prevent eDNA from widespread adoption and benefit, and the emerging technologies that may address those challenges.
Collapse
Affiliation(s)
- Luke R Thompson
- Northern Gulf Institute, Mississippi State University, 2 Research Blvd, Starkville, MS 39759, USA; Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, 4301 Rickenbacker Cswy, Miami, FL 33149, USA.
| | - Peter Thielen
- Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Road, Laurel, MD 20723-6099, USA
| |
Collapse
|
10
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
11
|
Krinos AI, Cohen NR, Follows MJ, Alexander H. Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly. BMC Bioinformatics 2023; 24:74. [PMID: 36869298 PMCID: PMC9983209 DOI: 10.1186/s12859-022-05121-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 12/21/2022] [Indexed: 03/05/2023] Open
Abstract
BACKGROUND Diverse communities of microbial eukaryotes in the global ocean provide a variety of essential ecosystem services, from primary production and carbon flow through trophic transfer to cooperation via symbioses. Increasingly, these communities are being understood through the lens of omics tools, which enable high-throughput processing of diverse communities. Metatranscriptomics offers an understanding of near real-time gene expression in microbial eukaryotic communities, providing a window into community metabolic activity. RESULTS Here we present a workflow for eukaryotic metatranscriptome assembly, and validate the ability of the pipeline to recapitulate real and manufactured eukaryotic community-level expression data. We also include an open-source tool for simulating environmental metatranscriptomes for testing and validation purposes. We reanalyze previously published metatranscriptomic datasets using our metatranscriptome analysis approach. CONCLUSION We determined that a multi-assembler approach improves eukaryotic metatranscriptome assembly based on recapitulated taxonomic and functional annotations from an in-silico mock community. The systematic validation of metatranscriptome assembly and annotation methods provided here is a necessary step to assess the fidelity of our community composition measurements and functional content assignments from eukaryotic metatranscriptomes.
Collapse
Affiliation(s)
- Arianna I Krinos
- MIT-WHOI Joint Program in Oceanography and Applied Ocean Science and Engineering, Cambridge and Woods Hole, MA, USA.
- Department of Biology, Woods Hole Oceanographic Institution, Woods Hole, MA, USA.
- Department of Earth, Atmospheric, and Planetary Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Natalie R Cohen
- Skidaway Institute of Oceanography, University of Georgia, Savannah, GA, USA
| | - Michael J Follows
- Department of Earth, Atmospheric, and Planetary Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Harriet Alexander
- Department of Biology, Woods Hole Oceanographic Institution, Woods Hole, MA, USA.
| |
Collapse
|
12
|
Djaffardjy M, Marchment G, Sebe C, Blanchet R, Bellajhame K, Gaignard A, Lemoine F, Cohen-Boulakia S. Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems. Comput Struct Biotechnol J 2023; 21:2075-2085. [PMID: 36968012 PMCID: PMC10030817 DOI: 10.1016/j.csbj.2023.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 03/03/2023] [Accepted: 03/03/2023] [Indexed: 03/09/2023] Open
Abstract
Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.
Collapse
|
13
|
Salazar VW, Shaban B, Quiroga MDM, Turnbull R, Tescari E, Rossetto Marcelino V, Verbruggen H, Lê Cao KA. Metaphor-A workflow for streamlined assembly and binning of metagenomes. Gigascience 2022; 12:giad055. [PMID: 37522759 PMCID: PMC10388702 DOI: 10.1093/gigascience/giad055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/05/2023] [Accepted: 07/04/2023] [Indexed: 08/01/2023] Open
Abstract
Recent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyze genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customizable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly-combining the short-input reads into longer, contiguous fragments (contigs)-and binning, clustering these contigs into individual genome bins. The limitations of assembly and binning algorithms also pose different challenges depending on the selected strategy to execute them. Both of these processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data and by combining multiple binning algorithms with a bin refinement step to achieve high-quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets and the impact of available assembly and binning strategies on the final results.
Collapse
Affiliation(s)
- Vinícius W Salazar
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| | - Babak Shaban
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Maria del Mar Quiroga
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Robert Turnbull
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Edoardo Tescari
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Vanessa Rossetto Marcelino
- Department of Molecular and Translational Sciences, Monash University, Clayton, VIC 3168, Victoria, Australia
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC 3168, Victoria, Australia
- School of BioSciences, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
- Department of Microbiology and Immunology, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Parkville, VIC 3052, Victoria, Australia
| | - Heroen Verbruggen
- School of BioSciences, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| |
Collapse
|
14
|
Roach MJ, Pierce-Ward NT, Suchecki R, Mallawaarachchi V, Papudeshi B, Handley SA, Brown CT, Watson-Haigh NS, Edwards RA. Ten simple rules and a template for creating workflows-as-applications. PLoS Comput Biol 2022; 18:e1010705. [PMID: 36520686 PMCID: PMC9754251 DOI: 10.1371/journal.pcbi.1010705] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Affiliation(s)
- Michael J. Roach
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, South Australia, Australia
- * E-mail:
| | - N. Tessa Pierce-Ward
- Department of Population Health and Reproduction, University of California, Davis, California, United States of America
| | | | - Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, South Australia, Australia
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, South Australia, Australia
| | - Scott A. Handley
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - C. Titus Brown
- Department of Population Health and Reproduction, University of California, Davis, California, United States of America
| | | | - Robert A. Edwards
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, South Australia, Australia
| |
Collapse
|
15
|
Kimble M, Allers S, Campbell K, Chen C, Jackson LM, King BL, Silverbrand S, York G, Beard K. medna-metadata: an open-source data management system for tracking environmental DNA samples and metadata. Bioinformatics 2022; 38:4589-4597. [PMID: 35960154 PMCID: PMC9524998 DOI: 10.1093/bioinformatics/btac556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 07/23/2022] [Accepted: 08/09/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Environmental DNA (eDNA), as a rapidly expanding research field, stands to benefit from shared resources including sampling protocols, study designs, discovered sequences, and taxonomic assignments to sequences. High-quality community shareable eDNA resources rely heavily on comprehensive metadata documentation that captures the complex workflows covering field sampling, molecular biology lab work, and bioinformatic analyses. There are limited sources that provide documentation of database development on comprehensive metadata for eDNA and these workflows and no open-source software. RESULTS We present medna-metadata, an open-source, modular system that aligns with Findable, Accessible, Interoperable, and Reusable guiding principles that support scholarly data reuse and the database and application development of a standardized metadata collection structure that encapsulates critical aspects of field data collection, wet lab processing, and bioinformatic analysis. Medna-metadata is showcased with metabarcoding data from the Gulf of Maine (Polinski et al., 2019). AVAILABILITY AND IMPLEMENTATION The source code of the medna-metadata web application is hosted on GitHub (https://github.com/Maine-eDNA/medna-metadata). Medna-metadata is a docker-compose installable package. Documentation can be found at https://medna-metadata.readthedocs.io/en/latest/?badge=latest. The application is implemented in Python, PostgreSQL and PostGIS, RabbitMQ, and NGINX, with all major browsers supported. A demo can be found at https://demo.metadata.maine-edna.org/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- M Kimble
- School of Computing and Information Science, University of Maine, Orono, ME 04469, USA
| | - S Allers
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME 04469, USA
| | - K Campbell
- School of Computing and Information Science, University of Maine, Orono, ME 04469, USA
| | - C Chen
- School of Computing and Information Science, University of Maine, Orono, ME 04469, USA
| | - L M Jackson
- Advanced Research Computing, Security and Information Management, University of Maine, Orono, ME 04469, USA
- Maine EPSCoR, University of Maine, Orono, ME 04469, USA
| | - B L King
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME 04469, USA
| | - S Silverbrand
- School of Marine Sciences, University of Maine, Orono, ME 04469, USA
| | - G York
- Environmental DNA Laboratory, Coordinated Operating Research Entities, University of Maine, Orono, ME 04469, USA
| | - K Beard
- School of Computing and Information Science, University of Maine, Orono, ME 04469, USA
| |
Collapse
|
16
|
Thompson LR, Anderson SR, Den Uyl PA, Patin NV, Lim SJ, Sanderson G, Goodwin KD. Tourmaline: A containerized workflow for rapid and iterable amplicon sequence analysis using QIIME 2 and Snakemake. Gigascience 2022; 11:6651346. [PMID: 35902092 PMCID: PMC9334028 DOI: 10.1093/gigascience/giac066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Revised: 02/28/2022] [Accepted: 06/15/2022] [Indexed: 12/21/2022] Open
Abstract
Background Amplicon sequencing (metabarcoding) is a common method to survey diversity of environmental communities whereby a single genetic locus is amplified and sequenced from the DNA of whole or partial organisms, organismal traces (e.g., skin, mucus, feces), or microbes in an environmental sample. Several software packages exist for analyzing amplicon data, among which QIIME 2 has emerged as a popular option because of its broad functionality, plugin architecture, provenance tracking, and interactive visualizations. However, each new analysis requires the user to keep track of input and output file names, parameters, and commands; this lack of automation and standardization is inefficient and creates barriers to meta-analysis and sharing of results. Findings We developed Tourmaline, a Python-based workflow that implements QIIME 2 and is built using the Snakemake workflow management system. Starting from a configuration file that defines parameters and input files—a reference database, a sample metadata file, and a manifest or archive of FASTQ sequences—it uses QIIME 2 to run either the DADA2 or Deblur denoising algorithm; assigns taxonomy to the resulting representative sequences; performs analyses of taxonomic, alpha, and beta diversity; and generates an HTML report summarizing and linking to the output files. Features include support for multiple cores, automatic determination of trimming parameters using quality scores, representative sequence filtering (taxonomy, length, abundance, prevalence, or ID), support for multiple taxonomic classification and sequence alignment methods, outlier detection, and automated initialization of a new analysis using previous settings. The workflow runs natively on Linux and macOS or via a Docker container. We ran Tourmaline on a 16S ribosomal RNA amplicon data set from Lake Erie surface water, showing its utility for parameter optimization and the ability to easily view interactive visualizations through the HTML report, QIIME 2 viewer, and R- and Python-based Jupyter notebooks. Conclusion Automated workflows like Tourmaline enable rapid analysis of environmental amplicon data, decreasing the time from data generation to actionable results. Tourmaline is available for download at github.com/aomlomics/tourmaline.
Collapse
Affiliation(s)
- Luke R Thompson
- Northern Gulf Institute, Mississippi State University, Mississippi State, MS 39762, USA.,Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, FL 33149, USA
| | - Sean R Anderson
- Northern Gulf Institute, Mississippi State University, Mississippi State, MS 39762, USA.,Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, FL 33149, USA
| | - Paul A Den Uyl
- Cooperative Institute for Great Lakes Research, University of Michigan, Ann Arbor, MI 48108, USA
| | - Nastassia V Patin
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, FL 33149, USA.,Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine and Atmospheric Science, University of Miami, Miami, FL 33149, USA
| | - Shen Jean Lim
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, FL 33149, USA.,Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine and Atmospheric Science, University of Miami, Miami, FL 33149, USA
| | - Grant Sanderson
- Marine Science Department, University of Hawaii, Hilo, HI 96720, USA
| | - Kelly D Goodwin
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, FL 33149, USA
| |
Collapse
|
17
|
Wieczór M, Genna V, Aranda J, Badia RM, Gelpí JL, Gapsys V, de Groot BL, Lindahl E, Municoy M, Hospital A, Orozco M. Pre-exascale HPC approaches for molecular dynamics simulations. Covid-19 research: A use case. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL MOLECULAR SCIENCE 2022; 13:e1622. [PMID: 35935573 PMCID: PMC9347456 DOI: 10.1002/wcms.1622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/25/2022] [Accepted: 04/28/2022] [Indexed: 06/15/2023]
Abstract
Exascale computing has been a dream for ages and is close to becoming a reality that will impact how molecular simulations are being performed, as well as the quantity and quality of the information derived for them. We review how the biomolecular simulations field is anticipating these new architectures, making emphasis on recent work from groups in the BioExcel Center of Excellence for High Performance Computing. We exemplified the power of these simulation strategies with the work done by the HPC simulation community to fight Covid-19 pandemics. This article is categorized under:Data Science > Computer Algorithms and ProgrammingData Science > Databases and Expert SystemsMolecular and Statistical Mechanics > Molecular Dynamics and Monte-Carlo Methods.
Collapse
Affiliation(s)
- Miłosz Wieczór
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and TechnologyBarcelonaSpain
- Department of Physical ChemistryGdansk University of TechnologyGdańskPoland
| | - Vito Genna
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and TechnologyBarcelonaSpain
| | - Juan Aranda
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and TechnologyBarcelonaSpain
| | | | - Josep Lluís Gelpí
- Barcelona Supercomputing CenterBarcelonaSpain
- Department of Biochemistry and BiomedicineUniversity of BarcelonaBarcelonaSpain
| | - Vytautas Gapsys
- Max Planck Institute for Multidisciplinary SciencesComputational Biomolecular Dynamics GroupGoettingenGermany
| | - Bert L. de Groot
- Max Planck Institute for Multidisciplinary SciencesComputational Biomolecular Dynamics GroupGoettingenGermany
| | - Erik Lindahl
- Department of Applied PhysicsSwedish e‐Science Research Center, KTH Royal Institute of TechnologyStockholmSweden
- Department of Biochemistry and Biophysics, Science for Life LaboratoryStockholm UniversityStockholmSweden
| | | | - Adam Hospital
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and TechnologyBarcelonaSpain
| | - Modesto Orozco
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and TechnologyBarcelonaSpain
- Department of Biochemistry and BiomedicineUniversity of BarcelonaBarcelonaSpain
| |
Collapse
|
18
|
|
19
|
Allain F, Roméjon J, La Rosa P, Jarlier F, Servant N, Hupé P. Geniac: Automatic Configuration GENerator and Installer for nextflow pipelines. OPEN RESEARCH EUROPE 2022; 1:76. [PMID: 37645091 PMCID: PMC10445886 DOI: 10.12688/openreseurope.13861.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/11/2022] [Indexed: 08/31/2023]
Abstract
With the advent of high-throughput biotechnological platforms and their ever-growing capacity, life science has turned into a digitized, computational and data-intensive discipline. As a consequence, standard analysis with a bioinformatics pipeline in the context of routine production has become a challenge such that the data can be processed in real-time and delivered to the end-users as fast as possible. The usage of workflow management systems along with packaging systems and containerization technologies offer an opportunity to tackle this challenge. While very powerful, they can be used and combined in many multiple ways which may differ from one developer to another. Therefore, promoting the homogeneity of the workflow implementation requires guidelines and protocols which detail how the source code of the bioinformatics pipeline should be written and organized to ensure its usability, maintainability, interoperability, sustainability, portability, reproducibility, scalability and efficiency. Capitalizing on Nextflow, Conda, Docker, Singularity and the nf-core initiative, we propose a set of best practices along the development life cycle of the bioinformatics pipeline and deployment for production operations which target different expert communities including i) the bioinformaticians and statisticians ii) the software engineers and iii) the data managers and core facility engineers. We implemented Geniac (Automatic Configuration GENerator and Installer for nextflow pipelines) which consists of a toolbox with three components: i) a technical documentation available at https://geniac.readthedocs.io to detail coding guidelines for the bioinformatics pipeline with Nextflow, ii) a command line interface with a linter to check that the code respects the guidelines, and iii) an add-on to generate configuration files, build the containers and deploy the pipeline. The Geniac toolbox aims at the harmonization of development practices across developers and automation of the generation of configuration files and containers by parsing the source code of the Nextflow pipeline.
Collapse
Affiliation(s)
- Fabrice Allain
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Julien Roméjon
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Philippe La Rosa
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Frédéric Jarlier
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Nicolas Servant
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Philippe Hupé
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
- UMR144, CNRS, Paris, F-75005, France
| |
Collapse
|
20
|
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23:6514404. [PMID: 35076693 PMCID: PMC8921630 DOI: 10.1093/bib/bbab563] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/03/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Collapse
Affiliation(s)
- Venket Raghavan
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | - Louis Kraft
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | | | | |
Collapse
|
21
|
Schatz MC, Philippakis AA, Afgan E, Banks E, Carey VJ, Carroll RJ, Culotti A, Ellrott K, Goecks J, Grossman RL, Hall IM, Hansen KD, Lawson J, Leek JT, Luria AO, Mosher S, Morgan M, Nekrutenko A, O’Connor BD, Osborn K, Paten B, Patterson C, Tan FJ, Taylor CO, Vessio J, Waldron L, Wang T, Wuichet K. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. CELL GENOMICS 2022; 2:100085. [PMID: 35199087 PMCID: PMC8863334 DOI: 10.1016/j.xgen.2021.100085] [Citation(s) in RCA: 51] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.
Collapse
Affiliation(s)
- Michael C. Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Enis Afgan
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Eric Banks
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Robert J. Carroll
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Alessandro Culotti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Translational Data Science, University of Chicago, Chicago, IL, USA
| | - Kyle Ellrott
- Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | - Jeremy Goecks
- Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | - Robert L. Grossman
- Center for Translational Data Science, University of Chicago, Chicago, IL, USA
| | - Ira M. Hall
- Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Kasper D. Hansen
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Jeffrey T. Leek
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Stephen Mosher
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Martin Morgan
- Department of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
| | - Anton Nekrutenko
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, State College, PA, USA
| | | | - Kevin Osborn
- UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
| | | | - Frederick J. Tan
- Department of Embryology, Carnegie Institution, Baltimore, MD, USA
| | - Casey Overby Taylor
- Departments of Medicine and Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Vessio
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Levi Waldron
- Department of Epidemiology and Biostatistics, City University of New York Graduate School of Public Health and Health Policy, New York, NY, USA
| | - Ting Wang
- Department of Genetics, Washington University of St. Louis, St. Louis, MO, USA
| | - Kristin Wuichet
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
22
|
Mohammadi MM, Bavi O. DNA sequencing: an overview of solid-state and biological nanopore-based methods. Biophys Rev 2021; 14:99-110. [PMID: 34840616 PMCID: PMC8609259 DOI: 10.1007/s12551-021-00857-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 10/14/2021] [Indexed: 12/23/2022] Open
Abstract
The field of sequencing is a topic of significant interest since its emergence and has become increasingly important over time. Impressive achievements have been obtained in this field, especially in relations to DNA and RNA sequencing. Since the first achievements by Sanger and colleagues in the 1950s, many sequencing techniques have been developed, while others have disappeared. DNA sequencing has undergone three generations of major evolution. Each generation has its own specifications that are mentioned briefly. Among these generations, nanopore sequencing has its own exciting characteristics that have been given more attention here. Among pioneer technologies being used by the third-generation techniques, nanopores, either biological or solid-state, have been experimentally or theoretically extensively studied. All sequencing technologies have their own advantages and disadvantages, so nanopores are not free from this general rule. It is also generally pointed out what research has been done to overcome the obstacles. In this review, biological and solid-state nanopores are elaborated on, and applications of them are also discussed briefly.
Collapse
Affiliation(s)
- Mohammad M Mohammadi
- Department of Mechanical and Aerospace Engineering, Shiraz University of Technology, Shiraz, 71557-13876 Iran
| | - Omid Bavi
- Department of Mechanical and Aerospace Engineering, Shiraz University of Technology, Shiraz, 71557-13876 Iran
| |
Collapse
|
23
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
24
|
Dorado G, Gálvez S, Rosales TE, Vásquez VF, Hernández P. Analyzing Modern Biomolecules: The Revolution of Nucleic-Acid Sequencing - Review. Biomolecules 2021; 11:1111. [PMID: 34439777 PMCID: PMC8393538 DOI: 10.3390/biom11081111] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 07/12/2021] [Accepted: 07/23/2021] [Indexed: 02/06/2023] Open
Abstract
Recent developments have revolutionized the study of biomolecules. Among them are molecular markers, amplification and sequencing of nucleic acids. The latter is classified into three generations. The first allows to sequence small DNA fragments. The second one increases throughput, reducing turnaround and pricing, and is therefore more convenient to sequence full genomes and transcriptomes. The third generation is currently pushing technology to its limits, being able to sequence single molecules, without previous amplification, which was previously impossible. Besides, this represents a new revolution, allowing researchers to directly sequence RNA without previous retrotranscription. These technologies are having a significant impact on different areas, such as medicine, agronomy, ecology and biotechnology. Additionally, the study of biomolecules is revealing interesting evolutionary information. That includes deciphering what makes us human, including phenomena like non-coding RNA expansion. All this is redefining the concept of gene and transcript. Basic analyses and applications are now facilitated with new genome editing tools, such as CRISPR. All these developments, in general, and nucleic-acid sequencing, in particular, are opening a new exciting era of biomolecule analyses and applications, including personalized medicine, and diagnosis and prevention of diseases for humans and other animals.
Collapse
Affiliation(s)
- Gabriel Dorado
- Dep. Bioquímica y Biología Molecular, Campus Rabanales C6-1-E17, Campus de Excelencia Internacional Agroalimentario (ceiA3), Universidad de Córdoba, 14071 Córdoba, Spain
| | - Sergio Gálvez
- Dep. Lenguajes y Ciencias de la Computación, Boulevard Louis Pasteur 35, Universidad de Málaga, 29071 Málaga, Spain;
| | - Teresa E. Rosales
- Laboratorio de Arqueobiología, Avda. Universitaria s/n, Universidad Nacional de Trujillo, 13011 Trujillo, Peru;
| | - Víctor F. Vásquez
- Centro de Investigaciones Arqueobiológicas y Paleoecológicas Andinas Arqueobios, Martínez de Companón 430-Bajo 100, Urbanización San Andres, 13088 Trujillo, Peru;
| | - Pilar Hernández
- Instituto de Agricultura Sostenible (IAS), Consejo Superior de Investigaciones Científicas (CSIC), Alameda del Obispo s/n, 14080 Córdoba, Spain;
| |
Collapse
|
25
|
Melendrez MC, Shaw S, Brown CT, Goodner BW, Kvaal C. Editorial: Curriculum Applications in Microbiology: Bioinformatics in the Classroom. Front Microbiol 2021; 12:705233. [PMID: 34276638 PMCID: PMC8281245 DOI: 10.3389/fmicb.2021.705233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 06/07/2021] [Indexed: 11/18/2022] Open
Affiliation(s)
| | - Sophie Shaw
- Centre for Genome Enabled Biology and Medicine, University of Aberdeen, Aberdeen, United Kingdom
| | - C Titus Brown
- Department of Population Health and Reproduction, University of California, Davis, Davis, CA, United States
| | | | - Christopher Kvaal
- Department of Biology, St. Cloud State University, St. Cloud, MN, United States
| |
Collapse
|
26
|
Sahneh F, Balk MA, Kisley M, Chan CK, Fox M, Nord B, Lyons E, Swetnam T, Huppenkothen D, Sutherland W, Walls RL, Quinn DP, Tarin T, LeBauer D, Ribes D, Birnie DP, Lushbough C, Carr E, Nearing G, Fischer J, Tyle K, Carrasco L, Lang M, Rose PW, Rushforth RR, Roy S, Matheson T, Lee T, Brown CT, Teal TK, Papeș M, Kobourov S, Merchant N. Ten simple rules to cultivate transdisciplinary collaboration in data science. PLoS Comput Biol 2021; 17:e1008879. [PMID: 33983959 PMCID: PMC8118297 DOI: 10.1371/journal.pcbi.1008879] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Affiliation(s)
- Faryad Sahneh
- Data Science Institute, University of Arizona, Tucson, Arizona, United States of America
- Computer Science Department, University of Arizona, Tucson, Arizona, United States of America
- * E-mail:
| | - Meghan A. Balk
- BIO5 Institute, University of Arizona, Tucson, Arizona, United States of America
- National Museum of Natural History, Department of Paleontology, Washington, District of Columbia, United States of America
| | - Marina Kisley
- Computer Science Department, University of Arizona, Tucson, Arizona, United States of America
| | - Chi-kwan Chan
- Data Science Institute, University of Arizona, Tucson, Arizona, United States of America
- Steward Observatory and Department of Astronomy, University of Arizona, Tucson, Arizona, United States of America
| | - Mercury Fox
- Data Science Institute, University of Arizona, Tucson, Arizona, United States of America
- CODATA Center of Excellence in Data for Society, Washington, District of Columbia, United States of America
- School of Information, University of Arizona, Tucson, Arizona, United States of America
- Native Nations Institute, University of Arizona, Tucson, Arizona, United States of America
- Center for Digital Society and Data Studies, University of Arizona, Tucson, Arizona, United States of America
| | - Brian Nord
- Fermi National Accelerator Laboratory, Batavia, Illinois, United States of America
- Kavli Institute for Cosmological Physics, University of Chicago, Chicago, Illinois, United States of America
- Department of Astronomy and Astrophysics, University of Chicago, Illinois, United States of America
| | - Eric Lyons
- BIO5 Institute, University of Arizona, Tucson, Arizona, United States of America
- School of Plant Sciences, University of Arizona, Tucson, Arizona, United States of America
- CyVerse, University of Arizona, Tucson, Arizona, United States of America
| | - Tyson Swetnam
- BIO5 Institute, University of Arizona, Tucson, Arizona, United States of America
| | - Daniela Huppenkothen
- DIRAC Institute, Department of Astronomy, University of Washington, Seattle, Washington, United States of America
- eScience Institute, University of Washington, Seattle, Washington, United States of America
| | - Will Sutherland
- Department of Human Centered Design and Engineering, University of Washington, Seattle, Washington, United States of America
| | - Ramona L. Walls
- BIO5 Institute, University of Arizona, Tucson, Arizona, United States of America
| | - Daven P. Quinn
- Department of Geoscience, University of Wisconsin–Madison, Madison, Wisconsin, United States of America
| | - Tonantzin Tarin
- Instituto de Ecología, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - David LeBauer
- College of Agriculture and Life Sciences, University of Arizona, Tucson, Arizona, United States of America
| | - David Ribes
- Department of Human Centered Design and Engineering, University of Washington, Seattle, Washington, United States of America
| | - Dunbar P. Birnie
- Department of Materials Science and Engineering, Rutgers University, Piscataway, New Jersey, United States of America
| | - Carol Lushbough
- Biomedical Engineering Department, University of South Dakota, Sioux Falls, South Dakota, United States of America
- BioSNTR, Brookings, South Dakota, United States of America
| | - Eric Carr
- National Institute for Mathematical and Biological Synthesis, University of Tennessee, Knoxville, Tennessee, United States of America
| | - Grey Nearing
- Google Research, Mountain View, California, United States of America
| | - Jeremy Fischer
- Pervasive Technology Institute, Indiana University Bloomington, Bloomington, Indiana, United States of America
- JetStream Cloud, Indiana University Bloomington, Bloomington, Indiana, United States of America
| | - Kevin Tyle
- Atmospheric & Environmental Sciences, University at Albany, Albany, New York, United States of America
| | - Luis Carrasco
- National Institute for Mathematical and Biological Synthesis, University of Tennessee, Knoxville, Tennessee, United States of America
| | - Meagan Lang
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Peter W. Rose
- San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, United States of America
| | - Richard R. Rushforth
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, Arizona, United States of America
| | - Samapriya Roy
- Planet Labs, San Francisco, California, United States of America
| | - Thomas Matheson
- NSF’s National Optical-Infrared Astronomy Research Laboratory, Tucson, Arizona, United States of America
| | - Tina Lee
- CyVerse, University of Arizona, Tucson, Arizona, United States of America
| | - C. Titus Brown
- Department of Population Health and Reproduction, University of California, Davis, Davis, California, United States of America
| | - Tracy K. Teal
- Dryad, Durham, North Carolina, United States of America
| | - Monica Papeș
- National Institute for Mathematical and Biological Synthesis, University of Tennessee, Knoxville, Tennessee, United States of America
- Ecology & Evolutionary Biology, University of Tennessee, Knoxville, Tennessee, United States of America
| | - Stephen Kobourov
- Computer Science Department, University of Arizona, Tucson, Arizona, United States of America
| | - Nirav Merchant
- Data Science Institute, University of Arizona, Tucson, Arizona, United States of America
- CyVerse, University of Arizona, Tucson, Arizona, United States of America
| |
Collapse
|
27
|
Abstract
A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.
Collapse
|
28
|
Jackson M, Kavoussanakis K, Wallace EWJ. Using prototyping to choose a bioinformatics workflow management system. PLoS Comput Biol 2021; 17:e1008622. [PMID: 33630841 PMCID: PMC7906312 DOI: 10.1371/journal.pcbi.1008622] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Workflow management systems represent, manage, and execute multistep computational analyses and offer many benefits to bioinformaticians. They provide a common language for describing analysis workflows, contributing to reproducibility and to building libraries of reusable components. They can support both incremental build and re-entrancy—the ability to selectively re-execute parts of a workflow in the presence of additional inputs or changes in configuration and to resume execution from where a workflow previously stopped. Many workflow management systems enhance portability by supporting the use of containers, high-performance computing (HPC) systems, and clouds. Most importantly, workflow management systems allow bioinformaticians to delegate how their workflows are run to the workflow management system and its developers. This frees the bioinformaticians to focus on what these workflows should do, on their data analyses, and on their science. RiboViz is a package to extract biological insight from ribosome profiling data to help advance understanding of protein synthesis. At the heart of RiboViz is an analysis workflow, implemented in a Python script. To conform to best practices for scientific computing which recommend the use of build tools to automate workflows and to reuse code instead of rewriting it, the authors reimplemented this workflow within a workflow management system. To select a workflow management system, a rapid survey of available systems was undertaken, and candidates were shortlisted: Snakemake, cwltool, Toil, and Nextflow. Each candidate was evaluated by quickly prototyping a subset of the RiboViz workflow, and Nextflow was chosen. The selection process took 10 person-days, a small cost for the assurance that Nextflow satisfied the authors’ requirements. The use of prototyping can offer a low-cost way of making a more informed selection of software to use within projects, rather than relying solely upon reviews and recommendations by others. Data analysis involves many steps, as data are wrangled, processed, and analysed using a succession of unrelated software packages. Running the right steps, in the right order, and putting the right outputs in the right places, is a major source of frustration. Workflow management systems require that each data analysis step be “wrapped” in a structured way, describing its inputs, parameters, and outputs. By writing these wrappers, the scientist can focus on the meaning of each step, and how they fit together, which is the interesting part. The system uses these wrappers to decide what steps to run and how to run these and takes charge of running the steps, including reporting on errors. This makes it much easier to repeatedly run the analysis and to run it transparently upon different computers. To select a workflow management system, we surveyed available tools and chose 4 in which we developed prototype implementations to evaluate their suitability for our project. We conclude that many similar multistep data analysis workflows can be rewritten in a workflow management system, and we advocate prototyping as a low-cost (both time and effort) way of making an informed selection of software for use within a research project.
Collapse
Affiliation(s)
- Michael Jackson
- EPCC, The University of Edinburgh, Edinburgh, United Kingdom
- * E-mail: (MJ); (EWJW)
| | | | - Edward W. J. Wallace
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
- * E-mail: (MJ); (EWJW)
| |
Collapse
|