Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D, Borkum M, Bechhofer S, Roos M, Li P, De Roure D. myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res 2010;38:W677-82. [PMID: 20501605 PMCID: PMC2896080 DOI: 10.1093/nar/gkq429] [Citation(s) in RCA: 214] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open

For:	Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D, Borkum M, Bechhofer S, Roos M, Li P, De Roure D. myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res 2010;38:W677-82. [PMID: 20501605 PMCID: PMC2896080 DOI: 10.1093/nar/gkq429] [Citation(s) in RCA: 214] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open

Number

Cited by Other Article(s)

Du X, Dastmalchi F, Diller MA, Brochhausen M, Garrett TJ, Hogan WR, Lemas DJ. An Automated Workflow Composition System for Liquid Chromatography-Mass Spectrometry Metabolomics Data Processing. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2023;34:2857-2863. [PMID: 37874901 DOI: 10.1021/jasms.3c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]

Djaffardjy M, Marchment G, Sebe C, Blanchet R, Bellajhame K, Gaignard A, Lemoine F, Cohen-Boulakia S. Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems. Comput Struct Biotechnol J 2023;21:2075-2085. [PMID: 36968012 PMCID: PMC10030817 DOI: 10.1016/j.csbj.2023.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 03/03/2023] [Accepted: 03/03/2023] [Indexed: 03/09/2023] Open

Diao J, Zhou Z, Xue X, Zhao D, Chen S. Bioinformatic workflow fragment discovery leveraging the social-aware knowledge graph. Front Genet 2022;13:941996. [PMID: 36092917 PMCID: PMC9459048 DOI: 10.3389/fgene.2022.941996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 06/29/2022] [Indexed: 11/13/2022] Open

Du X, Aristizabal-Henao JJ, Garrett TJ, Brochhausen M, Hogan WR, Lemas DJ. A Checklist for Reproducible Computational Analysis in Clinical Metabolomics Research. Metabolites 2022;12:87. [PMID: 35050209 PMCID: PMC8779534 DOI: 10.3390/metabo12010087] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 12/25/2021] [Accepted: 01/10/2022] [Indexed: 12/15/2022] Open

Tangaro MA, Mandreoli P, Chiara M, Donvito G, Antonacci M, Parisi A, Bianco A, Romano A, Bianchi DM, Cangelosi D, Uva P, Molineris I, Nosi V, Calogero RA, Alessandri L, Pedrini E, Mordenti M, Bonetti E, Sangiorgi L, Pesole G, Zambelli F. Laniakea@ReCaS: exploring the potential of customisable Galaxy on-demand instances as a cloud-based service. BMC Bioinformatics 2021;22:544. [PMID: 34749633 PMCID: PMC8574934 DOI: 10.1186/s12859-021-04401-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 09/24/2021] [Indexed: 11/16/2022] Open

Abstract

BACKGROUND

Improving the availability and usability of data and analytical tools is a critical precondition for further advancing modern biological and biomedical research. For instance, one of the many ramifications of the COVID-19 global pandemic has been to make even more evident the importance of having bioinformatics tools and data readily actionable by researchers through convenient access points and supported by adequate IT infrastructures. One of the most successful efforts in improving the availability and usability of bioinformatics tools and data is represented by the Galaxy workflow manager and its thriving community. In 2020 we introduced Laniakea, a software platform conceived to streamline the configuration and deployment of "on-demand" Galaxy instances over the cloud. By facilitating the set-up and configuration of Galaxy web servers, Laniakea provides researchers with a powerful and highly customisable platform for executing complex bioinformatics analyses. The system can be accessed through a dedicated and user-friendly web interface that allows the Galaxy web server's initial configuration and deployment.

RESULTS

"Laniakea@ReCaS", the first instance of a Laniakea-based service, is managed by ELIXIR-IT and was officially launched in February 2020, after about one year of development and testing that involved several users. Researchers can request access to Laniakea@ReCaS through an open-ended call for use-cases. Ten project proposals have been accepted since then, totalling 18 Galaxy on-demand virtual servers that employ ~ 100 CPUs, ~ 250 GB of RAM and ~ 5 TB of storage and serve several different communities and purposes. Herein, we present eight use cases demonstrating the versatility of the platform.

CONCLUSIONS

During this first year of activity, the Laniakea-based service emerged as a flexible platform that facilitated the rapid development of bioinformatics tools, the efficient delivery of training activities, and the provision of public bioinformatics services in different settings, including food safety and clinical research. Laniakea@ReCaS provides a proof of concept of how enabling access to appropriate, reliable IT resources and ready-to-use bioinformatics tools can considerably streamline researchers' work.

Collapse

Affiliation(s)

Marco Antonio Tangaro Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
Pietro Mandreoli Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy
Matteo Chiara Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy
Giacinto Donvito National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
Marica Antonacci National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
Antonio Parisi Istituto Zooprofilattico Sperimentale Della Puglia e Della Basilicata, Via Manfredonia 20, 71121, Foggia, Italy
Angelica Bianco Istituto Zooprofilattico Sperimentale Della Puglia e Della Basilicata, Via Manfredonia 20, 71121, Foggia, Italy
Angelo Romano National Reference Laboratory for Coagulase-Positive Staphylococci Including Staphylococcus Aureus, Istituto Zooprofilattico Sperimentale del Piemonte, Liguria e Valle d'Aosta, Via Bologna 148, 10154, Turin, Italy
Daniela Manila Bianchi National Reference Laboratory for Coagulase-Positive Staphylococci Including Staphylococcus Aureus, Istituto Zooprofilattico Sperimentale del Piemonte, Liguria e Valle d'Aosta, Via Bologna 148, 10154, Turin, Italy
Davide Cangelosi Clinical Bioinformatics Unit, Scientific Direction, IRCCS Istituto Giannina Gaslini, Via Gerolamo Gaslini 5, 16147, Genova, Italy
Paolo Uva Clinical Bioinformatics Unit, Scientific Direction, IRCCS Istituto Giannina Gaslini, Via Gerolamo Gaslini 5, 16147, Genova, Italy Italian Institute of Technology, Via Morego 30, 16163, Genova, Italy
Ivan Molineris Department of Life Science and System Biology, University of Turin, Via Accademia Albertina, 13-1023, Turin, Italy
Vladimir Nosi Department of Computer Science, University of Turin, Via Pessinetto 12, 10049, Turin, Italy
Raffaele A Calogero Department of Molecular Biotechnology and Health Sciences, Via Nizza 52, 10126, Turin, Italy
Luca Alessandri Department of Molecular Biotechnology and Health Sciences, Via Nizza 52, 10126, Turin, Italy
Elena Pedrini Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
Marina Mordenti Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
Emanuele Bonetti Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy Department of Experimental Oncology, European Institute of Oncology, Via Adamello 16, 20139, Milan, Italy
Luca Sangiorgi Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
Graziano Pesole Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy. Department of Biosciences, Biotechnologies and Biopharmaceutics, University of Bari, Via Orabona 4, 70126, Bari, Italy.
Federico Zambelli Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy. Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy.

Collapse

BROWN ANDREWW, ASLIBEKYAN STELLA, BIER DENNIS, DA SILVA RAFAELFERREIRA, HOOVER ADAM, KLURFELD DAVIDM, LOKEN ERIC, MAYO-WILSON EVAN, MENACHEMI NIR, PAVELA GREG, QUINN PATRICKD, SCHOELLER DALE, TEKWE CARMEN, VALDEZ DANNY, VORLAND COLBYJ, WHIGHAM LEAHD, ALLISON DAVIDB. Toward more rigorous and informative nutritional epidemiology: The rational space between dismissal and defense of the status quo. Crit Rev Food Sci Nutr 2021;63:3150-3167. [PMID: 34678079 PMCID: PMC9023609 DOI: 10.1080/10408398.2021.1985427] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]

Lamprecht AL, Palmblad M, Ison J, Schwämmle V, Al Manir MS, Altintas I, Baker CJO, Ben Hadj Amor A, Capella-Gutierrez S, Charonyktakis P, Crusoe MR, Gil Y, Goble C, Griffin TJ, Groth P, Ienasescu H, Jagtap P, Kalaš M, Kasalica V, Khanteymoori A, Kuhn T, Mei H, Ménager H, Möller S, Richardson RA, Robert V, Soiland-Reyes S, Stevens R, Szaniszlo S, Verberne S, Verhoeven A, Wolstencroft K. Perspectives on automated composition of workflows in the life sciences. F1000Res 2021;10:897. [PMID: 34804501 PMCID: PMC8573700 DOI: 10.12688/f1000research.54159.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/27/2021] [Indexed: 12/29/2022] Open

Abstract

Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.

Collapse

Affiliation(s)

Anna-Lena Lamprecht Utrecht University, 3584 CS Utrecht, The Netherlands
Magnus Palmblad Leiden University Medical Center, 2333 ZA, Leiden, The Netherlands
Jon Ison French Institute of Bioinformatics, 91057 Évry, France
Veit Schwämmle University of Southern Denmark, 5230 Odense M, Denmark
Mohammad Sadnan Al Manir University of Virginia, Charlottesville, VA, 22903, USA
Ilkay Altintas University of California San Diego, La Jolla, CA, 92093, USA
Christopher J. O. Baker University of New Brunswick, Saint John, E2L 4L5, Canada IPSNP Computing Inc., Saint John, E2L 4S6, Canada
Ammar Ben Hadj Amor Westerdijk Institute, 3584 CT, Utrecht, The Netherlands
Salvador Capella-Gutierrez Barcelona Supercomputing Center (BSC), 08034, Barcelona, Spain
Paulos Charonyktakis Gnosis Data Analysis PC, GR-700 13 Heraklion, Greece
Michael R. Crusoe VU Amsterdam, 1081 HV Amsterdam, The Netherlands
Yolanda Gil University of Southern California, Marina Del Rey, CA, 90292, USA
Carole Goble Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
Timothy J. Griffin Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, 55455, USA
Paul Groth University of Amsterdam, 1090 GH Amsterdam, The Netherlands
Hans Ienasescu Technical University of Denmark, 2800 Kongens Lyngby, Denmark
Pratik Jagtap Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, 55455, USA
Matúš Kalaš University of Bergen, 5020 Bergen, Norway
Vedran Kasalica Utrecht University, 3584 CS Utrecht, The Netherlands
Alireza Khanteymoori Bioinformatics Group, University of Freiburg, 79110 Freiburg, Germany
Tobias Kuhn VU Amsterdam, 1081 HV Amsterdam, The Netherlands
Hailiang Mei Sequencing Analysis Support Core, Leiden University Medical Center, 2333 ZC Leiden, The Netherlands
Hervé Ménager Institut Pasteur, 75015 Paris, France
Steffen Möller IBIMA, Rostock University Medical Center, 18057 Rostock, Germany
Robin A. Richardson Netherlands eScience Center, 1098 XG Amsterdam, The Netherlands
Vincent Robert Westerdijk Institute, 3584 CT, Utrecht, The Netherlands
Stian Soiland-Reyes Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK Informatics Institute, University of Amsterdam, 1090 GH Amsterdam, The Netherlands
Robert Stevens Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
Szoke Szaniszlo Westerdijk Institute, 3584 CT, Utrecht, The Netherlands
Suzan Verberne Leiden Institute of Advanced Computer Science, Leiden University, 2333 BE Leiden, The Netherlands
Aswin Verhoeven Leiden University Medical Center, 2333 ZA, Leiden, The Netherlands
Katherine Wolstencroft Leiden Institute of Advanced Computer Science, Leiden University, 2333 BE Leiden, The Netherlands

Collapse

Samota EK, Davey RP. Knowledge and Attitudes Among Life Scientists Toward Reproducibility Within Journal Articles: A Research Survey. Front Res Metr Anal 2021;6:678554. [PMID: 34268467 PMCID: PMC8276979 DOI: 10.3389/frma.2021.678554] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 05/18/2021] [Indexed: 12/22/2022] Open

Abstract

We constructed a survey to understand how authors and scientists view the issues around reproducibility, focusing on interactive elements such as interactive figures embedded within online publications, as a solution for enabling the reproducibility of experiments. We report the views of 251 researchers, comprising authors who have published in eLIFE Sciences, and those who work at the Norwich Biosciences Institutes (NBI). The survey also outlines to what extent researchers are occupied with reproducing experiments themselves. Currently, there is an increasing range of tools that attempt to address the production of reproducible research by making code, data, and analyses available to the community for reuse. We wanted to collect information about attitudes around the consumer end of the spectrum, where life scientists interact with research outputs to interpret scientific results. Static plots and figures within articles are a central part of this interpretation, and therefore we asked respondents to consider various features for an interactive figure within a research article that would allow them to better understand and reproduce a published analysis. The majority (91%) of respondents reported that when authors describe their research methodology (methods and analyses) in detail, published research can become more reproducible. The respondents believe that having interactive figures in published papers is a beneficial element to themselves, the papers they read as well as to their readers. Whilst interactive figures are one potential solution for consuming the results of research more effectively to enable reproducibility, we also review the equally pressing technical and cultural demands on researchers that need to be addressed to achieve greater success in reproducibility in the life sciences.

Collapse

Porubsky V, Smith L, Sauro HM. Publishing reproducible dynamic kinetic models. Brief Bioinform 2021;22:bbaa152. [PMID: 32793969 PMCID: PMC8138891 DOI: 10.1093/bib/bbaa152] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Revised: 05/19/2020] [Accepted: 06/17/2020] [Indexed: 11/14/2022] Open

Choi K, Karr JR, Sauro HM. Status and Challenges of Reproducibility in Computational Systems and Synthetic Biology. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11525-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open

Bartley BA, Beal J, Karr JR, Strychalski EA. Organizing genome engineering for the gigabase scale. Nat Commun 2020;11:689. [PMID: 32019919 PMCID: PMC7000699 DOI: 10.1038/s41467-020-14314-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Accepted: 12/18/2019] [Indexed: 12/11/2022] Open

Harjes J, Link A, Weibulat T, Triebel D, Rambold G. FAIR digital objects in environmental and life sciences should comprise workflow operation design data and method information for repeatability of study setups and reproducibility of results. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020;2020:5894776. [PMID: 32815545 PMCID: PMC7439577 DOI: 10.1093/database/baaa059] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Revised: 07/01/2020] [Accepted: 07/07/2020] [Indexed: 12/23/2022]

Abstract

Repeatability of study setups and reproducibility of research results by underlying data are major requirements in science. Until now, abstract models for describing the structural logic of studies in environmental sciences are lacking and tools for data management are insufficient. Mandatory for repeatability and reproducibility is the use of sophisticated data management solutions going beyond data file sharing. Particularly, it implies maintenance of coherent data along workflows. Design data concern elements from elementary domains of operations being transformation, measurement and transaction. Operation design elements and method information are specified for each consecutive workflow segment from field to laboratory campaigns. The strict linkage of operation design element values, operation values and objects is essential. For enabling coherence of corresponding objects along consecutive workflow segments, the assignment of unique identifiers and the specification of their relations are mandatory. The abstract model presented here addresses these aspects, and the software DiversityDescriptions (DWB-DD) facilitates the management of thusly connected digital data objects and structures. DWB-DD allows for an individual specification of operation design elements and their linking to objects. Two workflow design use cases, one for DNA barcoding and another for cultivation of fungal isolates, are given. To publish those structured data, standard schema mapping and XML-provision of digital objects are essential. Schemas useful for this mapping include the Ecological Markup Language, the Schema for Meta-omics Data of Collection Objects and the Standard for Structured Descriptive Data. Data pipelines with DWB-DD include the mapping and conversion between schemas and functions for data publishing and archiving according to the Open Archival Information System standard. The setting allows for repeatability of study setups, reproducibility of study results and for supporting work groups to structure and maintain their data from the beginning of a study. The theory of ‘FAIR++’ digital objects is introduced.

Collapse

Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience 2019;8:giz095. [PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/23/2019] [Accepted: 07/17/2019] [Indexed: 01/22/2023] Open

Abstract

BACKGROUND

The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.

RESULTS

Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.

CONCLUSIONS

The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.

Collapse

Andrio P, Hospital A, Conejero J, Jordá L, Del Pino M, Codo L, Soiland-Reyes S, Goble C, Lezzi D, Badia RM, Orozco M, Gelpi JL. BioExcel Building Blocks, a software library for interoperable biomolecular simulation workflows. Sci Data 2019;6:169. [PMID: 31506435 PMCID: PMC6736963 DOI: 10.1038/s41597-019-0177-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 08/16/2019] [Indexed: 12/26/2022] Open

Malandrino D, Manno I, Negro A, Petta A, Serra L, Cantarella C, Scarano V. Social support for collaboration and group awareness in life science research teams. SOURCE CODE FOR BIOLOGY AND MEDICINE 2019;14:4. [PMID: 31320922 PMCID: PMC6615102 DOI: 10.1186/s13029-019-0074-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 07/01/2019] [Indexed: 11/10/2022]

Karim MR, Michel A, Zappa A, Baranov P, Sahay R, Rebholz-Schuhmann D. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Brief Bioinform 2019;19:1035-1050. [PMID: 28419324 PMCID: PMC6169675 DOI: 10.1093/bib/bbx039] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Indexed: 11/22/2022] Open

Enabling precision medicine via standard communication of HTS provenance, analysis, and results. PLoS Biol 2018;16:e3000099. [PMID: 30596645 PMCID: PMC6338479 DOI: 10.1371/journal.pbio.3000099] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 01/18/2019] [Indexed: 11/30/2022] Open

Abstract

A personalized approach based on a patient's or pathogen’s unique genomic sequence is the foundation of precision medicine. Genomic findings must be robust and reproducible, and experimental data capture should adhere to findable, accessible, interoperable, and reusable (FAIR) guiding principles. Moreover, effective precision medicine requires standardized reporting that extends beyond wet-lab procedures to computational methods. The BioCompute framework (https://w3id.org/biocompute/1.3.0) enables standardized reporting of genomic sequence data provenance, including provenance domain, usability domain, execution domain, verification kit, and error domain. This framework facilitates communication and promotes interoperability. Bioinformatics computation instances that employ the BioCompute framework are easily relayed, repeated if needed, and compared by scientists, regulators, test developers, and clinicians. Easing the burden of performing the aforementioned tasks greatly extends the range of practical application. Large clinical trials, precision medicine, and regulatory submissions require a set of agreed upon standards that ensures efficient communication and documentation of genomic analyses. The BioCompute paradigm and the resulting BioCompute Objects (BCOs) offer that standard and are freely accessible as a GitHub organization (https://github.com/biocompute-objects) following the “Open-Stand.org principles for collaborative open standards development.” With high-throughput sequencing (HTS) studies communicated using a BCO, regulatory agencies (e.g., Food and Drug Administration [FDA]), diagnostic test developers, researchers, and clinicians can expand collaboration to drive innovation in precision medicine, potentially decreasing the time and cost associated with next-generation sequencing workflow exchange, reporting, and regulatory reviews.

This Community Page article presents a communication standard for the provenance of high-throughput sequencing data; a BioCompute Object (BCO) can serve as a history of what was computed, be used as part of a validation process, or provide clarity and transparency of an experimental process to collaborators.

Collapse

Mondelli ML, Magalhães T, Loss G, Wilde M, Foster I, Mattoso M, Katz D, Barbosa H, de Vasconcelos ATR, Ocaña K, Gadelha LMR. BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 2018;6:e5551. [PMID: 30186700 PMCID: PMC6119457 DOI: 10.7717/peerj.5551] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 08/07/2018] [Indexed: 11/20/2022] Open

Misra BB, Langefeld CD, Olivier M, Cox LA. Integrated Omics: Tools, Advances, and Future Approaches. J Mol Endocrinol 2018;62:JME-18-0055. [PMID: 30006342 DOI: 10.1530/jme-18-0055] [Citation(s) in RCA: 220] [Impact Index Per Article: 36.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 07/02/2018] [Accepted: 07/12/2018] [Indexed: 12/13/2022]

Naldi A, Hernandez C, Levy N, Stoll G, Monteiro PT, Chaouiya C, Helikar T, Zinovyev A, Calzone L, Cohen-Boulakia S, Thieffry D, Paulevé L. The CoLoMoTo Interactive Notebook: Accessible and Reproducible Computational Analyses for Qualitative Biological Networks. Front Physiol 2018;9:680. [PMID: 29971009 PMCID: PMC6018415 DOI: 10.3389/fphys.2018.00680] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Accepted: 05/15/2018] [Indexed: 01/07/2023] Open

Affiliation(s)

Aurélien Naldi Computational Systems Biology Team, Institut de Biologie de I'Ecole Normale Supérieure, Centre National de la Recherche Scientifique UMR8197, Institut National de la Santé et de la Recherche Médicale U1024, École Normale Supérieure, PSL Université, Paris, France
Céline Hernandez Computational Systems Biology Team, Institut de Biologie de I'Ecole Normale Supérieure, Centre National de la Recherche Scientifique UMR8197, Institut National de la Santé et de la Recherche Médicale U1024, École Normale Supérieure, PSL Université, Paris, France
Nicolas Levy Laboratoire de Recherche en Informatique UMR8623, Université Paris-Sud, Centre National de la Recherche Scientifique, Université Paris-Saclay, Orsay, France École Normale Supérieure de Lyon, Lyon, France
Gautier Stoll Université Paris Descartes/Paris V, Sorbonne Paris Cité, Paris, France Équipe 11 Labellisée Ligue Nationale Contre le Cancer, Centre de Recherche des Cordeliers, Paris, France Institut National de la Santé et de la Recherche Médicale, U1138, Paris, France Université Pierre et Marie Curie, Paris, France Metabolomics and Cell Biology Platforms, Gustave Roussy Cancer, Villejuif, France
Pedro T. Monteiro INESC-ID/Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal
Claudine Chaouiya Instituto Gulbenkian de Ciência, Oeiras, Portugal
Tomáš Helikar Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, United States
Andrei Zinovyev Institut Curie, PSL Research University, Paris, France Institut National de la Santé et de la Recherche Médicale, U900, Paris, France MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France Lobachevsky University, Nizhni Novgorod, Russia
Laurence Calzone Institut Curie, PSL Research University, Paris, France Institut National de la Santé et de la Recherche Médicale, U900, Paris, France MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
Sarah Cohen-Boulakia Laboratoire de Recherche en Informatique UMR8623, Université Paris-Sud, Centre National de la Recherche Scientifique, Université Paris-Saclay, Orsay, France
Denis Thieffry Computational Systems Biology Team, Institut de Biologie de I'Ecole Normale Supérieure, Centre National de la Recherche Scientifique UMR8197, Institut National de la Santé et de la Recherche Médicale U1024, École Normale Supérieure, PSL Université, Paris, France
Loïc Paulevé Laboratoire de Recherche en Informatique UMR8623, Université Paris-Sud, Centre National de la Recherche Scientifique, Université Paris-Saclay, Orsay, France

Collapse

Halioui A, Valtchev P, Diallo AB. Bioinformatic workflow extraction from scientific texts based on word sense disambiguation and relation extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018;15:1979-1990. [PMID: 29994265 DOI: 10.1109/tcbb.2018.2847336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Taghiyar MJ, Rosner J, Grewal D, Grande BM, Aniba R, Grewal J, Boutros PC, Morin RD, Bashashati A, Shah SP. Kronos: a workflow assembler for genome analytics and informatics. Gigascience 2018;6:1-10. [PMID: 28655203 PMCID: PMC5569921 DOI: 10.1093/gigascience/gix042] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 06/07/2017] [Indexed: 11/25/2022] Open

Affiliation(s)

M Jafar Taghiyar Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
Jamie Rosner Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada
Diljot Grewal Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
Bruno M Grande Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, V5A 1S6 Burnaby, BC, Canada
Radhouane Aniba Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
Jasleen Grewal Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, V5A 1S6 Burnaby, BC, Canada
Paul C Boutros Ontario Institute for Cancer Research (OICR), 661 University Avenue, M5G 0A3 Toronto, ON, Canada.,Department of Medical Biophysics, University of Toronto, 101 College Street, M5G 1L7 Toronto, ON, Canada
Ryan D Morin Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, V5A 1S6 Burnaby, BC, Canada
Ali Bashashati Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
Sohrab P Shah Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada

Collapse

Thanki AS, Soranzo N, Haerty W, Davey RP. GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline. Gigascience 2018;7:1-10. [PMID: 29425291 PMCID: PMC5863215 DOI: 10.1093/gigascience/giy005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Revised: 07/31/2017] [Accepted: 01/18/2018] [Indexed: 11/13/2022] Open

Abstract

Background

Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological, and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestral gene duplication events as well as identifying genes that have diverged from a common ancestor under positive selection. There are various tools available, such as MSOAR, OrthoMCL, and HomoloGene, to identify gene families and visualize syntenic information between species, providing an overview of syntenic regions evolution at the family level. Unfortunately, none of them provide information about structural changes within genes, such as the conservation of ancestral exon boundaries among multiple genomes. The Ensembl GeneTrees computational pipeline generates gene trees based on coding sequences, provides details about exon conservation, and is used in the Ensembl Compara project to discover gene families.

Findings

A certain amount of expertise is required to configure and run the Ensembl Compara GeneTrees pipeline via command line. Therefore, we converted this pipeline into a Galaxy workflow, called GeneSeqToFamily, and provided additional functionality. This workflow uses existing tools from the Galaxy ToolShed, as well as providing additional wrappers and tools that are required to run the workflow.

Conclusions

GeneSeqToFamily represents the Ensembl GeneTrees pipeline as a set of interconnected Galaxy tools, so they can be run interactively within the Galaxy's user-friendly workflow environment while still providing the flexibility to tailor the analysis by changing configurations and tools if necessary. Additional tools allow users to subsequently visualize the gene families produced by the workflow, using the Aequatus.js interactive tool, which has been developed as part of the Aequatus software project.

Collapse

Pfeuffer J, Sachsenberg T, Alka O, Walzer M, Fillbrunn A, Nilse L, Schilling O, Reinert K, Kohlbacher O. OpenMS – A platform for reproducible analysis of mass spectrometry data. J Biotechnol 2017;261:142-148. [DOI: 10.1016/j.jbiotec.2017.05.016] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Revised: 05/17/2017] [Accepted: 05/22/2017] [Indexed: 10/19/2022]

Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Trop Med Health 2017;45:33. [PMID: 29093641 PMCID: PMC5658975 DOI: 10.1186/s41182-017-0073-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 10/12/2017] [Indexed: 01/05/2023] Open

Project Data Management Planning. ECOL INFORM 2017. [DOI: 10.1007/978-3-319-59928-1_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]

Wollmann T, Erfle H, Eils R, Rohr K, Gunkel M. Workflows for microscopy image analysis and cellular phenotyping. J Biotechnol 2017;261:70-75. [PMID: 28757289 DOI: 10.1016/j.jbiotec.2017.07.019] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Revised: 07/18/2017] [Accepted: 07/21/2017] [Indexed: 10/19/2022]

Mısırlı G, Madsen C, Murieta IS, Bultelle M, Flanagan K, Pocock M, Hallinan J, McLaughlin JA, Clark‐Casey J, Lyne M, Micklem G, Stan G, Kitney R, Wipat A. Constructing synthetic biology workflows in the cloud. ENGINEERING BIOLOGY 2017. [DOI: 10.1049/enb.2017.0001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Grüning BA, Fallmann J, Yusuf D, Will S, Erxleben A, Eggenhofer F, Houwaart T, Batut B, Videm P, Bagnacani A, Wolfien M, Lott SC, Hoogstrate Y, Hess WR, Wolkenhauer O, Hoffmann S, Akalin A, Ohler U, Stadler PF, Backofen R. The RNA workbench: best practices for RNA and high-throughput sequencing bioinformatics in Galaxy. Nucleic Acids Res 2017;45:W560-W566. [PMID: 28582575 PMCID: PMC5570170 DOI: 10.1093/nar/gkx409] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Revised: 04/13/2017] [Accepted: 05/31/2017] [Indexed: 01/23/2023] Open

Affiliation(s)

Björn A. Grüning Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany Center for Biological Systems Analysis (ZBSA), University of Freiburg, Habsburgerstr. 49, D-79104 Freiburg, Germany
Jörg Fallmann Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany
Dilmurat Yusuf Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, D-13125, Berlin, Germany
Sebastian Will Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria
Anika Erxleben Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
Florian Eggenhofer Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
Torsten Houwaart Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
Bérénice Batut Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
Pavankumar Videm Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
Andrea Bagnacani Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstr. 69, D-18051 Rostock, Germany
Markus Wolfien Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstr. 69, D-18051 Rostock, Germany
Steffen C. Lott Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schänzlestr. 1, D-79104 Freiburg, Germany
Youri Hoogstrate Department of Urology, Erasmus University Medical Center, Wytemaweg 80, 3015 CN Rotterdam, Netherlands
Wolfgang R. Hess Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schänzlestr. 1, D-79104 Freiburg, Germany
Olaf Wolkenhauer Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstr. 69, D-18051 Rostock, Germany
Steve Hoffmann Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany
Altuna Akalin Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, D-13125, Berlin, Germany
Uwe Ohler Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, D-13125, Berlin, Germany Departments of Biology and Computer Science, Humboldt University, Unter den Linden 6, D-10099 Berlin
Peter F. Stadler Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, D-04103 Leipzig, Germany Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
Rolf Backofen Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany Center for Biological Systems Analysis (ZBSA), University of Freiburg, Habsburgerstr. 49, D-79104 Freiburg, Germany BIOSS Centre for Biological Signaling Studies, University of Freiburg, Schänzlestr. 18, D-79104 Freiburg, Germany

Collapse

Creation of Reusable Bioinformatics Workflows for Reproducible Analysis of LC-MS Proteomics Data. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/978-1-4939-7119-0_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]

Urdidiales‐Nieto D, Navas‐Delgado I, Aldana‐Montes JF. Biological Web Service Repositories Review. Mol Inform 2017;36:1600035. [PMID: 27783459 PMCID: PMC5434852 DOI: 10.1002/minf.201600035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Accepted: 09/27/2016] [Indexed: 12/26/2022]

Hoogstrate Y, Zhang C, Senf A, Bijlard J, Hiltemann S, van Enckevort D, Repo S, Heringa J, Jenster G, J A Fijneman R, Boiten JW, A Meijer G, Stubbs A, Rambla J, Spalding D, Abeln S. Integration of EGA secure data access into Galaxy. F1000Res 2017;5. [PMID: 28232859 PMCID: PMC5302147 DOI: 10.12688/f1000research.10221.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/30/2016] [Indexed: 12/31/2022] Open

Abstract

High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed: https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.

Collapse

From the evaluation of existing solutions to an all-inclusive package for biobanks. HEALTH AND TECHNOLOGY 2017;7:89-95. [PMID: 28344915 PMCID: PMC5346419 DOI: 10.1007/s12553-016-0175-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 12/19/2016] [Indexed: 11/26/2022]

Abstract

The domain of biobanking has gone through many stages and as a result there are a wide range of commercial and open source software solutions available. The utilization of these software tools requires different levels of domain and technical skills for installation, configuration and ultimate us of these biobank software tools. To compound this complexity the biobanking community are required to work together in order to share knowledge and jointly build solutions to underpin the research infrastructure. We have evaluated the available tools, described them in a catalogue (BiobankApps) and made a selection of tools available to biobanks in a reference toolbox (BIBBOX) that are use-case driven. In the BiobankApps tool catalogue, both commercial and open source software solutions related to the biobanking domain are included, classified and evaluated. The evaluation covers: 1) “user review” by an authenticated user 2) domain expert: quick analysis by BBMRI members and 3) domain expert: detailed analysis and test installation with real world data. The evaluation is paired with a survey across the more “advanced” (from a technology perspective) biobanks to investigate what tools are currently used and summarises known benefits/drawbacks of the respective packages. In the second step we recommend tools for specific use cases, and install, configure and connect these in the BIBBOX framework. This service also builds on the existing work in the United Kingdom in seeking to establish the motivations for different stakeholders to become involved and therefore assisting in prioritising the use-cases based on the level of need and support within the research community. All tools associated to a use-case are available as BIBBOX applications (technically this is achieved by docker containers), which are integrated in the BIBBOX framework with central identification and user management. In future work we plan to share the acquired knowledge with other networks, develop an Application Programmable Interface (API) for the exchange of metadata with other tool catalogues and work on an ontology for the evaluation of biobank software.

Collapse

Exploring Protein-Protein Interactions as Drug Targets for Anti-cancer Therapy with In Silico Workflows. Methods Mol Biol 2017;1647:221-236. [PMID: 28809006 DOI: 10.1007/978-1-4939-7201-2_15] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]

White Paper on Research Data Service Discoverability. PUBLICATIONS 2016. [DOI: 10.3390/publications5010001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Simonyan V, Goecks J, Mazumder R. Biocompute Objects-A Step towards Evaluation and Validation of Biomedical Scientific Computations. PDA J Pharm Sci Technol 2016;71:136-146. [PMID: 27974626 DOI: 10.5731/pdajpst.2016.006734] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]

Abstract

The unpredictability of actual physical, chemical, and biological experiments due to the multitude of environmental and procedural factors is well documented. What is systematically overlooked, however, is that computational biology algorithms are also affected by multiplicity of parameters and have no lesser volatility. The complexities of computation protocols and interpretation of outcomes is only a part of the challenge: There are also virtually no standardized and industry-accepted metadata schemas for reporting the computational objects that record the parameters used for computations together with the results of computations. Thus, it is often impossible to reproduce the results of a previously performed computation due to missing information on parameters, versions, arguments, conditions, and procedures of application launch. In this article we describe the concept of biocompute objects developed specifically to satisfy regulatory research needs for evaluation, validation, and verification of bioinformatics pipelines. We envision generalized versions of biocompute objects called biocompute templates that support a single class of analyses but can be adapted to meet unique needs. To make these templates widely usable, we outline a simple but powerful cross-platform implementation. We also discuss the reasoning and potential usability for such concept within the larger scientific community through the creation of a biocompute object database initially consisting of records relevant to the U.S. Food and Drug Administration. A biocompute object database record will be similar to a GenBank record in form; the difference being that instead of describing a sequence, the biocompute record will include information related to parameters, dependencies, usage, and other information related to specific computational instance. This mechanism will extend similar efforts and also serve as a collaborative ground to ensure interoperability between different platforms, industries, scientists, regulators, and other stakeholders interested in biocomputing.

Collapse

Guler AT, Waaijer CJ, Mohammed Y, Palmblad M. Automating bibliometric analyses using Taverna scientific workflows: A tutorial on integrating Web Services. J Informetr 2016. [DOI: 10.1016/j.joi.2016.05.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]

Barnett CB, Aoki-Kinoshita KF, Naidoo KJ. The Glycome Analytics Platform: an integrative framework for glycobioinformatics. Bioinformatics 2016;32:3005-11. [PMID: 27288496 DOI: 10.1093/bioinformatics/btw341] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Accepted: 05/26/2016] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

Complex carbohydrates play a central role in cellular communication and in disease development. O- and N-glycans, which are post-translationally attached to proteins and lipids, are sugar chains that are rooted, tree structures. Independent efforts to develop computational tools for analyzing complex carbohydrate structures have been designed to exploit specific databases requiring unique formatting and limited transferability. Attempts have been made at integrating these resources, yet it remains difficult to communicate and share data across several online resources. A disadvantage of the lack of coordination between development efforts is the inability of the user community to create reproducible analyses (workflows). The latter results in the more serious unreliability of glycomics metadata.

RESULTS

In this paper, we realize the significance of connecting multiple online glycan resources that can be used to design reproducible experiments for obtaining, generating and analyzing cell glycomes. To address this, a suite of tools and utilities, have been integrated into the analytic functionality of the Galaxy bioinformatics platform to provide a Glycome Analytics Platform (GAP).Using this platform, users can design in silico workflows to manipulate various formats of glycan sequences and analyze glycomes through access to web data and services. We illustrate the central functionality and features of the GAP by way of example; we analyze and compare the features of the N-glycan glycome of monocytic cells sourced from two separate data depositions.This paper highlights the use of reproducible research methods for glycomics analysis and the GAP presents an opportunity for integrating tools in glycobioinformatics.

AVAILABILITY AND IMPLEMENTATION

This software is open-source and available online at https://bitbucket.org/scientificomputing/glycome-analytics-platform

CONTACTS

chris.barnett@uct.ac.za or kevin.naidoo@uct.ac.za

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Digles D, Zdrazil B, Neefs JM, Van Vlijmen H, Herhaus C, Caracoti A, Brea J, Roibás B, Loza MI, Queralt-Rosinach N, Furlong LI, Gaulton A, Bartek L, Senger S, Chichester C, Engkvist O, Evelo CT, Franklin NI, Marren D, Ecker GF, Jacoby E. Open PHACTS computational protocols for in silico target validation of cellular phenotypic screens: knowing the knowns. MEDCHEMCOMM 2016;7:1237-1244. [PMID: 27774140 PMCID: PMC5063042 DOI: 10.1039/c6md00065g] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 05/10/2016] [Indexed: 01/09/2023]

Affiliation(s)

D Digles Department of Pharmaceutical Chemistry , University of Vienna , Pharmacoinformatics Research Group , Althanstraße 14 , 1090 Wien , Austria .
B Zdrazil Department of Pharmaceutical Chemistry , University of Vienna , Pharmacoinformatics Research Group , Althanstraße 14 , 1090 Wien , Austria .
J-M Neefs Janssen Research & Development , Turnhoutseweg 30 , B-2340 Beerse , Belgium .
H Van Vlijmen Janssen Research & Development , Turnhoutseweg 30 , B-2340 Beerse , Belgium .
C Herhaus Merck KGaA, Merck Serono R&D , Computational Chemistry , Frankfurter Straße 250 , 64293 Darmstadt , Germany
A Caracoti BIOVIA , a Dassault Systèmes brand , 334 Cambridge Science Park , Cambridge CB4 0WN , UK
J Brea Grupo BioFarma-USEF , Departamento de Farmacología , Facultad de Farmacia , Campus Universitario Sur s/n , 15782 Santiago de Compostela , Spain
B Roibás Grupo BioFarma-USEF , Departamento de Farmacología , Facultad de Farmacia , Campus Universitario Sur s/n , 15782 Santiago de Compostela , Spain
M I Loza Grupo BioFarma-USEF , Departamento de Farmacología , Facultad de Farmacia , Campus Universitario Sur s/n , 15782 Santiago de Compostela , Spain
N Queralt-Rosinach Research Programme on Biomedical Informatics (GRIB) , Hospital del Mar Medical Research Institute (IMIM) , Department of Experimental and Health Sciences , Universitat Pompeu Fabra , C/Dr Aiguader 88 , E-08003 Barcelona , Spain
L I Furlong Research Programme on Biomedical Informatics (GRIB) , Hospital del Mar Medical Research Institute (IMIM) , Department of Experimental and Health Sciences , Universitat Pompeu Fabra , C/Dr Aiguader 88 , E-08003 Barcelona , Spain
A Gaulton European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI) , Wellcome Genome Campus , Hinxton , Cambridge CB10 1SD , UK
L Bartek GlaxoSmithKline , Medicines Research Centre , Stevenage SG1 2NY , UK
S Senger GlaxoSmithKline , Medicines Research Centre , Stevenage SG1 2NY , UK
C Chichester Swiss Institute of Bioinformatics , CALIPHO Group , CMU Rue Michel-Servet 1 , 1211 Geneva 4 , Switzerland ; Nestlé Institute of Health Sciences SA , EPFL Innovation Park, Bâtiment H , 1015 Lausanne , Switzerland
O Engkvist Chemistry Innovation Centre , Discovery Sciences , AstraZeneca R&D Gothenburg , SE-431 83 Mölndal , Sweden
C T Evelo Department of Bioinformatics - BiGCaT , P.O. Box 616 , UNS50 Box19 , NL-6200MD Maastricht , The Netherlands
N I Franklin Open Innovation Drug Discovery , Discovery Chemistry Eli Lilly and Company , Lilly Corporate Center , DC 1920 , Indianapolis , IN 46285 , USA
D Marren Eli Lilly and Company Ltd. , Lilly Research Centre , Erl Wood Manor, Sunninghill Road , Windlesham , Surrey GU20 6PH , England , UK
G F Ecker Department of Pharmaceutical Chemistry , University of Vienna , Pharmacoinformatics Research Group , Althanstraße 14 , 1090 Wien , Austria .
E Jacoby Janssen Research & Development , Turnhoutseweg 30 , B-2340 Beerse , Belgium .

Collapse

de la Garza L, Veit J, Szolek A, Röttig M, Aiche S, Gesing S, Reinert K, Kohlbacher O. From the desktop to the grid: scalable bioinformatics via workflow conversion. BMC Bioinformatics 2016;17:127. [PMID: 26968893 PMCID: PMC4788856 DOI: 10.1186/s12859-016-0978-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Accepted: 03/03/2016] [Indexed: 01/04/2023] Open

Abstract

Background

Reproducibility is one of the tenets of the scientific method. Scientific experiments often comprise complex data flows, selection of adequate parameters, and analysis and visualization of intermediate and end results. Breaking down the complexity of such experiments into the joint collaboration of small, repeatable, well defined tasks, each with well defined inputs, parameters, and outputs, offers the immediate benefit of identifying bottlenecks, pinpoint sections which could benefit from parallelization, among others. Workflows rest upon the notion of splitting complex work into the joint effort of several manageable tasks.

There are several engines that give users the ability to design and execute workflows. Each engine was created to address certain problems of a specific community, therefore each one has its advantages and shortcomings. Furthermore, not all features of all workflow engines are royalty-free —an aspect that could potentially drive away members of the scientific community.

Results

We have developed a set of tools that enables the scientific community to benefit from workflow interoperability. We developed a platform-free structured representation of parameters, inputs, outputs of command-line tools in so-called Common Tool Descriptor documents. We have also overcome the shortcomings and combined the features of two royalty-free workflow engines with a substantial user community: the Konstanz Information Miner, an engine which we see as a formidable workflow editor, and the Grid and User Support Environment, a web-based framework able to interact with several high-performance computing resources. We have thus created a free and highly accessible way to design workflows on a desktop computer and execute them on high-performance computing resources.

Conclusions

Our work will not only reduce time spent on designing scientific workflows, but also make executing workflows on remote high-performance computing resources more accessible to technically inexperienced users. We strongly believe that our efforts not only decrease the turnaround time to obtain scientific results but also have a positive impact on reproducibility, thus elevating the quality of obtained scientific results.

Collapse

Guler AT, Waaijer CJF, Palmblad M. Scientific workflows for bibliometrics. Scientometrics 2016;107:385-398. [PMID: 27122644 PMCID: PMC4833826 DOI: 10.1007/s11192-016-1885-6] [Citation(s) in RCA: 112] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2015] [Indexed: 11/26/2022]

Tosta FE, Braganholo V, Murta L, Mattoso M. Improving workflow design by mining reusable tasks. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2015. [DOI: 10.1186/s13173-015-0035-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Sfakianaki P, Koumakis L, Sfakianakis S, Iatraki G, Zacharioudakis G, Graf N, Marias K, Tsiknakis M. Semantic biomedical resource discovery: a Natural Language Processing framework. BMC Med Inform Decis Mak 2015;15:77. [PMID: 26423616 PMCID: PMC4591066 DOI: 10.1186/s12911-015-0200-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 09/21/2015] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

A plethora of publicly available biomedical resources do currently exist and are constantly increasing at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow and lags behind progress in the general NLP domain. The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology expert users to efficiently search for biomedical resources using natural language.

METHODS

A Natural Language Processing engine which can "translate" free text into targeted queries, automatically transforming a clinical research question into a request description that contains only terms of ontologies, has been implemented. The implementation is based on information extraction techniques for text in natural language, guided by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web application at ( http://calchas.ics.forth.gr/ ).

RESULTS

For our experiments, a range of clinical questions were established based on descriptions of clinical trials from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually or as a set of tools forming a computational pipeline. The results were compared with those obtained from an automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall measurements were used. Our results indicate that the proposed framework has a high precision and low recall, implying that the system returns essentially more relevant results than irrelevant.

CONCLUSIONS

There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical question in natural language and efficient dynamic biomedical resources discovery.

Collapse

Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A Catalog of Virtual Machine Images for the Life Sciences. Bioinform Biol Insights 2015;9:125-8. [PMID: 26401099 PMCID: PMC4567039 DOI: 10.4137/bbi.s28636] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Revised: 06/29/2015] [Accepted: 07/05/2015] [Indexed: 12/14/2022] Open

A Digital Repository and Execution Platform for Interactive Scholarly Publications in Neuroscience. Neuroinformatics 2015;14:23-40. [PMID: 26306864 DOI: 10.1007/s12021-015-9276-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Cock PJA, Chilton JM, Grüning B, Johnson JE, Soranzo N. NCBI BLAST+ integrated into Galaxy. Gigascience 2015;4:39. [PMID: 26336600 PMCID: PMC4557756 DOI: 10.1186/s13742-015-0080-7] [Citation(s) in RCA: 148] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Accepted: 08/18/2015] [Indexed: 01/29/2023] Open

Velloso H, Vialle RA, Ortega JM. BOWS (bioinformatics open web services) to centralize bioinformatics tools in web services. BMC Res Notes 2015;8:206. [PMID: 26032494 PMCID: PMC4467627 DOI: 10.1186/s13104-015-1190-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2014] [Accepted: 05/20/2015] [Indexed: 11/13/2022] Open

Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BIOMED RESEARCH INTERNATIONAL 2015;2015:904541. [PMID: 26125026 PMCID: PMC4466500 DOI: 10.1155/2015/904541] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 04/01/2015] [Accepted: 04/01/2015] [Indexed: 02/07/2023]

Drug discovery FAQs: workflows for answering multidomain drug discovery questions. Drug Discov Today 2015;20:399-405. [DOI: 10.1016/j.drudis.2014.11.006] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Revised: 10/22/2014] [Accepted: 11/13/2014] [Indexed: 12/26/2022]

Lord E, Diallo AB, Makarenkov V. Classification of bioinformatics workflows using weighted versions of partitioning and hierarchical clustering algorithms. BMC Bioinformatics 2015;16:68. [PMID: 25887434 PMCID: PMC4354763 DOI: 10.1186/s12859-015-0508-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 02/20/2015] [Indexed: 11/10/2022] Open

Abstract

Background

Workflows, or computational pipelines, consisting of collections of multiple linked tasks are becoming more and more popular in many scientific fields, including computational biology. For example, simulation studies, which are now a must for statistical validation of new bioinformatics methods and software, are frequently carried out using the available workflow platforms. Workflows are typically organized to minimize the total execution time and to maximize the efficiency of the included operations. Clustering algorithms can be applied either for regrouping similar workflows for their simultaneous execution on a server, or for dispatching some lengthy workflows to different servers, or for classifying the available workflows with a view to performing a specific keyword search.

Results

In this study, we consider four different workflow encoding and clustering schemes which are representative for bioinformatics projects. Some of them allow for clustering workflows with similar topological features, while the others regroup workflows according to their specific attributes (e.g. associated keywords) or execution time. The four types of workflow encoding examined in this study were compared using the weighted versions of k-means and k-medoids partitioning algorithms. The Calinski-Harabasz, Silhouette and logSS clustering indices were considered. Hierarchical classification methods, including the UPGMA, Neighbor Joining, Fitch and Kitsch algorithms, were also applied to classify bioinformatics workflows. Moreover, a novel pairwise measure of clustering solution stability, which can be computed in situations when a series of independent program runs is carried out, was introduced.

Conclusions

Our findings based on the analysis of 220 real-life bioinformatics workflows suggest that the weighted clustering models based on keywords information or tasks execution times provide the most appropriate clustering solutions. Using datasets generated by the Armadillo and Taverna scientific workflow management system, we found that the weighted cosine distance in association with the k-medoids partitioning algorithm and the presence-absence workflow encoding provided the highest values of the Rand index among all compared clustering strategies. The introduced clustering stability indices, PS and PSG, can be effectively used to identify elements with a low clustering support.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0508-1) contains supplementary material, which is available to authorized users.

Collapse