1
|
Du X, Dastmalchi F, Diller MA, Brochhausen M, Garrett TJ, Hogan WR, Lemas DJ. An Automated Workflow Composition System for Liquid Chromatography-Mass Spectrometry Metabolomics Data Processing. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2023; 34:2857-2863. [PMID: 37874901 DOI: 10.1021/jasms.3c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]
Abstract
Liquid chromatography-mass spectrometry (LC-MS) metabolomics studies produce high-dimensional data that must be processed by a complex network of informatics tools to generate analysis-ready data sets. As the first computational step in metabolomics, data processing is increasingly becoming a challenge for researchers to develop customized computational workflows that are applicable for LC-MS metabolomics analysis. Ontology-based automated workflow composition (AWC) systems provide a feasible approach for developing computational workflows that consume high-dimensional molecular data. We used the Automated Pipeline Explorer (APE) to create an AWC for LC-MS metabolomics data processing across three use cases. Our results show that APE predicted 145 data processing workflows across all the three use cases. We identified six traditional workflows and six novel workflows. Through manual review, we found that one-third of novel workflows were executable whereby the data processing function could be completed without obtaining an error. When selecting the top six workflows from each use case, the computational viable rate of our predicted workflows reached 45%. Collectively, our study demonstrates the feasibility of developing an AWC system for LC-MS metabolomics data processing.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, United States
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Farhad Dastmalchi
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - Matthew A Diller
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - Mathias Brochhausen
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205, United States
| | - Timothy J Garrett
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - William R Hogan
- Data Science Institute, Medical College of Wisconsin, Milwaukee, Wisconsin 53226, United States
| | - Dominick J Lemas
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
- Department of Obstetrics and Gynecology, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
- Center for Perinatal Outcomes Research, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| |
Collapse
|
2
|
Djaffardjy M, Marchment G, Sebe C, Blanchet R, Bellajhame K, Gaignard A, Lemoine F, Cohen-Boulakia S. Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems. Comput Struct Biotechnol J 2023; 21:2075-2085. [PMID: 36968012 PMCID: PMC10030817 DOI: 10.1016/j.csbj.2023.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 03/03/2023] [Accepted: 03/03/2023] [Indexed: 03/09/2023] Open
Abstract
Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.
Collapse
|
3
|
Diao J, Zhou Z, Xue X, Zhao D, Chen S. Bioinformatic workflow fragment discovery leveraging the social-aware knowledge graph. Front Genet 2022; 13:941996. [PMID: 36092917 PMCID: PMC9459048 DOI: 10.3389/fgene.2022.941996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 06/29/2022] [Indexed: 11/13/2022] Open
Abstract
Constructing a novel bioinformatic workflow by reusing and repurposing fragments crossing workflows is regarded as an error-avoiding and effort-saving strategy. Traditional techniques have been proposed to discover scientific workflow fragments leveraging their profiles and historical usages of their activities (or services). However, social relations of workflows, including relations between services and their developers have not been explored extensively. In fact, current techniques describe invoking relations between services, mostly, and they can hardly reveal implicit relations between services. To address this challenge, we propose a social-aware scientific workflow knowledge graph (S2KG) to capture common types of entities and various types of relations by analyzing relevant information about bioinformatic workflows and their developers recorded in repositories. Using attributes of entities such as credit and creation time, the union impact of several positive and negative links in S2KG is identified, to evaluate the feasibility of workflow fragment construction. To facilitate the discovery of single services, a service invoking network is extracted form S2KG, and service communities are constructed accordingly. A bioinformatic workflow fragment discovery mechanism based on Yen’s method is developed to discover appropriate fragments with respect to certain user’s requirements. Extensive experiments are conducted, where bioinformatic workflows publicly accessible at the myExperiment repository are adopted. Evaluation results show that our technique performs better than the state-of-the-art techniques in terms of the precision, recall, and F1.
Collapse
Affiliation(s)
- Jin Diao
- School of Information Engineering, China University of Geosciences (Beijing), Beijing, China
| | - Zhangbing Zhou
- School of Information Engineering, China University of Geosciences (Beijing), Beijing, China
- Computer Science Department, TELECOM SudParis, Evry, France
- *Correspondence: Zhangbing Zhou,
| | - Xiao Xue
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Deng Zhao
- School of Information Engineering, China University of Geosciences (Beijing), Beijing, China
| | | |
Collapse
|
4
|
Du X, Aristizabal-Henao JJ, Garrett TJ, Brochhausen M, Hogan WR, Lemas DJ. A Checklist for Reproducible Computational Analysis in Clinical Metabolomics Research. Metabolites 2022; 12:87. [PMID: 35050209 PMCID: PMC8779534 DOI: 10.3390/metabo12010087] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 12/25/2021] [Accepted: 01/10/2022] [Indexed: 12/15/2022] Open
Abstract
Clinical metabolomics emerged as a novel approach for biomarker discovery with the translational potential to guide next-generation therapeutics and precision health interventions. However, reproducibility in clinical research employing metabolomics data is challenging. Checklists are a helpful tool for promoting reproducible research. Existing checklists that promote reproducible metabolomics research primarily focused on metadata and may not be sufficient to ensure reproducible metabolomics data processing. This paper provides a checklist including actions that need to be taken by researchers to make computational steps reproducible for clinical metabolomics studies. We developed an eight-item checklist that includes criteria related to reusable data sharing and reproducible computational workflow development. We also provided recommended tools and resources to complete each item, as well as a GitHub project template to guide the process. The checklist is concise and easy to follow. Studies that follow this checklist and use recommended resources may facilitate other researchers to reproduce metabolomics results easily and efficiently.
Collapse
Affiliation(s)
- Xinsong Du
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA; (X.D.); (W.R.H.)
| | | | - Timothy J. Garrett
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, FL 32610, USA;
| | - Mathias Brochhausen
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA;
| | - William R. Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA; (X.D.); (W.R.H.)
| | - Dominick J. Lemas
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA; (X.D.); (W.R.H.)
| |
Collapse
|
5
|
Tangaro MA, Mandreoli P, Chiara M, Donvito G, Antonacci M, Parisi A, Bianco A, Romano A, Bianchi DM, Cangelosi D, Uva P, Molineris I, Nosi V, Calogero RA, Alessandri L, Pedrini E, Mordenti M, Bonetti E, Sangiorgi L, Pesole G, Zambelli F. Laniakea@ReCaS: exploring the potential of customisable Galaxy on-demand instances as a cloud-based service. BMC Bioinformatics 2021; 22:544. [PMID: 34749633 PMCID: PMC8574934 DOI: 10.1186/s12859-021-04401-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 09/24/2021] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Improving the availability and usability of data and analytical tools is a critical precondition for further advancing modern biological and biomedical research. For instance, one of the many ramifications of the COVID-19 global pandemic has been to make even more evident the importance of having bioinformatics tools and data readily actionable by researchers through convenient access points and supported by adequate IT infrastructures. One of the most successful efforts in improving the availability and usability of bioinformatics tools and data is represented by the Galaxy workflow manager and its thriving community. In 2020 we introduced Laniakea, a software platform conceived to streamline the configuration and deployment of "on-demand" Galaxy instances over the cloud. By facilitating the set-up and configuration of Galaxy web servers, Laniakea provides researchers with a powerful and highly customisable platform for executing complex bioinformatics analyses. The system can be accessed through a dedicated and user-friendly web interface that allows the Galaxy web server's initial configuration and deployment. RESULTS "Laniakea@ReCaS", the first instance of a Laniakea-based service, is managed by ELIXIR-IT and was officially launched in February 2020, after about one year of development and testing that involved several users. Researchers can request access to Laniakea@ReCaS through an open-ended call for use-cases. Ten project proposals have been accepted since then, totalling 18 Galaxy on-demand virtual servers that employ ~ 100 CPUs, ~ 250 GB of RAM and ~ 5 TB of storage and serve several different communities and purposes. Herein, we present eight use cases demonstrating the versatility of the platform. CONCLUSIONS During this first year of activity, the Laniakea-based service emerged as a flexible platform that facilitated the rapid development of bioinformatics tools, the efficient delivery of training activities, and the provision of public bioinformatics services in different settings, including food safety and clinical research. Laniakea@ReCaS provides a proof of concept of how enabling access to appropriate, reliable IT resources and ready-to-use bioinformatics tools can considerably streamline researchers' work.
Collapse
Affiliation(s)
- Marco Antonio Tangaro
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
| | - Pietro Mandreoli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy
| | - Matteo Chiara
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy
| | - Giacinto Donvito
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
| | - Marica Antonacci
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
| | - Antonio Parisi
- Istituto Zooprofilattico Sperimentale Della Puglia e Della Basilicata, Via Manfredonia 20, 71121, Foggia, Italy
| | - Angelica Bianco
- Istituto Zooprofilattico Sperimentale Della Puglia e Della Basilicata, Via Manfredonia 20, 71121, Foggia, Italy
| | - Angelo Romano
- National Reference Laboratory for Coagulase-Positive Staphylococci Including Staphylococcus Aureus, Istituto Zooprofilattico Sperimentale del Piemonte, Liguria e Valle d'Aosta, Via Bologna 148, 10154, Turin, Italy
| | - Daniela Manila Bianchi
- National Reference Laboratory for Coagulase-Positive Staphylococci Including Staphylococcus Aureus, Istituto Zooprofilattico Sperimentale del Piemonte, Liguria e Valle d'Aosta, Via Bologna 148, 10154, Turin, Italy
| | - Davide Cangelosi
- Clinical Bioinformatics Unit, Scientific Direction, IRCCS Istituto Giannina Gaslini, Via Gerolamo Gaslini 5, 16147, Genova, Italy
| | - Paolo Uva
- Clinical Bioinformatics Unit, Scientific Direction, IRCCS Istituto Giannina Gaslini, Via Gerolamo Gaslini 5, 16147, Genova, Italy
- Italian Institute of Technology, Via Morego 30, 16163, Genova, Italy
| | - Ivan Molineris
- Department of Life Science and System Biology, University of Turin, Via Accademia Albertina, 13-1023, Turin, Italy
| | - Vladimir Nosi
- Department of Computer Science, University of Turin, Via Pessinetto 12, 10049, Turin, Italy
| | - Raffaele A Calogero
- Department of Molecular Biotechnology and Health Sciences, Via Nizza 52, 10126, Turin, Italy
| | - Luca Alessandri
- Department of Molecular Biotechnology and Health Sciences, Via Nizza 52, 10126, Turin, Italy
| | - Elena Pedrini
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
| | - Marina Mordenti
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
| | - Emanuele Bonetti
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
- Department of Experimental Oncology, European Institute of Oncology, Via Adamello 16, 20139, Milan, Italy
| | - Luca Sangiorgi
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
| | - Graziano Pesole
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy.
- Department of Biosciences, Biotechnologies and Biopharmaceutics, University of Bari, Via Orabona 4, 70126, Bari, Italy.
| | - Federico Zambelli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy.
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy.
| |
Collapse
|
6
|
BROWN ANDREWW, ASLIBEKYAN STELLA, BIER DENNIS, DA SILVA RAFAELFERREIRA, HOOVER ADAM, KLURFELD DAVIDM, LOKEN ERIC, MAYO-WILSON EVAN, MENACHEMI NIR, PAVELA GREG, QUINN PATRICKD, SCHOELLER DALE, TEKWE CARMEN, VALDEZ DANNY, VORLAND COLBYJ, WHIGHAM LEAHD, ALLISON DAVIDB. Toward more rigorous and informative nutritional epidemiology: The rational space between dismissal and defense of the status quo. Crit Rev Food Sci Nutr 2021; 63:3150-3167. [PMID: 34678079 PMCID: PMC9023609 DOI: 10.1080/10408398.2021.1985427] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
To date, nutritional epidemiology has relied heavily on relatively weak methods including simple observational designs and substandard measurements. Despite low internal validity and other sources of bias, claims of causality are made commonly in this literature. Nutritional epidemiology investigations can be improved through greater scientific rigor and adherence to scientific reporting commensurate with research methods used. Some commentators advocate jettisoning nutritional epidemiology entirely, perhaps believing improvements are impossible. Still others support only normative refinements. But neither abolition nor minor tweaks are appropriate. Nutritional epidemiology, in its present state, offers utility, yet also needs marked, reformational renovation. Changing the status quo will require ongoing, unflinching scrutiny of research questions, practices, and reporting-and a willingness to admit that "good enough" is no longer good enough. As such, a workshop entitled "Toward more rigorous and informative nutritional epidemiology: the rational space between dismissal and defense of the status quo" was held from July 15 to August 14, 2020. This virtual symposium focused on: (1) Stronger Designs, (2) Stronger Measurement, (3) Stronger Analyses, and (4) Stronger Execution and Reporting. Participants from several leading academic institutions explored existing, evolving, and new better practices, tools, and techniques to collaboratively advance specific recommendations for strengthening nutritional epidemiology.
Collapse
Affiliation(s)
- ANDREW W. BROWN
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| | | | - DENNIS BIER
- Baylor College of Medicine, Houston, Texas, USA
| | | | - ADAM HOOVER
- Clemson University, Clemson, South Carolina, USA
| | - DAVID M. KLURFELD
- United States Department of Agriculture, Agricultural Research Service, Beltsville, Maryland, USA
| | - ERIC LOKEN
- University of Connecticut, Storrs, Connecticut, USA
| | - EVAN MAYO-WILSON
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| | - NIR MENACHEMI
- Indiana University Fairbanks School of Public Health at IUPUI, Indianapolis, Indiana, USA
| | - GREG PAVELA
- University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - PATRICK D. QUINN
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| | - DALE SCHOELLER
- University of Wisconsin-Madison Biotechnology Center, Madison, Wisconsin, USA
| | - CARMEN TEKWE
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| | - DANNY VALDEZ
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| | - COLBY J. VORLAND
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| | - LEAH D. WHIGHAM
- University of Texas Health Science Center School of Public Health, El Paso, Texas, USA
| | - DAVID B. ALLISON
- Indiana University School of Public Health-Bloomington, Bloomington, Indiana, USA
| |
Collapse
|
7
|
Lamprecht AL, Palmblad M, Ison J, Schwämmle V, Al Manir MS, Altintas I, Baker CJO, Ben Hadj Amor A, Capella-Gutierrez S, Charonyktakis P, Crusoe MR, Gil Y, Goble C, Griffin TJ, Groth P, Ienasescu H, Jagtap P, Kalaš M, Kasalica V, Khanteymoori A, Kuhn T, Mei H, Ménager H, Möller S, Richardson RA, Robert V, Soiland-Reyes S, Stevens R, Szaniszlo S, Verberne S, Verhoeven A, Wolstencroft K. Perspectives on automated composition of workflows in the life sciences. F1000Res 2021; 10:897. [PMID: 34804501 PMCID: PMC8573700 DOI: 10.12688/f1000research.54159.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/27/2021] [Indexed: 12/29/2022] Open
Abstract
Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.
Collapse
Affiliation(s)
| | - Magnus Palmblad
- Leiden University Medical Center, 2333 ZA, Leiden, The Netherlands
| | - Jon Ison
- French Institute of Bioinformatics, 91057 Évry, France
| | | | | | - Ilkay Altintas
- University of California San Diego, La Jolla, CA, 92093, USA
| | - Christopher J. O. Baker
- University of New Brunswick, Saint John, E2L 4L5, Canada
- IPSNP Computing Inc., Saint John, E2L 4S6, Canada
| | | | | | | | | | - Yolanda Gil
- University of Southern California, Marina Del Rey, CA, 90292, USA
| | - Carole Goble
- Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Timothy J. Griffin
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Paul Groth
- University of Amsterdam, 1090 GH Amsterdam, The Netherlands
| | - Hans Ienasescu
- Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Pratik Jagtap
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, 55455, USA
| | | | | | | | - Tobias Kuhn
- VU Amsterdam, 1081 HV Amsterdam, The Netherlands
| | - Hailiang Mei
- Sequencing Analysis Support Core, Leiden University Medical Center, 2333 ZC Leiden, The Netherlands
| | | | - Steffen Möller
- IBIMA, Rostock University Medical Center, 18057 Rostock, Germany
| | | | | | - Stian Soiland-Reyes
- Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
- Informatics Institute, University of Amsterdam, 1090 GH Amsterdam, The Netherlands
| | - Robert Stevens
- Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | | | - Suzan Verberne
- Leiden Institute of Advanced Computer Science, Leiden University, 2333 BE Leiden, The Netherlands
| | - Aswin Verhoeven
- Leiden University Medical Center, 2333 ZA, Leiden, The Netherlands
| | - Katherine Wolstencroft
- Leiden Institute of Advanced Computer Science, Leiden University, 2333 BE Leiden, The Netherlands
| |
Collapse
|
8
|
Samota EK, Davey RP. Knowledge and Attitudes Among Life Scientists Toward Reproducibility Within Journal Articles: A Research Survey. Front Res Metr Anal 2021; 6:678554. [PMID: 34268467 PMCID: PMC8276979 DOI: 10.3389/frma.2021.678554] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 05/18/2021] [Indexed: 12/22/2022] Open
Abstract
We constructed a survey to understand how authors and scientists view the issues around reproducibility, focusing on interactive elements such as interactive figures embedded within online publications, as a solution for enabling the reproducibility of experiments. We report the views of 251 researchers, comprising authors who have published in eLIFE Sciences, and those who work at the Norwich Biosciences Institutes (NBI). The survey also outlines to what extent researchers are occupied with reproducing experiments themselves. Currently, there is an increasing range of tools that attempt to address the production of reproducible research by making code, data, and analyses available to the community for reuse. We wanted to collect information about attitudes around the consumer end of the spectrum, where life scientists interact with research outputs to interpret scientific results. Static plots and figures within articles are a central part of this interpretation, and therefore we asked respondents to consider various features for an interactive figure within a research article that would allow them to better understand and reproduce a published analysis. The majority (91%) of respondents reported that when authors describe their research methodology (methods and analyses) in detail, published research can become more reproducible. The respondents believe that having interactive figures in published papers is a beneficial element to themselves, the papers they read as well as to their readers. Whilst interactive figures are one potential solution for consuming the results of research more effectively to enable reproducibility, we also review the equally pressing technical and cultural demands on researchers that need to be addressed to achieve greater success in reproducibility in the life sciences.
Collapse
Affiliation(s)
- Evanthia Kaimaklioti Samota
- Earlham Institute, Norwich, United Kingdom
- School of Biological Sciences, University of East Anglia, Norwich, United Kingdom
| | | |
Collapse
|
9
|
Porubsky V, Smith L, Sauro HM. Publishing reproducible dynamic kinetic models. Brief Bioinform 2021; 22:bbaa152. [PMID: 32793969 PMCID: PMC8138891 DOI: 10.1093/bib/bbaa152] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Revised: 05/19/2020] [Accepted: 06/17/2020] [Indexed: 11/14/2022] Open
Abstract
Publishing repeatable and reproducible computational models is a crucial aspect of the scientific method in computational biology and one that is often forgotten in the rush to publish. The pressures of academic life and the lack of any reward system at institutions, granting agencies and journals means that publishing reproducible science is often either non-existent or, at best, presented in the form of an incomplete description. In the article, we will focus on repeatability and reproducibility in the systems biology field where a great many published models cannot be reproduced and in many cases even repeated. This review describes the current landscape of software tooling, model repositories, model standards and best practices for publishing repeatable and reproducible kinetic models. The review also discusses possible future remedies including working more closely with journals to help reviewers and editors ensure that published kinetic models are at minimum, repeatable. Contact: hsauro@uw.edu.
Collapse
Affiliation(s)
- Veronica Porubsky
- Department of Bioengineering, University of Washington, Seattle, 98105,USA
| | - Lucian Smith
- Department of Bioengineering, University of Washington, Seattle, 98105,USA
| | - Herbert M Sauro
- Department of Bioengineering, University of Washington, Seattle, 98105,USA
| |
Collapse
|
10
|
Choi K, Karr JR, Sauro HM. Status and Challenges of Reproducibility in Computational Systems and Synthetic Biology. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11525-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
|
11
|
Bartley BA, Beal J, Karr JR, Strychalski EA. Organizing genome engineering for the gigabase scale. Nat Commun 2020; 11:689. [PMID: 32019919 PMCID: PMC7000699 DOI: 10.1038/s41467-020-14314-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Accepted: 12/18/2019] [Indexed: 12/11/2022] Open
Abstract
Genome-scale engineering holds great potential to impact science, industry, medicine, and society, and recent improvements in DNA synthesis have enabled the manipulation of megabase genomes. However, coordinating and integrating the workflows and large teams necessary for gigabase genome engineering remains a considerable challenge. We examine this issue and recommend a path forward by: 1) adopting and extending existing representations for designs, assembly plans, samples, data, and workflows; 2) developing new technologies for data curation and quality control; 3) conducting fundamental research on genome-scale modeling and design; and 4) developing new legal and contractual infrastructure to facilitate collaboration.
Collapse
Affiliation(s)
| | - Jacob Beal
- Raytheon BBN Technologies, Cambridge, MA, 02138, USA.
| | - Jonathan R Karr
- Icahn Institute and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10128, USA
| | | |
Collapse
|
12
|
Harjes J, Link A, Weibulat T, Triebel D, Rambold G. FAIR digital objects in environmental and life sciences should comprise workflow operation design data and method information for repeatability of study setups and reproducibility of results. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5894776. [PMID: 32815545 PMCID: PMC7439577 DOI: 10.1093/database/baaa059] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Revised: 07/01/2020] [Accepted: 07/07/2020] [Indexed: 12/23/2022]
Abstract
Repeatability of study setups and reproducibility of research results by underlying data are major requirements in science. Until now, abstract models for describing the structural logic of studies in environmental sciences are lacking and tools for data management are insufficient. Mandatory for repeatability and reproducibility is the use of sophisticated data management solutions going beyond data file sharing. Particularly, it implies maintenance of coherent data along workflows. Design data concern elements from elementary domains of operations being transformation, measurement and transaction. Operation design elements and method information are specified for each consecutive workflow segment from field to laboratory campaigns. The strict linkage of operation design element values, operation values and objects is essential. For enabling coherence of corresponding objects along consecutive workflow segments, the assignment of unique identifiers and the specification of their relations are mandatory. The abstract model presented here addresses these aspects, and the software DiversityDescriptions (DWB-DD) facilitates the management of thusly connected digital data objects and structures. DWB-DD allows for an individual specification of operation design elements and their linking to objects. Two workflow design use cases, one for DNA barcoding and another for cultivation of fungal isolates, are given. To publish those structured data, standard schema mapping and XML-provision of digital objects are essential. Schemas useful for this mapping include the Ecological Markup Language, the Schema for Meta-omics Data of Collection Objects and the Standard for Structured Descriptive Data. Data pipelines with DWB-DD include the mapping and conversion between schemas and functions for data publishing and archiving according to the Open Archival Information System standard. The setting allows for repeatability of study setups, reproducibility of study results and for supporting work groups to structure and maintain their data from the beginning of a study. The theory of ‘FAIR++’ digital objects is introduced.
Collapse
Affiliation(s)
- Janno Harjes
- University of Bayreuth, Universitätsstraße 30, 95440 Bayreuth, Germany
| | - Anton Link
- Staatliche Naturwissenschaftliche Sammlungen Bayerns, Menzinger Straße 67, 80638 München, Germany
| | - Tanja Weibulat
- Staatliche Naturwissenschaftliche Sammlungen Bayerns, Menzinger Straße 67, 80638 München, Germany.,German Federation for Biological Data e. V., Campus Ring 1, 28759 Bremen, Germany
| | - Dagmar Triebel
- Staatliche Naturwissenschaftliche Sammlungen Bayerns, Menzinger Straße 67, 80638 München, Germany.,German Federation for Biological Data e. V., Campus Ring 1, 28759 Bremen, Germany
| | - Gerhard Rambold
- University of Bayreuth, Universitätsstraße 30, 95440 Bayreuth, Germany
| |
Collapse
|
13
|
Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience 2019; 8:giz095. [PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/23/2019] [Accepted: 07/17/2019] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. RESULTS Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. CONCLUSIONS The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.
Collapse
Affiliation(s)
- Farah Zaib Khan
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
- Common Workflow Language Project
| | | | - Richard O Sinnott
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | - Andrew Lonie
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | | | | |
Collapse
|
14
|
Andrio P, Hospital A, Conejero J, Jordá L, Del Pino M, Codo L, Soiland-Reyes S, Goble C, Lezzi D, Badia RM, Orozco M, Gelpi JL. BioExcel Building Blocks, a software library for interoperable biomolecular simulation workflows. Sci Data 2019; 6:169. [PMID: 31506435 PMCID: PMC6736963 DOI: 10.1038/s41597-019-0177-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 08/16/2019] [Indexed: 12/26/2022] Open
Abstract
In the recent years, the improvement of software and hardware performance has made biomolecular simulations a mature tool for the study of biological processes. Simulation length and the size and complexity of the analyzed systems make simulations both complementary and compatible with other bioinformatics disciplines. However, the characteristics of the software packages used for simulation have prevented the adoption of the technologies accepted in other bioinformatics fields like automated deployment systems, workflow orchestration, or the use of software containers. We present here a comprehensive exercise to bring biomolecular simulations to the “bioinformatics way of working”. The exercise has led to the development of the BioExcel Building Blocks (BioBB) library. BioBB’s are built as Python wrappers to provide an interoperable architecture. BioBB’s have been integrated in a chain of usual software management tools to generate data ontologies, documentation, installation packages, software containers and ways of integration with workflow managers, that make them usable in most computational environments.
Collapse
Affiliation(s)
- Pau Andrio
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Adam Hospital
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology (BIST), Baldiri Reixac 10, Barcelona, 08028, Spain
| | - Javier Conejero
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Luis Jordá
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Marc Del Pino
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Laia Codo
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Stian Soiland-Reyes
- School of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Carole Goble
- School of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Daniele Lezzi
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Rosa M Badia
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain
| | - Modesto Orozco
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology (BIST), Baldiri Reixac 10, Barcelona, 08028, Spain.,Department Biochemistry and Molecular Biomedicine, University of Barcelona, Barcelona, Spain
| | - Josep Ll Gelpi
- Barcelona Supercomputing Center (BSC), Jordi Girona 29, 08034, Barcelona, Spain. .,Department Biochemistry and Molecular Biomedicine, University of Barcelona, Barcelona, Spain.
| |
Collapse
|
15
|
Malandrino D, Manno I, Negro A, Petta A, Serra L, Cantarella C, Scarano V. Social support for collaboration and group awareness in life science research teams. SOURCE CODE FOR BIOLOGY AND MEDICINE 2019; 14:4. [PMID: 31320922 PMCID: PMC6615102 DOI: 10.1186/s13029-019-0074-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 07/01/2019] [Indexed: 11/10/2022]
Abstract
BACKGROUND Next-generation sequencing (NGS) technologies have revolutionarily reshaped the landscape of '-omics' research areas. They produce a plethora of information requiring specific knowledge in sample preparation, analysis and characterization. Additionally, expertise and competencies are required when using bioinformatics tools and methods for efficient analysis, interpretation, and visualization of data. These skills are rarely covered in a single laboratory. More often the samples are isolated and purified in a first laboratory, sequencing is performed by a private company or a specialized lab, while the produced data are analyzed by a third group of researchers. In this scenario, the support, the communication, and the information sharing among researchers represent the key points to build a common knowledge and to meet the project objectives. RESULTS We present ElGalaxy, a system designed and developed to support collaboration and information sharing among researchers. Specifically, we integrated collaborative functionalities within an application usually adopted by Life Science researchers. ElGalaxy, therefore, is the result of the integration of Galaxy, i.e., a Workflow Management System, with Elgg, i.e., a Social Network Engine. CONCLUSIONS ElGalaxy enables scientists, that work on the same experiment, to collaborate and share information, to discuss about methods, and to evaluate results of the individual steps, as well as of entire activities, performed during their experiments. ElGalaxy also allows a greater team awareness, especially when experiments are carried out with researchers which belong to different and distributed research centers.
Collapse
Affiliation(s)
- Delfina Malandrino
- Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, Fisciano (SA), Italy
| | - Ilaria Manno
- Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, Fisciano (SA), Italy
| | - Alberto Negro
- Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, Fisciano (SA), Italy
| | - Andrea Petta
- Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, Fisciano (SA), Italy
| | - Luigi Serra
- Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, Fisciano (SA), Italy
| | - Concita Cantarella
- Consiglio per la Ricerca in Agricoltura e l’Analisi dell’Economia Agraria, Pontecagnano (SA), Salerno, Italy
| | - Vittorio Scarano
- Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, Fisciano (SA), Italy
| |
Collapse
|
16
|
Karim MR, Michel A, Zappa A, Baranov P, Sahay R, Rebholz-Schuhmann D. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Brief Bioinform 2019; 19:1035-1050. [PMID: 28419324 PMCID: PMC6169675 DOI: 10.1093/bib/bbx039] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Indexed: 11/22/2022] Open
Abstract
Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.
Collapse
Affiliation(s)
- Md Rezaul Karim
- Semantics in eHealth and Life Sciences (SeLS), Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland
| | - Audrey Michel
- School of Biochemistry and Cell Biology, University College Cork, Ireland
| | - Achille Zappa
- Insight Centre for Data Analytics, National University of Ireland Galway, Dangan, Galway, Ireland
| | - Pavel Baranov
- School of Biochemistry and Cell Biology, University College Cork, Ireland
| | - Ratnesh Sahay
- Semantics in eHealth and Life Sciences (SeLS), Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland
| | | |
Collapse
|
17
|
Enabling precision medicine via standard communication of HTS provenance, analysis, and results. PLoS Biol 2018; 16:e3000099. [PMID: 30596645 PMCID: PMC6338479 DOI: 10.1371/journal.pbio.3000099] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 01/18/2019] [Indexed: 11/30/2022] Open
Abstract
A personalized approach based on a patient's or pathogen’s unique genomic sequence is the foundation of precision medicine. Genomic findings must be robust and reproducible, and experimental data capture should adhere to findable, accessible, interoperable, and reusable (FAIR) guiding principles. Moreover, effective precision medicine requires standardized reporting that extends beyond wet-lab procedures to computational methods. The BioCompute framework (https://w3id.org/biocompute/1.3.0) enables standardized reporting of genomic sequence data provenance, including provenance domain, usability domain, execution domain, verification kit, and error domain. This framework facilitates communication and promotes interoperability. Bioinformatics computation instances that employ the BioCompute framework are easily relayed, repeated if needed, and compared by scientists, regulators, test developers, and clinicians. Easing the burden of performing the aforementioned tasks greatly extends the range of practical application. Large clinical trials, precision medicine, and regulatory submissions require a set of agreed upon standards that ensures efficient communication and documentation of genomic analyses. The BioCompute paradigm and the resulting BioCompute Objects (BCOs) offer that standard and are freely accessible as a GitHub organization (https://github.com/biocompute-objects) following the “Open-Stand.org principles for collaborative open standards development.” With high-throughput sequencing (HTS) studies communicated using a BCO, regulatory agencies (e.g., Food and Drug Administration [FDA]), diagnostic test developers, researchers, and clinicians can expand collaboration to drive innovation in precision medicine, potentially decreasing the time and cost associated with next-generation sequencing workflow exchange, reporting, and regulatory reviews. This Community Page article presents a communication standard for the provenance of high-throughput sequencing data; a BioCompute Object (BCO) can serve as a history of what was computed, be used as part of a validation process, or provide clarity and transparency of an experimental process to collaborators.
Collapse
|
18
|
Mondelli ML, Magalhães T, Loss G, Wilde M, Foster I, Mattoso M, Katz D, Barbosa H, de Vasconcelos ATR, Ocaña K, Gadelha LMR. BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 2018; 6:e5551. [PMID: 30186700 PMCID: PMC6119457 DOI: 10.7717/peerj.5551] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 08/07/2018] [Indexed: 11/20/2022] Open
Abstract
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
Collapse
Affiliation(s)
- Maria Luiza Mondelli
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Thiago Magalhães
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Guilherme Loss
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Michael Wilde
- Computation Institute, Argonne National Laboratory/University of Chicago, Chicago, IL, USA
| | - Ian Foster
- Computation Institute, Argonne National Laboratory/University of Chicago, Chicago, IL, USA
| | - Marta Mattoso
- Computer and Systems Engineering Program, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Daniel Katz
- National Center for Supercomputing Applications, University of Illinois, Urbana, IL, USA
| | - Helio Barbosa
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil.,Federal University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
| | | | - Kary Ocaña
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Luiz M R Gadelha
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| |
Collapse
|
19
|
Misra BB, Langefeld CD, Olivier M, Cox LA. Integrated Omics: Tools, Advances, and Future Approaches. J Mol Endocrinol 2018; 62:JME-18-0055. [PMID: 30006342 DOI: 10.1530/jme-18-0055] [Citation(s) in RCA: 220] [Impact Index Per Article: 36.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 07/02/2018] [Accepted: 07/12/2018] [Indexed: 12/13/2022]
Abstract
With the rapid adoption of high-throughput omic approaches to analyze biological samples such as genomics, transcriptomics, proteomics, and metabolomics, each analysis can generate tera- to peta-byte sized data files on a daily basis. These data file sizes, together with differences in nomenclature among these data types, make the integration of these multi-dimensional omics data into biologically meaningful context challenging. Variously named as integrated omics, multi-omics, poly-omics, trans-omics, pan-omics, or shortened to just 'omics', the challenges include differences in data cleaning, normalization, biomolecule identification, data dimensionality reduction, biological contextualization, statistical validation, data storage and handling, sharing, and data archiving. The ultimate goal is towards the holistic realization of a 'systems biology' understanding of the biological question in hand. Commonly used approaches in these efforts are currently limited by the 3 i's - integration, interpretation, and insights. Post integration, these very large datasets aim to yield unprecedented views of cellular systems at exquisite resolution for transformative insights into processes, events, and diseases through various computational and informatics frameworks. With the continued reduction in costs and processing time for sample analyses, and increasing types of omics datasets generated such as glycomics, lipidomics, microbiomics, and phenomics, an increasing number of scientists in this interdisciplinary domain of bioinformatics face these challenges. We discuss recent approaches, existing tools, and potential caveats in the integration of omics datasets for development of standardized analytical pipelines that could be adopted by the global omics research community.
Collapse
Affiliation(s)
- Biswapriya B Misra
- B Misra, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Carl D Langefeld
- C Langefeld, Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Michael Olivier
- M Olivier, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Laura A Cox
- L Cox, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| |
Collapse
|
20
|
Naldi A, Hernandez C, Levy N, Stoll G, Monteiro PT, Chaouiya C, Helikar T, Zinovyev A, Calzone L, Cohen-Boulakia S, Thieffry D, Paulevé L. The CoLoMoTo Interactive Notebook: Accessible and Reproducible Computational Analyses for Qualitative Biological Networks. Front Physiol 2018; 9:680. [PMID: 29971009 PMCID: PMC6018415 DOI: 10.3389/fphys.2018.00680] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Accepted: 05/15/2018] [Indexed: 01/07/2023] Open
Abstract
Analysing models of biological networks typically relies on workflows in which different software tools with sensitive parameters are chained together, many times with additional manual steps. The accessibility and reproducibility of such workflows is challenging, as publications often overlook analysis details, and because some of these tools may be difficult to install, and/or have a steep learning curve. The CoLoMoTo Interactive Notebook provides a unified environment to edit, execute, share, and reproduce analyses of qualitative models of biological networks. This framework combines the power of different technologies to ensure repeatability and to reduce users' learning curve of these technologies. The framework is distributed as a Docker image with the tools ready to be run without any installation step besides Docker, and is available on Linux, macOS, and Microsoft Windows. The embedded computational workflows are edited with a Jupyter web interface, enabling the inclusion of textual annotations, along with the explicit code to execute, as well as the visualization of the results. The resulting notebook files can then be shared and re-executed in the same environment. To date, the CoLoMoTo Interactive Notebook provides access to the software tools GINsim, BioLQM, Pint, MaBoSS, and Cell Collective, for the modeling and analysis of Boolean and multi-valued networks. More tools will be included in the future. We developed a Python interface for each of these tools to offer a seamless integration in the Jupyter web interface and ease the chaining of complementary analyses.
Collapse
Affiliation(s)
- Aurélien Naldi
- Computational Systems Biology Team, Institut de Biologie de I'Ecole Normale Supérieure, Centre National de la Recherche Scientifique UMR8197, Institut National de la Santé et de la Recherche Médicale U1024, École Normale Supérieure, PSL Université, Paris, France
| | - Céline Hernandez
- Computational Systems Biology Team, Institut de Biologie de I'Ecole Normale Supérieure, Centre National de la Recherche Scientifique UMR8197, Institut National de la Santé et de la Recherche Médicale U1024, École Normale Supérieure, PSL Université, Paris, France
| | - Nicolas Levy
- Laboratoire de Recherche en Informatique UMR8623, Université Paris-Sud, Centre National de la Recherche Scientifique, Université Paris-Saclay, Orsay, France
- École Normale Supérieure de Lyon, Lyon, France
| | - Gautier Stoll
- Université Paris Descartes/Paris V, Sorbonne Paris Cité, Paris, France
- Équipe 11 Labellisée Ligue Nationale Contre le Cancer, Centre de Recherche des Cordeliers, Paris, France
- Institut National de la Santé et de la Recherche Médicale, U1138, Paris, France
- Université Pierre et Marie Curie, Paris, France
- Metabolomics and Cell Biology Platforms, Gustave Roussy Cancer, Villejuif, France
| | - Pedro T. Monteiro
- INESC-ID/Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal
| | | | - Tomáš Helikar
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, United States
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, Paris, France
- Institut National de la Santé et de la Recherche Médicale, U900, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
- Lobachevsky University, Nizhni Novgorod, Russia
| | - Laurence Calzone
- Institut Curie, PSL Research University, Paris, France
- Institut National de la Santé et de la Recherche Médicale, U900, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
| | - Sarah Cohen-Boulakia
- Laboratoire de Recherche en Informatique UMR8623, Université Paris-Sud, Centre National de la Recherche Scientifique, Université Paris-Saclay, Orsay, France
| | - Denis Thieffry
- Computational Systems Biology Team, Institut de Biologie de I'Ecole Normale Supérieure, Centre National de la Recherche Scientifique UMR8197, Institut National de la Santé et de la Recherche Médicale U1024, École Normale Supérieure, PSL Université, Paris, France
| | - Loïc Paulevé
- Laboratoire de Recherche en Informatique UMR8623, Université Paris-Sud, Centre National de la Recherche Scientifique, Université Paris-Saclay, Orsay, France
| |
Collapse
|
21
|
Halioui A, Valtchev P, Diallo AB. Bioinformatic workflow extraction from scientific texts based on word sense disambiguation and relation extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1979-1990. [PMID: 29994265 DOI: 10.1109/tcbb.2018.2847336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper introduces a method for automatic workflow extraction from texts using Process-Oriented Case-Based Reasoning (POCBR). While the current workflow management systems implement mostly different complicated graphical tasks based on advanced distributed solutions (e.g. cloud computing and grid computation), workflow knowledge acquisition from texts using case-based reasoning represents more expressive and semantic cases representations. We propose in this context, an ontology-based workflow extraction framework to acquire processual knowledge from texts. Our methodology extends classic NLP techniques to extract and disambiguate tasks and relations in texts. Using a graph-based representation of workflows and a domain ontology, our extraction process uses a context-aware approach to recognize workflow components: data and control flows. We applied our framework in a technical domain in bioinformatics: i.e. phylogenetic analyses. An evaluation based on workflow semantic similarities on a gold standard proves that our approach provides promising results in the process extraction domain. Both data and implementation of our framework are available in: http://labo.bioinfo.uqam.ca/tgrowler.
Collapse
|
22
|
Taghiyar MJ, Rosner J, Grewal D, Grande BM, Aniba R, Grewal J, Boutros PC, Morin RD, Bashashati A, Shah SP. Kronos: a workflow assembler for genome analytics and informatics. Gigascience 2018; 6:1-10. [PMID: 28655203 PMCID: PMC5569921 DOI: 10.1093/gigascience/gix042] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 06/07/2017] [Indexed: 11/25/2022] Open
Abstract
Background: The field of next-generation sequencing informatics has matured to a point where algorithmic advances in sequence alignment and individual feature detection methods have stabilized. Practical and robust implementation of complex analytical workflows (where such tools are structured into “best practices” for automated analysis of next-generation sequencing datasets) still requires significant programming investment and expertise. Results: We present Kronos, a software platform for facilitating the development and execution of modular, auditable, and distributable bioinformatics workflows. Kronos obviates the need for explicit coding of workflows by compiling a text configuration file into executable Python applications. Making analysis modules would still require programming. The framework of each workflow includes a run manager to execute the encoded workflows locally (or on a cluster or cloud), parallelize tasks, and log all runtime events. The resulting workflows are highly modular and configurable by construction, facilitating flexible and extensible meta-applications that can be modified easily through configuration file editing. The workflows are fully encoded for ease of distribution and can be instantiated on external systems, a step toward reproducible research and comparative analyses. We introduce a framework for building Kronos components that function as shareable, modular nodes in Kronos workflows. Conclusions: The Kronos platform provides a standard framework for developers to implement custom tools, reuse existing tools, and contribute to the community at large. Kronos is shipped with both Docker and Amazon Web Services Machine Images. It is free, open source, and available through the Python Package Index and at https://github.com/jtaghiyar/kronos.
Collapse
Affiliation(s)
- M Jafar Taghiyar
- Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
| | - Jamie Rosner
- Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada
| | - Diljot Grewal
- Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
| | - Bruno M Grande
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, V5A 1S6 Burnaby, BC, Canada
| | - Radhouane Aniba
- Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
| | - Jasleen Grewal
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, V5A 1S6 Burnaby, BC, Canada
| | - Paul C Boutros
- Ontario Institute for Cancer Research (OICR), 661 University Avenue, M5G 0A3 Toronto, ON, Canada.,Department of Medical Biophysics, University of Toronto, 101 College Street, M5G 1L7 Toronto, ON, Canada
| | - Ryan D Morin
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, V5A 1S6 Burnaby, BC, Canada
| | - Ali Bashashati
- Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
| | - Sohrab P Shah
- Department of Molecular Oncology, British Columbia Cancer Agency, 675 West 10th Ave, V5Z 1L3 Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, University of British Columbia, 2211 Wesbrook Mall, V6T 2B5 Vancouver, BC, Canada
| |
Collapse
|
23
|
Thanki AS, Soranzo N, Haerty W, Davey RP. GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline. Gigascience 2018; 7:1-10. [PMID: 29425291 PMCID: PMC5863215 DOI: 10.1093/gigascience/giy005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Revised: 07/31/2017] [Accepted: 01/18/2018] [Indexed: 11/13/2022] Open
Abstract
Background Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological, and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestral gene duplication events as well as identifying genes that have diverged from a common ancestor under positive selection. There are various tools available, such as MSOAR, OrthoMCL, and HomoloGene, to identify gene families and visualize syntenic information between species, providing an overview of syntenic regions evolution at the family level. Unfortunately, none of them provide information about structural changes within genes, such as the conservation of ancestral exon boundaries among multiple genomes. The Ensembl GeneTrees computational pipeline generates gene trees based on coding sequences, provides details about exon conservation, and is used in the Ensembl Compara project to discover gene families. Findings A certain amount of expertise is required to configure and run the Ensembl Compara GeneTrees pipeline via command line. Therefore, we converted this pipeline into a Galaxy workflow, called GeneSeqToFamily, and provided additional functionality. This workflow uses existing tools from the Galaxy ToolShed, as well as providing additional wrappers and tools that are required to run the workflow. Conclusions GeneSeqToFamily represents the Ensembl GeneTrees pipeline as a set of interconnected Galaxy tools, so they can be run interactively within the Galaxy's user-friendly workflow environment while still providing the flexibility to tailor the analysis by changing configurations and tools if necessary. Additional tools allow users to subsequently visualize the gene families produced by the workflow, using the Aequatus.js interactive tool, which has been developed as part of the Aequatus software project.
Collapse
Affiliation(s)
- Anil S Thanki
- Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Nicola Soranzo
- Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Wilfried Haerty
- Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Robert P Davey
- Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK
| |
Collapse
|
24
|
Pfeuffer J, Sachsenberg T, Alka O, Walzer M, Fillbrunn A, Nilse L, Schilling O, Reinert K, Kohlbacher O. OpenMS – A platform for reproducible analysis of mass spectrometry data. J Biotechnol 2017; 261:142-148. [DOI: 10.1016/j.jbiotec.2017.05.016] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Revised: 05/17/2017] [Accepted: 05/22/2017] [Indexed: 10/19/2022]
|
25
|
Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Trop Med Health 2017; 45:33. [PMID: 29093641 PMCID: PMC5658975 DOI: 10.1186/s41182-017-0073-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 10/12/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Tropical medicine appeared as a distinct sub-discipline in the late nineteenth century, during a period of rapid European colonial expansion in Africa and Asia. After a dramatic drop after World War II, research on tropical diseases have received more attention and research funding in the twenty-first century. METHODS We used Apache Taverna to integrate Europe PMC and MapAffil web services, containing the spatiotemporal analysis workflow from a list of PubMed queries to a list of publication years and author affiliations geoparsed to latitudes and longitudes. The results could then be visualized in the Quantum Geographic Information System (QGIS). RESULTS Our workflows automatically matched 253,277 affiliations to geographical coordinates for the first authors of 379,728 papers on tropical diseases in a single execution. The bibliometric analyses show how research output in tropical diseases follow major historical shifts in the twentieth century and renewed interest in and funding for tropical disease research in the twenty-first century. They show the effects of disease outbreaks, WHO eradication programs, vaccine developments, wars, refugee migrations, and peace treaties. CONCLUSIONS Literature search and geoparsing web services can be combined in scientific workflows performing a complete spatiotemporal bibliometric analyses of research in tropical medicine. The workflows and datasets are freely available and can be used to reproduce or refine the analyses and test specific hypotheses or look into particular diseases or geographic regions. This work exceeds all previously published bibliometric analyses on tropical diseases in both scale and spatiotemporal range.
Collapse
Affiliation(s)
- Magnus Palmblad
- Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, the Netherlands
| | - Vetle I. Torvik
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL USA
| |
Collapse
|
26
|
|
27
|
Wollmann T, Erfle H, Eils R, Rohr K, Gunkel M. Workflows for microscopy image analysis and cellular phenotyping. J Biotechnol 2017; 261:70-75. [PMID: 28757289 DOI: 10.1016/j.jbiotec.2017.07.019] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Revised: 07/18/2017] [Accepted: 07/21/2017] [Indexed: 10/19/2022]
Abstract
In large scale biological experiments, like high-throughput or high-content cellular screening, the amount and the complexity of images to be analyzed are steadily increasing. To handle and process these images, well defined image processing and analysis steps need to be performed by applying dedicated workflows. Multiple software tools have emerged with the aim to facilitate creation of such workflows by integrating existing methods, tools, and routines, and by adapting them to different applications and questions, as well as making them reusable and interchangeable. In this review, we describe workflow systems for the integration of microscopy image analysis techniques with focus on KNIME and Galaxy.
Collapse
Affiliation(s)
- Thomas Wollmann
- Dept. Bioinformatics and Functional Genomics, Biomedical Computer Vision Group, University of Heidelberg, BioQuant, IPMB, and DKFZ Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany.
| | - Holger Erfle
- High-Content Analysis of the Cell (HiCell) and ViroQuant-CellNetworks RNAi Screening Facility, BioQuant, University of Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Roland Eils
- Dept. Bioinformatics and Functional Genomics, Biomedical Computer Vision Group, University of Heidelberg, BioQuant, IPMB, and DKFZ Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Karl Rohr
- Dept. Bioinformatics and Functional Genomics, Biomedical Computer Vision Group, University of Heidelberg, BioQuant, IPMB, and DKFZ Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| | - Manuel Gunkel
- High-Content Analysis of the Cell (HiCell) and ViroQuant-CellNetworks RNAi Screening Facility, BioQuant, University of Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
| |
Collapse
|
28
|
Mısırlı G, Madsen C, Murieta IS, Bultelle M, Flanagan K, Pocock M, Hallinan J, McLaughlin JA, Clark‐Casey J, Lyne M, Micklem G, Stan G, Kitney R, Wipat A. Constructing synthetic biology workflows in the cloud. ENGINEERING BIOLOGY 2017. [DOI: 10.1049/enb.2017.0001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Affiliation(s)
- Göksel Mısırlı
- School of Computing Science Newcastle University Newcastle upon Tyne UK
| | - Curtis Madsen
- Electrical & Computer Engineering Department Boston University Boston USA
| | | | | | - Keith Flanagan
- School of Computing Science Newcastle University Newcastle upon Tyne UK
| | | | | | | | - Justin Clark‐Casey
- Department of Genetics, Cambridge Systems Biology Centre University of Cambridge Cambridge UK
| | - Mike Lyne
- Department of Genetics, Cambridge Systems Biology Centre University of Cambridge Cambridge UK
| | - Gos Micklem
- Department of Genetics, Cambridge Systems Biology Centre University of Cambridge Cambridge UK
| | - Guy‐Bart Stan
- Department of Bioengineering Imperial College London London UK
| | - Richard Kitney
- Department of Bioengineering Imperial College London London UK
| | - Anil Wipat
- School of Computing Science Newcastle University Newcastle upon Tyne UK
| |
Collapse
|
29
|
Grüning BA, Fallmann J, Yusuf D, Will S, Erxleben A, Eggenhofer F, Houwaart T, Batut B, Videm P, Bagnacani A, Wolfien M, Lott SC, Hoogstrate Y, Hess WR, Wolkenhauer O, Hoffmann S, Akalin A, Ohler U, Stadler PF, Backofen R. The RNA workbench: best practices for RNA and high-throughput sequencing bioinformatics in Galaxy. Nucleic Acids Res 2017; 45:W560-W566. [PMID: 28582575 PMCID: PMC5570170 DOI: 10.1093/nar/gkx409] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Revised: 04/13/2017] [Accepted: 05/31/2017] [Indexed: 01/23/2023] Open
Abstract
RNA-based regulation has become a major research topic in molecular biology. The analysis of epigenetic and expression data is therefore incomplete if RNA-based regulation is not taken into account. Thus, it is increasingly important but not yet standard to combine RNA-centric data and analysis tools with other types of experimental data such as RNA-seq or ChIP-seq. Here, we present the RNA workbench, a comprehensive set of analysis tools and consolidated workflows that enable the researcher to combine these two worlds. Based on the Galaxy framework the workbench guarantees simple access, easy extension, flexible adaption to personal and security needs, and sophisticated analyses that are independent of command-line knowledge. Currently, it includes more than 50 bioinformatics tools that are dedicated to different research areas of RNA biology including RNA structure analysis, RNA alignment, RNA annotation, RNA-protein interaction, ribosome profiling, RNA-seq analysis and RNA target prediction. The workbench is developed and maintained by experts in RNA bioinformatics and the Galaxy framework. Together with the growing community evolving around this workbench, we are committed to keep the workbench up-to-date for future standards and needs, providing researchers with a reliable and robust framework for RNA data analysis. AVAILABILITY The RNA workbench is available at https://github.com/bgruening/galaxy-rna-workbench.
Collapse
Affiliation(s)
- Björn A. Grüning
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
- Center for Biological Systems Analysis (ZBSA), University of Freiburg, Habsburgerstr. 49, D-79104 Freiburg, Germany
| | - Jörg Fallmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany
| | - Dilmurat Yusuf
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, D-13125, Berlin, Germany
| | - Sebastian Will
- Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria
| | - Anika Erxleben
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
| | - Florian Eggenhofer
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
| | - Torsten Houwaart
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
| | - Bérénice Batut
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
| | - Pavankumar Videm
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
| | - Andrea Bagnacani
- Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstr. 69, D-18051 Rostock, Germany
| | - Markus Wolfien
- Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstr. 69, D-18051 Rostock, Germany
| | - Steffen C. Lott
- Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schänzlestr. 1, D-79104 Freiburg, Germany
| | - Youri Hoogstrate
- Department of Urology, Erasmus University Medical Center, Wytemaweg 80, 3015 CN Rotterdam, Netherlands
| | - Wolfgang R. Hess
- Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schänzlestr. 1, D-79104 Freiburg, Germany
| | - Olaf Wolkenhauer
- Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstr. 69, D-18051 Rostock, Germany
| | - Steve Hoffmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany
| | - Altuna Akalin
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, D-13125, Berlin, Germany
| | - Uwe Ohler
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, D-13125, Berlin, Germany
- Departments of Biology and Computer Science, Humboldt University, Unter den Linden 6, D-10099 Berlin
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria
- Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, D-04103 Leipzig, Germany
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany
- Center for Biological Systems Analysis (ZBSA), University of Freiburg, Habsburgerstr. 49, D-79104 Freiburg, Germany
- BIOSS Centre for Biological Signaling Studies, University of Freiburg, Schänzlestr. 18, D-79104 Freiburg, Germany
| |
Collapse
|
30
|
|
31
|
Urdidiales‐Nieto D, Navas‐Delgado I, Aldana‐Montes JF. Biological Web Service Repositories Review. Mol Inform 2017; 36:1600035. [PMID: 27783459 PMCID: PMC5434852 DOI: 10.1002/minf.201600035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Accepted: 09/27/2016] [Indexed: 12/26/2022]
Abstract
Web services play a key role in bioinformatics enabling the integration of database access and analysis of algorithms. However, Web service repositories do not usually publish information on the changes made to their registered Web services. Dynamism is directly related to the changes in the repositories (services registered or unregistered) and at service level (annotation changes). Thus, users, software clients or workflow based approaches lack enough relevant information to decide when they should review or re-execute a Web service or workflow to get updated or improved results. The dynamism of the repository could be a measure for workflow developers to re-check service availability and annotation changes in the services of interest to them. This paper presents a review on the most well-known Web service repositories in the life sciences including an analysis of their dynamism. Freshness is introduced in this paper, and has been used as the measure for the dynamism of these repositories.
Collapse
Affiliation(s)
- David Urdidiales‐Nieto
- Department of Computer Languages and Computing ScienceHigher Technical School of Computer Science EngineeringUniversity of MalagaMalaga29071Spain
| | - Ismael Navas‐Delgado
- Department of Computer Languages and Computing ScienceHigher Technical School of Computer Science EngineeringUniversity of MalagaMalaga29071Spain
| | - José F. Aldana‐Montes
- Department of Computer Languages and Computing ScienceHigher Technical School of Computer Science EngineeringUniversity of MalagaMalaga29071Spain
| |
Collapse
|
32
|
Hoogstrate Y, Zhang C, Senf A, Bijlard J, Hiltemann S, van Enckevort D, Repo S, Heringa J, Jenster G, J A Fijneman R, Boiten JW, A Meijer G, Stubbs A, Rambla J, Spalding D, Abeln S. Integration of EGA secure data access into Galaxy. F1000Res 2017; 5. [PMID: 28232859 PMCID: PMC5302147 DOI: 10.12688/f1000research.10221.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/30/2016] [Indexed: 12/31/2022] Open
Abstract
High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure
http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed:
https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.
Collapse
Affiliation(s)
- Youri Hoogstrate
- Department of Bioinformatics, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | - Chao Zhang
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| | - Alexander Senf
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Saskia Hiltemann
- Department of Bioinformatics, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | | | | | - Jaap Heringa
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| | - Guido Jenster
- Department of Urology, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | | | | | - Gerrit A Meijer
- Diagnostic Oncology, Netherlands Cancer Institute, Amsterdam, Netherlands
| | - Andrew Stubbs
- Department of Bioinformatics, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | - Jordi Rambla
- Centre for Genomic Regulation, Parc de Recerca Biomédica de Barcelona, Barcelona, Spain
| | - Dylan Spalding
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Sanne Abeln
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| |
Collapse
|
33
|
From the evaluation of existing solutions to an all-inclusive package for biobanks. HEALTH AND TECHNOLOGY 2017; 7:89-95. [PMID: 28344915 PMCID: PMC5346419 DOI: 10.1007/s12553-016-0175-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 12/19/2016] [Indexed: 11/26/2022]
Abstract
The domain of biobanking has gone through many stages and as a result there are a wide range of commercial and open source software solutions available. The utilization of these software tools requires different levels of domain and technical skills for installation, configuration and ultimate us of these biobank software tools. To compound this complexity the biobanking community are required to work together in order to share knowledge and jointly build solutions to underpin the research infrastructure. We have evaluated the available tools, described them in a catalogue (BiobankApps) and made a selection of tools available to biobanks in a reference toolbox (BIBBOX) that are use-case driven. In the BiobankApps tool catalogue, both commercial and open source software solutions related to the biobanking domain are included, classified and evaluated. The evaluation covers: 1) “user review” by an authenticated user 2) domain expert: quick analysis by BBMRI members and 3) domain expert: detailed analysis and test installation with real world data. The evaluation is paired with a survey across the more “advanced” (from a technology perspective) biobanks to investigate what tools are currently used and summarises known benefits/drawbacks of the respective packages. In the second step we recommend tools for specific use cases, and install, configure and connect these in the BIBBOX framework. This service also builds on the existing work in the United Kingdom in seeking to establish the motivations for different stakeholders to become involved and therefore assisting in prioritising the use-cases based on the level of need and support within the research community. All tools associated to a use-case are available as BIBBOX applications (technically this is achieved by docker containers), which are integrated in the BIBBOX framework with central identification and user management. In future work we plan to share the acquired knowledge with other networks, develop an Application Programmable Interface (API) for the exchange of metadata with other tool catalogues and work on an ontology for the evaluation of biobank software.
Collapse
|
34
|
Exploring Protein-Protein Interactions as Drug Targets for Anti-cancer Therapy with In Silico Workflows. Methods Mol Biol 2017; 1647:221-236. [PMID: 28809006 DOI: 10.1007/978-1-4939-7201-2_15] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
We describe a computational protocol to aid the design of small molecule and peptide drugs that target protein-protein interactions, particularly for anti-cancer therapy. To achieve this goal, we explore multiple strategies, including finding binding hot spots, incorporating chemical similarity and bioactivity data, and sampling similar binding sites from homologous protein complexes. We demonstrate how to combine existing interdisciplinary resources with examples of semi-automated workflows. Finally, we discuss several major problems, including the occurrence of drug-resistant mutations, drug promiscuity, and the design of dual-effect inhibitors.
Collapse
|
35
|
|
36
|
Simonyan V, Goecks J, Mazumder R. Biocompute Objects-A Step towards Evaluation and Validation of Biomedical Scientific Computations. PDA J Pharm Sci Technol 2016; 71:136-146. [PMID: 27974626 DOI: 10.5731/pdajpst.2016.006734] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The unpredictability of actual physical, chemical, and biological experiments due to the multitude of environmental and procedural factors is well documented. What is systematically overlooked, however, is that computational biology algorithms are also affected by multiplicity of parameters and have no lesser volatility. The complexities of computation protocols and interpretation of outcomes is only a part of the challenge: There are also virtually no standardized and industry-accepted metadata schemas for reporting the computational objects that record the parameters used for computations together with the results of computations. Thus, it is often impossible to reproduce the results of a previously performed computation due to missing information on parameters, versions, arguments, conditions, and procedures of application launch. In this article we describe the concept of biocompute objects developed specifically to satisfy regulatory research needs for evaluation, validation, and verification of bioinformatics pipelines. We envision generalized versions of biocompute objects called biocompute templates that support a single class of analyses but can be adapted to meet unique needs. To make these templates widely usable, we outline a simple but powerful cross-platform implementation. We also discuss the reasoning and potential usability for such concept within the larger scientific community through the creation of a biocompute object database initially consisting of records relevant to the U.S. Food and Drug Administration. A biocompute object database record will be similar to a GenBank record in form; the difference being that instead of describing a sequence, the biocompute record will include information related to parameters, dependencies, usage, and other information related to specific computational instance. This mechanism will extend similar efforts and also serve as a collaborative ground to ensure interoperability between different platforms, industries, scientists, regulators, and other stakeholders interested in biocomputing.
Collapse
Affiliation(s)
- Vahan Simonyan
- Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, MD, USA;
| | - Jeremy Goecks
- Computational Biology Institute, George Washington University, Ashburn, VA, USA; and
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, DC, USA
| |
Collapse
|
37
|
Guler AT, Waaijer CJ, Mohammed Y, Palmblad M. Automating bibliometric analyses using Taverna scientific workflows: A tutorial on integrating Web Services. J Informetr 2016. [DOI: 10.1016/j.joi.2016.05.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
38
|
Barnett CB, Aoki-Kinoshita KF, Naidoo KJ. The Glycome Analytics Platform: an integrative framework for glycobioinformatics. Bioinformatics 2016; 32:3005-11. [PMID: 27288496 DOI: 10.1093/bioinformatics/btw341] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Accepted: 05/26/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Complex carbohydrates play a central role in cellular communication and in disease development. O- and N-glycans, which are post-translationally attached to proteins and lipids, are sugar chains that are rooted, tree structures. Independent efforts to develop computational tools for analyzing complex carbohydrate structures have been designed to exploit specific databases requiring unique formatting and limited transferability. Attempts have been made at integrating these resources, yet it remains difficult to communicate and share data across several online resources. A disadvantage of the lack of coordination between development efforts is the inability of the user community to create reproducible analyses (workflows). The latter results in the more serious unreliability of glycomics metadata. RESULTS In this paper, we realize the significance of connecting multiple online glycan resources that can be used to design reproducible experiments for obtaining, generating and analyzing cell glycomes. To address this, a suite of tools and utilities, have been integrated into the analytic functionality of the Galaxy bioinformatics platform to provide a Glycome Analytics Platform (GAP).Using this platform, users can design in silico workflows to manipulate various formats of glycan sequences and analyze glycomes through access to web data and services. We illustrate the central functionality and features of the GAP by way of example; we analyze and compare the features of the N-glycan glycome of monocytic cells sourced from two separate data depositions.This paper highlights the use of reproducible research methods for glycomics analysis and the GAP presents an opportunity for integrating tools in glycobioinformatics. AVAILABILITY AND IMPLEMENTATION This software is open-source and available online at https://bitbucket.org/scientificomputing/glycome-analytics-platform CONTACTS chris.barnett@uct.ac.za or kevin.naidoo@uct.ac.za SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher B Barnett
- Scientific Computing Research Unit and Department of Chemistry, University of Cape Town, Rondebosch 7701, South Africa
| | - Kiyoko F Aoki-Kinoshita
- Department of Bioinformatics, Faculty of Engineering, Soka University, Hachioji, Tokyo 192-8577, Japan
| | - Kevin J Naidoo
- Scientific Computing Research Unit and Department of Chemistry, University of Cape Town, Rondebosch 7701, South Africa
| |
Collapse
|
39
|
Digles D, Zdrazil B, Neefs JM, Van Vlijmen H, Herhaus C, Caracoti A, Brea J, Roibás B, Loza MI, Queralt-Rosinach N, Furlong LI, Gaulton A, Bartek L, Senger S, Chichester C, Engkvist O, Evelo CT, Franklin NI, Marren D, Ecker GF, Jacoby E. Open PHACTS computational protocols for in silico target validation of cellular phenotypic screens: knowing the knowns. MEDCHEMCOMM 2016; 7:1237-1244. [PMID: 27774140 PMCID: PMC5063042 DOI: 10.1039/c6md00065g] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 05/10/2016] [Indexed: 01/09/2023]
Abstract
Phenotypic screening is in a renaissance phase and is expected by many academic and industry leaders to accelerate the discovery of new drugs for new biology. Given that phenotypic screening is per definition target agnostic, the emphasis of in silico and in vitro follow-up work is on the exploration of possible molecular mechanisms and efficacy targets underlying the biological processes interrogated by the phenotypic screening experiments. Herein, we present six exemplar computational protocols for the interpretation of cellular phenotypic screens based on the integration of compound, target, pathway, and disease data established by the IMI Open PHACTS project. The protocols annotate phenotypic hit lists and allow follow-up experiments and mechanistic conclusions. The annotations included are from ChEMBL, ChEBI, GO, WikiPathways and DisGeNET. Also provided are protocols which select from the IUPHAR/BPS Guide to PHARMACOLOGY interaction file selective compounds to probe potential targets and a correlation robot which systematically aims to identify an overlap of active compounds in both the phenotypic as well as any kinase assay. The protocols are applied to a phenotypic pre-lamin A/C splicing assay selected from the ChEMBL database to illustrate the process. The computational protocols make use of the Open PHACTS API and data and are built within the Pipeline Pilot and KNIME workflow tools.
Collapse
Affiliation(s)
- D Digles
- Department of Pharmaceutical Chemistry , University of Vienna , Pharmacoinformatics Research Group , Althanstraße 14 , 1090 Wien , Austria .
| | - B Zdrazil
- Department of Pharmaceutical Chemistry , University of Vienna , Pharmacoinformatics Research Group , Althanstraße 14 , 1090 Wien , Austria .
| | - J-M Neefs
- Janssen Research & Development , Turnhoutseweg 30 , B-2340 Beerse , Belgium .
| | - H Van Vlijmen
- Janssen Research & Development , Turnhoutseweg 30 , B-2340 Beerse , Belgium .
| | - C Herhaus
- Merck KGaA, Merck Serono R&D , Computational Chemistry , Frankfurter Straße 250 , 64293 Darmstadt , Germany
| | - A Caracoti
- BIOVIA , a Dassault Systèmes brand , 334 Cambridge Science Park , Cambridge CB4 0WN , UK
| | - J Brea
- Grupo BioFarma-USEF , Departamento de Farmacología , Facultad de Farmacia , Campus Universitario Sur s/n , 15782 Santiago de Compostela , Spain
| | - B Roibás
- Grupo BioFarma-USEF , Departamento de Farmacología , Facultad de Farmacia , Campus Universitario Sur s/n , 15782 Santiago de Compostela , Spain
| | - M I Loza
- Grupo BioFarma-USEF , Departamento de Farmacología , Facultad de Farmacia , Campus Universitario Sur s/n , 15782 Santiago de Compostela , Spain
| | - N Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB) , Hospital del Mar Medical Research Institute (IMIM) , Department of Experimental and Health Sciences , Universitat Pompeu Fabra , C/Dr Aiguader 88 , E-08003 Barcelona , Spain
| | - L I Furlong
- Research Programme on Biomedical Informatics (GRIB) , Hospital del Mar Medical Research Institute (IMIM) , Department of Experimental and Health Sciences , Universitat Pompeu Fabra , C/Dr Aiguader 88 , E-08003 Barcelona , Spain
| | - A Gaulton
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI) , Wellcome Genome Campus , Hinxton , Cambridge CB10 1SD , UK
| | - L Bartek
- GlaxoSmithKline , Medicines Research Centre , Stevenage SG1 2NY , UK
| | - S Senger
- GlaxoSmithKline , Medicines Research Centre , Stevenage SG1 2NY , UK
| | - C Chichester
- Swiss Institute of Bioinformatics , CALIPHO Group , CMU Rue Michel-Servet 1 , 1211 Geneva 4 , Switzerland ; Nestlé Institute of Health Sciences SA , EPFL Innovation Park, Bâtiment H , 1015 Lausanne , Switzerland
| | - O Engkvist
- Chemistry Innovation Centre , Discovery Sciences , AstraZeneca R&D Gothenburg , SE-431 83 Mölndal , Sweden
| | - C T Evelo
- Department of Bioinformatics - BiGCaT , P.O. Box 616 , UNS50 Box19 , NL-6200MD Maastricht , The Netherlands
| | - N I Franklin
- Open Innovation Drug Discovery , Discovery Chemistry Eli Lilly and Company , Lilly Corporate Center , DC 1920 , Indianapolis , IN 46285 , USA
| | - D Marren
- Eli Lilly and Company Ltd. , Lilly Research Centre , Erl Wood Manor, Sunninghill Road , Windlesham , Surrey GU20 6PH , England , UK
| | - G F Ecker
- Department of Pharmaceutical Chemistry , University of Vienna , Pharmacoinformatics Research Group , Althanstraße 14 , 1090 Wien , Austria .
| | - E Jacoby
- Janssen Research & Development , Turnhoutseweg 30 , B-2340 Beerse , Belgium .
| |
Collapse
|
40
|
de la Garza L, Veit J, Szolek A, Röttig M, Aiche S, Gesing S, Reinert K, Kohlbacher O. From the desktop to the grid: scalable bioinformatics via workflow conversion. BMC Bioinformatics 2016; 17:127. [PMID: 26968893 PMCID: PMC4788856 DOI: 10.1186/s12859-016-0978-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Accepted: 03/03/2016] [Indexed: 01/04/2023] Open
Abstract
Background Reproducibility is one of the tenets of the scientific method. Scientific experiments often comprise complex data flows, selection of adequate parameters, and analysis and visualization of intermediate and end results. Breaking down the complexity of such experiments into the joint collaboration of small, repeatable, well defined tasks, each with well defined inputs, parameters, and outputs, offers the immediate benefit of identifying bottlenecks, pinpoint sections which could benefit from parallelization, among others. Workflows rest upon the notion of splitting complex work into the joint effort of several manageable tasks. There are several engines that give users the ability to design and execute workflows. Each engine was created to address certain problems of a specific community, therefore each one has its advantages and shortcomings. Furthermore, not all features of all workflow engines are royalty-free —an aspect that could potentially drive away members of the scientific community. Results We have developed a set of tools that enables the scientific community to benefit from workflow interoperability. We developed a platform-free structured representation of parameters, inputs, outputs of command-line tools in so-called Common Tool Descriptor documents. We have also overcome the shortcomings and combined the features of two royalty-free workflow engines with a substantial user community: the Konstanz Information Miner, an engine which we see as a formidable workflow editor, and the Grid and User Support Environment, a web-based framework able to interact with several high-performance computing resources. We have thus created a free and highly accessible way to design workflows on a desktop computer and execute them on high-performance computing resources. Conclusions Our work will not only reduce time spent on designing scientific workflows, but also make executing workflows on remote high-performance computing resources more accessible to technically inexperienced users. We strongly believe that our efforts not only decrease the turnaround time to obtain scientific results but also have a positive impact on reproducibility, thus elevating the quality of obtained scientific results.
Collapse
Affiliation(s)
- Luis de la Garza
- Center for Bioinformatics and Dept. of Computer Science, University of Tübingen, Sand 14, Tübingen, 72070, Germany.
| | - Johannes Veit
- Center for Bioinformatics and Dept. of Computer Science, University of Tübingen, Sand 14, Tübingen, 72070, Germany
| | - Andras Szolek
- Center for Bioinformatics and Dept. of Computer Science, University of Tübingen, Sand 14, Tübingen, 72070, Germany
| | - Marc Röttig
- Center for Bioinformatics and Dept. of Computer Science, University of Tübingen, Sand 14, Tübingen, 72070, Germany
| | - Stephan Aiche
- Algorithmic Bioinformatics, Computer Science Institute, Freie Universität Berlin, Takustr. 9, Berlin, 14195, Germany
| | - Sandra Gesing
- College of Engineering, University of Notre Dame, 257 Fitzpatrick Hall, Notre Dame, 46556, IN, United States
| | - Knut Reinert
- Algorithmic Bioinformatics, Computer Science Institute, Freie Universität Berlin, Takustr. 9, Berlin, 14195, Germany
| | - Oliver Kohlbacher
- Center for Bioinformatics and Dept. of Computer Science, University of Tübingen, Sand 14, Tübingen, 72070, Germany
| |
Collapse
|
41
|
Abstract
Scientific workflows organize the assembly of specialized software into an overall data flow and are particularly well suited for multi-step analyses using different types of software tools. They are also favorable in terms of reusability, as previously designed workflows could be made publicly available through the myExperiment community and then used in other workflows. We here illustrate how scientific workflows and the Taverna workbench in particular can be used in bibliometrics. We discuss the specific capabilities of Taverna that makes this software a powerful tool in this field, such as automated data import via Web services, data extraction from XML by XPaths, and statistical analysis and visualization with R. The support of the latter is particularly relevant, as it allows integration of a number of recently developed R packages specifically for bibliometrics. Examples are used to illustrate the possibilities of Taverna in the fields of bibliometrics and scientometrics.
Collapse
Affiliation(s)
- Arzu Tugce Guler
- />Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, The Netherlands
| | - Cathelijn J. F. Waaijer
- />Faculty of Social and Behavioural Sciences, Centre for Science and Technology Studies, Leiden University, Leiden, The Netherlands
| | - Magnus Palmblad
- />Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
42
|
Tosta FE, Braganholo V, Murta L, Mattoso M. Improving workflow design by mining reusable tasks. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2015. [DOI: 10.1186/s13173-015-0035-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
43
|
Sfakianaki P, Koumakis L, Sfakianakis S, Iatraki G, Zacharioudakis G, Graf N, Marias K, Tsiknakis M. Semantic biomedical resource discovery: a Natural Language Processing framework. BMC Med Inform Decis Mak 2015; 15:77. [PMID: 26423616 PMCID: PMC4591066 DOI: 10.1186/s12911-015-0200-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 09/21/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A plethora of publicly available biomedical resources do currently exist and are constantly increasing at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow and lags behind progress in the general NLP domain. The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology expert users to efficiently search for biomedical resources using natural language. METHODS A Natural Language Processing engine which can "translate" free text into targeted queries, automatically transforming a clinical research question into a request description that contains only terms of ontologies, has been implemented. The implementation is based on information extraction techniques for text in natural language, guided by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web application at ( http://calchas.ics.forth.gr/ ). RESULTS For our experiments, a range of clinical questions were established based on descriptions of clinical trials from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually or as a set of tools forming a computational pipeline. The results were compared with those obtained from an automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall measurements were used. Our results indicate that the proposed framework has a high precision and low recall, implying that the system returns essentially more relevant results than irrelevant. CONCLUSIONS There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical question in natural language and efficient dynamic biomedical resources discovery.
Collapse
Affiliation(s)
- Pepi Sfakianaki
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
| | - Lefteris Koumakis
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
| | - Stelios Sfakianakis
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
| | - Galatia Iatraki
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
| | - Giorgos Zacharioudakis
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
| | - Norbert Graf
- Paediatric Haematology and Oncology, Saarland University Hospital, Homburg, Germany
| | - Kostas Marias
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
| | - Manolis Tsiknakis
- Foundation for Research and Technology Hellas (FORTH), Institute of Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete Greece
- Department of Informatics Engineering, Technological Educational Institute, Heraklion, Crete Greece
| |
Collapse
|
44
|
Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A Catalog of Virtual Machine Images for the Life Sciences. Bioinform Biol Insights 2015; 9:125-8. [PMID: 26401099 PMCID: PMC4567039 DOI: 10.4137/bbi.s28636] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Revised: 06/29/2015] [Accepted: 07/05/2015] [Indexed: 12/14/2022] Open
Abstract
Virtualization is becoming increasingly important in bioscience, enabling assembly and provisioning of complete computer setups, including operating system, data, software, and services packaged as virtual machine images (VMIs). We present an open catalog of VMIs for the life sciences, where scientists can share information about images and optionally upload them to a server equipped with a large file system and fast Internet connection. Other scientists can then search for and download images that can be run on the local computer or in a cloud computing environment, providing easy access to bioinformatics environments. We also describe applications where VMIs aid life science research, including distributing tools and data, supporting reproducible analysis, and facilitating education. BioImg.org is freely available at: https://bioimg.org.
Collapse
Affiliation(s)
- Martin Dahlö
- SNIC-UPPMAX, Department of Information Technology, Uppsala University, Uppsala, Sweden. ; Science for Life Laboratory, Uppsala University, Uppsala, Sweden. ; Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| | - Frédéric Haziza
- SNIC-UPPMAX, Department of Information Technology, Uppsala University, Uppsala, Sweden
| | | | | | - Erik Bongcam-Rudloff
- SLU-Global Bioinformatics Centre, Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Ola Spjuth
- SNIC-UPPMAX, Department of Information Technology, Uppsala University, Uppsala, Sweden. ; Science for Life Laboratory, Uppsala University, Uppsala, Sweden. ; Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| |
Collapse
|
45
|
A Digital Repository and Execution Platform for Interactive Scholarly Publications in Neuroscience. Neuroinformatics 2015; 14:23-40. [PMID: 26306864 DOI: 10.1007/s12021-015-9276-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
The CARMEN Virtual Laboratory (VL) is a cloud-based platform which allows neuroscientists to store, share, develop, execute, reproduce and publicise their work. This paper describes new functionality in the CARMEN VL: an interactive publications repository. This new facility allows users to link data and software to publications. This enables other users to examine data and software associated with the publication and execute the associated software within the VL using the same data as the authors used in the publication. The cloud-based architecture and SaaS (Software as a Service) framework allows vast data sets to be uploaded and analysed using software services. Thus, this new interactive publications facility allows others to build on research results through reuse. This aligns with recent developments by funding agencies, institutions, and publishers with a move to open access research. Open access provides reproducibility and verification of research resources and results. Publications and their associated data and software will be assured of long-term preservation and curation in the repository. Further, analysing research data and the evaluations described in publications frequently requires a number of execution stages many of which are iterative. The VL provides a scientific workflow environment to combine software services into a processing tree. These workflows can also be associated with publications and executed by users. The VL also provides a secure environment where users can decide the access rights for each resource to ensure copyright and privacy restrictions are met.
Collapse
|
46
|
Cock PJA, Chilton JM, Grüning B, Johnson JE, Soranzo N. NCBI BLAST+ integrated into Galaxy. Gigascience 2015; 4:39. [PMID: 26336600 PMCID: PMC4557756 DOI: 10.1186/s13742-015-0080-7] [Citation(s) in RCA: 148] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Accepted: 08/18/2015] [Indexed: 01/29/2023] Open
Abstract
Background The NCBI BLAST suite has become ubiquitous in modern molecular biology and is used for small tasks such as checking capillary sequencing results of single PCR products, genome annotation or even larger scale pan-genome analyses. For early adopters of the Galaxy web-based biomedical data analysis platform, integrating BLAST into Galaxy was a natural step for sequence comparison workflows. Findings The command line NCBI BLAST+ tool suite was wrapped for use within Galaxy. Appropriate datatypes were defined as needed. The integration of the BLAST+ tool suite into Galaxy has the goal of making common BLAST tasks easy and advanced tasks possible. Conclusions This project is an informal international collaborative effort, and is deployed and used on Galaxy servers worldwide. Several examples of applications are described here.
Collapse
Affiliation(s)
- Peter J A Cock
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA Scotland UK
| | - John M Chilton
- Minnesota Supercomputing Institute, University of Minnesota, 599 Walter Library, 117 Pleasant St. SE, 55455 Minneapolis, MN USA
| | - Björn Grüning
- Department of Computer Science, Albert-Ludwigs-University of Freiburg, Georges-Köhler-Allee 106, Freiburg, 79110 Germany
| | - James E Johnson
- Minnesota Supercomputing Institute, University of Minnesota, 599 Walter Library, 117 Pleasant St. SE, 55455 Minneapolis, MN USA
| | | |
Collapse
|
47
|
Velloso H, Vialle RA, Ortega JM. BOWS (bioinformatics open web services) to centralize bioinformatics tools in web services. BMC Res Notes 2015; 8:206. [PMID: 26032494 PMCID: PMC4467627 DOI: 10.1186/s13104-015-1190-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2014] [Accepted: 05/20/2015] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Bioinformaticians face a range of difficulties to get locally-installed tools running and producing results; they would greatly benefit from a system that could centralize most of the tools, using an easy interface for input and output. Web services, due to their universal nature and widely known interface, constitute a very good option to achieve this goal. RESULTS Bioinformatics open web services (BOWS) is a system based on generic web services produced to allow programmatic access to applications running on high-performance computing (HPC) clusters. BOWS intermediates the access to registered tools by providing front-end and back-end web services. Programmers can install applications in HPC clusters in any programming language and use the back-end service to check for new jobs and their parameters, and then to send the results to BOWS. Programs running in simple computers consume the BOWS front-end service to submit new processes and read results. BOWS compiles Java clients, which encapsulate the front-end web service requisitions, and automatically creates a web page that disposes the registered applications and clients. CONCLUSIONS Bioinformatics open web services registered applications can be accessed from virtually any programming language through web services, or using standard java clients. The back-end can run in HPC clusters, allowing bioinformaticians to remotely run high-processing demand applications directly from their machines.
Collapse
Affiliation(s)
- Henrique Velloso
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, MG, Brazil.
| | - Ricardo A Vialle
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, MG, Brazil.
| | - J Miguel Ortega
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, MG, Brazil.
| |
Collapse
|
48
|
Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BIOMED RESEARCH INTERNATIONAL 2015; 2015:904541. [PMID: 26125026 PMCID: PMC4466500 DOI: 10.1155/2015/904541] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 04/01/2015] [Accepted: 04/01/2015] [Indexed: 02/07/2023]
Abstract
Sequencing the human genome began in 1994, and 10 years of work were necessary in order to provide a nearly complete sequence. Nowadays, NGS technologies allow sequencing of a whole human genome in a few days. This deluge of data challenges scientists in many ways, as they are faced with data management issues and analysis and visualization drawbacks due to the limitations of current bioinformatics tools. In this paper, we describe how the NGS Big Data revolution changes the way of managing and analysing data. We present how biologists are confronted with abundance of methods, tools, and data formats. To overcome these problems, focus on Big Data Information Technology innovations from web and business intelligence. We underline the interest of NoSQL databases, which are much more efficient than relational databases. Since Big Data leads to the loss of interactivity with data during analysis due to high processing time, we describe solutions from the Business Intelligence that allow one to regain interactivity whatever the volume of data is. We illustrate this point with a focus on the Amadea platform. Finally, we discuss visualization challenges posed by Big Data and present the latest innovations with JavaScript graphic libraries.
Collapse
|
49
|
Drug discovery FAQs: workflows for answering multidomain drug discovery questions. Drug Discov Today 2015; 20:399-405. [DOI: 10.1016/j.drudis.2014.11.006] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Revised: 10/22/2014] [Accepted: 11/13/2014] [Indexed: 12/26/2022]
|
50
|
Lord E, Diallo AB, Makarenkov V. Classification of bioinformatics workflows using weighted versions of partitioning and hierarchical clustering algorithms. BMC Bioinformatics 2015; 16:68. [PMID: 25887434 PMCID: PMC4354763 DOI: 10.1186/s12859-015-0508-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 02/20/2015] [Indexed: 11/10/2022] Open
Abstract
Background Workflows, or computational pipelines, consisting of collections of multiple linked tasks are becoming more and more popular in many scientific fields, including computational biology. For example, simulation studies, which are now a must for statistical validation of new bioinformatics methods and software, are frequently carried out using the available workflow platforms. Workflows are typically organized to minimize the total execution time and to maximize the efficiency of the included operations. Clustering algorithms can be applied either for regrouping similar workflows for their simultaneous execution on a server, or for dispatching some lengthy workflows to different servers, or for classifying the available workflows with a view to performing a specific keyword search. Results In this study, we consider four different workflow encoding and clustering schemes which are representative for bioinformatics projects. Some of them allow for clustering workflows with similar topological features, while the others regroup workflows according to their specific attributes (e.g. associated keywords) or execution time. The four types of workflow encoding examined in this study were compared using the weighted versions of k-means and k-medoids partitioning algorithms. The Calinski-Harabasz, Silhouette and logSS clustering indices were considered. Hierarchical classification methods, including the UPGMA, Neighbor Joining, Fitch and Kitsch algorithms, were also applied to classify bioinformatics workflows. Moreover, a novel pairwise measure of clustering solution stability, which can be computed in situations when a series of independent program runs is carried out, was introduced. Conclusions Our findings based on the analysis of 220 real-life bioinformatics workflows suggest that the weighted clustering models based on keywords information or tasks execution times provide the most appropriate clustering solutions. Using datasets generated by the Armadillo and Taverna scientific workflow management system, we found that the weighted cosine distance in association with the k-medoids partitioning algorithm and the presence-absence workflow encoding provided the highest values of the Rand index among all compared clustering strategies. The introduced clustering stability indices, PS and PSG, can be effectively used to identify elements with a low clustering support. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0508-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Etienne Lord
- Département d'informatique, Université du Québec à Montréal, C.P. 8888 succ. Centre-Ville, Montreal, QC, H3C 3P8, Canada. .,Département de sciences biologiques, Université à Montréal, C.P. 6128 succ. Centre-Ville, Montreal, QC, H3C 3J7, Canada.
| | - Abdoulaye Baniré Diallo
- Département d'informatique, Université du Québec à Montréal, C.P. 8888 succ. Centre-Ville, Montreal, QC, H3C 3P8, Canada.
| | - Vladimir Makarenkov
- Département d'informatique, Université du Québec à Montréal, C.P. 8888 succ. Centre-Ville, Montreal, QC, H3C 3P8, Canada.
| |
Collapse
|