1
|
Green R, Qu X, Liu J, Yu T. BTR: a bioinformatics tool recommendation system. Bioinformatics 2024; 40:btae275. [PMID: 38662583 PMCID: PMC11091741 DOI: 10.1093/bioinformatics/btae275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 02/29/2024] [Accepted: 04/24/2024] [Indexed: 05/15/2024] Open
Abstract
MOTIVATION The rapid expansion of Bioinformatics research has led to a proliferation of computational tools for scientific analysis pipelines. However, constructing these pipelines is a demanding task, requiring extensive domain knowledge and careful consideration. As the Bioinformatics landscape evolves, researchers, both novice and expert, may feel overwhelmed in unfamiliar fields, potentially leading to the selection of unsuitable tools during workflow development. RESULTS In this article, we introduce the Bioinformatics Tool Recommendation system (BTR), a deep learning model designed to recommend suitable tools for a given workflow-in-progress. BTR leverages recent advances in graph neural network technology, representing the workflow as a graph to capture essential context. Natural language processing techniques enhance tool recommendations by analyzing associated tool descriptions. Experiments demonstrate that BTR outperforms the existing Galaxy tool recommendation system, showcasing its potential to streamline scientific workflow construction. AVAILABILITY AND IMPLEMENTATION The Python source code is available at https://github.com/ryangreenj/bioinformatics_tool_recommendation.
Collapse
Affiliation(s)
- Ryan Green
- Department of Computer Science, University of Cincinnati, Cincinnati 45219, United States
| | - Xufeng Qu
- Department of Biostatistics, Virginia Commonwealth University, Richmond 23284, United States
| | - Jinze Liu
- Department of Biostatistics, Virginia Commonwealth University, Richmond 23284, United States
| | - Tingting Yu
- School of Computing, University of Connecticut, Storrs 06269, United States
| |
Collapse
|
2
|
Du X, Dastmalchi F, Diller MA, Brochhausen M, Garrett TJ, Hogan WR, Lemas DJ. An Automated Workflow Composition System for Liquid Chromatography-Mass Spectrometry Metabolomics Data Processing. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2023; 34:2857-2863. [PMID: 37874901 DOI: 10.1021/jasms.3c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]
Abstract
Liquid chromatography-mass spectrometry (LC-MS) metabolomics studies produce high-dimensional data that must be processed by a complex network of informatics tools to generate analysis-ready data sets. As the first computational step in metabolomics, data processing is increasingly becoming a challenge for researchers to develop customized computational workflows that are applicable for LC-MS metabolomics analysis. Ontology-based automated workflow composition (AWC) systems provide a feasible approach for developing computational workflows that consume high-dimensional molecular data. We used the Automated Pipeline Explorer (APE) to create an AWC for LC-MS metabolomics data processing across three use cases. Our results show that APE predicted 145 data processing workflows across all the three use cases. We identified six traditional workflows and six novel workflows. Through manual review, we found that one-third of novel workflows were executable whereby the data processing function could be completed without obtaining an error. When selecting the top six workflows from each use case, the computational viable rate of our predicted workflows reached 45%. Collectively, our study demonstrates the feasibility of developing an AWC system for LC-MS metabolomics data processing.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, United States
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Farhad Dastmalchi
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - Matthew A Diller
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - Mathias Brochhausen
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205, United States
| | - Timothy J Garrett
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - William R Hogan
- Data Science Institute, Medical College of Wisconsin, Milwaukee, Wisconsin 53226, United States
| | - Dominick J Lemas
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
- Department of Obstetrics and Gynecology, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
- Center for Perinatal Outcomes Research, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| |
Collapse
|
3
|
Scheider S, de Jong T. A conceptual model for automating spatial network analysis. TRANSACTIONS IN GIS : TG 2022; 26:421-458. [PMID: 35874873 PMCID: PMC9298018 DOI: 10.1111/tgis.12855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 09/07/2021] [Accepted: 09/14/2021] [Indexed: 06/15/2023]
Abstract
Spatial network analysis is a collection of methods for measuring accessibility potentials as well as for analyzing flows over transport networks. Though it has been part of the practice of geographic information systems for a long time, designing network analytical workflows still requires a considerable amount of expertise. In principle, artificial intelligence methods for workflow synthesis could be used to automate this task. This would improve the (re)usability of analytic resources. However, though underlying graph algorithms are well understood, we still lack a conceptual model that captures the required methodological know-how. The reason is that in practice this know-how goes beyond graph theory to a significant extent. In this article we suggest interpreting spatial networks in terms of quantified relations between spatial objects, where both the objects themselves and their relations can be quantified in an extensive or an intensive manner. Using this model, it becomes possible to effectively organize data sources and network functions towards common analytical goals for answering questions. We tested our model on 12 analytical tasks, and evaluated automatically synthesized workflows with network experts. Results show that standard data models are insufficient for answering questions, and that our model adds information crucial for understanding spatial network functionality.
Collapse
Affiliation(s)
- Simon Scheider
- Department of Human Geography and Spatial PlanningUtrecht UniversityUtrechtthe Netherlands
| | - Tom de Jong
- Department of LogisticsStellenbosch UniversityStellenboschSouth Africa
| |
Collapse
|
4
|
Lamprecht AL, Palmblad M, Ison J, Schwämmle V, Al Manir MS, Altintas I, Baker CJO, Ben Hadj Amor A, Capella-Gutierrez S, Charonyktakis P, Crusoe MR, Gil Y, Goble C, Griffin TJ, Groth P, Ienasescu H, Jagtap P, Kalaš M, Kasalica V, Khanteymoori A, Kuhn T, Mei H, Ménager H, Möller S, Richardson RA, Robert V, Soiland-Reyes S, Stevens R, Szaniszlo S, Verberne S, Verhoeven A, Wolstencroft K. Perspectives on automated composition of workflows in the life sciences. F1000Res 2021; 10:897. [PMID: 34804501 PMCID: PMC8573700 DOI: 10.12688/f1000research.54159.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/27/2021] [Indexed: 12/29/2022] Open
Abstract
Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.
Collapse
Affiliation(s)
| | - Magnus Palmblad
- Leiden University Medical Center, 2333 ZA, Leiden, The Netherlands
| | - Jon Ison
- French Institute of Bioinformatics, 91057 Évry, France
| | | | | | - Ilkay Altintas
- University of California San Diego, La Jolla, CA, 92093, USA
| | - Christopher J. O. Baker
- University of New Brunswick, Saint John, E2L 4L5, Canada
- IPSNP Computing Inc., Saint John, E2L 4S6, Canada
| | | | | | | | | | - Yolanda Gil
- University of Southern California, Marina Del Rey, CA, 90292, USA
| | - Carole Goble
- Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Timothy J. Griffin
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Paul Groth
- University of Amsterdam, 1090 GH Amsterdam, The Netherlands
| | - Hans Ienasescu
- Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Pratik Jagtap
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, 55455, USA
| | | | | | | | - Tobias Kuhn
- VU Amsterdam, 1081 HV Amsterdam, The Netherlands
| | - Hailiang Mei
- Sequencing Analysis Support Core, Leiden University Medical Center, 2333 ZC Leiden, The Netherlands
| | | | - Steffen Möller
- IBIMA, Rostock University Medical Center, 18057 Rostock, Germany
| | | | | | - Stian Soiland-Reyes
- Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
- Informatics Institute, University of Amsterdam, 1090 GH Amsterdam, The Netherlands
| | - Robert Stevens
- Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | | | - Suzan Verberne
- Leiden Institute of Advanced Computer Science, Leiden University, 2333 BE Leiden, The Netherlands
| | - Aswin Verhoeven
- Leiden University Medical Center, 2333 ZA, Leiden, The Netherlands
| | - Katherine Wolstencroft
- Leiden Institute of Advanced Computer Science, Leiden University, 2333 BE Leiden, The Netherlands
| |
Collapse
|
5
|
Kasalica V, Schwämmle V, Palmblad M, Ison J, Lamprecht AL. APE in the Wild: Automated Exploration of Proteomics Workflows in the bio.tools Registry. J Proteome Res 2021; 20:2157-2165. [PMID: 33720735 PMCID: PMC8041394 DOI: 10.1021/acs.jproteome.0c00983] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The bio.tools registry is a main catalogue of computational tools in the life sciences. More than 17 000 tools have been registered by the international bioinformatics community. The bio.tools metadata schema includes semantic annotations of tool functions, that is, formal descriptions of tools' data types, formats, and operations with terms from the EDAM bioinformatics ontology. Such annotations enable the automated composition of tools into multistep pipelines or workflows. In this Technical Note, we revisit a previous case study on the automated composition of proteomics workflows. We use the same four workflow scenarios but instead of using a small set of tools with carefully handcrafted annotations, we explore workflows directly on bio.tools. We use the Automated Pipeline Explorer (APE), a reimplementation and extension of the workflow composition method previously used. Moving "into the wild" opens up an unprecedented wealth of tools and a huge number of alternative workflows. Automated composition tools can be used to explore this space of possibilities systematically. Inevitably, the mixed quality of semantic annotations in bio.tools leads to unintended or erroneous tool combinations. However, our results also show that additional control mechanisms (tool filters, configuration options, and workflow constraints) can effectively guide the exploration toward smaller sets of more meaningful workflows.
Collapse
Affiliation(s)
- Vedran Kasalica
- Department of Information and Computing Sciences, Utrecht University, Utrecht 3584 CC, The Netherlands
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense 5230, Denmark
| | - Magnus Palmblad
- Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | - Jon Ison
- Institut Français de Bioinformatique, CNRS, Crémieux F-91000, France
| | - Anna-Lena Lamprecht
- Department of Information and Computing Sciences, Utrecht University, Utrecht 3584 CC, The Netherlands
| |
Collapse
|
6
|
Kruiger JF, Kasalica V, Meerlo R, Lamprecht A, Nyamsuren E, Scheider S. Loose programming of GIS workflows with geo-analytical concepts. TRANSACTIONS IN GIS : TG 2021; 25:424-449. [PMID: 33776542 PMCID: PMC7983927 DOI: 10.1111/tgis.12692] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
Loose programming enables analysts to program with concepts instead of procedural code. Data transformations are left underspecified, leaving out procedural details and exploiting knowledge about the applicability of functions to data types. To synthesize workflows of high quality for a geo-analytical task, the semantic type system needs to reflect knowledge of geographic information systems (GIS) at a level that is deep enough to capture geo-analytical concepts and intentions, yet shallow enough to generalize over GIS implementations. Recently, core concepts of spatial information and related geo-analytical concepts were proposed as a way to add the required abstraction level to current geodata models. The core concept data types (CCD) ontology is a semantic type system that can be used to constrain GIS functions for workflow synthesis. However, to date, it is unknown what gain in precision and workflow quality can be expected. In this article we synthesize workflows by annotating GIS tools with these types, specifying a range of common analytical tasks taken from an urban livability scenario. We measure the quality of automatically synthesized workflows against a benchmark generated from common data types. Results show that CCD concepts significantly improve the precision of workflow synthesis.
Collapse
Affiliation(s)
- Johannes F. Kruiger
- Department of Human Geography and Spatial PlanningUtrecht UniversityUtrechtthe Netherlands
| | - Vedran Kasalica
- Department of Information and Computing SciencesUtrecht UniversityUtrechtthe Netherlands
| | - Rogier Meerlo
- Department of Human Geography and Spatial PlanningUtrecht UniversityUtrechtthe Netherlands
| | - Anna‐Lena Lamprecht
- Department of Information and Computing SciencesUtrecht UniversityUtrechtthe Netherlands
| | - Enkhbold Nyamsuren
- Department of Human Geography and Spatial PlanningUtrecht UniversityUtrechtthe Netherlands
| | - Simon Scheider
- Department of Human Geography and Spatial PlanningUtrecht UniversityUtrechtthe Netherlands
| |
Collapse
|