1
|
DIMet: an open-source tool for differential analysis of targeted isotope-labeled metabolomics data. Bioinformatics 2024; 40:btae282. [PMID: 38656970 PMCID: PMC11109473 DOI: 10.1093/bioinformatics/btae282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 04/04/2024] [Accepted: 04/23/2024] [Indexed: 04/26/2024] Open
Abstract
MOTIVATION Many diseases, such as cancer, are characterized by an alteration of cellular metabolism allowing cells to adapt to changes in the microenvironment. Stable isotope-resolved metabolomics (SIRM) and downstream data analyses are widely used techniques for unraveling cells' metabolic activity to understand the altered functioning of metabolic pathways in the diseased state. While a number of bioinformatic solutions exist for the differential analysis of SIRM data, there is currently no available resource providing a comprehensive toolbox. RESULTS In this work, we present DIMet, a one-stop comprehensive tool for differential analysis of targeted tracer data. DIMet accepts metabolite total abundances, isotopologue contributions, and isotopic mean enrichment, and supports differential comparison (pairwise and multi-group), time-series analyses, and labeling profile comparison. Moreover, it integrates transcriptomics and targeted metabolomics data through network-based metabolograms. We illustrate the use of DIMet in real SIRM datasets obtained from Glioblastoma P3 cell-line samples. DIMet is open-source, and is readily available for routine downstream analysis of isotope-labeled targeted metabolomics data, as it can be used both in the command line interface or as a complete toolkit in the public Galaxy Europe and Workfow4Metabolomics web platforms. AVAILABILITY AND IMPLEMENTATION DIMet is freely available at https://github.com/cbib/DIMet, and through https://usegalaxy.eu and https://workflow4metabolomics.usegalaxy.fr. All the datasets are available at Zenodo https://zenodo.org/records/10925786.
Collapse
|
2
|
Mobilisation and analyses of publicly available SARS-CoV-2 data for pandemic responses. Microb Genom 2024; 10:001188. [PMID: 38358325 PMCID: PMC10926692 DOI: 10.1099/mgen.0.001188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 01/14/2024] [Indexed: 02/16/2024] Open
Abstract
The COVID-19 pandemic has seen large-scale pathogen genomic sequencing efforts, becoming part of the toolbox for surveillance and epidemic research. This resulted in an unprecedented level of data sharing to open repositories, which has actively supported the identification of SARS-CoV-2 structure, molecular interactions, mutations and variants, and facilitated vaccine development and drug reuse studies and design. The European COVID-19 Data Platform was launched to support this data sharing, and has resulted in the deposition of several million SARS-CoV-2 raw reads. In this paper we describe (1) open data sharing, (2) tools for submission, analysis, visualisation and data claiming (e.g. ORCiD), (3) the systematic analysis of these datasets, at scale via the SARS-CoV-2 Data Hubs as well as (4) lessons learnt. This paper describes a component of the Platform, the SARS-CoV-2 Data Hubs, which enable the extension and set up of infrastructure that we intend to use more widely in the future for pathogen surveillance and pandemic preparedness.
Collapse
|
3
|
"Be sustainable": EOSC-Life recommendations for implementation of FAIR principles in life science data handling. EMBO J 2023; 42:e115008. [PMID: 37964598 PMCID: PMC10690449 DOI: 10.15252/embj.2023115008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 09/12/2023] [Accepted: 09/18/2023] [Indexed: 11/16/2023] Open
Abstract
The main goals and challenges for the life science communities in the Open Science framework are to increase reuse and sustainability of data resources, software tools, and workflows, especially in large-scale data-driven research and computational analyses. Here, we present key findings, procedures, effective measures and recommendations for generating and establishing sustainable life science resources based on the collaborative, cross-disciplinary work done within the EOSC-Life (European Open Science Cloud for Life Sciences) consortium. Bringing together 13 European life science research infrastructures, it has laid the foundation for an open, digital space to support biological and medical research. Using lessons learned from 27 selected projects, we describe the organisational, technical, financial and legal/ethical challenges that represent the main barriers to sustainability in the life sciences. We show how EOSC-Life provides a model for sustainable data management according to FAIR (findability, accessibility, interoperability, and reusability) principles, including solutions for sensitive- and industry-related resources, by means of cross-disciplinary training and best practices sharing. Finally, we illustrate how data harmonisation and collaborative work facilitate interoperability of tools, data, solutions and lead to a better understanding of concepts, semantics and functionalities in the life sciences.
Collapse
|
4
|
Transformer-based tool recommendation system in Galaxy. BMC Bioinformatics 2023; 24:446. [PMID: 38012574 PMCID: PMC10680333 DOI: 10.1186/s12859-023-05573-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 11/17/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND Galaxy is a web-based open-source platform for scientific analyses. Researchers use thousands of high-quality tools and workflows for their respective analyses in Galaxy. Tool recommender system predicts a collection of tools that can be used to extend an analysis. In this work, a tool recommender system is developed by training a transformer on workflows available on Galaxy Europe and its performance is compared to other neural networks such as recurrent, convolutional and dense neural networks. RESULTS The transformer neural network achieves two times faster convergence, has significantly lower model usage (model reconstruction and prediction) time and shows a better generalisation that goes beyond training workflows than the older tool recommender system created using RNN in Galaxy. In addition, the transformer also outperforms CNN and DNN on several key indicators. It achieves a faster convergence time, lower model usage time, and higher quality tool recommendations than CNN. Compared to DNN, it converges faster to a higher precision@k metric (approximately 0.98 by transformer compared to approximately 0.9 by DNN) and shows higher quality tool recommendations. CONCLUSION Our work shows a novel usage of transformers to recommend tools for extending scientific workflows. A more robust tool recommendation model, created using a transformer, having significantly lower usage time than RNN and CNN, higher precision@k than DNN, and higher quality tool recommendations than all three neural networks, will benefit researchers in creating scientifically significant workflows and exploratory data analysis in Galaxy. Additionally, the ability to train faster than all three neural networks imparts more scalability for training on larger datasets consisting of millions of tool sequences. Open-source scripts to create the recommendation model are available under MIT licence at https://github.com/anuprulez/galaxy_tool_recommendation_transformers.
Collapse
|
5
|
PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2023; 116:974-988. [PMID: 37818860 DOI: 10.1111/tpj.16474] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 09/08/2023] [Indexed: 10/13/2023]
Abstract
In modern reproducible, hypothesis-driven plant research, scientists are increasingly relying on research data management (RDM) services and infrastructures to streamline the processes of collecting, processing, sharing, and archiving research data. FAIR (i.e., findable, accessible, interoperable, and reusable) research data play a pivotal role in enabling the integration of interdisciplinary knowledge and facilitating the comparison and synthesis of a wide range of analytical findings. The PLANTdataHUB offers a solution that realizes RDM of scientific (meta)data as evolving collections of files in a directory - yielding FAIR digital objects called ARCs - with tools that enable scientists to plan, communicate, collaborate, publish, and reuse data on the same platform while gaining continuous quality control insights. The centralized platform is scalable from personal use to global communities and provides advanced federation capabilities for institutions that prefer to host their own satellite instances. This approach borrows many concepts from software development and adapts them to fit the challenges of the field of modern plant science undergoing digital transformation. The PLANTdataHUB supports researchers in each stage of a scientific project with adaptable continuous quality control insights, from the early planning phase to data publication. The central live instance of PLANTdataHUB is accessible at (https://git.nfdi4plants.org), and it will continue to evolve as a community-driven and dynamic resource that serves the needs of contemporary plant science.
Collapse
|
6
|
Activator-blocker model of transcriptional regulation by pioneer-like factors. Nat Commun 2023; 14:5677. [PMID: 37709752 PMCID: PMC10502082 DOI: 10.1038/s41467-023-41507-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 09/06/2023] [Indexed: 09/16/2023] Open
Abstract
Zygotic genome activation (ZGA) in the development of flies, fish, frogs and mammals depends on pioneer-like transcription factors (TFs). Those TFs create open chromatin regions, promote histone acetylation on enhancers, and activate transcription. Here, we use the panel of single, double and triple mutants for zebrafish genome activators Pou5f3, Sox19b and Nanog, multi-omics and mathematical modeling to investigate the combinatorial mechanisms of genome activation. We show that Pou5f3 and Nanog act differently on synergistic and antagonistic enhancer types. Pou5f3 and Nanog both bind as pioneer-like TFs on synergistic enhancers, promote histone acetylation and activate transcription. Antagonistic enhancers are activated by binding of one of these factors. The other TF binds as non-pioneer-like TF, competes with the activator and blocks all its effects, partially or completely. This activator-blocker mechanism mutually restricts widespread transcriptional activation by Pou5f3 and Nanog and prevents premature expression of late developmental regulators in the early embryo.
Collapse
|
7
|
Integrative meta-omics in Galaxy and beyond. ENVIRONMENTAL MICROBIOME 2023; 18:56. [PMID: 37420292 DOI: 10.1186/s40793-023-00514-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 07/05/2023] [Indexed: 07/09/2023]
Abstract
BACKGROUND 'Omics methods have empowered scientists to tackle the complexity of microbial communities on a scale not attainable before. Individually, omics analyses can provide great insight; while combined as "meta-omics", they enhance the understanding of which organisms occupy specific metabolic niches, how they interact, and how they utilize environmental nutrients. Here we present three integrative meta-omics workflows, developed in Galaxy, for enhanced analysis and integration of metagenomics, metatranscriptomics, and metaproteomics, combined with our newly developed web-application, ViMO (Visualizer for Meta-Omics) to analyse metabolisms in complex microbial communities. RESULTS In this study, we applied the workflows on a highly efficient cellulose-degrading minimal consortium enriched from a biogas reactor to analyse the key roles of uncultured microorganisms in complex biomass degradation processes. Metagenomic analysis recovered metagenome-assembled genomes (MAGs) for several constituent populations including Hungateiclostridium thermocellum, Thermoclostridium stercorarium and multiple heterogenic strains affiliated to Coprothermobacter proteolyticus. The metagenomics workflow was developed as two modules, one standard, and one optimized for improving the MAG quality in complex samples by implementing a combination of single- and co-assembly, and dereplication after binning. The exploration of the active pathways within the recovered MAGs can be visualized in ViMO, which also provides an overview of the MAG taxonomy and quality (contamination and completeness), and information about carbohydrate-active enzymes (CAZymes), as well as KEGG annotations and pathways, with counts and abundances at both mRNA and protein level. To achieve this, the metatranscriptomic reads and metaproteomic mass-spectrometry spectra are mapped onto predicted genes from the metagenome to analyse the functional potential of MAGs, as well as the actual expressed proteins and functions of the microbiome, all visualized in ViMO. CONCLUSION Our three workflows for integrative meta-omics in combination with ViMO presents a progression in the analysis of 'omics data, particularly within Galaxy, but also beyond. The optimized metagenomics workflow allows for detailed reconstruction of microbial community consisting of MAGs with high quality, and thus improves analyses of the metabolism of the microbiome, using the metatranscriptomics and metaproteomics workflows.
Collapse
|
8
|
A comparative gene expression matrix in Apoe-deficient mice identifies unique and atherosclerotic disease stage-specific gene regulation patterns in monocytes and macrophages. Atherosclerosis 2023; 371:1-13. [PMID: 36940535 DOI: 10.1016/j.atherosclerosis.2023.03.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 03/02/2023] [Accepted: 03/08/2023] [Indexed: 03/23/2023]
Abstract
BACKGROUND AND AIMS Atherosclerosis is a systemic and chronic inflammatory disease propagated by monocytes and macrophages. Yet, our knowledge on how transcriptome of these cells evolves in time and space is limited. We aimed at characterizing gene expression changes in site-specific macrophages and in circulating monocytes during the course of atherosclerosis. METHODS We utilized apolipoprotein E-deficient mice undergoing one- and six-month high cholesterol diet to model early and advanced atherosclerosis. Aortic macrophages, peritoneal macrophages, and circulating monocytes from each mouse were subjected to bulk RNA-sequencing (RNA-seq). We constructed a comparative directory that profiles lesion- and disease stage-specific transcriptomic regulation of the three cell types in atherosclerosis. Lastly, the regulation of one gene, Gpnmb, whose expression positively correlated with atheroma growth, was validated using single-cell RNA-seq (scRNA-seq) of atheroma plaque from murine and human. RESULTS The convergence of gene regulation between the three investigated cell types was surprisingly low. Overall 3245 differentially expressed genes were involved in the biological modulation of aortic macrophages, among which less than 1% were commonly regulated by the remote monocytes/macrophages. Aortic macrophages regulated gene expression most actively during atheroma initiation. Through complementary interrogation of murine and human scRNA-seq datasets, we showcased the practicality of our directory, using the selected gene, Gpnmb, whose expression in aortic macrophages, and a subset of foamy macrophages in particular, strongly correlated with disease advancement during atherosclerosis initiation and progression. CONCLUSIONS Our study provides a unique toolset to explore gene regulation of macrophage-related biological processes in and outside the atheromatous plaque at early and advanced disease stages.
Collapse
|
9
|
The Planemo toolkit for developing, deploying, and executing scientific data analyses in Galaxy and beyond. Genome Res 2023; 33:261-268. [PMID: 36828587 PMCID: PMC10069471 DOI: 10.1101/gr.276963.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 01/11/2023] [Indexed: 02/26/2023]
Abstract
There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For more than a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. To streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users. Here we outline Planemo's implementation and describe its broad range of functionality for designing, testing, and executing Galaxy tools, workflows, and training material. In addition, we discuss the philosophy underlying Galaxy tool and workflow development, and how Planemo encourages the use of development best practices, such as test-driven development, by its users, including those who are not professional software developers.
Collapse
|
10
|
An accessible infrastructure for artificial intelligence using a Docker-based JupyterLab in Galaxy. Gigascience 2022; 12:giad028. [PMID: 37099385 PMCID: PMC10132306 DOI: 10.1093/gigascience/giad028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 01/23/2023] [Accepted: 04/11/2023] [Indexed: 04/27/2023] Open
Abstract
BACKGROUND Artificial intelligence (AI) programs that train on large datasets require powerful compute infrastructure consisting of several CPU cores and GPUs. JupyterLab provides an excellent framework for developing AI programs, but it needs to be hosted on such an infrastructure to enable faster training of AI programs using parallel computing. FINDINGS An open-source, docker-based, and GPU-enabled JupyterLab infrastructure is developed that runs on the public compute infrastructure of Galaxy Europe consisting of thousands of CPU cores, many GPUs, and several petabytes of storage to rapidly prototype and develop end-to-end AI projects. Using a JupyterLab notebook, long-running AI model training programs can also be executed remotely to create trained models, represented in open neural network exchange (ONNX) format, and other output datasets in Galaxy. Other features include Git integration for version control, the option of creating and executing pipelines of notebooks, and multiple dashboards and packages for monitoring compute resources and visualization, respectively. CONCLUSIONS These features make JupyterLab in Galaxy Europe highly suitable for creating and managing AI projects. A recent scientific publication that predicts infected regions in COVID-19 computed tomography scan images is reproduced using various features of JupyterLab on Galaxy Europe. In addition, ColabFold, a faster implementation of AlphaFold2, is accessed in JupyterLab to predict the 3-dimensional structure of protein sequences. JupyterLab is accessible in 2 ways-one as an interactive Galaxy tool and the other by running the underlying Docker container. In both ways, long-running training can be executed on Galaxy's compute infrastructure. Scripts to create the Docker container are available under MIT license at https://github.com/usegalaxy-eu/gpu-jupyterlab-docker.
Collapse
|
11
|
Training Infrastructure as a Service. Gigascience 2022; 12:giad048. [PMID: 37395629 PMCID: PMC10316688 DOI: 10.1093/gigascience/giad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 05/31/2023] [Accepted: 06/08/2023] [Indexed: 07/04/2023] Open
Abstract
BACKGROUND Hands-on training, whether in bioinformatics or other domains, often requires significant technical resources and knowledge to set up and run. Instructors must have access to powerful compute infrastructure that can support resource-intensive jobs running efficiently. Often this is achieved using a private server where there is no contention for the queue. However, this places a significant prerequisite knowledge or labor barrier for instructors, who must spend time coordinating deployment and management of compute resources. Furthermore, with the increase of virtual and hybrid teaching, where learners are located in separate physical locations, it is difficult to track student progress as efficiently as during in-person courses. FINDINGS Originally developed by Galaxy Europe and the Gallantries project, together with the Galaxy community, we have created Training Infrastructure-as-a-Service (TIaaS), aimed at providing user-friendly training infrastructure to the global training community. TIaaS provides dedicated training resources for Galaxy-based courses and events. Event organizers register their course, after which trainees are transparently placed in a private queue on the compute infrastructure, which ensures jobs complete quickly, even when the main queue is experiencing high wait times. A built-in dashboard allows instructors to monitor student progress. CONCLUSIONS TIaaS provides a significant improvement for instructors and learners, as well as infrastructure administrators. The instructor dashboard makes remote events not only possible but also easy. Students experience continuity of learning, as all training happens on Galaxy, which they can continue to use after the event. In the past 60 months, 504 training events with over 24,000 learners have used this infrastructure for Galaxy training.
Collapse
|
12
|
The antileukemic activity of decitabine upon PML/RARA-negative AML blasts is supported by all-trans retinoic acid: in vitro and in vivo evidence for cooperation. Blood Cancer J 2022; 12:122. [PMID: 35995769 PMCID: PMC9395383 DOI: 10.1038/s41408-022-00715-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 07/03/2022] [Accepted: 07/29/2022] [Indexed: 12/02/2022] Open
Abstract
The prognosis of AML patients with adverse genetics, such as a complex, monosomal karyotype and TP53 lesions, is still dismal even with standard chemotherapy. DNA-hypomethylating agent monotherapy induces an encouraging response rate in these patients. When combined with decitabine (DAC), all-trans retinoic acid (ATRA) resulted in an improved response rate and longer overall survival in a randomized phase II trial (DECIDER; NCT00867672). The molecular mechanisms governing this in vivo synergism are unclear. We now demonstrate cooperative antileukemic effects of DAC and ATRA on AML cell lines U937 and MOLM-13. By RNA-sequencing, derepression of >1200 commonly regulated transcripts following the dual treatment was observed. Overall chromatin accessibility (interrogated by ATAC-seq) and, in particular, at motifs of retinoic acid response elements were affected by both single-agent DAC and ATRA, and enhanced by the dual treatment. Cooperativity regarding transcriptional induction and chromatin remodeling was demonstrated by interrogating the HIC1, CYP26A1, GBP4, and LYZ genes, in vivo gene derepression by expression studies on peripheral blood blasts from AML patients receiving DAC + ATRA. The two drugs also cooperated in derepression of transposable elements, more effectively in U937 (mutated TP53) than MOLM-13 (intact TP53), resulting in a “viral mimicry” response. In conclusion, we demonstrate that in vitro and in vivo, the antileukemic and gene-derepressive epigenetic activity of DAC is enhanced by ATRA.
Collapse
|
13
|
Abstract
BACKGROUND Chromatin loops are an essential factor in the structural organization of the genome; however, their detection in Hi-C interaction matrices is a challenging and compute-intensive task. The approach presented here, integrated into the HiCExplorer software, shows a chromatin loop detection algorithm that applies a strict candidate selection based on continuous negative binomial distributions and performs a Wilcoxon rank-sum test to detect enriched Hi-C interactions. RESULTS HiCExplorer's loop detection has a high detection rate and accuracy. It is the fastest available CPU implementation and utilizes all threads offered by modern multicore platforms. CONCLUSIONS HiCExplorer's method to detect loops by using a continuous negative binomial function combined with the donut approach from HiCCUPS leads to reliable and fast computation of loops. All the loop-calling algorithms investigated provide differing results, which intersect by $\sim 50\%$ at most. The tested in situ Hi-C data contain a large amount of noise; achieving better agreement between loop calling algorithms will require cleaner Hi-C data and therefore future improvements to the experimental methods that generate the data.
Collapse
|
14
|
Selection Analysis Identifies Clusters of Unusual Mutational Changes in Omicron Lineage BA.1 That Likely Impact Spike Function. Mol Biol Evol 2022; 39:msac061. [PMID: 35325204 PMCID: PMC9037384 DOI: 10.1093/molbev/msac061] [Citation(s) in RCA: 52] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Among the 30 nonsynonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (1) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (2) interactions of Spike with ACE2 receptors, and (3) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any virus within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and, in combination with other mutations, adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron overall previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.
Collapse
|
15
|
Galaxy: A Decade of Realising CWFR Concepts. DATA INTELLIGENCE 2022. [DOI: 10.1162/dint_a_00136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Abstract
Despite recent encouragement to follow the FAIR principles, the day-to-day research practices have not changed substantially. Due to new developments and the increasing pressure to apply best practices, initiatives to improve the efficiency and reproducibility of scientific workflows are becoming more prevalent. In this article, we discuss the importance of well-annotated tools and the specific requirements to ensure reproducible research with FAIR outputs. We detail how Galaxy, an open-source workflow management system with a web-based interface, has implemented the concepts that are put forward by the Canonical Workflow Framework for Research (CWFR), whilst minimising changes to the practices of scientific communities. Although we showcase concrete applications from two different domains, this approach is generalisable to any domain and particularly useful in interdisciplinary research and science-based applications.
Collapse
|
16
|
Non-coding RNAs underlying the pathophysiological links between type 2 diabetes and pancreatic cancer: A systematic review. J Diabetes Investig 2022; 13:405-428. [PMID: 34859606 PMCID: PMC8902405 DOI: 10.1111/jdi.13727] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 11/11/2021] [Accepted: 11/30/2021] [Indexed: 12/21/2022] Open
Abstract
Type 2 diabetes is known as a risk factor for pancreatic cancer (PC). Various genetic and environmental factors cause both these global chronic diseases. The mechanisms that define their relationships are complex and poorly understood. Recent studies have implicated that metabolic abnormalities, including hyperglycemia and hyperinsulinemia, could lead to cell damage responses, cell transformation, and increased cancer risk. Hence, these kinds of abnormalities following molecular events could be essential to develop our understanding of this complicated link. Among different molecular events, focusing on shared signaling pathways including metabolic (PI3K/Akt/mTOR) and mitogenic (MAPK) pathways in addition to regulatory mechanisms of gene expression such as those involved in non-coding RNAs (miRNAs, circRNAs, and lncRNAs) could be considered as powerful tools to describe this association. A better understanding of the molecular mechanisms involved in the development of type 2 diabetes and pancreatic cancer would help us to find a new research area for developing therapeutic and preventive strategies. For this purpose, in this review, we focused on the shared molecular events resulting in type 2 diabetes and pancreatic cancer. First, a comprehensive literature review was performed to determine similar molecular pathways and non-coding RNAs; then, the final results were discussed in more detail.
Collapse
|
17
|
Selection analysis identifies unusual clustered mutational changes in Omicron lineage BA.1 that likely impact Spike function. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.01.14.476382. [PMID: 35075456 PMCID: PMC8786225 DOI: 10.1101/2022.01.14.476382] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Among the 30 non-synonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (i) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (ii) interactions of Spike with ACE2 receptors, and (iii) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any genomes within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron over all previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.
Collapse
|
18
|
Expanding the Galaxy's reference data. BIOINFORMATICS ADVANCES 2022; 2:vbac030. [PMID: 35669346 PMCID: PMC9155181 DOI: 10.1093/bioadv/vbac030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 04/01/2022] [Accepted: 04/26/2022] [Indexed: 01/27/2023]
Abstract
Summary Properly and effectively managing reference datasets is an important task for many bioinformatics analyses. Refgenie is a reference asset management system that allows users to easily organize, retrieve and share such datasets. Here, we describe the integration of refgenie into the Galaxy platform. Server administrators are able to configure Galaxy to make use of reference datasets made available on a refgenie instance. In addition, a Galaxy Data Manager tool has been developed to provide a graphical interface to refgenie's remote reference retrieval functionality. A large collection of reference datasets has also been made available using the CVMFS (CernVM File System) repository from GalaxyProject.org, with mirrors across the USA, Canada, Europe and Australia, enabling easy use outside of Galaxy. Availability and implementation The ability of Galaxy to use refgenie assets was added to the core Galaxy framework in version 22.01, which is available from https://github.com/galaxyproject/galaxy under the Academic Free License version 3.0. The refgenie Data Manager tool can be installed via the Galaxy ToolShed, with source code managed at https://github.com/BlankenbergLab/galaxy-tools-blankenberg/tree/main/data_managers/data_manager_refgenie_pull and released using an MIT license. Access to existing data is also available through CVMFS, with instructions at https://galaxyproject.org/admin/reference-data-repo/. No new data were generated or analyzed in support of this research.
Collapse
|
19
|
|
20
|
Scool: a new data storage format for single-cell Hi-C data. Bioinformatics 2021; 37:2053-2054. [PMID: 33135074 PMCID: PMC8337000 DOI: 10.1093/bioinformatics/btaa924] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 10/07/2020] [Accepted: 10/20/2020] [Indexed: 12/30/2022] Open
Abstract
Motivation Single-cell Hi-C research currently lacks an efficient, easy to use and shareable data storage format. Recent studies have used a variety of sub-optimal solutions: publishing raw data only, text-based interaction matrices, or reusing established Hi-C storage formats for single interaction matrices. These approaches are storage and pre-processing intensive, require long labour time and are often error-prone. Results The single-cell cooler file format (scool) provides an efficient, user-friendly and storage-saving approach for single-cell Hi-C data. It is a flavour of the established cooler format and guarantees stable API support. Availability and implementation The single-cell cooler format is part of the cooler file format as of API version 0.8.9. It is available via pip, conda and github: https://github.com/mirnylab/cooler. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
21
|
A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive. Bioinformatics 2021; 37:3983-3985. [PMID: 34096994 PMCID: PMC8344586 DOI: 10.1093/bioinformatics/btab421] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 04/23/2021] [Accepted: 06/05/2021] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. AVAILABILITY CLI ENA upload tool is available at github.com/usegalaxy-eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools-iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR-Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785).
Collapse
|
22
|
Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine. PLoS Comput Biol 2021; 17:e1009014. [PMID: 34061826 PMCID: PMC8213174 DOI: 10.1371/journal.pcbi.1009014] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 06/18/2021] [Accepted: 04/27/2021] [Indexed: 11/25/2022] Open
Abstract
Supervised machine learning is an essential but difficult to use approach in biomedical data analysis. The Galaxy-ML toolkit (https://galaxyproject.org/community/machine-learning/) makes supervised machine learning more accessible to biomedical scientists by enabling them to perform end-to-end reproducible machine learning analyses at large scale using only a web browser. Galaxy-ML extends Galaxy (https://galaxyproject.org), a biomedical computational workbench used by tens of thousands of scientists across the world, with a suite of tools for all aspects of supervised machine learning.
Collapse
|
23
|
|
24
|
|
25
|
Neuartige Emulsionen mit siliziumorganischen Copolymeren als Emulgatoren / New Types of Emulsions Containing Organo Modified Silicone Copolymers as Emulsifiers. TENSIDE SURFACT DET 2021. [DOI: 10.1515/tsd-1992-290203] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
26
|
Rheologische Eigenschaften von Dimersäurebetainlösungen / Rheological Properties of Dimer Acid Betaine Solutions. TENSIDE SURFACT DET 2021. [DOI: 10.1515/tsd-1994-310214] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
27
|
Phasenverhalten siliziumorganischer Kammpolymerer in wäßriger Lösung / Phase Behaviour of Silicone Surfactants with a Comblike Structure in Aqueous Solution. TENSIDE SURFACT DET 2021. [DOI: 10.1515/tsd-1994-310212] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
28
|
|
29
|
Robust and efficient single-cell Hi-C clustering with approximate k-nearest neighbor graphs. Bioinformatics 2021; 37:4006-4013. [PMID: 34021764 PMCID: PMC9502147 DOI: 10.1093/bioinformatics/btab394] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 05/03/2021] [Accepted: 05/19/2021] [Indexed: 11/23/2022] Open
Abstract
Motivation Hi-C technology provides insights into the 3D organization of the chromatin, and the single-cell Hi-C method enables researchers to gain knowledge about the chromatin state in individual cell levels. Single-cell Hi-C interaction matrices are high dimensional and very sparse. To cluster thousands of single-cell Hi-C interaction matrices, they are flattened and compiled into one matrix. Depending on the resolution, this matrix can have a few million or even billions of features; therefore, computations can be memory intensive. We present a single-cell Hi-C clustering approach using an approximate nearest neighbors method based on locality-sensitive hashing to reduce the dimensions and the computational resources. Results The presented method can process a 10 kb single-cell Hi-C dataset with 2600 cells and needs 40 GB of memory, while competitive approaches are not computable even with 1 TB of memory. It can be shown that the differentiation of the cells by their chromatin folding properties and, therefore, the quality of the clustering of single-cell Hi-C data is advantageous compared to competitive algorithms. Availability and implementation The presented clustering algorithm is part of the scHiCExplorer, is available on Github https://github.com/joachimwolff/scHiCExplorer, and as a conda package via the bioconda channel. The approximate nearest neighbors implementation is available via https://github.com/joachimwolff/sparse-neighbors-search and as a conda package via the bioconda channel. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
30
|
|
31
|
A rigorous evaluation of optimal peptide targets for MS-based clinical diagnostics of Coronavirus Disease 2019 (COVID-19). Clin Proteomics 2021; 18:15. [PMID: 33971807 PMCID: PMC8107781 DOI: 10.1186/s12014-021-09321-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Accepted: 05/01/2021] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The Coronavirus Disease 2019 (COVID-19) global pandemic has had a profound, lasting impact on the world's population. A key aspect to providing care for those with COVID-19 and checking its further spread is early and accurate diagnosis of infection, which has been generally done via methods for amplifying and detecting viral RNA molecules. Detection and quantitation of peptides using targeted mass spectrometry-based strategies has been proposed as an alternative diagnostic tool due to direct detection of molecular indicators from non-invasively collected samples as well as the potential for high-throughput analysis in a clinical setting; many studies have revealed the presence of viral peptides within easily accessed patient samples. However, evidence suggests that some viral peptides could serve as better indicators of COVID-19 infection status than others, due to potential misidentification of peptides derived from human host proteins, poor spectral quality, high limits of detection etc. METHODS: In this study we have compiled a list of 636 peptides identified from Sudden Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) samples, including from in vitro and clinical sources. These datasets were rigorously analyzed using automated, Galaxy-based workflows containing tools such as PepQuery, BLAST-P, and the Multi-omic Visualization Platform as well as the open-source tools MetaTryp and Proteomics Data Viewer (PDV). RESULTS Using PepQuery for confirming peptide spectrum matches, we were able to narrow down the 639-peptide possibilities to 87 peptides that were most robustly detected and specific to the SARS-CoV-2 virus. The specificity of these sequences to coronavirus taxa was confirmed using Unipept and BLAST-P. Through stringent p-value cutoff combined with manual verification of peptide spectrum match quality, 4 peptides derived from the nucleocapsid phosphoprotein and membrane protein were found to be most robustly detected across all cell culture and clinical samples, including those collected non-invasively. CONCLUSION We propose that these peptides would be of the most value for clinical proteomics applications seeking to detect COVID-19 from patient samples. We also contend that samples harvested from the upper respiratory tract and oral cavity have the highest potential for diagnosis of SARS-CoV-2 infection from easily collected patient samples using mass spectrometry-based proteomics assays.
Collapse
|
32
|
Abstract
The Coronavirus Disease 2019 (COVID-19) outbreaks have caused universities all across the globe to close their campuses and forced them to initiate online teaching. This article reviews the pedagogical foundations for developing effective distance education practices, starting from the assumption that promoting autonomous thinking is an essential element to guarantee full citizenship in a democracy and for moral decision-making in situations of rapid change, which has become a pressing need in the context of a pandemic. In addition, the main obstacles related to this new context are identified, and solutions are proposed according to the existing bibliography in learning sciences.
Collapse
|
33
|
pyGenomeTracks: reproducible plots for multivariate genomic datasets. Bioinformatics 2021; 37:422-423. [PMID: 32745185 PMCID: PMC8058774 DOI: 10.1093/bioinformatics/btaa692] [Citation(s) in RCA: 165] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 07/20/2020] [Accepted: 07/27/2020] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Generating publication ready plots to display multiple genomic tracks can pose a serious challenge. Making desirable and accurate figures requires considerable effort. This is usually done by hand or using a vector graphic software. RESULTS pyGenomeTracks (PGT) is a modular plotting tool that easily combines multiple tracks. It enables a reproducible and standardized generation of highly customizable and publication ready images. AVAILABILITY AND IMPLEMENTATION PGT is available through a graphical interface on https://usegalaxy.eu and through the command line. It is provided on conda via the bioconda channel, on pip and it is openly developed on github: https://github.com/deeptools/pyGenomeTracks. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
34
|
Abstract
BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize software containers including the metadata, versions, licenses, and software dependencies. BioContainers supports multiple packaging and container technologies such as Conda, Docker, and Singularity. The BioContainers provide over 9000 bioinformatics tools, including more than 200 proteomics and mass spectrometry tools. Here we introduce the BioContainers Registry and Restful API to make containerized bioinformatics tools more findable, accessible, interoperable, and reusable (FAIR). The BioContainers Registry provides a fast and convenient way to find and retrieve bioinformatics tool packages and containers. By doing so, it will increase the use of bioinformatics packages and containers while promoting replicability and reproducibility in research.
Collapse
|
35
|
Freely accessible ready to use global infrastructure for SARS-CoV-2 monitoring. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.03.25.437046. [PMID: 33791701 PMCID: PMC8010728 DOI: 10.1101/2021.03.25.437046] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
The COVID-19 pandemic is the first global health crisis to occur in the age of big genomic data.Although data generation capacity is well established and sufficiently standardized, analytical capacity is not. To establish analytical capacity it is necessary to pull together global computational resources and deliver the best open source tools and analysis workflows within a ready to use, universally accessible resource. Such a resource should not be controlled by a single research group, institution, or country. Instead it should be maintained by a community of users and developers who ensure that the system remains operational and populated with current tools. A community is also essential for facilitating the types of discourse needed to establish best analytical practices. Bringing together public computational research infrastructure from the USA, Europe, and Australia, we developed a distributed data analysis platform that accomplishes these goals. It is immediately accessible to anyone in the world and is designed for the analysis of rapidly growing collections of deep sequencing datasets. We demonstrate its utility by detecting allelic variants in high-quality existing SARS-CoV-2 sequencing datasets and by continuous reanalysis of COG-UK data. All workflows, data, and documentation is available at https://covid19.galaxyproject.org .
Collapse
|
36
|
Tool recommender system in Galaxy using deep learning. Gigascience 2021; 10:6065533. [PMID: 33404053 PMCID: PMC7786169 DOI: 10.1093/gigascience/giaa152] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Revised: 08/02/2020] [Accepted: 11/26/2020] [Indexed: 11/28/2022] Open
Abstract
Background Galaxy is a web-based and open-source scientific data-processing platform. Researchers compose pipelines in Galaxy to analyse scientific data. These pipelines, also known as workflows, can be complex and difficult to create from thousands of tools, especially for researchers new to Galaxy. To help researchers with creating workflows, a system is developed to recommend tools that can facilitate further data analysis. Findings A model is developed to recommend tools using a deep learning approach by analysing workflows composed by researchers on the European Galaxy server. The higher-order dependencies in workflows, represented as directed acyclic graphs, are learned by training a gated recurrent units neural network, a variant of a recurrent neural network. In the neural network training, the weights of tools used are derived from their usage frequencies over time and the sequences of tools are uniformly sampled from training data. Hyperparameters of the neural network are optimized using Bayesian optimization. Mean accuracy of 98% in recommending tools is achieved for the top-1 metric. Conclusions The model is accessed by a Galaxy API to provide researchers with recommended tools in an interactive manner using multiple user interface integrations on the European Galaxy server. High-quality and highly used tools are shown at the top of the recommendations. The scripts and data to create the recommendation system are available under MIT license at https://github.com/anuprulez/galaxy_tool_recommendation.
Collapse
|
37
|
A single-cell RNA-sequencing training and analysis suite using the Galaxy framework. Gigascience 2020; 9:5931798. [PMID: 33079170 PMCID: PMC7574357 DOI: 10.1093/gigascience/giaa102] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 08/30/2020] [Indexed: 11/25/2022] Open
Abstract
Background The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large computing requirements and the statistically driven methods needed to process and understand these ever-growing datasets. Results Here we outline several Galaxy workflows and learning resources for single-cell RNA-sequencing, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework provides tools, workflows, and trainings that not only enable users to perform 1-click 10x preprocessing but also empower them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The downstream analysis supports a range of high-quality interoperable suites separated into common stages of analysis: inspection, filtering, normalization, confounder removal, and clustering. The teaching resources cover concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal. Conclusions The reproducible and training-oriented Galaxy framework provides a sustainable high-performance computing environment for users to run flexible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy community provide a means for users to learn, publish, and teach single-cell RNA-sequencing analysis.
Collapse
|
38
|
NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy. Gigascience 2020; 9:giaa105. [PMID: 33068114 PMCID: PMC7568507 DOI: 10.1093/gigascience/giaa105] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 08/10/2020] [Accepted: 09/16/2020] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Long-read sequencing can be applied to generate very long contigs and even completely assembled genomes at relatively low cost and with minimal sample preparation. As a result, long-read sequencing platforms are becoming more popular. In this respect, the Oxford Nanopore Technologies-based long-read sequencing "nanopore" platform is becoming a widely used tool with a broad range of applications and end-users. However, the need to explore and manipulate the complex data generated by long-read sequencing platforms necessitates accompanying specialized bioinformatics platforms and tools to process the long-read data correctly. Importantly, such tools should additionally help democratize bioinformatics analysis by enabling easy access and ease-of-use solutions for researchers. RESULTS The Galaxy platform provides a user-friendly interface to computational command line-based tools, handles the software dependencies, and provides refined workflows. The users do not have to possess programming experience or extended computer skills. The interface enables researchers to perform powerful bioinformatics analysis, including the assembly and analysis of short- or long-read sequence data. The newly developed "NanoGalaxy" is a Galaxy-based toolkit for analysing long-read sequencing data, which is suitable for diverse applications, including de novo genome assembly from genomic, metagenomic, and plasmid sequence reads. CONCLUSIONS A range of best-practice tools and workflows for long-read sequence genome assembly has been integrated into a NanoGalaxy platform to facilitate easy access and use of bioinformatics tools for researchers. NanoGalaxy is freely available at the European Galaxy server https://nanopore.usegalaxy.eu with supporting self-learning training material available at https://training.galaxyproject.org.
Collapse
|
39
|
No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathog 2020; 16:e1008643. [PMID: 32790776 PMCID: PMC7425854 DOI: 10.1371/journal.ppat.1008643] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.
Collapse
|
40
|
Ultrastructural, transcriptional, and functional differences between human reticulated and non-reticulated platelets. J Thromb Haemost 2020; 18:2034-2046. [PMID: 32428354 DOI: 10.1111/jth.14895] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 04/22/2020] [Accepted: 05/06/2020] [Indexed: 11/25/2022]
Abstract
BACKGROUND Reticulated platelets (RP) are the youngest circulating platelets in blood. An increased amount of this subpopulation is associated with higher cardiovascular risk and mortality. OBJECTIVES It is unknown to what extent intrinsic properties of RP contribute to their hyperreactive features. This study is the first providing a multifactorial approach based on ultrastructural, transcriptional, and functional analysis of RP compared to non-RP sorted by flow cytometry. METHODS Reticulated platelets and non-RP were sorted after platelet staining with SYTO 13. Employing transmission electron microscopy, 1089 micrographs were analyzed for platelet size, amounts of intracellular structures, and anatomical surrogates indicating activation. Long and small RNA-sequencing (RNA-seq) were performed for analyzing differential gene expression. Functional analysis of P-selectin-an upregulated mRNA in RP-was performed in healthy subjects and patients on P2Y12 -inhibitors. RESULTS Electron micrographs uncovered distinct ultrastructural differences in RP versus non-RP. Cross sections were 1.9-fold larger in RP (P < .0001). Amounts of α-granules, dense granules, open canalicular system-openings, and mitochondria were increased in RP, which persisted after adjustment for platelet size. Long RNA-seq showed 1212 upregulated transcripts that are predominantly associated to platelet shape change, aggregation, and activation; 1264 mRNAs were downregulated in RP. Small RNA-seq did not reveal any differentially expressed transcripts. Functional analysis displayed higher P-selectin expression as compared to non-RP upon ADP- or TRAP-stimulation. CONCLUSIONS Our results demonstrate that altered intrinsic structural and molecular properties contribute to the hyperreactivity of RP. These properties and an increased amount of RP may account for the association with cardiovascular risk.
Collapse
|
41
|
Ewastools: Infinium Human Methylation BeadChip pipeline for population epigenetics integrated into Galaxy. Gigascience 2020; 9:5836679. [PMID: 32401319 PMCID: PMC7219210 DOI: 10.1093/gigascience/giaa049] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 02/24/2020] [Accepted: 04/21/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Infinium Human Methylation BeadChip is an array platform for complex evaluation of DNA methylation at an individual CpG locus in the human genome based on Illumina's bead technology and is one of the most common techniques used in epigenome-wide association studies. Finding associations between epigenetic variation and phenotype is a significant challenge in biomedical research. The newest version, HumanMethylationEPIC, quantifies the DNA methylation level of 850,000 CpG sites, while the previous versions, HumanMethylation450 and HumanMethylation27, measured >450,000 and 27,000 loci, respectively. Although a number of bioinformatics tools have been developed to analyse this assay, they require some programming skills and experience in order to be usable. RESULTS We have developed a pipeline for the Galaxy platform for those without experience aimed at DNA methylation analysis using the Infinium Human Methylation BeadChip. Our tool is integrated into Galaxy (http://galaxyproject.org), a web-based platform. This allows users to analyse data from the Infinium Human Methylation BeadChip in the easiest possible way. CONCLUSIONS The pipeline provides a group of integrated analytical methods wrapped into an easy-to-use interface. Our tool is available from the Galaxy ToolShed, GitHub repository, and also as a Docker image. The aim of this project is to make Infinium Human Methylation BeadChip analysis more flexible and accessible to everyone.
Collapse
|
42
|
GraphClust2: Annotation and discovery of structured RNAs with scalable and accessible integrative clustering. Gigascience 2019; 8:giz150. [PMID: 31808801 PMCID: PMC6897289 DOI: 10.1093/gigascience/giz150] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 08/23/2019] [Accepted: 11/20/2019] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND RNA plays essential roles in all known forms of life. Clustering RNA sequences with common sequence and structure is an essential step towards studying RNA function. With the advent of high-throughput sequencing techniques, experimental and genomic data are expanding to complement the predictive methods. However, the existing methods do not effectively utilize and cope with the immense amount of data becoming available. RESULTS Hundreds of thousands of non-coding RNAs have been detected; however, their annotation is lagging behind. Here we present GraphClust2, a comprehensive approach for scalable clustering of RNAs based on sequence and structural similarities. GraphClust2 bridges the gap between high-throughput sequencing and structural RNA analysis and provides an integrative solution by incorporating diverse experimental and genomic data in an accessible manner via the Galaxy framework. GraphClust2 can efficiently cluster and annotate large datasets of RNAs and supports structure-probing data. We demonstrate that the annotation performance of clustering functional RNAs can be considerably improved. Furthermore, an off-the-shelf procedure is introduced for identifying locally conserved structure candidates in long RNAs. We suggest the presence and the sparseness of phylogenetically conserved local structures for a collection of long non-coding RNAs. CONCLUSIONS By clustering data from 2 cross-linking immunoprecipitation experiments, we demonstrate the benefits of GraphClust2 for motif discovery under the presence of biological and methodological biases. Finally, we uncover prominent targets of double-stranded RNA binding protein Roquin-1, such as BCOR's 3' untranslated region that contains multiple binding stem-loops that are evolutionary conserved.
Collapse
|
43
|
Abstract
The German Network for Bioinformatics Infrastructure (de.NBI) is a national and academic infrastructure funded by the German Federal Ministry of Education and Research (BMBF). The de.NBI provides (i) service, (ii) training, and (iii) cloud computing to users in life sciences research and biomedicine in Germany and Europe and (iv) fosters the cooperation of the German bioinformatics community with international network structures. The de.NBI members also run the German node (ELIXIR-DE) within the European ELIXIR infrastructure. The de.NBI / ELIXIR-DE training platform, also known as special interest group 3 (SIG 3) ‘Training & Education’, coordinates the bioinformatics training of de.NBI and the German ELIXIR node. The network provides a high-quality, coherent, timely, and impactful training program across its eight service centers. Life scientists learn how to handle and analyze biological big data more effectively by applying tools, standards and compute services provided by de.NBI. Since 2015, more than 300 training courses were carried out with about 6,000 participants and these courses received recommendation rates of almost 90% (status as of July 2020). In addition to face-to-face training courses, online training was introduced on the de.NBI website in 2016 and guidelines for the preparation of e-learning material were established in 2018. In 2016, ELIXIR-DE joined the ELIXIR training platform. Here, the de.NBI / ELIXIR-DE training platform collaborates with ELIXIR in training activities, advertising training courses via TeSS and discussions on the exchange of data for training events essential for quality assessment on both the technical and administrative levels. The de.NBI training program trained thousands of scientists from Germany and beyond in many different areas of bioinformatics.
Collapse
|
44
|
Practical Computational Reproducibility in the Life Sciences. Cell Syst 2019; 6:631-635. [PMID: 29953862 DOI: 10.1016/j.cels.2018.03.014] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 02/24/2018] [Accepted: 03/23/2018] [Indexed: 10/28/2022]
Abstract
Many areas of research suffer from poor reproducibility, particularly in computationally intensive domains where results rely on a series of complex methodological decisions that are not well captured by traditional publication approaches. Various guidelines have emerged for achieving reproducibility, but implementation of these practices remains difficult due to the challenge of assembling software tools plus associated libraries, connecting tools together into pipelines, and specifying parameters. Here, we discuss a suite of cutting-edge technologies that make computational reproducibility not just possible, but practical in both time and effort. This suite combines three well-tested components-a system for building highly portable packages of bioinformatics software, containerization and virtualization technologies for isolating reusable execution environments for these packages, and workflow systems that automatically orchestrate the composition of these packages for entire pipelines-to achieve an unprecedented level of computational reproducibility. We also provide a practical implementation and five recommendations to help set a typical researcher on the path to performing data analyses reproducibly.
Collapse
|
45
|
Community-Driven Data Analysis Training for Biology. Cell Syst 2019; 6:752-758.e1. [PMID: 29953864 DOI: 10.1016/j.cels.2018.05.012] [Citation(s) in RCA: 97] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Revised: 03/10/2018] [Accepted: 05/18/2018] [Indexed: 01/12/2023]
Abstract
The primary problem with the explosion of biomedical datasets is not the data, not computational resources, and not the required storage space, but the general lack of trained and skilled researchers to manipulate and analyze these data. Eliminating this problem requires development of comprehensive educational resources. Here we present a community-driven framework that enables modern, interactive teaching of data analytics in life sciences and facilitates the development of training materials. The key feature of our system is that it is not a static but a continuously improved collection of tutorials. By coupling tutorials with a web-based analysis framework, biomedical researchers can learn by performing computation themselves through a web browser without the need to install software or search for example datasets. Our ultimate goal is to expand the breadth of training materials to include fundamental statistical and data science topics and to precipitate a complete re-engineering of undergraduate and graduate curricula in life sciences. This project is accessible at https://training.galaxyproject.org.
Collapse
|
46
|
The bio.tools registry of software tools and data resources for the life sciences. Genome Biol 2019; 20:164. [PMID: 31405382 PMCID: PMC6691543 DOI: 10.1186/s13059-019-1772-6] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 07/22/2019] [Indexed: 11/28/2022] Open
Abstract
Bioinformaticians and biologists rely increasingly upon workflows for the flexible utilization of the many life science tools that are needed to optimally convert data into knowledge. We outline a pan-European enterprise to provide a catalogue ( https://bio.tools ) of tools and databases that can be used in these workflows. bio.tools not only lists where to find resources, but also provides a wide variety of practical information.
Collapse
|
47
|
The RNA workbench 2.0: next generation RNA data analysis. Nucleic Acids Res 2019; 47:W511-W515. [PMID: 31073612 PMCID: PMC6602469 DOI: 10.1093/nar/gkz353] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Revised: 04/11/2019] [Accepted: 04/29/2019] [Indexed: 12/30/2022] Open
Abstract
RNA has become one of the major research topics in molecular biology. As a central player in key processes regulating gene expression, RNA is in the focus of many efforts to decipher the pathways that govern the transition of genetic information to a fully functional cell. As more and more researchers join this endeavour, there is a rapidly growing demand for comprehensive collections of tools that cover the diverse layers of RNA-related research. However, increasing amounts of data, from diverse types of experiments, addressing different aspects of biological questions need to be consolidated and integrated into a single framework. Only then is it possible to connect findings from e.g. RNA-Seq experiments and methods for e.g. target predictions. To address these needs, we present the RNA Workbench 2.0 , an updated online resource for RNA related analysis. With the RNA Workbench we created a comprehensive set of analysis tools and workflows that enables researchers to analyze their data without the need for sophisticated command-line skills. This update takes the established framework to the next level, providing not only a containerized infrastructure for analysis, but also a ready-to-use platform for hands-on training, analysis, data exploration, and visualization. The new framework is available at https://rna.usegalaxy.eu , and login is free and open to all users. The containerized version can be found at https://github.com/bgruening/galaxy-rna-workbench.
Collapse
|
48
|
Abstract
Chorismate and isochorismate constitute branch-point intermediates in the biosynthesis of many aromatic metabolites in microorganisms and plants. To obtain unnatural compounds, we modified the route to menaquinone in Escherichia coli. We propose a model for the binding of isochorismate to the active site of MenD ((1R,2S, 5S,6S)-2-succinyl-5-enolpyruvyl-6-hydroxycyclohex-3-ene-1-carboxylate (SEPHCHC) synthase) that explains the outcome of the native reaction with α-ketoglutarate. We have rationally designed variants of MenD for the conversion of several isochorismate analogues. The double-variant Asn117Arg-Leu478Thr preferentially converts (5S,6S)-5,6-dihydroxycyclohexa-1,3-diene-1-carboxylate (2,3-trans-CHD), the hydrolysis product of isochorismate, with a >70-fold higher ratio than that for the wild type. The single-variant Arg107Ile uses (5S,6S)-6-amino-5-hydroxycyclohexa-1,3-diene-1-carboxylate (2,3-trans-CHA) as substrate with >6-fold conversion compared to wild-type MenD. The novel compounds have been made accessible in vivo (up to 5.3 g L-1 ). Unexpectedly, as the identified residues such as Arg107 are highly conserved (>94 %), some of the designed variations can be found in wild-type SEPHCHC synthases from other bacteria (Arg107Lys, 0.3 %). This raises the question for the possible natural occurrence of as yet unexplored branches of the shikimate pathway.
Collapse
|
49
|
Biomolecular Reaction and Interaction Dynamics Global Environment (BRIDGE). Bioinformatics 2019; 35:3508-3509. [DOI: 10.1093/bioinformatics/btz107] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Revised: 01/14/2019] [Accepted: 02/11/2019] [Indexed: 11/12/2022] Open
Abstract
Abstract
Motivation
The pathway from genomics through proteomics and onto a molecular description of biochemical processes makes the discovery of drugs and biomaterials possible. A research framework common to genomics and proteomics is needed to conduct biomolecular simulations that will connect biological data to the dynamic molecular mechanisms of enzymes and proteins. Novice biomolecular modelers are faced with the daunting task of complex setups and a myriad of possible choices preventing their use of molecular simulations and their ability to conduct reliable and reproducible computations that can be shared with collaborators and verified for procedural accuracy.
Results
We present the foundations of Biomolecular Reaction and Interaction Dynamics Global Environment (BRIDGE) developed on the Galaxy platform that makes possible fundamental molecular dynamics of proteins through workflows and pipelines via commonly used packages, such as NAMD, GROMACS and CHARMM. BRIDGE can be used to set up and simulate biological macromolecules, perform conformational analysis from trajectory data and conduct data analytics of large scale protein motions using statistical rigor. We illustrate the basic BRIDGE simulation and analytics capabilities on a previously reported CBH1 protein simulation.
Availability and implementation
Publicly available at https://github.com/scientificomputing/BRIDGE and https://usegalaxy.eu
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
|
50
|
Pou5f3, SoxB1, and Nanog remodel chromatin on high nucleosome affinity regions at zygotic genome activation. Genome Res 2019; 29:383-395. [PMID: 30674556 PMCID: PMC6396415 DOI: 10.1101/gr.240572.118] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Accepted: 01/16/2019] [Indexed: 12/16/2022]
Abstract
The zebrafish embryo is transcriptionally mostly quiescent during the first 10 cell cycles, until the main wave of zygotic genome activation (ZGA) occurs, accompanied by fast chromatin remodeling. At ZGA, homologs of the mammalian stem cell transcription factors (TFs) Pou5f3, Nanog, and Sox19b bind to thousands of developmental enhancers to initiate transcription. So far, how these TFs influence chromatin dynamics at ZGA has remained unresolved. To address this question, we analyzed nucleosome positions in wild-type and maternal-zygotic (MZ) mutants for pou5f3 and nanog by MNase-seq. We show that Nanog, Sox19b, and Pou5f3 bind to the high nucleosome affinity regions (HNARs). HNARs are spanning over 600 bp, featuring high in vivo and predicted in vitro nucleosome occupancy and high predicted propeller twist DNA shape value. We suggest a two-step nucleosome destabilization-depletion model, in which the same intrinsic DNA properties of HNAR promote both high nucleosome occupancy and differential binding of TFs. In the first step, already before ZGA, Pou5f3 and Nanog destabilize nucleosomes at HNAR centers genome-wide. In the second step, post-ZGA, Nanog, Pou5f3, and SoxB1 maintain open chromatin state on the subset of HNARs, acting synergistically. Nanog binds to the HNAR center, whereas the Pou5f3 stabilizes the flanks. The HNAR model will provide a useful tool for genome regulatory studies in a variety of biological systems.
Collapse
|