1
|
Sastry AV, Yuan Y, Poudel S, Rychel K, Yoo R, Lamoureux CR, Li G, Burrows JT, Chauhan S, Haiman ZB, Al Bulushi T, Seif Y, Palsson BO, Zielinski DC. iModulonMiner and PyModulon: Software for unsupervised mining of gene expression compendia. PLoS Comput Biol 2024; 20:e1012546. [PMID: 39441835 PMCID: PMC11534266 DOI: 10.1371/journal.pcbi.1012546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 11/04/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
Public gene expression databases are a rapidly expanding resource of organism responses to diverse perturbations, presenting both an opportunity and a challenge for bioinformatics workflows to extract actionable knowledge of transcription regulatory network function. Here, we introduce a five-step computational pipeline, called iModulonMiner, to compile, process, curate, analyze, and characterize the totality of RNA-seq data for a given organism or cell type. This workflow is centered around the data-driven computation of co-regulated gene sets using Independent Component Analysis, called iModulons, which have been shown to have broad applications. As a demonstration, we applied this workflow to generate the iModulon structure of Bacillus subtilis using all high-quality, publicly-available RNA-seq data. Using this structure, we predicted regulatory interactions for multiple transcription factors, identified groups of co-expressed genes that are putatively regulated by undiscovered transcription factors, and predicted properties of a recently discovered single-subunit phage RNA polymerase. We also present a Python package, PyModulon, with functions to characterize, visualize, and explore computed iModulons. The pipeline, available at https://github.com/SBRG/iModulonMiner, can be readily applied to diverse organisms to gain a rapid understanding of their transcriptional regulatory network structure and condition-specific activity.
Collapse
Affiliation(s)
- Anand V. Sastry
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Yuan Yuan
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Saugat Poudel
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Kevin Rychel
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Reo Yoo
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Cameron R. Lamoureux
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Gaoyuan Li
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Joshua T. Burrows
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Siddharth Chauhan
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Zachary B. Haiman
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Tahani Al Bulushi
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Yara Seif
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, California, United States of America
- Department of Pediatrics, University of California, San Diego, La Jolla, California, United States of America
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, Denmark
| | - Daniel C. Zielinski
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| |
Collapse
|
2
|
Zhang Y, Bharadhwaj VS, Kodamullil AT, Herrmann C. A network of transcriptomic signatures identifies novel comorbidity mechanisms between schizophrenia and somatic disorders. DISCOVER MENTAL HEALTH 2024; 4:11. [PMID: 38573526 PMCID: PMC10994898 DOI: 10.1007/s44192-024-00063-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 03/28/2024] [Indexed: 04/05/2024]
Abstract
The clinical burden of mental illness, in particular schizophrenia and bipolar disorder, are driven by frequent chronic courses and increased mortality, as well as the risk for comorbid conditions such as cardiovascular disease and type 2 diabetes. Evidence suggests an overlap of molecular pathways between psychotic disorders and somatic comorbidities. In this study, we developed a computational framework to perform comorbidity modeling via an improved integrative unsupervised machine learning approach based on multi-rank non-negative matrix factorization (mrNMF). Using this procedure, we extracted molecular signatures potentially explaining shared comorbidity mechanisms. For this, 27 case-control microarray transcriptomic datasets across multiple tissues were collected, covering three main categories of conditions including psychotic disorders, cardiovascular diseases and type II diabetes. We addressed the limitation of normal NMF for parameter selection by introducing multi-rank ensembled NMF to identify signatures under various hierarchical levels simultaneously. Analysis of comorbidity signature pairs was performed to identify several potential mechanisms involving activation of inflammatory response auxiliarily interconnecting angiogenesis, oxidative response and GABAergic neuro-action. Overall, we proposed a general cross-cohorts computing workflow for investigating the comorbid pattern across multiple symptoms, applied it to the real-data comorbidity study on schizophrenia, and further discussed the potential for future application of the approach.
Collapse
Affiliation(s)
- Youcheng Zhang
- Institute of Pharmacy and Molecular Biotechnology (IPMB) & BioQuant, Universität Heidelberg, 69120, Heidelberg, Germany
| | - Vinay S Bharadhwaj
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53757, Sankt Augustin, Germany
| | - Alpha T Kodamullil
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53757, Sankt Augustin, Germany
| | - Carl Herrmann
- Institute of Pharmacy and Molecular Biotechnology (IPMB) & BioQuant, Universität Heidelberg, 69120, Heidelberg, Germany.
| |
Collapse
|
3
|
Fouché A, Chadoutaud L, Delattre O, Zinovyev A. Transmorph: a unifying computational framework for modular single-cell RNA-seq data integration. NAR Genom Bioinform 2023; 5:lqad069. [PMID: 37448589 PMCID: PMC10336778 DOI: 10.1093/nargab/lqad069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 06/02/2023] [Accepted: 07/10/2023] [Indexed: 07/15/2023] Open
Abstract
Data integration of single-cell RNA-seq (scRNA-seq) data describes the task of embedding datasets gathered from different sources or experiments into a common representation so that cells with similar types or states are embedded close to one another independently from their dataset of origin. Data integration is a crucial step in most scRNA-seq data analysis pipelines involving multiple batches. It improves data visualization, batch effect reduction, clustering, label transfer, and cell type inference. Many data integration tools have been proposed during the last decade, but a surge in the number of these methods has made it difficult to pick one for a given use case. Furthermore, these tools are provided as rigid pieces of software, making it hard to adapt them to various specific scenarios. In order to address both of these issues at once, we introduce the transmorph framework. It allows the user to engineer powerful data integration pipelines and is supported by a rich software ecosystem. We demonstrate transmorph usefulness by solving a variety of practical challenges on scRNA-seq datasets including joint datasets embedding, gene space integration, and transfer of cycle phase annotations. transmorph is provided as an open source python package.
Collapse
Affiliation(s)
- Aziz Fouché
- To whom correspondence should be addressed. Tel: +33 156246989;
| | - Loïc Chadoutaud
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, 75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75005 Paris, France
| | - Olivier Delattre
- INSERM U830, Equipe Labellisée LNCC, SIREDO Oncology Centre, Institut Curie, 75005 Paris, France
| | - Andrei Zinovyev
- Correspondence may also be addressed to Andrei Zinovyev. Tel: +33 156246989;
| |
Collapse
|
4
|
Anglada-Girotto M, Miravet-Verde S, Serrano L, Head SA. robustica: customizable robust independent component analysis. BMC Bioinformatics 2022; 23:519. [PMID: 36471244 PMCID: PMC9721028 DOI: 10.1186/s12859-022-05043-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 11/08/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Independent Component Analysis (ICA) allows the dissection of omic datasets into modules that help to interpret global molecular signatures. The inherent randomness of this algorithm can be overcome by clustering many iterations of ICA together to obtain robust components. Existing algorithms for robust ICA are dependent on the choice of clustering method and on computing a potentially biased and large Pearson distance matrix. RESULTS We present robustica, a Python-based package to compute robust independent components with a fully customizable clustering algorithm and distance metric. Here, we exploited its customizability to revisit and optimize robust ICA systematically. Of the 6 popular clustering algorithms considered, DBSCAN performed the best at clustering independent components across ICA iterations. To enable using Euclidean distances, we created a subroutine that infers and corrects the components' signs across ICA iterations. Our subroutine increased the resolution, robustness, and computational efficiency of the algorithm. Finally, we show the applicability of robustica by dissecting over 500 tumor samples from low-grade glioma (LGG) patients, where we define two new gene expression modules with key modulators of tumor progression upon IDH1 and TP53 mutagenesis. CONCLUSION robustica brings precise, efficient, and customizable robust ICA into the Python toolbox. Through its customizability, we explored how different clustering algorithms and distance metrics can further optimize robust ICA. Then, we showcased how robustica can be used to discover gene modules associated with combinations of features of biological interest. Taken together, given the broad applicability of ICA for omic data analysis, we envision robustica will facilitate the seamless computation and integration of robust independent components in large pipelines.
Collapse
Affiliation(s)
- Miquel Anglada-Girotto
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Samuel Miravet-Verde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Luis Serrano
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- ICREA, Pg. LLuís Companys 23, 08010, Barcelona, Spain.
| | - Sarah A Head
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
| |
Collapse
|
5
|
Xu Z, Escalera S, Pavão A, Richard M, Tu WW, Yao Q, Zhao H, Guyon I. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. PATTERNS 2022; 3:100543. [PMID: 35845844 PMCID: PMC9278500 DOI: 10.1016/j.patter.2022.100543] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/21/2022] [Accepted: 06/03/2022] [Indexed: 11/29/2022]
Abstract
Obtaining a standardized benchmark of computational methods is a major issue in data-science communities. Dedicated frameworks enabling fair benchmarking in a unified environment are yet to be developed. Here, we introduce Codabench, a meta-benchmark platform that is open sourced and community driven for benchmarking algorithms or software agents versus datasets or tasks. A public instance of Codabench is open to everyone free of charge and allows benchmark organizers to fairly compare submissions under the same setting (software, hardware, data, algorithms), with custom protocols and data formats. Codabench has unique features facilitating easy organization of flexible and reproducible benchmarks, such as the possibility of reusing templates of benchmarks and supplying compute resources on demand. Codabench has been used internally and externally on various applications, receiving more than 130 users and 2,500 submissions. As illustrative use cases, we introduce four diverse benchmarks covering graph machine learning, cancer heterogeneity, clinical diagnosis, and reinforcement learning. Codabench facilitates flexible, easy, and reproducible benchmarking Organizers can customize benchmark design and submission format Organizers may host their own platform instance or use the public instance Four use cases in diverse domains are introduced to demonstrate the key features
In almost all communities working on data science, researchers face increasingly severe issues of reproducibility and fair comparison. Researchers work on their own version of hardware/software environment, code, and data, and consequently, the published results are hardly comparable. We introduce Codabench, a meta-benchmark platform, that is capable of flexible and easy benchmarking and supports reproducibility. Codabench is an important step toward benchmarking and reproducible research. It has been used in various communities including graph machine learning, cancer heterogeneity, clinical diagnosis, and reinforcement learning. Codabench is ready to help trendy research, e.g., artificial intelligence (AI) for science and data-centric AI.
Collapse
Affiliation(s)
- Zhen Xu
- 4Paradigm, Beijing 100085, China
- Corresponding author
| | - Sergio Escalera
- Computer Vision Center, Universitat de Barcelona, 08007 Barcelona, Spain
| | - Adrien Pavão
- LISN/CNRS/INRIA, University Paris-Saclay, 91190 Gif-sur-Yvette, France
| | - Magali Richard
- University Grenoble Alpes, CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, 38000 Grenoble, France
| | | | | | | | - Isabelle Guyon
- LISN/CNRS/INRIA, University Paris-Saclay, 91190 Gif-sur-Yvette, France
- ChaLearn, Berkeley, CA, USA
- Corresponding author
| |
Collapse
|
6
|
Captier N, Merlevede J, Molkenov A, Seisenova A, Zhubanchaliyev A, Nazarov PV, Barillot E, Kairov U, Zinovyev A. BIODICA: a computational environment for Independent Component Analysis of omics data. Bioinformatics 2022; 38:2963-2964. [PMID: 35561190 DOI: 10.1093/bioinformatics/btac204] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 03/29/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
SUMMARY We developed BIODICA, an integrated computational environment for application of independent component analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata. The computational core is the novel Python package stabilized-ica which provides interface to several ICA algorithms, a stabilization procedure, meta-analysis and component interpretation tools. BIODICA is equipped with a user-friendly graphical user interface, allowing non-experienced users to perform the ICA-based omics data analysis. The results are provided in interactive ways, thus facilitating communication with biology experts. AVAILABILITY AND IMPLEMENTATION BIODICA is implemented in Java, Python and JavaScript. The source code is freely available on GitHub under the MIT and the GNU LGPL licenses. BIODICA is supported on all major operating systems. URL: https://sysbio-curie.github.io/biodica-environment/.
Collapse
Affiliation(s)
- Nicolas Captier
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
- Laboratoire d'Imagerie Translationnelle en Oncologie, Institut Curie, INSERM U1288, PSL Research University, 91400 Orsay, France
| | - Jane Merlevede
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
| | - Askhat Molkenov
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Ainur Seisenova
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Altynbek Zhubanchaliyev
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Petr V Nazarov
- Multiomics Data Science Research Group, Department of Cancer Research & Bioinformatics Platform, Luxembourg Institute of Health, L-1445 Strassen, Luxembourg
| | - Emmanuel Barillot
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
| | - Ulykbek Kairov
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Andrei Zinovyev
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
| |
Collapse
|
7
|
Ashenova A, Daniyarov A, Molkenov A, Sharip A, Zinovyev A, Kairov U. Meta-Analysis of Esophageal Cancer Transcriptomes Using Independent Component Analysis. Front Genet 2021; 12:683632. [PMID: 34795689 PMCID: PMC8594933 DOI: 10.3389/fgene.2021.683632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 10/05/2021] [Indexed: 11/17/2022] Open
Abstract
Independent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA for identification and comprehensive analysis of reproducible signaling pathways and molecular signatures involved in this cancer type. In this study, four independent esophageal cancer transcriptomic datasets from GEO databases were used. A bioinformatics tool « BiODICA-Independent Component Analysis of Big Omics Data» was applied to compute independent components (ICs). Gene Set Enrichment Analysis (GSEA) and ToppGene uncovered the most significantly enriched pathways. Construction and visualization of gene networks and graphs were performed using the Cytoscape, and HPRD database. The correlation graph between decompositions into 30 ICs was built with absolute correlation values exceeding 0.3. Clusters of components-pseudocliques were observed in the structure of the correlation graph. The top 1,000 most contributing genes of each ICs in the pseudocliques were mapped to the PPI network to construct associated signaling pathways. Some cliques were composed of densely interconnected nodes and included components common to most cancer types (such as cell cycle and extracellular matrix signals), while others were specific to EC. The results of this investigation may reveal potential biomarkers of esophageal carcinogenesis, functional subsystems dysregulated in the tumor cells, and be helpful in predicting the early development of a tumor.
Collapse
Affiliation(s)
- Ainur Ashenova
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
- Department of Biology, School of Sciences and Humanities, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Asset Daniyarov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Aigul Sharip
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, INSERM U900, Paris, France
- Laboratory of Advanced Methods for High-dimensional Data Analysis, Lobachevsky University, Nizhny Novgorod, Russia
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| |
Collapse
|
8
|
Sastry AV, Hu A, Heckmann D, Poudel S, Kavvas E, Palsson BO. Independent component analysis recovers consistent regulatory signals from disparate datasets. PLoS Comput Biol 2021; 17:e1008647. [PMID: 33529205 PMCID: PMC7888660 DOI: 10.1371/journal.pcbi.1008647] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Revised: 02/17/2021] [Accepted: 12/18/2020] [Indexed: 01/03/2023] Open
Abstract
The availability of bacterial transcriptomes has dramatically increased in recent years. This data deluge could result in detailed inference of underlying regulatory networks, but the diversity of experimental platforms and protocols introduces critical biases that could hinder scalable analysis of existing data. Here, we show that the underlying structure of the E. coli transcriptome, as determined by Independent Component Analysis (ICA), is conserved across multiple independent datasets, including both RNA-seq and microarray datasets. We subsequently combined five transcriptomics datasets into a large compendium containing over 800 expression profiles and discovered that its underlying ICA-based structure was still comparable to that of the individual datasets. With this understanding, we expanded our analysis to over 3,000 E. coli expression profiles and predicted three high-impact regulons that respond to oxidative stress, anaerobiosis, and antibiotic treatment. ICA thus enables deep analysis of disparate data to uncover new insights that were not visible in the individual datasets. Cells adapt to diverse environments by regulating gene expression. Genome-wide measurements of gene expression levels have exponentially increased in recent years, but successful integration and analysis of these datasets are limited. Recently, we showed that independent component analysis (ICA), a signal deconvolution algorithm, can separate a large bacterial gene expression dataset into groups of co-regulated genes. This previous study focused on data generated by a standardized pipeline and did not address whether ICA extracts the same quantitative co-expression signals across expression profiling platforms. In this study, we show that ICA finds similar co-regulation patterns underlying multiple gene expression datasets and can be used as a tool to integrate and interpret diverse datasets. Using a dataset containing over 3,000 expression profiles, we predicted three new regulons and characterized their activities. Since large, standardized expression datasets only exist for a few bacterial strains, these results broaden the possible applications of this tool to better understand transcriptional regulation across a wide range of microbes.
Collapse
Affiliation(s)
- Anand V. Sastry
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Alyssa Hu
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - David Heckmann
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Saugat Poudel
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Erol Kavvas
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Lyngby, Denmark
- * E-mail:
| |
Collapse
|
9
|
Rychel K, Decker K, Sastry AV, Phaneuf PV, Poudel S, Palsson BO. iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning. Nucleic Acids Res 2021; 49:D112-D120. [PMID: 33045728 PMCID: PMC7778901 DOI: 10.1093/nar/gkaa810] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/10/2020] [Accepted: 09/15/2020] [Indexed: 12/15/2022] Open
Abstract
Independent component analysis (ICA) of bacterial transcriptomes has emerged as a powerful tool for obtaining co-regulated, independently-modulated gene sets (iModulons), inferring their activities across a range of conditions, and enabling their association to known genetic regulators. By grouping and analyzing genes based on observations from big data alone, iModulons can provide a novel perspective into how the composition of the transcriptome adapts to environmental conditions. Here, we present iModulonDB (imodulondb.org), a knowledgebase of prokaryotic transcriptional regulation computed from high-quality transcriptomic datasets using ICA. Users select an organism from the home page and then search or browse the curated iModulons that make up its transcriptome. Each iModulon and gene has its own interactive dashboard, featuring plots and tables with clickable, hoverable, and downloadable features. This site enhances research by presenting scientists of all backgrounds with co-expressed gene sets and their activity levels, which lead to improved understanding of regulator-gene relationships, discovery of transcription factors, and the elucidation of unexpected relationships between conditions and genetic regulatory activity. The current release of iModulonDB covers three organisms (Escherichia coli, Staphylococcus aureus and Bacillus subtilis) with 204 iModulons, and can be expanded to cover many additional organisms.
Collapse
Affiliation(s)
- Kevin Rychel
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Katherine Decker
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Anand V Sastry
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Patrick V Phaneuf
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Saugat Poudel
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
10
|
Machine learning uncovers independently regulated modules in the Bacillus subtilis transcriptome. Nat Commun 2020; 11:6338. [PMID: 33311500 PMCID: PMC7732839 DOI: 10.1038/s41467-020-20153-9] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 10/29/2020] [Indexed: 12/24/2022] Open
Abstract
The transcriptional regulatory network (TRN) of Bacillus subtilis coordinates cellular functions of fundamental interest, including metabolism, biofilm formation, and sporulation. Here, we use unsupervised machine learning to modularize the transcriptome and quantitatively describe regulatory activity under diverse conditions, creating an unbiased summary of gene expression. We obtain 83 independently modulated gene sets that explain most of the variance in expression and demonstrate that 76% of them represent the effects of known regulators. The TRN structure and its condition-dependent activity uncover putative or recently discovered roles for at least five regulons, such as a relationship between histidine utilization and quorum sensing. The TRN also facilitates quantification of population-level sporulation states. As this TRN covers the majority of the transcriptome and concisely characterizes the global expression state, it could inform research on nearly every aspect of transcriptional regulation in B. subtilis. The systems-level regulatory structure underlying gene expression in bacteria can be inferred using machine learning algorithms. Here we show this structure for Bacillus subtilis, present five hypotheses gleaned from it, and analyse the process of sporulation from its perspective.
Collapse
|
11
|
Transcriptional Programs Define Intratumoral Heterogeneity of Ewing Sarcoma at Single-Cell Resolution. Cell Rep 2020; 30:1767-1779.e6. [DOI: 10.1016/j.celrep.2020.01.049] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2019] [Revised: 10/07/2019] [Accepted: 01/15/2020] [Indexed: 12/16/2022] Open
|
12
|
Peng L, Liu F, Yang J, Liu X, Meng Y, Deng X, Peng C, Tian G, Zhou L. Probing lncRNA-Protein Interactions: Data Repositories, Models, and Algorithms. Front Genet 2020; 10:1346. [PMID: 32082358 PMCID: PMC7005249 DOI: 10.3389/fgene.2019.01346] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Accepted: 12/09/2019] [Indexed: 12/31/2022] Open
Abstract
Identifying lncRNA-protein interactions (LPIs) is vital to understanding various key biological processes. Wet experiments found a few LPIs, but experimental methods are costly and time-consuming. Therefore, computational methods are increasingly exploited to capture LPI candidates. We introduced relevant data repositories, focused on two types of LPI prediction models: network-based methods and machine learning-based methods. Machine learning-based methods contain matrix factorization-based techniques and ensemble learning-based techniques. To detect the performance of computational methods, we compared parts of LPI prediction models on Leave-One-Out cross-validation (LOOCV) and fivefold cross-validation. The results show that SFPEL-LPI obtained the best performance of AUC. Although computational models have efficiently unraveled some LPI candidates, there are many limitations involved. We discussed future directions to further boost LPI predictive performance.
Collapse
Affiliation(s)
- Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Fuxing Liu
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Jialiang Yang
- Department of Sciences, Genesis (Beijing) Co. Ltd., Beijing, China
| | - Xiaojun Liu
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Yajie Meng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiaojun Deng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Cheng Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Geng Tian
- Department of Sciences, Genesis (Beijing) Co. Ltd., Beijing, China
| | - Liqian Zhou
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| |
Collapse
|
13
|
Nazarov PV, Wienecke-Baldacchino AK, Zinovyev A, Czerwińska U, Muller A, Nashan D, Dittmar G, Azuaje F, Kreis S. Deconvolution of transcriptomes and miRNomes by independent component analysis provides insights into biological processes and clinical outcomes of melanoma patients. BMC Med Genomics 2019; 12:132. [PMID: 31533822 PMCID: PMC6751789 DOI: 10.1186/s12920-019-0578-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Accepted: 09/05/2019] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND The amount of publicly available cancer-related "omics" data is constantly growing and can potentially be used to gain insights into the tumour biology of new cancer patients, their diagnosis and suitable treatment options. However, the integration of different datasets is not straightforward and requires specialized approaches to deal with heterogeneity at technical and biological levels. METHODS Here we present a method that can overcome technical biases, predict clinically relevant outcomes and identify tumour-related biological processes in patients using previously collected large discovery datasets. The approach is based on independent component analysis (ICA) - an unsupervised method of signal deconvolution. We developed parallel consensus ICA that robustly decomposes transcriptomics datasets into expression profiles with minimal mutual dependency. RESULTS By applying the method to a small cohort of primary melanoma and control samples combined with a large discovery melanoma dataset, we demonstrate that our method distinguishes cell-type specific signals from technical biases and allows to predict clinically relevant patient characteristics. We showed the potential of the method to predict cancer subtypes and estimate the activity of key tumour-related processes such as immune response, angiogenesis and cell proliferation. ICA-based risk score was proposed and its connection to patient survival was validated with an independent cohort of patients. Additionally, through integration of components identified for mRNA and miRNA data, the proposed method helped deducing biological functions of miRNAs, which would otherwise not be possible. CONCLUSIONS We present a method that can be used to map new transcriptomic data from cancer patient samples onto large discovery datasets. The method corrects technical biases, helps characterizing activity of biological processes or cell types in the new samples and provides the prognosis of patient survival.
Collapse
Affiliation(s)
- Petr V. Nazarov
- Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg
| | - Anke K. Wienecke-Baldacchino
- Life Sciences Research Unit (LSRU), University of Luxembourg, L-4367 Belvaux, Luxembourg
- Epidemiology and Microbial Genomics Unit, Department of Microbiology, Laboratoire National de Santé, Dudelange, Luxembourg
| | - Andrei Zinovyev
- INSERM, U900, F-75005 Paris, France
- MINES ParisTech, PSL Research University, F-75006 Paris, France
| | - Urszula Czerwińska
- INSERM, U900, F-75005 Paris, France
- MINES ParisTech, PSL Research University, F-75006 Paris, France
- Centre de Recherches Interdisciplinaires, Université Paris Descartes, Paris, France
| | - Arnaud Muller
- Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg
| | | | - Gunnar Dittmar
- Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg
| | - Francisco Azuaje
- Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg
| | - Stephanie Kreis
- Life Sciences Research Unit (LSRU), University of Luxembourg, L-4367 Belvaux, Luxembourg
| |
Collapse
|
14
|
Sompairac N, Nazarov PV, Czerwinska U, Cantini L, Biton A, Molkenov A, Zhumadilov Z, Barillot E, Radvanyi F, Gorban A, Kairov U, Zinovyev A. Independent Component Analysis for Unraveling the Complexity of Cancer Omics Datasets. Int J Mol Sci 2019; 20:E4414. [PMID: 31500324 PMCID: PMC6771121 DOI: 10.3390/ijms20184414] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2019] [Revised: 09/02/2019] [Accepted: 09/04/2019] [Indexed: 12/13/2022] Open
Abstract
Independent component analysis (ICA) is a matrix factorization approach where the signals captured by each individual matrix factors are optimized to become as mutually independent as possible. Initially suggested for solving source blind separation problems in various fields, ICA was shown to be successful in analyzing functional magnetic resonance imaging (fMRI) and other types of biomedical data. In the last twenty years, ICA became a part of the standard machine learning toolbox, together with other matrix factorization methods such as principal component analysis (PCA) and non-negative matrix factorization (NMF). Here, we review a number of recent works where ICA was shown to be a useful tool for unraveling the complexity of cancer biology from the analysis of different types of omics data, mainly collected for tumoral samples. Such works highlight the use of ICA in dimensionality reduction, deconvolution, data pre-processing, meta-analysis, and others applied to different data types (transcriptome, methylome, proteome, single-cell data). We particularly focus on the technical aspects of ICA application in omics studies such as using different protocols, determining the optimal number of components, assessing and improving reproducibility of the ICA results, and comparison with other popular matrix factorization techniques. We discuss the emerging ICA applications to the integrative analysis of multi-level omics datasets and introduce a conceptual view on ICA as a tool for defining functional subsystems of a complex biological system and their interactions under various conditions. Our review is accompanied by a Jupyter notebook which illustrates the discussed concepts and provides a practical tool for applying ICA to the analysis of cancer omics datasets.
Collapse
Affiliation(s)
- Nicolas Sompairac
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
- Centre de Recherches Interdisciplinaires, Université Paris Descartes, 75004 Paris, France.
| | - Petr V Nazarov
- Multiomics Data Science Research Group, Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg.
| | - Urszula Czerwinska
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| | - Laura Cantini
- Computational Systems Biology Team, Institut de Biologie de l'Ecole Normale Supérieure, CNRS UMR8197, INSERM U1024, Ecole Normale Supérieure, PSL Research University, 75005 Paris, France.
| | - Anne Biton
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI, USR 3756 Institut Pasteur et CNRS), 75015 Paris, France.
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Zhaxybay Zhumadilov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
- University Medical Center, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Emmanuel Barillot
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| | - Francois Radvanyi
- Institut Curie, PSL Research University, 75005 Paris, France.
- CNRS, UMR 144, 75248 Paris, France.
| | - Alexander Gorban
- Center for Mathematical Modeling, University of Leicester, Leicester LE1 7RH, UK.
- Lobachevsky University, 603022 Nizhny Novgorod, Russia.
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| |
Collapse
|
15
|
Molecular Inverse Comorbidity between Alzheimer's Disease and Lung Cancer: New Insights from Matrix Factorization. Int J Mol Sci 2019; 20:ijms20133114. [PMID: 31247897 PMCID: PMC6650839 DOI: 10.3390/ijms20133114] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 06/13/2019] [Accepted: 06/18/2019] [Indexed: 12/23/2022] Open
Abstract
Matrix factorization (MF) is an established paradigm for large-scale biological data analysis with tremendous potential in computational biology. Here, we challenge MF in depicting the molecular bases of epidemiologically described disease–disease (DD) relationships. As a use case, we focus on the inverse comorbidity association between Alzheimer’s disease (AD) and lung cancer (LC), described as a lower than expected probability of developing LC in AD patients. To this day, the molecular mechanisms underlying DD relationships remain poorly explained and their better characterization might offer unprecedented clinical opportunities. To this goal, we extend our previously designed MF-based framework for the molecular characterization of DD relationships. Considering AD–LC inverse comorbidity as a case study, we highlight multiple molecular mechanisms, among which we confirm the involvement of processes related to the immune system and mitochondrial metabolism. We then distinguish mechanisms specific to LC from those shared with other cancers through a pan-cancer analysis. Additionally, new candidate molecular players, such as estrogen receptor (ER), cadherin 1 (CDH1) and histone deacetylase (HDAC), are pinpointed as factors that might underlie the inverse relationship, opening the way to new investigations. Finally, some lung cancer subtype-specific factors are also detected, also suggesting the existence of heterogeneity across patients in the context of inverse comorbidity.
Collapse
|