1
|
Karkar S, Sharma A, Herrmann C, Blum Y, Richard M. DECOMICS, a shiny application for unsupervised cell type deconvolution and biological interpretation of bulk omic data. BIOINFORMATICS ADVANCES 2024; 4:vbae136. [PMID: 39411450 PMCID: PMC11479579 DOI: 10.1093/bioadv/vbae136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 07/18/2024] [Accepted: 09/18/2024] [Indexed: 10/19/2024]
Abstract
Summary Unsupervised deconvolution algorithms are often used to estimate cell composition from bulk tissue samples. However, applying cell-type deconvolution and interpreting the results remain a challenge, even more without prior training in bioinformatics. Here, we propose a tool for estimating and identifying cell type composition from bulk transcriptomes or methylomes. DECOMICS is a shiny-web application dedicated to unsupervised deconvolution approaches of bulk omic data. It provides (i) a variety of existing algorithms to perform deconvolution on the gene expression or methylation-level matrix, (ii) an enrichment analysis module to aid biological interpretation of the deconvolved components, based on enrichment analysis, and (iii) some visualization tools. Input data can be downloaded in csv format and preprocessed in the web application (normalization, transformation, and feature selection). The results of the deconvolution, enrichment, and visualization processes can be downloaded. Availability and implementation DECOMICS is an R-shiny web application that can be launched (i) directly from a local R session using the R package available here: https://gitlab.in2p3.fr/Magali.Richard/decomics (either by installing it locally or via a virtual machine and a Docker image that we provide); or (ii) in the Biosphere-IFB Clouds Federation for Life Science, a multi-cloud environment scalable for high-performance computing: https://biosphere.france-bioinformatique.fr/catalogue/appliance/193/.
Collapse
Affiliation(s)
- Slim Karkar
- IBGC, UMR 5095, University of Bordeaux, CNRS, Bordeaux Bioinformatic Center, Bordeaux 33077, France
| | - Ashwini Sharma
- Health Data Science Unit, Medical Faculty Heidelberg and BioQuant, Heidelberg 69120, Germany
| | - Carl Herrmann
- Health Data Science Unit, Medical Faculty Heidelberg and BioQuant, Heidelberg 69120, Germany
| | - Yuna Blum
- IGDR (Institut de Genetique et Developpement de Rennes), UMR 6290, ERL U1305, Equipe Labellisée Ligue Nationale contre le Cancer, University of Rennes, CNRS, INSERM, Rennes 35000, France
| | - Magali Richard
- TIMC, UMR 5525, Université Grenoble Alpes, CNRS, Grenoble F-38700, France
| |
Collapse
|
2
|
Chepeleva M, Kaoma T, Zinovyev A, Toth R, Nazarov PV. consICA: an R package for robust reference-free deconvolution of multi-omics data. BIOINFORMATICS ADVANCES 2024; 4:vbae102. [PMID: 39027644 PMCID: PMC11257712 DOI: 10.1093/bioadv/vbae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 06/25/2024] [Accepted: 07/12/2024] [Indexed: 07/20/2024]
Abstract
Motivation Deciphering molecular signals from omics data helps understanding cellular processes and disease progression. Effective algorithms for extracting these signals are essential, with a strong emphasis on robustness and reproducibility. Results R/Bioconductor package consICA implements consensus independent component analysis (ICA)-a data-driven deconvolution method to decompose heterogeneous omics data and extract features suitable for patient stratification and multimodal data integration. The method separates biologically relevant molecular signals from technical effects and provides information about the cellular composition and biological processes. Build-in annotation, survival analysis, and report generation provide useful tools for the interpretation of extracted signals. The implementation of parallel computing in the package ensures efficient analysis using modern multicore systems. The package offers a reproducible and efficient data-driven solution for the analysis of complex molecular profiles, with significant implications for cancer research. Availability and implementation The package is implemented in R and available under MIT license at Bioconductor (https://bioconductor.org/packages/consICA) or at GitHub (https://github.com/biomod-lih/consICA).
Collapse
Affiliation(s)
- Maryna Chepeleva
- Multiomics Data Science Research Group, Department of Cancer Research, Luxembourg Institute of Health, Strassen L-1445, Luxembourg
- Faculty of Science, Technology and Medicine, University of Luxembourg, Esch-sur-Alzette L-4365, Luxembourg
| | - Tony Kaoma
- Bioinformatics and AI Unit, Department of Medical Informatics, Luxembourg Institute of Health, Strassen L-1445, Luxembourg
| | | | - Reka Toth
- Multiomics Data Science Research Group, Department of Cancer Research, Luxembourg Institute of Health, Strassen L-1445, Luxembourg
- Bioinformatics and AI Unit, Department of Medical Informatics, Luxembourg Institute of Health, Strassen L-1445, Luxembourg
| | - Petr V Nazarov
- Multiomics Data Science Research Group, Department of Cancer Research, Luxembourg Institute of Health, Strassen L-1445, Luxembourg
- Bioinformatics and AI Unit, Department of Medical Informatics, Luxembourg Institute of Health, Strassen L-1445, Luxembourg
| |
Collapse
|
3
|
Li M, Guo H, Wang K, Kang C, Yin Y, Zhang H. AVBAE-MODFR: A novel deep learning framework of embedding and feature selection on multi-omics data for pan-cancer classification. Comput Biol Med 2024; 177:108614. [PMID: 38796884 DOI: 10.1016/j.compbiomed.2024.108614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 02/27/2024] [Accepted: 05/11/2024] [Indexed: 05/29/2024]
Abstract
Integration analysis of cancer multi-omics data for pan-cancer classification has the potential for clinical applications in various aspects such as tumor diagnosis, analyzing clinically significant features, and providing precision medicine. In these applications, the embedding and feature selection on high-dimensional multi-omics data is clinically necessary. Recently, deep learning algorithms become the most promising cancer multi-omic integration analysis methods, due to the powerful capability of capturing nonlinear relationships. Developing effective deep learning architectures for cancer multi-omics embedding and feature selection remains a challenge for researchers in view of high dimensionality and heterogeneity. In this paper, we propose a novel two-phase deep learning model named AVBAE-MODFR for pan-cancer classification. AVBAE-MODFR achieves embedding by a multi2multi autoencoder based on the adversarial variational Bayes method and further performs feature selection utilizing a dual-net-based feature ranking method. AVBAE-MODFR utilizes AVBAE to pre-train the network parameters, which improves the classification performance and enhances feature ranking stability in MODFR. Firstly, AVBAE learns high-quality representation among multiple omics features for unsupervised pan-cancer classification. We design an efficient discriminator architecture to distinguish the latent distributions for updating forward variational parameters. Secondly, we propose MODFR to simultaneously evaluate multi-omics feature importance for feature selection by training a designed multi2one selector network, where the efficient evaluation approach based on the average gradient of random mask subsets can avoid bias caused by input feature drift. We conduct experiments on the TCGA pan-cancer dataset and compare it with four state-of-the-art methods for each phase. The results show the superiority of AVBAE-MODFR over SOTA methods.
Collapse
Affiliation(s)
- Minghe Li
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Huike Guo
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Keao Wang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Chuanze Kang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Yanbin Yin
- Department of Food Science and Technology, University of Nebraska - Lincoln, NE, USA
| | - Han Zhang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China.
| |
Collapse
|
4
|
Chakraborty S, Sharma G, Karmakar S, Banerjee S. Multi-OMICS approaches in cancer biology: New era in cancer therapy. Biochim Biophys Acta Mol Basis Dis 2024; 1870:167120. [PMID: 38484941 DOI: 10.1016/j.bbadis.2024.167120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/06/2024] [Accepted: 03/06/2024] [Indexed: 04/01/2024]
Abstract
Innovative multi-omics frameworks integrate diverse datasets from the same patients to enhance our understanding of the molecular and clinical aspects of cancers. Advanced omics and multi-view clustering algorithms present unprecedented opportunities for classifying cancers into subtypes, refining survival predictions and treatment outcomes, and unravelling key pathophysiological processes across various molecular layers. However, with the increasing availability of cost-effective high-throughput technologies (HTT) that generate vast amounts of data, analyzing single layers often falls short of establishing causal relations. Integrating multi-omics data spanning genomes, epigenomes, transcriptomes, proteomes, metabolomes, and microbiomes offers unique prospects to comprehend the underlying biology of complex diseases like cancer. This discussion explores algorithmic frameworks designed to uncover cancer subtypes, disease mechanisms, and methods for identifying pivotal genomic alterations. It also underscores the significance of multi-omics in tumor classifications, diagnostics, and prognostications. Despite its unparalleled advantages, the integration of multi-omics data has been slow to find its way into everyday clinics. A major hurdle is the uneven maturity of different omics approaches and the widening gap between the generation of large datasets and the capacity to process this data. Initiatives promoting the standardization of sample processing and analytical pipelines, as well as multidisciplinary training for experts in data analysis and interpretation, are crucial for translating theoretical findings into practical applications.
Collapse
Affiliation(s)
- Sohini Chakraborty
- Department of Biotechnology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Gaurav Sharma
- Department of Biotechnology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Sricheta Karmakar
- Department of Biotechnology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Satarupa Banerjee
- Department of Biotechnology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India.
| |
Collapse
|
5
|
Thuilliez C, Moquin-Beaudry G, Khneisser P, Marques Da Costa ME, Karkar S, Boudhouche H, Drubay D, Audinot B, Geoerger B, Scoazec JY, Gaspar N, Marchais A. CellsFromSpace: a fast, accurate, and reference-free tool to deconvolve and annotate spatially distributed omics data. BIOINFORMATICS ADVANCES 2024; 4:vbae081. [PMID: 38915885 PMCID: PMC11194756 DOI: 10.1093/bioadv/vbae081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 05/02/2024] [Accepted: 05/29/2024] [Indexed: 06/26/2024]
Abstract
Motivation Spatial transcriptomics enables the analysis of cell crosstalk in healthy and diseased organs by capturing the transcriptomic profiles of millions of cells within their spatial contexts. However, spatial transcriptomics approaches also raise new computational challenges for the multidimensional data analysis associated with spatial coordinates. Results In this context, we introduce a novel analytical framework called CellsFromSpace based on independent component analysis (ICA), which allows users to analyze various commercially available technologies without relying on a single-cell reference dataset. The ICA approach deployed in CellsFromSpace decomposes spatial transcriptomics data into interpretable components associated with distinct cell types or activities. ICA also enables noise or artifact reduction and subset analysis of cell types of interest through component selection. We demonstrate the flexibility and performance of CellsFromSpace using real-world samples to demonstrate ICA's ability to successfully identify spatially distributed cells as well as rare diffuse cells, and quantitatively deconvolute datasets from the Visium, Slide-seq, MERSCOPE, and CosMX technologies. Comparative analysis with a current alternative reference-free deconvolution tool also highlights CellsFromSpace's speed, scalability and accuracy in processing complex, even multisample datasets. CellsFromSpace also offers a user-friendly graphical interface enabling non-bioinformaticians to annotate and interpret components based on spatial distribution and contributor genes, and perform full downstream analysis. Availability and implementation CellsFromSpace (CFS) is distributed as an R package available from github at https://github.com/gustaveroussy/CFS along with tutorials, examples, and detailed documentation.
Collapse
Affiliation(s)
- Corentin Thuilliez
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
| | - Gaël Moquin-Beaudry
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
| | - Pierre Khneisser
- Department of Medical Biology and Pathology, Gustave Roussy Cancer Campus, Villejuif 94805, France
| | - Maria Eugenia Marques Da Costa
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
- Department of Pediatric and Adolescent Oncology, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif 94805, France
| | - Slim Karkar
- University Bordeaux, CNRS, IBGC, UMR, Bordeaux 33077, France
- Bordeaux Bioinformatic Center CBiB, University of Bordeaux, Bordeaux 33000, France
| | - Hanane Boudhouche
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
| | - Damien Drubay
- Office of Biostatistics and Epidemiology, Gustave Roussy, Université Paris-Saclay, Villejuif 94805, France
- Inserm, Université Paris-Saclay, CESP U1018, Oncostat, Labeled Ligue Contre le Cancer, Villejuif 94805, France
| | - Baptiste Audinot
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
| | - Birgit Geoerger
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
- Department of Pediatric and Adolescent Oncology, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif 94805, France
| | - Jean-Yves Scoazec
- Department of Medical Biology and Pathology, Gustave Roussy Cancer Campus, Villejuif 94805, France
| | - Nathalie Gaspar
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
- Department of Pediatric and Adolescent Oncology, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif 94805, France
| | - Antonin Marchais
- INSERM U1015, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif F-94805, France
- Department of Pediatric and Adolescent Oncology, Gustave Roussy Cancer Campus, Université Paris-Saclay, Villejuif 94805, France
| |
Collapse
|
6
|
Doria-Belenguer S, Xenos A, Ceddia G, Malod-Dognin N, Pržulj N. The axes of biology: a novel axes-based network embedding paradigm to decipher the functional mechanisms of the cell. BIOINFORMATICS ADVANCES 2024; 4:vbae075. [PMID: 38827411 PMCID: PMC11142626 DOI: 10.1093/bioadv/vbae075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 04/15/2024] [Accepted: 05/22/2024] [Indexed: 06/04/2024]
Abstract
Summary Common approaches for deciphering biological networks involve network embedding algorithms. These approaches strictly focus on clustering the genes' embedding vectors and interpreting such clusters to reveal the hidden information of the networks. However, the difficulty in interpreting the genes' clusters and the limitations of the functional annotations' resources hinder the identification of the currently unknown cell's functioning mechanisms. We propose a new approach that shifts this functional exploration from the embedding vectors of genes in space to the axes of the space itself. Our methodology better disentangles biological information from the embedding space than the classic gene-centric approach. Moreover, it uncovers new data-driven functional interactions that are unregistered in the functional ontologies, but biologically coherent. Furthermore, we exploit these interactions to define new higher-level annotations that we term Axes-Specific Functional Annotations and validate them through literature curation. Finally, we leverage our methodology to discover evolutionary connections between cellular functions and the evolution of species. Availability and implementation Data and source code can be accessed at https://gitlab.bsc.es/sdoria/axes-of-biology.git.
Collapse
Affiliation(s)
| | | | - Gaia Ceddia
- Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | | | - Nataša Pržulj
- Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
- Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom
- ICREA, Barcelona 08010, Spain
| |
Collapse
|
7
|
Ferro dos Santos MR, Giuili E, De Koker A, Everaert C, De Preter K. Computational deconvolution of DNA methylation data from mixed DNA samples. Brief Bioinform 2024; 25:bbae234. [PMID: 38762790 PMCID: PMC11102637 DOI: 10.1093/bib/bbae234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 03/30/2024] [Accepted: 04/30/2024] [Indexed: 05/20/2024] Open
Abstract
In this review, we provide a comprehensive overview of the different computational tools that have been published for the deconvolution of bulk DNA methylation (DNAm) data. Here, deconvolution refers to the estimation of cell-type proportions that constitute a mixed sample. The paper reviews and compares 25 deconvolution methods (supervised, unsupervised or hybrid) developed between 2012 and 2023 and compares the strengths and limitations of each approach. Moreover, in this study, we describe the impact of the platform used for the generation of methylation data (including microarrays and sequencing), the applied data pre-processing steps and the used reference dataset on the deconvolution performance. Next to reference-based methods, we also examine methods that require only partial reference datasets or require no reference set at all. In this review, we provide guidelines for the use of specific methods dependent on the DNA methylation data type and data availability.
Collapse
Affiliation(s)
- Maísa R Ferro dos Santos
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Edoardo Giuili
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Andries De Koker
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Celine Everaert
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Katleen De Preter
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| |
Collapse
|
8
|
Liu C, Yu C, Song G, Fan X, Peng S, Zhang S, Zhou X, Zhang C, Geng X, Wang T, Cheng W, Zhu W. Comprehensive analysis of miRNA-mRNA regulatory pairs associated with colorectal cancer and the role in tumor immunity. BMC Genomics 2023; 24:724. [PMID: 38036953 PMCID: PMC10688136 DOI: 10.1186/s12864-023-09635-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 08/29/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND MicroRNA (miRNA) which can act as post-transcriptional regulators of mRNAs via base-pairing with complementary sequences within mRNAs is involved in processes of the complex interaction between immune system and tumors. In this research, we elucidated the profiles of miRNAs and target mRNAs expression and their associations with the phenotypic hallmarks of colorectal cancers (CRC) by integrating transcriptomic, immunophenotype, methylation, mutation and survival data. RESULTS We conducted the analysis of differential miRNA/mRNA expression profile by GEO, TCGA and GTEx databases and the correlation between miRNA and targeted mRNA by miRTarBase and TarBase. Then we detected using qRT-PCR and validated the diagnostic value of miRNA-mRNA regulator pairs by the ROC, calibration curve and DCA. Phenotypic hallmarks of regulatory pairs including tumor-infiltrating lymphocytes, tumor microenvironment, tumor mutation burden, global methylation and gene mutation were also described. The expression levels of miRNAs and target mRNAs were detected in 80 paired colon tissue samples. Ultimately, we picked up two pivotal regulatory pairs (miR-139-5p/ STC1 and miR-20a-5p/ FGL2) and verified the diagnostic value of the complex model which is the combination of 4 signatures above-mentioned in 3 testing GEO datasets and an external validation cohort. CONCLUSIONS We found that 2 miRNAs by targeting 2 metastasis-related mRNAs were correlated with tumor-infiltrating macrophages, HRAS, and BRAF gene mutation status. Our results established the diagnostic model containing 2 miRNAs and their respective targeted mRNAs to distinguish CRCs and normal controls and displayed their complex roles in CRC pathogenesis especially tumor immunity.
Collapse
Affiliation(s)
- Cheng Liu
- Department of Gastroenterology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, Jiangsu, China
| | - Chun Yu
- Department of Gastroenterology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, Jiangsu, China
| | - Guoxin Song
- Department of Pathology, the First Affiliated Hospital of Nanjing Medical University, Nanjing, 210029, China, Jiangsu
| | - Xingchen Fan
- Department of Oncology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, China, Jiangsu
| | - Shuang Peng
- Department of Oncology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, China, Jiangsu
| | - Shiyu Zhang
- Department of Oncology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, China, Jiangsu
| | - Xin Zhou
- Department of Oncology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, China, Jiangsu
| | - Cheng Zhang
- Department of Science and Technology, the First Affiliated Hospital of Nanjing Medical University, Nanjing, 210029, China, Jiangsu
| | - Xiangnan Geng
- Department of Clinical Engineer, the First Affiliated Hospital of Nanjing Medical University, Nanjing, 210029, China, Jiangsu
| | - Tongshan Wang
- Department of Oncology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, China, Jiangsu
| | - Wenfang Cheng
- Department of Gastroenterology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, Jiangsu, China.
| | - Wei Zhu
- Department of Oncology, the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, China, Jiangsu.
| |
Collapse
|
9
|
Gong Z, Chen C, Chen C, Li C, Tian X, Gong Z, Lv X. RamanCMP: A Raman spectral classification acceleration method based on lightweight model and model compression techniques. Anal Chim Acta 2023; 1278:341758. [PMID: 37709483 DOI: 10.1016/j.aca.2023.341758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 08/02/2023] [Accepted: 08/27/2023] [Indexed: 09/16/2023]
Abstract
In recent years, Raman spectroscopy combined with deep learning techniques has been widely used in various fields such as medical, chemical, and geological. However, there is still room for optimization of deep learning techniques and model compression algorithms for processing Raman spectral data. To further optimize deep learning models applied to Raman spectroscopy, in this study time, accuracy, sensitivity, specificity and floating point operations numbers(FLOPs) are used as evaluation metrics to optimize the model, which is named RamanCompact(RamanCMP). The experimental data used in this research are selected from the RRUFF public dataset, which consists of 723 Raman spectroscopy data samples from 10 different mineral categories. In this paper, 1D-EfficientNet adapted to the spectral data as well as 1D-DRSN are proposed to improve the model classification accuracy. To achieve better classification accuracy while optimizing the time parameters, three model compression methods are designed: knowledge distillation using 1D-EfficientNet model as a teacher model to train convolutional neural networks(CNN), proposing a channel conversion method to optimize 1D-DRSN model, and using 1D-DRSN model as a feature extractor in combination with linear discriminant analysis(LDA) model for classification. Compared with the traditional LDA and CNN models, the accuracy of 1D-EfficientNet and 1D-DRSN is improved by more than 20%. The time of the distilled model is reduced by 9680.9s compared with the teacher model 1D-EfficientNet under the condition of losing 2.07% accuracy. The accuracy of the distilled model is improved by 20% compared to the CNN student model while keeping inference efficiency constant. The 1D-DRSN optimized with channel conversion method saves 60% inference time of the original 1D-DRSN model. Feature extraction reduces the inference time of 1D-DRSN model by 93% with 94.48% accuracy. This study innovatively combines lightweight models and model compression algorithms to improve the classification speed of deep learning models in the field of Raman spectroscopy, forming a complete set of analysis methods and laying the foundation for future research.
Collapse
Affiliation(s)
- Zengyun Gong
- College of Software, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| | - Chen Chen
- College of Information Science and Engineering, Xinjiang University, Urumqi, 830046, Xinjian, China.
| | - Cheng Chen
- College of Software, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| | - Chenxi Li
- Oncological Department of Oral and Maxillofacial Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, China.
| | - Xuecong Tian
- College of Information Science and Engineering, Xinjiang University, Urumqi, 830046, Xinjian, China.
| | - Zhongcheng Gong
- Oncological Department of Oral and Maxillofacial Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, China; Hospital of Stomatology Xinjiang Medical University, Urumqi, 830054, Xinjiang, China; Stomatological Research Institute of Xinjiang Uygur Autonomous Region, Urumqi, 830054, Xinjiang, China.
| | - Xiaoyi Lv
- College of Software, Xinjiang University, Urumqi, 830046, Xinjiang, China; Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| |
Collapse
|
10
|
Fouché A, Chadoutaud L, Delattre O, Zinovyev A. Transmorph: a unifying computational framework for modular single-cell RNA-seq data integration. NAR Genom Bioinform 2023; 5:lqad069. [PMID: 37448589 PMCID: PMC10336778 DOI: 10.1093/nargab/lqad069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 06/02/2023] [Accepted: 07/10/2023] [Indexed: 07/15/2023] Open
Abstract
Data integration of single-cell RNA-seq (scRNA-seq) data describes the task of embedding datasets gathered from different sources or experiments into a common representation so that cells with similar types or states are embedded close to one another independently from their dataset of origin. Data integration is a crucial step in most scRNA-seq data analysis pipelines involving multiple batches. It improves data visualization, batch effect reduction, clustering, label transfer, and cell type inference. Many data integration tools have been proposed during the last decade, but a surge in the number of these methods has made it difficult to pick one for a given use case. Furthermore, these tools are provided as rigid pieces of software, making it hard to adapt them to various specific scenarios. In order to address both of these issues at once, we introduce the transmorph framework. It allows the user to engineer powerful data integration pipelines and is supported by a rich software ecosystem. We demonstrate transmorph usefulness by solving a variety of practical challenges on scRNA-seq datasets including joint datasets embedding, gene space integration, and transfer of cycle phase annotations. transmorph is provided as an open source python package.
Collapse
Affiliation(s)
- Aziz Fouché
- To whom correspondence should be addressed. Tel: +33 156246989;
| | - Loïc Chadoutaud
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, 75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75005 Paris, France
| | - Olivier Delattre
- INSERM U830, Equipe Labellisée LNCC, SIREDO Oncology Centre, Institut Curie, 75005 Paris, France
| | - Andrei Zinovyev
- Correspondence may also be addressed to Andrei Zinovyev. Tel: +33 156246989;
| |
Collapse
|
11
|
Mirkes EM, Bac J, Fouché A, Stasenko SV, Zinovyev A, Gorban AN. Domain Adaptation Principal Component Analysis: Base Linear Method for Learning with Out-of-Distribution Data. ENTROPY (BASEL, SWITZERLAND) 2022; 25:33. [PMID: 36673174 PMCID: PMC9858254 DOI: 10.3390/e25010033] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 12/18/2022] [Accepted: 12/21/2022] [Indexed: 06/17/2023]
Abstract
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains.
Collapse
Affiliation(s)
- Evgeny M. Mirkes
- School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK
| | - Jonathan Bac
- Institut Curie, PSL Research University, 75005 Paris, France
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, 75012 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75005 Paris, France
| | - Aziz Fouché
- Institut Curie, PSL Research University, 75005 Paris, France
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, 75012 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75005 Paris, France
| | - Sergey V. Stasenko
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603000 Nizhniy Novgorod, Russia
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75005 Paris, France
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, 75012 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75005 Paris, France
| | - Alexander N. Gorban
- School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK
| |
Collapse
|
12
|
Anglada-Girotto M, Miravet-Verde S, Serrano L, Head SA. robustica: customizable robust independent component analysis. BMC Bioinformatics 2022; 23:519. [PMID: 36471244 PMCID: PMC9721028 DOI: 10.1186/s12859-022-05043-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 11/08/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Independent Component Analysis (ICA) allows the dissection of omic datasets into modules that help to interpret global molecular signatures. The inherent randomness of this algorithm can be overcome by clustering many iterations of ICA together to obtain robust components. Existing algorithms for robust ICA are dependent on the choice of clustering method and on computing a potentially biased and large Pearson distance matrix. RESULTS We present robustica, a Python-based package to compute robust independent components with a fully customizable clustering algorithm and distance metric. Here, we exploited its customizability to revisit and optimize robust ICA systematically. Of the 6 popular clustering algorithms considered, DBSCAN performed the best at clustering independent components across ICA iterations. To enable using Euclidean distances, we created a subroutine that infers and corrects the components' signs across ICA iterations. Our subroutine increased the resolution, robustness, and computational efficiency of the algorithm. Finally, we show the applicability of robustica by dissecting over 500 tumor samples from low-grade glioma (LGG) patients, where we define two new gene expression modules with key modulators of tumor progression upon IDH1 and TP53 mutagenesis. CONCLUSION robustica brings precise, efficient, and customizable robust ICA into the Python toolbox. Through its customizability, we explored how different clustering algorithms and distance metrics can further optimize robust ICA. Then, we showcased how robustica can be used to discover gene modules associated with combinations of features of biological interest. Taken together, given the broad applicability of ICA for omic data analysis, we envision robustica will facilitate the seamless computation and integration of robust independent components in large pipelines.
Collapse
Affiliation(s)
- Miquel Anglada-Girotto
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Samuel Miravet-Verde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Luis Serrano
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- ICREA, Pg. LLuís Companys 23, 08010, Barcelona, Spain.
| | - Sarah A Head
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
| |
Collapse
|
13
|
Singh KS, van der Hooft JJJ, van Wees SCM, Medema MH. Integrative omics approaches for biosynthetic pathway discovery in plants. Nat Prod Rep 2022; 39:1876-1896. [PMID: 35997060 PMCID: PMC9491492 DOI: 10.1039/d2np00032f] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Indexed: 12/13/2022]
Abstract
Covering: up to 2022With the emergence of large amounts of omics data, computational approaches for the identification of plant natural product biosynthetic pathways and their genetic regulation have become increasingly important. While genomes provide clues regarding functional associations between genes based on gene clustering, metabolome mining provides a foundational technology to chart natural product structural diversity in plants, and transcriptomics has been successfully used to identify new members of their biosynthetic pathways based on coexpression. Thus far, most approaches utilizing transcriptomics and metabolomics have been targeted towards specific pathways and use one type of omics data at a time. Recent technological advances now provide new opportunities for integration of multiple omics types and untargeted pathway discovery. Here, we review advances in plant biosynthetic pathway discovery using genomics, transcriptomics, and metabolomics, as well as recent efforts towards omics integration. We highlight how transcriptomics and metabolomics provide complementary information to link genes to metabolites, by associating temporal and spatial gene expression levels with metabolite abundance levels across samples, and by matching mass-spectral features to enzyme families. Furthermore, we suggest that elucidation of gene regulatory networks using time-series data may prove useful for efforts to unwire the complexities of biosynthetic pathway components based on regulatory interactions and events.
Collapse
Affiliation(s)
- Kumar Saurabh Singh
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
- Plant-Microbe Interactions, Institute of Environmental Biology, Utrecht University, The Netherlands.
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
- Department of Biochemistry, University of Johannesburg, Auckland Park, Johannesburg 2006, South Africa
| | - Saskia C M van Wees
- Plant-Microbe Interactions, Institute of Environmental Biology, Utrecht University, The Netherlands.
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
| |
Collapse
|
14
|
Hiort P, Hugo J, Zeinert J, Müller N, Kashyap S, Rajapakse JC, Azuaje F, Renard BY, Baum K. DrDimont: explainable drug response prediction from differential analysis of multi-omics networks. Bioinformatics 2022; 38:ii113-ii119. [PMID: 36124784 PMCID: PMC9486584 DOI: 10.1093/bioinformatics/btac477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION While it has been well established that drugs affect and help patients differently, personalized drug response predictions remain challenging. Solutions based on single omics measurements have been proposed, and networks provide means to incorporate molecular interactions into reasoning. However, how to integrate the wealth of information contained in multiple omics layers still poses a complex problem. RESULTS We present DrDimont, Drug response prediction from Differential analysis of multi-omics networks. It allows for comparative conclusions between two conditions and translates them into differential drug response predictions. DrDimont focuses on molecular interactions. It establishes condition-specific networks from correlation within an omics layer that are then reduced and combined into heterogeneous, multi-omics molecular networks. A novel semi-local, path-based integration step ensures integrative conclusions. Differential predictions are derived from comparing the condition-specific integrated networks. DrDimont's predictions are explainable, i.e. molecular differences that are the source of high differential drug scores can be retrieved. We predict differential drug response in breast cancer using transcriptomics, proteomics, phosphosite and metabolomics measurements and contrast estrogen receptor positive and receptor negative patients. DrDimont performs better than drug prediction based on differential protein expression or PageRank when evaluating it on ground truth data from cancer cell lines. We find proteomic and phosphosite layers to carry most information for distinguishing drug response. AVAILABILITY AND IMPLEMENTATION DrDimont is available on CRAN: https://cran.r-project.org/package=DrDimont. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pauline Hiort
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| | - Julian Hugo
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| | - Justus Zeinert
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| | - Nataniel Müller
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| | - Spoorthi Kashyap
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| | - Jagath C Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore
| | | | - Bernhard Y Renard
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| | | |
Collapse
|
15
|
easyMF: A Web Platform for Matrix Factorization-Based Gene Discovery from Large-scale Transcriptome Data. Interdiscip Sci 2022; 14:746-758. [PMID: 35585280 DOI: 10.1007/s12539-022-00522-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 04/06/2022] [Accepted: 04/07/2022] [Indexed: 01/22/2023]
Abstract
With the development of high-throughput experimental technologies, large-scale RNA sequencing (RNA-Seq) data have been and continue to be produced, but have led to challenges in extracting relevant biological knowledge hidden in the produced high-dimensional gene expression matrices. Here, we develop easyMF ( https://github.com/cma2015/easyMF ), a web platform that can facilitate functional gene discovery from large-scale transcriptome data using matrix factorization (MF) algorithms. Compared with existing MF-based software packages, easyMF exhibits several promising features, such as greater functionality, flexibility and ease of use. The easyMF platform is equipped using the Big-Data-supported Galaxy system with user-friendly graphic user interfaces, allowing users with little programming experience to streamline transcriptome analysis from raw reads to gene expression, carry out multiple-scenario MF analysis, and perform multiple-way MF-based gene discovery. easyMF is also powered with the advanced packing technology to enhance ease of use under different operating systems and computational environments. We illustrated the application of easyMF for seed gene discovery from temporal, spatial, and integrated RNA-Seq datasets of maize (Zea mays L.), resulting in the identification of 3,167 seed stage-specific, 1,849 seed compartment-specific, and 774 seed-specific genes, respectively. The present results also indicated that easyMF can prioritize seed-related genes with superior prediction performance over the state-of-art network-based gene prioritization system MaizeNet. As a modular, containerized and open-source platform, easyMF can be further customized to satisfy users' specific demands of functional gene discovery and deployed as a web service for broad applications.
Collapse
|
16
|
Captier N, Merlevede J, Molkenov A, Seisenova A, Zhubanchaliyev A, Nazarov PV, Barillot E, Kairov U, Zinovyev A. BIODICA: a computational environment for Independent Component Analysis of omics data. Bioinformatics 2022; 38:2963-2964. [PMID: 35561190 DOI: 10.1093/bioinformatics/btac204] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 03/29/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
SUMMARY We developed BIODICA, an integrated computational environment for application of independent component analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata. The computational core is the novel Python package stabilized-ica which provides interface to several ICA algorithms, a stabilization procedure, meta-analysis and component interpretation tools. BIODICA is equipped with a user-friendly graphical user interface, allowing non-experienced users to perform the ICA-based omics data analysis. The results are provided in interactive ways, thus facilitating communication with biology experts. AVAILABILITY AND IMPLEMENTATION BIODICA is implemented in Java, Python and JavaScript. The source code is freely available on GitHub under the MIT and the GNU LGPL licenses. BIODICA is supported on all major operating systems. URL: https://sysbio-curie.github.io/biodica-environment/.
Collapse
Affiliation(s)
- Nicolas Captier
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
- Laboratoire d'Imagerie Translationnelle en Oncologie, Institut Curie, INSERM U1288, PSL Research University, 91400 Orsay, France
| | - Jane Merlevede
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
| | - Askhat Molkenov
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Ainur Seisenova
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Altynbek Zhubanchaliyev
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Petr V Nazarov
- Multiomics Data Science Research Group, Department of Cancer Research & Bioinformatics Platform, Luxembourg Institute of Health, L-1445 Strassen, Luxembourg
| | - Emmanuel Barillot
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
| | - Ulykbek Kairov
- National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
| | - Andrei Zinovyev
- Institut National de la Santé et de la Recherche Médicale (INSERM), U900, F-75005 Paris, France
- Institut Curie, PSL Research University, F-75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006 Paris, France
| |
Collapse
|
17
|
A deep clustering by multi-level feature fusion. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01557-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
18
|
Moerkerke B, Seurinck R. Discussion on "distributional independent component analysis for diverse neuroimaging modalities" by Ben Wu, Subhadip Pal, Jian Kang, and Ying Guo. Biometrics 2021; 78:1118-1121. [PMID: 34780667 DOI: 10.1111/biom.13590] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 07/15/2021] [Accepted: 07/22/2021] [Indexed: 12/12/2022]
Abstract
We are grateful for the opportunity to provide a discussion on this paper. We will first focus on the general context. Next, we will emphasize the novel key ideas proposed by the authors before formulating some open questions.
Collapse
Affiliation(s)
| | - Ruth Seurinck
- Data Mining and Modelling for Biomedicine, VIB-UGent Center for Inflammation Research, Ghent, Belgium.,Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| |
Collapse
|
19
|
Chauhan SM, Poudel S, Rychel K, Lamoureux C, Yoo R, Al Bulushi T, Yuan Y, Palsson BO, Sastry AV. Machine Learning Uncovers a Data-Driven Transcriptional Regulatory Network for the Crenarchaeal Thermoacidophile Sulfolobus acidocaldarius. Front Microbiol 2021; 12:753521. [PMID: 34777307 PMCID: PMC8578740 DOI: 10.3389/fmicb.2021.753521] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Accepted: 09/30/2021] [Indexed: 01/24/2023] Open
Abstract
Dynamic cellular responses to environmental constraints are coordinated by the transcriptional regulatory network (TRN), which modulates gene expression. This network controls most fundamental cellular responses, including metabolism, motility, and stress responses. Here, we apply independent component analysis, an unsupervised machine learning approach, to 95 high-quality Sulfolobus acidocaldarius RNA-seq datasets and extract 45 independently modulated gene sets, or iModulons. Together, these iModulons contain 755 genes (32% of the genes identified on the genome) and explain over 70% of the variance in the expression compendium. We show that five modules represent the effects of known transcriptional regulators, and hypothesize that most of the remaining modules represent the effects of uncharacterized regulators. Further analysis of these gene sets results in: (1) the prediction of a DNA export system composed of five uncharacterized genes, (2) expansion of the LysM regulon, and (3) evidence for an as-yet-undiscovered global regulon. Our approach allows for a mechanistic, systems-level elucidation of an extremophile's responses to biological perturbations, which could inform research on gene-regulator interactions and facilitate regulator discovery in S. acidocaldarius. We also provide the first global TRN for S. acidocaldarius. Collectively, these results provide a roadmap toward regulatory network discovery in archaea.
Collapse
Affiliation(s)
- Siddharth M. Chauhan
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Saugat Poudel
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Kevin Rychel
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Cameron Lamoureux
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Reo Yoo
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Tahani Al Bulushi
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Yuan Yuan
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Lyngby, Denmark
| | - Anand V. Sastry
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States
| |
Collapse
|
20
|
Ashenova A, Daniyarov A, Molkenov A, Sharip A, Zinovyev A, Kairov U. Meta-Analysis of Esophageal Cancer Transcriptomes Using Independent Component Analysis. Front Genet 2021; 12:683632. [PMID: 34795689 PMCID: PMC8594933 DOI: 10.3389/fgene.2021.683632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 10/05/2021] [Indexed: 11/17/2022] Open
Abstract
Independent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA for identification and comprehensive analysis of reproducible signaling pathways and molecular signatures involved in this cancer type. In this study, four independent esophageal cancer transcriptomic datasets from GEO databases were used. A bioinformatics tool « BiODICA-Independent Component Analysis of Big Omics Data» was applied to compute independent components (ICs). Gene Set Enrichment Analysis (GSEA) and ToppGene uncovered the most significantly enriched pathways. Construction and visualization of gene networks and graphs were performed using the Cytoscape, and HPRD database. The correlation graph between decompositions into 30 ICs was built with absolute correlation values exceeding 0.3. Clusters of components-pseudocliques were observed in the structure of the correlation graph. The top 1,000 most contributing genes of each ICs in the pseudocliques were mapped to the PPI network to construct associated signaling pathways. Some cliques were composed of densely interconnected nodes and included components common to most cancer types (such as cell cycle and extracellular matrix signals), while others were specific to EC. The results of this investigation may reveal potential biomarkers of esophageal carcinogenesis, functional subsystems dysregulated in the tumor cells, and be helpful in predicting the early development of a tumor.
Collapse
Affiliation(s)
- Ainur Ashenova
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
- Department of Biology, School of Sciences and Humanities, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Asset Daniyarov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Aigul Sharip
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, INSERM U900, Paris, France
- Laboratory of Advanced Methods for High-dimensional Data Analysis, Lobachevsky University, Nizhny Novgorod, Russia
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| |
Collapse
|
21
|
Gorban AN, Grechuk B, Mirkes EM, Stasenko SV, Tyukin IY. High-Dimensional Separability for One- and Few-Shot Learning. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1090. [PMID: 34441230 PMCID: PMC8392747 DOI: 10.3390/e23081090] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 08/08/2021] [Accepted: 08/13/2021] [Indexed: 12/31/2022]
Abstract
This work is driven by a practical question: corrections of Artificial Intelligence (AI) errors. These corrections should be quick and non-iterative. To solve this problem without modification of a legacy AI system, we propose special 'external' devices, correctors. Elementary correctors consist of two parts, a classifier that separates the situations with high risk of error from the situations in which the legacy AI system works well and a new decision that should be recommended for situations with potential errors. Input signals for the correctors can be the inputs of the legacy AI system, its internal signals, and outputs. If the intrinsic dimensionality of data is high enough then the classifiers for correction of small number of errors can be very simple. According to the blessing of dimensionality effects, even simple and robust Fisher's discriminants can be used for one-shot learning of AI correctors. Stochastic separation theorems provide the mathematical basis for this one-short learning. However, as the number of correctors needed grows, the cluster structure of data becomes important and a new family of stochastic separation theorems is required. We refuse the classical hypothesis of the regularity of the data distribution and assume that the data can have a rich fine-grained structure with many clusters and corresponding peaks in the probability density. New stochastic separation theorems for data with fine-grained structure are formulated and proved. On the basis of these theorems, the multi-correctors for granular data are proposed. The advantages of the multi-corrector technology were demonstrated by examples of correcting errors and learning new classes of objects by a deep convolutional neural network on the CIFAR-10 dataset. The key problems of the non-classical high-dimensional data analysis are reviewed together with the basic preprocessing steps including the correlation transformation, supervised Principal Component Analysis (PCA), semi-supervised PCA, transfer component analysis, and new domain adaptation PCA.
Collapse
Affiliation(s)
- Alexander N. Gorban
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (B.G.); (E.M.M.); (I.Y.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhni Novgorod, Russia;
| | - Bogdan Grechuk
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (B.G.); (E.M.M.); (I.Y.T.)
| | - Evgeny M. Mirkes
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (B.G.); (E.M.M.); (I.Y.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhni Novgorod, Russia;
| | - Sergey V. Stasenko
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhni Novgorod, Russia;
| | - Ivan Y. Tyukin
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (B.G.); (E.M.M.); (I.Y.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhni Novgorod, Russia;
- Department of Geoscience and Petroleum, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| |
Collapse
|
22
|
Blessing of dimensionality at the edge and geometry of few-shot learning. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.01.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
23
|
Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J 2021; 19:3735-3746. [PMID: 34285775 PMCID: PMC8258788 DOI: 10.1016/j.csbj.2021.06.030] [Citation(s) in RCA: 178] [Impact Index Per Article: 59.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 06/17/2021] [Accepted: 06/21/2021] [Indexed: 12/25/2022] Open
Abstract
Increased availability of high-throughput technologies has generated an ever-growing number of omics data that seek to portray many different but complementary biological layers including genomics, epigenomics, transcriptomics, proteomics, and metabolomics. New insight from these data have been obtained by machine learning algorithms that have produced diagnostic and classification biomarkers. Most biomarkers obtained to date however only include one omic measurement at a time and thus do not take full advantage of recent multi-omics experiments that now capture the entire complexity of biological systems. Multi-omics data integration strategies are needed to combine the complementary knowledge brought by each omics layer. We have summarized the most recent data integration methods/ frameworks into five different integration strategies: early, mixed, intermediate, late and hierarchical. In this mini-review, we focus on challenges and existing multi-omics integration strategies by paying special attention to machine learning applications.
Collapse
Affiliation(s)
- Milan Picard
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Périn
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
- Corresponding author.
| |
Collapse
|
24
|
Zinovyev A. Adaptation through the lens of single-cell multi-omics data: Comment on "Dynamic and thermodynamic models of adaptation" by A.N. Gorban et al. Phys Life Rev 2021; 38:132-134. [PMID: 34088607 DOI: 10.1016/j.plrev.2021.05.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 05/19/2021] [Indexed: 10/21/2022]
Affiliation(s)
- Andrei Zinovyev
- Institut Curie, PSL Research University, F-75005 Paris, France; INSERM, U900, F-75005 Paris, France; CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France; Laboratory of advanced methods for high-dimensional data analysis, Lobachevsky University, 603000 Nizhny Novgorod, Russia.
| |
Collapse
|
25
|
Kuksin M, Morel D, Aglave M, Danlos FX, Marabelle A, Zinovyev A, Gautheret D, Verlingue L. Applications of single-cell and bulk RNA sequencing in onco-immunology. Eur J Cancer 2021; 149:193-210. [PMID: 33866228 DOI: 10.1016/j.ejca.2021.03.005] [Citation(s) in RCA: 62] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 02/26/2021] [Accepted: 03/04/2021] [Indexed: 02/08/2023]
Abstract
The rising interest for precise characterization of the tumour immune contexture has recently brought forward the high potential of RNA sequencing (RNA-seq) in identifying molecular mechanisms engaged in the response to immunotherapy. In this review, we provide an overview of the major principles of single-cell and conventional (bulk) RNA-seq applied to onco-immunology. We describe standard preprocessing and statistical analyses of data obtained from such techniques and highlight some computational challenges relative to the sequencing of individual cells. We notably provide examples of gene expression analyses such as differential expression analysis, dimensionality reduction, clustering and enrichment analysis. Additionally, we used public data sets to exemplify how deconvolution algorithms can identify and quantify multiple immune subpopulations from either bulk or single-cell RNA-seq. We give examples of machine and deep learning models used to predict patient outcomes and treatment effect from high-dimensional data. Finally, we balance the strengths and weaknesses of single-cell and bulk RNA-seq regarding their applications in the clinic.
Collapse
Affiliation(s)
- Maria Kuksin
- ENS de Lyon, 15 Parvis René Descartes, 69007, Lyon, France; Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Daphné Morel
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; Département de Radiothérapie, Gustave Roussy Cancer Campus, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM UMR1030, Molecular Radiotherapy and Therapeutic Innovations, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Marine Aglave
- INSERM US23, CNRS UMS 3655, Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | | | - Aurélien Marabelle
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM U1015, Gustave Roussy, Université Paris Saclay, France
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, F-75005, Paris, France; INSERM, U900, F-75005, Paris, France; MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006, Paris, France; Laboratory of Advanced Methods for High-dimensional Data Analysis, Lobachevsky University, 603000, Nizhny Novgorod, Russia
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France; IHU PRISM, Gustave Roussy Cancer Campus, Gustave Roussy, 114 Rue Edouard Vaillant, 94800, Villejuif, France; Université Paris-Saclay, France
| | - Loïc Verlingue
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM UMR1030, Molecular Radiotherapy and Therapeutic Innovations, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France; Institut Curie, PSL Research University, F-75005, Paris, France; Université Paris-Saclay, France.
| |
Collapse
|
26
|
Gorban AN, Tyukina TA, Pokidysheva LI, Smirnova EV. Dynamic and thermodynamic models of adaptation. Phys Life Rev 2021; 37:17-64. [PMID: 33765608 DOI: 10.1016/j.plrev.2021.03.001] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 03/11/2021] [Indexed: 12/14/2022]
Abstract
The concept of biological adaptation was closely connected to some mathematical, engineering and physical ideas from the very beginning. Cannon in his "The wisdom of the body" (1932) systematically used the engineering vision of regulation. In 1938, Selye enriched this approach by the notion of adaptation energy. This term causes much debate when one takes it literally, as a physical quantity, i.e. a sort of energy. Selye did not use the language of mathematics systematically, but the formalization of his phenomenological theory in the spirit of thermodynamics was simple and led to verifiable predictions. In 1980s, the dynamics of correlation and variance in systems under adaptation to a load of environmental factors were studied and the universal effect in ensembles of systems under a load of similar factors was discovered: in a crisis, as a rule, even before the onset of obvious symptoms of stress, the correlation increases together with variance (and volatility). During 30 years, this effect has been supported by many observations of groups of humans, mice, trees, grassy plants, and on financial time series. In the last ten years, these results were supplemented by many new experiments, from gene networks in cardiology and oncology to dynamics of depression and clinical psychotherapy. Several systems of models were developed: the thermodynamic-like theory of adaptation of ensembles and several families of models of individual adaptation. Historically, the first group of models was based on Selye's concept of adaptation energy and used fitness estimates. Two other groups of models are based on the idea of hidden attractor bifurcation and on the advection-diffusion model for distribution of population in the space of physiological attributes. We explore this world of models and experiments, starting with classic works, with particular attention to the results of the last ten years and open questions.
Collapse
Affiliation(s)
- A N Gorban
- Department of Mathematics, University of Leicester, Leicester, UK; Lobachevsky University, Nizhni Novgorod, Russia.
| | - T A Tyukina
- Department of Mathematics, University of Leicester, Leicester, UK.
| | | | - E V Smirnova
- Siberian Federal University, Krasnoyarsk, Russia.
| |
Collapse
|
27
|
Scherer M, Schmidt F, Lazareva O, Walter J, Baumbach J, Schulz MH, List M. Machine learning for deciphering cell heterogeneity and gene regulation. NATURE COMPUTATIONAL SCIENCE 2021; 1:183-191. [PMID: 38183187 DOI: 10.1038/s43588-021-00038-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 02/08/2021] [Indexed: 12/14/2022]
Abstract
Epigenetics studies inheritable and reversible modifications of DNA that allow cells to control gene expression throughout their development and in response to environmental conditions. In computational epigenomics, machine learning is applied to study various epigenetic mechanisms genome wide. Its aim is to expand our understanding of cell differentiation, that is their specialization, in health and disease. Thus far, most efforts focus on understanding the functional encoding of the genome and on unraveling cell-type heterogeneity. Here, we provide an overview of state-of-the-art computational methods and their underlying statistical concepts, which range from matrix factorization and regularized linear regression to deep learning methods. We further show how the rise of single-cell technology leads to new computational challenges and creates opportunities to further our understanding of epigenetic regulation.
Collapse
Affiliation(s)
- Michael Scherer
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany
- Computational Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | | | - Olga Lazareva
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Jörn Walter
- Computational Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
- Computational BioMedicine Lab, Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Marcel H Schulz
- Institute of Cardiovascular Regeneration, University Hospital and Goethe University Frankfurt, Frankfurt, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
| |
Collapse
|
28
|
Simoneau J, Gosselin R, Scott MS. Factorial study of the RNA-seq computational workflow identifies biases as technical gene signatures. NAR Genom Bioinform 2021; 2:lqaa043. [PMID: 33575596 PMCID: PMC7671328 DOI: 10.1093/nargab/lqaa043] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 05/15/2020] [Accepted: 06/05/2020] [Indexed: 12/12/2022] Open
Abstract
RNA-seq is a modular experimental and computational approach aiming in identifying and quantifying RNA molecules. The modularity of the RNA-seq technology enables adaptation of the protocol to develop new ways to explore RNA biology, but this modularity also brings forth the importance of methodological thoroughness. Liberty of approach comes with the responsibility of choices, and such choices must be informed. Here, we present an approach that identifies gene group-specific quantification biases in current RNA-seq software and references by processing datasets using diverse RNA-seq computational pipelines, and by decomposing these expression datasets with an independent component analysis matrix factorization method. By exploring the RNA-seq pipeline using this systemic approach, we identify genome annotations as a design choice that affects to the same extent quantification results as does the choice of aligners and quantifiers. We also show that the different choices in RNA-seq methodology are not independent, identifying interactions between genome annotations and quantification software. Genes were mainly affected by differences in their sequence, by overlapping genes and genes with similar sequence. Our approach offers an explanation for the observed biases by identifying the common features used differently by the software and references, therefore providing leads for the betterment of RNA-seq methodology.
Collapse
Affiliation(s)
- Joël Simoneau
- Department of Biochemistry and Functional Genomics, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada
| | - Ryan Gosselin
- Department of Chemical & Biotechnological Engineering, Faculty of Engineering, Université de Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada
| | - Michelle S Scott
- Department of Biochemistry and Functional Genomics, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada
| |
Collapse
|
29
|
Sastry AV, Hu A, Heckmann D, Poudel S, Kavvas E, Palsson BO. Independent component analysis recovers consistent regulatory signals from disparate datasets. PLoS Comput Biol 2021; 17:e1008647. [PMID: 33529205 PMCID: PMC7888660 DOI: 10.1371/journal.pcbi.1008647] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Revised: 02/17/2021] [Accepted: 12/18/2020] [Indexed: 01/03/2023] Open
Abstract
The availability of bacterial transcriptomes has dramatically increased in recent years. This data deluge could result in detailed inference of underlying regulatory networks, but the diversity of experimental platforms and protocols introduces critical biases that could hinder scalable analysis of existing data. Here, we show that the underlying structure of the E. coli transcriptome, as determined by Independent Component Analysis (ICA), is conserved across multiple independent datasets, including both RNA-seq and microarray datasets. We subsequently combined five transcriptomics datasets into a large compendium containing over 800 expression profiles and discovered that its underlying ICA-based structure was still comparable to that of the individual datasets. With this understanding, we expanded our analysis to over 3,000 E. coli expression profiles and predicted three high-impact regulons that respond to oxidative stress, anaerobiosis, and antibiotic treatment. ICA thus enables deep analysis of disparate data to uncover new insights that were not visible in the individual datasets. Cells adapt to diverse environments by regulating gene expression. Genome-wide measurements of gene expression levels have exponentially increased in recent years, but successful integration and analysis of these datasets are limited. Recently, we showed that independent component analysis (ICA), a signal deconvolution algorithm, can separate a large bacterial gene expression dataset into groups of co-regulated genes. This previous study focused on data generated by a standardized pipeline and did not address whether ICA extracts the same quantitative co-expression signals across expression profiling platforms. In this study, we show that ICA finds similar co-regulation patterns underlying multiple gene expression datasets and can be used as a tool to integrate and interpret diverse datasets. Using a dataset containing over 3,000 expression profiles, we predicted three new regulons and characterized their activities. Since large, standardized expression datasets only exist for a few bacterial strains, these results broaden the possible applications of this tool to better understand transcriptional regulation across a wide range of microbes.
Collapse
Affiliation(s)
- Anand V. Sastry
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Alyssa Hu
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - David Heckmann
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Saugat Poudel
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Erol Kavvas
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Lyngby, Denmark
- * E-mail:
| |
Collapse
|
30
|
Nazarov PV, Kreis S. Integrative approaches for analysis of mRNA and microRNA high-throughput data. Comput Struct Biotechnol J 2021; 19:1154-1162. [PMID: 33680358 PMCID: PMC7895676 DOI: 10.1016/j.csbj.2021.01.029] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 01/19/2021] [Accepted: 01/20/2021] [Indexed: 12/11/2022] Open
Abstract
Review on tools and databases linking miRNA and its mRNA targetome. Databases show little overlap in miRNA targetome predictions suggesting strong contextual effects. Deconvolution and deep learning approaches are promising new approaches to improve miRNA targetome predictions.
Advanced sequencing technologies such as RNASeq provide the means for production of massive amounts of data, including transcriptome-wide expression levels of coding RNAs (mRNAs) and non-coding RNAs such as miRNAs, lncRNAs, piRNAs and many other RNA species. In silico analysis of datasets, representing only one RNA species is well established and a variety of tools and pipelines are available. However, attaining a more systematic view of how different players come together to regulate the expression of a gene or a group of genes requires a more intricate approach to data analysis. To fully understand complex transcriptional networks, datasets representing different RNA species need to be integrated. In this review, we will focus on miRNAs as key post-transcriptional regulators summarizing current computational approaches for miRNA:target gene prediction as well as new data-driven methods to tackle the problem of comprehensively and accurately dissecting miRNome-targetome interactions.
Collapse
Key Words
- CCA, canonical correlation analysis
- CDS, coding sequence
- CLASH, cross-linking, ligation and sequencing of hybrids
- CLIP, cross-linking immunoprecipitation
- CNN, convolutional neural network
- Data integration
- GO, gene ontology
- ICA, independent component analysis
- Matrix factorization
- NGS, next-generation sequencing
- NMF, non-negative matrix factorization
- PCA, principal component analysis
- RNASeq, high-throughput RNA sequencing
- TDMD, target RNA-directed miRNA degradation
- TF, transcription factors
- Target prediction
- Transcriptomics
- circRNA, circular RNA
- lncRNA, long non-coding RNA
- mRNA, messenger RNA
- miRNA, microRNA
- microRNA
Collapse
Affiliation(s)
- Petr V Nazarov
- Multiomics Data Science Research Group, Department of Oncology & Quantitative Biology Unit, Luxembourg Institute of Health (LIH), Strassen L-1445, Luxembourg
| | - Stephanie Kreis
- Signal Transduction Group, Department of Life Sciences and Medicine, University of Luxembourg, Belvaux L-4367, Luxembourg
| |
Collapse
|
31
|
Rychel K, Decker K, Sastry AV, Phaneuf PV, Poudel S, Palsson BO. iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning. Nucleic Acids Res 2021; 49:D112-D120. [PMID: 33045728 PMCID: PMC7778901 DOI: 10.1093/nar/gkaa810] [Citation(s) in RCA: 59] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/10/2020] [Accepted: 09/15/2020] [Indexed: 12/15/2022] Open
Abstract
Independent component analysis (ICA) of bacterial transcriptomes has emerged as a powerful tool for obtaining co-regulated, independently-modulated gene sets (iModulons), inferring their activities across a range of conditions, and enabling their association to known genetic regulators. By grouping and analyzing genes based on observations from big data alone, iModulons can provide a novel perspective into how the composition of the transcriptome adapts to environmental conditions. Here, we present iModulonDB (imodulondb.org), a knowledgebase of prokaryotic transcriptional regulation computed from high-quality transcriptomic datasets using ICA. Users select an organism from the home page and then search or browse the curated iModulons that make up its transcriptome. Each iModulon and gene has its own interactive dashboard, featuring plots and tables with clickable, hoverable, and downloadable features. This site enhances research by presenting scientists of all backgrounds with co-expressed gene sets and their activity levels, which lead to improved understanding of regulator-gene relationships, discovery of transcription factors, and the elucidation of unexpected relationships between conditions and genetic regulatory activity. The current release of iModulonDB covers three organisms (Escherichia coli, Staphylococcus aureus and Bacillus subtilis) with 204 iModulons, and can be expanded to cover many additional organisms.
Collapse
Affiliation(s)
- Kevin Rychel
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Katherine Decker
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Anand V Sastry
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Patrick V Phaneuf
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Saugat Poudel
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
- Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
32
|
Abstract
Ewing sarcoma (EwS) is a highly aggressive pediatric bone cancer that is defined by a somatic fusion between the EWSR1 gene and an ETS family member, most frequently the FLI1 gene, leading to expression of a chimeric transcription factor EWSR1-FLI1. Otherwise, EwS is one of the most genetically stable cancers. The situation when the major cancer driver is well known looks like a unique opportunity for applying the systems biology approach in order to understand the EwS mechanisms as well as to uncover some general mechanistic principles of carcinogenesis. A number of studies have been performed revealing the direct and indirect effects of EWSR1-FLI1 on multiple aspects of cellular life. Nevertheless, the emerging picture of the oncogene action appears to be highly complex and systemic, with multiple reciprocal influences between the immediate consequences of the driver mutation and intracellular and intercellular molecular mechanisms, including regulation of transcription, epigenome, and tumoral microenvironment. In this chapter, we present an overview of existing molecular profiling resources available for EwS tumors and cell lines and provide an online comprehensive catalogue of publicly available omics and other datasets. We further highlight the systems biology studies of EwS, involving mathematical modeling of networks and integration of molecular data. We conclude that despite the seeming simplicity, a lot has yet to be understood on the systems-wide mechanisms connecting the driver mutation and the major cellular phenotypes of this pediatric cancer. Overall, this chapter can serve as a guide for a systems biology researcher to start working on EwS.
Collapse
|
33
|
Machine learning uncovers independently regulated modules in the Bacillus subtilis transcriptome. Nat Commun 2020; 11:6338. [PMID: 33311500 PMCID: PMC7732839 DOI: 10.1038/s41467-020-20153-9] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 10/29/2020] [Indexed: 12/24/2022] Open
Abstract
The transcriptional regulatory network (TRN) of Bacillus subtilis coordinates cellular functions of fundamental interest, including metabolism, biofilm formation, and sporulation. Here, we use unsupervised machine learning to modularize the transcriptome and quantitatively describe regulatory activity under diverse conditions, creating an unbiased summary of gene expression. We obtain 83 independently modulated gene sets that explain most of the variance in expression and demonstrate that 76% of them represent the effects of known regulators. The TRN structure and its condition-dependent activity uncover putative or recently discovered roles for at least five regulons, such as a relationship between histidine utilization and quorum sensing. The TRN also facilitates quantification of population-level sporulation states. As this TRN covers the majority of the transcriptome and concisely characterizes the global expression state, it could inform research on nearly every aspect of transcriptional regulation in B. subtilis. The systems-level regulatory structure underlying gene expression in bacteria can be inferred using machine learning algorithms. Here we show this structure for Bacillus subtilis, present five hypotheses gleaned from it, and analyse the process of sporulation from its perspective.
Collapse
|
34
|
Scherer M, Nazarov PV, Toth R, Sahay S, Kaoma T, Maurer V, Vedeneev N, Plass C, Lengauer T, Walter J, Lutsik P. Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz. Nat Protoc 2020; 15:3240-3263. [PMID: 32978601 DOI: 10.1038/s41596-020-0369-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 05/29/2020] [Indexed: 12/13/2022]
Abstract
DNA methylation profiling offers unique insights into human development and diseases. Often the analysis of complex tissues and cell mixtures is the only feasible option to study methylation changes across large patient cohorts. Since DNA methylomes are highly cell type specific, deconvolution methods can be used to recover cell type-specific information in the form of latent methylation components (LMCs) from such 'bulk' samples. Reference-free deconvolution methods retrieve these components without the need for DNA methylation profiles of purified cell types. Currently no integrated and guided procedure is available for data preparation and subsequent interpretation of deconvolution results. Here, we describe a three-stage protocol for reference-free deconvolution of DNA methylation data comprising: (i) data preprocessing, confounder adjustment using independent component analysis (ICA) and feature selection using DecompPipeline, (ii) deconvolution with multiple parameters using MeDeCom, RefFreeCellMix or EDec and (iii) guided biological inference and validation of deconvolution results with the R/Shiny graphical user interface FactorViz. Our protocol simplifies the analysis and guides the initial interpretation of DNA methylation data derived from complex samples. The harmonized approach is particularly useful to dissect and evaluate cell heterogeneity in complex systems such as tumors. We apply the protocol to lung cancer methylomes from The Cancer Genome Atlas (TCGA) and show that our approach identifies the proportions of stromal cells and tumor-infiltrating immune cells, as well as associations of the detected components with clinical parameters. The protocol takes slightly >3 d to complete and requires basic R skills.
Collapse
Affiliation(s)
- Michael Scherer
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany.,Computational Biology, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Petr V Nazarov
- Quantitative Biology Unit, Luxembourg Institute of Health, Strassen, Luxembourg
| | - Reka Toth
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany.,Division of Thoracic Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Shashwat Sahay
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany.,Center for Digital Health, Berlin Institute of Health and Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Tony Kaoma
- Quantitative Biology Unit, Luxembourg Institute of Health, Strassen, Luxembourg
| | - Valentin Maurer
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | | | - Christoph Plass
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Thomas Lengauer
- Computational Biology, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Jörn Walter
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany
| | - Pavlo Lutsik
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
| |
Collapse
|
35
|
Nicolle R, Blum Y, Duconseil P, Vanbrugghe C, Brandone N, Poizat F, Roques J, Bigonnet M, Gayet O, Rubis M, Elarouci N, Armenoult L, Ayadi M, de Reyniès A, Giovannini M, Grandval P, Garcia S, Canivet C, Cros J, Bournet B, Buscail L, Moutardier V, Gilabert M, Iovanna J, Dusetti N. Establishment of a pancreatic adenocarcinoma molecular gradient (PAMG) that predicts the clinical outcome of pancreatic cancer. EBioMedicine 2020; 57:102858. [PMID: 32629389 PMCID: PMC7334821 DOI: 10.1016/j.ebiom.2020.102858] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 06/09/2020] [Accepted: 06/11/2020] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND A significant gap in pancreatic ductal adenocarcinoma (PDAC) patient's care is the lack of molecular parameters characterizing tumours and allowing a personalized treatment. METHODS Patient-derived xenografts (PDX) were obtained from 76 consecutive PDAC and classified according to their histology into five groups. A PDAC molecular gradient (PAMG) was constructed from PDX transcriptomes recapitulating the five histological groups along a continuous gradient. The prognostic and predictive value for PMAG was evaluated in: i/ two independent series (n = 598) of resected tumours; ii/ 60 advanced tumours obtained by diagnostic EUS-guided biopsy needle flushing and iii/ on 28 biopsies from mFOLFIRINOX treated metastatic tumours. FINDINGS A unique transcriptomic signature (PAGM) was generated with significant and independent prognostic value. PAMG significantly improves the characterization of PDAC heterogeneity compared to non-overlapping classifications as validated in 4 independent series of tumours (e.g. 308 consecutive resected PDAC, uHR=0.321 95% CI [0.207-0.5] and 60 locally-advanced or metastatic PDAC, uHR=0.308 95% CI [0.113-0.836]). The PAMG signature is also associated with progression under mFOLFIRINOX treatment (Pearson correlation to tumour response: -0.67, p-value < 0.001). INTERPRETATION PAMG unify all PDAC pre-existing classifications inducing a shift in the actual paradigm of binary classifications towards a better characterization in a gradient. FUNDING Project funding was provided by INCa (Grants number 2018-078 and 2018-079, BACAP BCB INCa_6294), Canceropole PACA, DGOS (labellisation SIRIC), Amidex Foundation, Fondation de France, INSERM and Ligue Contre le Cancer.
Collapse
Affiliation(s)
- Rémy Nicolle
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Yuna Blum
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Pauline Duconseil
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Hôpital Nord, Marseille, France
| | - Charles Vanbrugghe
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Hôpital Nord, Marseille, France
| | - Nicolas Brandone
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| | - Flora Poizat
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Institut Paoli-Calmettes, Marseille, France
| | - Julie Roques
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| | - Martin Bigonnet
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| | - Odile Gayet
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| | - Marion Rubis
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| | - Nabila Elarouci
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Lucile Armenoult
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Mira Ayadi
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Aurélien de Reyniès
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Marc Giovannini
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Institut Paoli-Calmettes, Marseille, France
| | - Philippe Grandval
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Hôpital de la Timone, Marseille, France
| | - Stephane Garcia
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Hôpital Nord, Marseille, France
| | - Cindy Canivet
- Department of Gastroenterology and Pancreatology, CHU - Rangueil and University of Toulouse, Toulouse, France
| | - Jérôme Cros
- Department of Digestive Oncology, Beaujon Hospital, Paris 7 University, APHP, Clichy, France
| | - Barbara Bournet
- Department of Gastroenterology and Pancreatology, CHU - Rangueil and University of Toulouse, Toulouse, France
| | - Louis Buscail
- Department of Gastroenterology and Pancreatology, CHU - Rangueil and University of Toulouse, Toulouse, France
| | - Vincent Moutardier
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Hôpital Nord, Marseille, France
| | - Marine Gilabert
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France; Institut Paoli-Calmettes, Marseille, France
| | - Juan Iovanna
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| | - Nelson Dusetti
- Centre de Recherche en Cancérologie de Marseille, CRCM, Inserm, CNRS, Institut Paoli-Calmettes, Aix-Marseille Université, Marseille, France
| |
Collapse
|
36
|
Transcriptional Programs Define Intratumoral Heterogeneity of Ewing Sarcoma at Single-Cell Resolution. Cell Rep 2020; 30:1767-1779.e6. [DOI: 10.1016/j.celrep.2020.01.049] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2019] [Revised: 10/07/2019] [Accepted: 01/15/2020] [Indexed: 12/16/2022] Open
|
37
|
Scala G, Federico A, Fortino V, Greco D, Majello B. Knowledge Generation with Rule Induction in Cancer Omics. Int J Mol Sci 2019; 21:E18. [PMID: 31861438 PMCID: PMC6981587 DOI: 10.3390/ijms21010018] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 11/26/2019] [Accepted: 12/13/2019] [Indexed: 12/21/2022] Open
Abstract
The explosion of omics data availability in cancer research has boosted the knowledge of the molecular basis of cancer, although the strategies for its definitive resolution are still not well established. The complexity of cancer biology, given by the high heterogeneity of cancer cells, leads to the development of pharmacoresistance for many patients, hampering the efficacy of therapeutic approaches. Machine learning techniques have been implemented to extract knowledge from cancer omics data in order to address fundamental issues in cancer research, as well as the classification of clinically relevant sub-groups of patients and for the identification of biomarkers for disease risk and prognosis. Rule induction algorithms are a group of pattern discovery approaches that represents discovered relationships in the form of human readable associative rules. The application of such techniques to the modern plethora of collected cancer omics data can effectively boost our understanding of cancer-related mechanisms. In fact, the capability of these methods to extract a huge amount of human readable knowledge will eventually help to uncover unknown relationships between molecular attributes and the malignant phenotype. In this review, we describe applications and strategies for the usage of rule induction approaches in cancer omics data analysis. In particular, we explore the canonical applications and the future challenges and opportunities posed by multi-omics integration problems.
Collapse
Affiliation(s)
- Giovanni Scala
- Department of Biology, University of Naples Federico II, 80126 Naples, Italy;
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, 33014 Tampere, Finland; (A.F.); (D.G.)
| | - Vittorio Fortino
- Institute of Biomedicine, University of Eastern Finland, 70210 Kuopio, Finland;
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, 33014 Tampere, Finland; (A.F.); (D.G.)
- Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Barbara Majello
- Department of Biology, University of Naples Federico II, 80126 Naples, Italy;
| |
Collapse
|
38
|
Di Giorgio E, Paluvai H, Picco R, Brancolini C. Genetic Programs Driving Oncogenic Transformation: Lessons from in Vitro Models. Int J Mol Sci 2019; 20:ijms20246283. [PMID: 31842516 PMCID: PMC6940909 DOI: 10.3390/ijms20246283] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 12/10/2019] [Accepted: 12/11/2019] [Indexed: 12/11/2022] Open
Abstract
Cancer complexity relies on the intracellular pleiotropy of oncogenes/tumor suppressors and in the strong interplay between tumors and micro- and macro-environments. Here we followed a reductionist approach, by analyzing the transcriptional adaptations induced by three oncogenes (RAS, MYC, and HDAC4) in an isogenic transformation process. Common pathways, in place of common genes became dysregulated. From our analysis it emerges that, during the process of transformation, tumor cells cultured in vitro prime some signaling pathways suitable for coping with the blood supply restriction, metabolic adaptations, infiltration of immune cells, and for acquiring the morphological plasticity needed during the metastatic phase. Finally, we identified two signatures of genes commonly regulated by the three oncogenes that successfully predict the outcome of patients affected by different cancer types. These results emphasize that, in spite of the heterogeneous mutational burden among different cancers and even within the same tumor, some common hubs do exist. Their location, at the intersection of the various signaling pathways, makes a therapeutic approach exploitable.
Collapse
|