1
|
Liu X, Duan C, Cai W, Shao X. Unmixing Autoencoder for Image Reconstruction from Hyperspectral Data. Anal Chem 2024. [PMID: 39690477 DOI: 10.1021/acs.analchem.4c02720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
Due to the complexity of samples and the limitations in spatial resolution, the spectra in hyperspectral imaging (HSI) are generally contributed to by multiple components, making univariate analysis ineffective. Although feature extraction methods have been applied, the chemical meaning of the compressed variables is difficult to interpret, limiting their further applications. An unmixing autoencoder (UAE) was developed in this work for the separation of the mixed spectra in HSI. The proposed model is composed of an encoder and a fully connected (FC) layer. The former is used to compress the input spectrum into several variables, and the latter is employed to reconstruct the spectrum. Combining reconstruction loss and sparse regularization, the weights and the spectral profiles of the components will be encoded in the compressed variables and the connection weights of FC, respectively. A simulated and three experimental HSI data sets were adopted to investigate the performance of the UAE model. The spectral components were successfully obtained, from which the handwriting under papers was revealed from the image of near-infrared (NIR) diffusive reflectance spectroscopy, and the images of lipids, proteins, and nucleic acids were reconstructed from the Raman and stimulated Raman scattering (SRS) images.
Collapse
Affiliation(s)
- Xuyang Liu
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
| | - Chaoshu Duan
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
| | - Wensheng Cai
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
| | - Xueguang Shao
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| |
Collapse
|
2
|
Sun Y, Kong L, Huang J, Deng H, Bian X, Li X, Cui F, Dou L, Cao C, Zou Q, Zhang Z. A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data. Brief Funct Genomics 2024; 23:733-744. [PMID: 38860675 DOI: 10.1093/bfgp/elae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/29/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
Collapse
Affiliation(s)
- Yidi Sun
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lingling Kong
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Jiayi Huang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Hongyan Deng
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xinling Bian
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, United States
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 210029, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| |
Collapse
|
3
|
Liu X, Wang H, Gao J. scIALM: A method for sparse scRNA-seq expression matrix imputation using the Inexact Augmented Lagrange Multiplier with low error. Comput Struct Biotechnol J 2024; 23:549-558. [PMID: 38274995 PMCID: PMC10809077 DOI: 10.1016/j.csbj.2023.12.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 12/21/2023] [Accepted: 12/22/2023] [Indexed: 01/27/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a high-throughput sequencing technology that quantifies gene expression profiles of specific cell populations at the single-cell level, providing a foundation for studying cellular heterogeneity and patient pathological characteristics. It is effective for developmental, fertility, and disease studies. However, the cell-gene expression matrix of single-cell sequencing data is often sparse and contains numerous zero values. Some of the zero values derive from noise, where dropout noise has a large impact on downstream analysis. In this paper, we propose a method named scIALM for imputation recovery of sparse single-cell RNA data expression matrices, which employs the Inexact Augmented Lagrange Multiplier method to use sparse but clean (accurate) data to recover unknown entries in the matrix. We perform experimental analysis on four datasets, calling the expression matrix after Quality Control (QC) as the original matrix, and comparing the performance of scIALM with six other methods using mean squared error (MSE), mean absolute error (MAE), Pearson correlation coefficient (PCC), and cosine similarity (CS). Our results demonstrate that scIALM accurately recovers the original data of the matrix with an error of 10e-4, and the mean value of the four metrics reaches 4.5072 (MSE), 0.765 (MAE), 0.8701 (PCC), 0.8896 (CS). In addition, at 10%-50% random masking noise, scIALM is the least sensitive to the masking ratio. For downstream analysis, this study uses adjusted rand index (ARI) and normalized mutual information (NMI) to evaluate the clustering effect, and the results are improved on three datasets containing real cluster labels.
Collapse
Affiliation(s)
- Xiaohong Liu
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Han Wang
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Jingyang Gao
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| |
Collapse
|
4
|
Shi M, Li X. Addressing scalability and managing sparsity and dropout events in single-cell representation identification with ZIGACL. Brief Bioinform 2024; 26:bbae703. [PMID: 39775477 PMCID: PMC11705091 DOI: 10.1093/bib/bbae703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 11/06/2024] [Accepted: 12/23/2024] [Indexed: 01/11/2025] Open
Abstract
Despite significant advancements in single-cell representation learning, scalability and managing sparsity and dropout events continue to challenge the field as scRNA-seq datasets expand. While current computational tools struggle to maintain both efficiency and accuracy, the accurate connection of these dropout events to specific biological functions usually requires additional, complex experiments, often hampered by potential inaccuracies in cell-type annotation. To tackle these challenges, the Zero-Inflated Graph Attention Collaborative Learning (ZIGACL) method has been developed. This innovative approach combines a Zero-Inflated Negative Binomial model with a Graph Attention Network, leveraging mutual information from neighboring cells to enhance dimensionality reduction and apply dynamic adjustments to the learning process through a co-supervised deep graph clustering model. ZIGACL's integration of denoising and topological embedding significantly improves clustering accuracy and ensures similar cells are grouped closely in the latent space. Comparative analyses across nine real scRNA-seq datasets have shown that ZIGACL significantly enhances single-cell data analysis by offering superior clustering performance and improved stability in cell representations, effectively addressing scalability and managing sparsity and dropout events, thereby advancing our understanding of cellular heterogeneity.
Collapse
Affiliation(s)
- Mingguang Shi
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui, China
| | - Xuefeng Li
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui, China
| |
Collapse
|
5
|
Zhang J, Ren H, Jiang Z, Chen Z, Yang Z, Matsubara Y, Sakurai Y. Strategic Multi-Omics Data Integration via Multi-Level Feature Contrasting and Matching. IEEE Trans Nanobioscience 2024; 23:579-590. [PMID: 39255078 DOI: 10.1109/tnb.2024.3456797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
The analysis and comprehension of multi-omics data has emerged as a prominent topic in the field of bioinformatics and data science. However, the sparsity characteristics and high dimensionality of omics data pose difficulties in terms of extracting meaningful information. Moreover, the heterogeneity inherent in multiple omics sources makes the effective integration of multi-omics data challenging To tackle these challenges, we propose MFCC-SAtt, a multi-level feature contrast clustering model based on self-attention to extract informative features from multi-omics data. MFCC-SAtt treats each omics type as a distinct modality and employs autoencoders with self-attention for each modality to integrate and compress their respective features into a shared feature space. By utilizing a multi-level feature extraction framework along with incorporating a semantic information extractor, we mitigate optimization conflicts arising from different learning objectives. Additionally, MFCC-SAtt guides deep clustering based on multi-level features which further enhances the quality of output labels. By conducting extensive experiments on multi-omics data, we have validated the exceptional performance of MFCC-SAtt. For instance, in a pan-cancer clustering task, MFCC-SAtt achieved an accuracy of over 80.38%.
Collapse
|
6
|
Cheng Y, Xu SM, Santucci K, Lindner G, Janitz M. Machine learning and related approaches in transcriptomics. Biochem Biophys Res Commun 2024; 724:150225. [PMID: 38852503 DOI: 10.1016/j.bbrc.2024.150225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 05/18/2024] [Accepted: 06/03/2024] [Indexed: 06/11/2024]
Abstract
Data acquisition for transcriptomic studies used to be the bottleneck in the transcriptomic analytical pipeline. However, recent developments in transcriptome profiling technologies have increased researchers' ability to obtain data, resulting in a shift in focus to data analysis. Incorporating machine learning to traditional analytical methods allows the possibility of handling larger volumes of complex data more efficiently. Many bioinformaticians, especially those unfamiliar with ML in the study of human transcriptomics and complex biological systems, face a significant barrier stemming from their limited awareness of the current landscape of ML utilisation in this field. To address this gap, this review endeavours to introduce those individuals to the general types of ML, followed by a comprehensive range of more specific techniques, demonstrated through examples of their incorporation into analytical pipelines for human transcriptome investigations. Important computational aspects such as data pre-processing, task formulation, results (performance of ML models), and validation methods are encompassed. In hope of better practical relevance, there is a strong focus on studies published within the last five years, almost exclusively examining human transcriptomes, with outcomes compared with standard non-ML tools.
Collapse
Affiliation(s)
- Yuning Cheng
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Si-Mei Xu
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Kristina Santucci
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Grace Lindner
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Michael Janitz
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia.
| |
Collapse
|
7
|
Marghi Y, Gala R, Baftizadeh F, Sümbül U. Joint inference of discrete cell types and continuous type-specific variability in single-cell datasets with MMIDAS. NATURE COMPUTATIONAL SCIENCE 2024; 4:706-722. [PMID: 39317764 DOI: 10.1038/s43588-024-00683-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 08/06/2024] [Indexed: 09/26/2024]
Abstract
Reproducible definition and identification of cell types is essential to enable investigations into their biological function and to understand their relevance in the context of development, disease and evolution. Current approaches model variability in data as continuous latent factors, followed by clustering as a separate step, or immediately apply clustering on the data. We show that such approaches can suffer from qualitative mistakes in identifying cell types robustly, particularly when the number of such cell types is in the hundreds or even thousands. Here we propose an unsupervised method, Mixture Model Inference with Discrete-coupled AutoencoderS (MMIDAS), which combines a generalized mixture model with a multi-armed deep neural network to jointly infer the discrete type and continuous type-specific variability. Using four recent datasets of brain cells spanning different technologies, species and conditions, we demonstrate that MMIDAS can identify reproducible cell types and infer cell type-dependent continuous variability in both unimodal and multimodal datasets.
Collapse
Affiliation(s)
| | | | | | - Uygar Sümbül
- Allen Institute, Seattle, WA, USA.
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
8
|
Zhang J, Larschan E, Bigness J, Singh R. scNODE : generative model for temporal single cell transcriptomic data prediction. Bioinformatics 2024; 40:ii146-ii154. [PMID: 39230694 PMCID: PMC11373355 DOI: 10.1093/bioinformatics/btae393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
SUMMARY Measurement of single-cell gene expression at different timepoints enables the study of cell development. However, due to the resource constraints and technical challenges associated with the single-cell experiments, researchers can only profile gene expression at discrete and sparsely sampled timepoints. This missing timepoint information impedes downstream cell developmental analyses. We propose scNODE, an end-to-end deep learning model that can predict in silico single-cell gene expression at unobserved timepoints. scNODE integrates a variational autoencoder with neural ordinary differential equations to predict gene expression using a continuous and nonlinear latent space. Importantly, we incorporate a dynamic regularization term to learn a latent space that is robust against distribution shifts when predicting single-cell gene expression at unobserved timepoints. Our evaluations on three real-world scRNA-seq datasets show that scNODE achieves higher predictive performance than state-of-the-art methods. We further demonstrate that scNODE's predictions help cell trajectory inference under the missing timepoint paradigm and the learned latent space is useful for in silico perturbation analysis of relevant genes along a developmental cell path. AVAILABILITY AND IMPLEMENTATION The data and code are publicly available at https://github.com/rsinghlab/scNODE.
Collapse
Affiliation(s)
- Jiaqi Zhang
- Department of Computer Science, Brown University, Providence, RI 02906, United States
| | - Erica Larschan
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI 02912, United States
| | - Jeremy Bigness
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI 02906, United States
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| |
Collapse
|
9
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
10
|
Xu B, Braun R. Variational inference of single cell time series. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.29.610389. [PMID: 39257806 PMCID: PMC11384007 DOI: 10.1101/2024.08.29.610389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
Time course single-cell RNA sequencing (scRNA-seq) enables researchers to probe genome-wide expression dynamics at the the single cell scale. However, when gene expression is affected jointly by time and cellular identity, analyzing such data - including conducting cell type annotation and modeling cell type-dependent dynamics - becomes challenging. To address this problem, we propose SNOW (SiNgle cell flOW map), a deep learning algorithm to deconvolve single cell time series data into time-dependent and time-independent contributions. SNOW has a number of advantages. First, it enables cell type annotation based on the time-independent dimensions. Second, it yields a probabilistic model that can be used to discriminate between biological temporal variation and batch effects contaminating individual timepoints, and provides an approach to mitigate batch effects. Finally, it is capable of projecting cells forward and backward in time, yielding time series at the individual cell level. This enables gene expression dynamics to be studied without the need for clustering or pseudobulking, which can be error prone and result in information loss. We describe our probabilistic framework in detail and demonstrate SNOW using data from three distinct time course scRNA-seq studies. Our results show that SNOW is able to construct biologically meaningful latent spaces, remove batch effects, and generate realistic time-series at the single-cell level. By way of example, we illustrate how the latter may be used to enhance the detection of cell type-specific circadian gene expression rhythms, and may be readily extended to other time-series analyses.
Collapse
Affiliation(s)
- Bingxian Xu
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
- NSF-Simons National Institute for Theory and Mathematics in Biology, Chicago, IL 60611, USA
| | - Rosemary Braun
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
- NSF-Simons National Institute for Theory and Mathematics in Biology, Chicago, IL 60611, USA
- Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL 60208, USA
- Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, USA
- Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL 60208, USA
- Santa Fe Institute, Santa Fe, NM 87501, USA
| |
Collapse
|
11
|
Zhou Y, Li H, Tse E, Sun H. Metal-detection based techniques and their applications in metallobiology. Chem Sci 2024; 15:10264-10280. [PMID: 38994399 PMCID: PMC11234822 DOI: 10.1039/d4sc00108g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 06/05/2024] [Indexed: 07/13/2024] Open
Abstract
Metals are essential for human health and play a crucial role in numerous biological processes and pathways. Gaining a deeper insight into these biological events will facilitate novel strategies for disease prevention, early detection, and personalized treatment. In recent years, there has been significant progress in the development of metal-detection based techniques from single cell metallome and proteome profiling to multiplex imaging, which greatly enhance our comprehension of the intricate roles played by metals in complex biological systems. This perspective summarizes the recent progress in advanced metal-detection based techniques and highlights successful applications in elucidating the roles of metals in biology and medicine. Technologies including machine learning that couple with single-cell analysis such as mass cytometry and their application in metallobiology, cancer biology and immunology are also emphasized. Finally, we provide insights into future prospects and challenges involved in metal-detection based techniques, with the aim of inspiring further methodological advancements and applications that are accessible to chemists, biologists, and clinicians.
Collapse
Affiliation(s)
- Ying Zhou
- Department of Chemistry, CAS-HKU Joint Laboratory of Metallomics for Health and Environment, The University of Hong Kong Pokfulam Road Hong Kong SAR P. R. China
| | - Hongyan Li
- Department of Chemistry, CAS-HKU Joint Laboratory of Metallomics for Health and Environment, The University of Hong Kong Pokfulam Road Hong Kong SAR P. R. China
| | - Eric Tse
- Department of Medicine, LKS Faculty of Medicine, The University of Hong Kong Pokfulam Road Hong Kong SAR P. R. China
| | - Hongzhe Sun
- Department of Chemistry, CAS-HKU Joint Laboratory of Metallomics for Health and Environment, The University of Hong Kong Pokfulam Road Hong Kong SAR P. R. China
| |
Collapse
|
12
|
Liu B, Rosenhahn B, Illig T, DeLuca DS. A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data. PLoS Comput Biol 2024; 20:e1011198. [PMID: 38959284 PMCID: PMC11251626 DOI: 10.1371/journal.pcbi.1011198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/16/2024] [Accepted: 06/11/2024] [Indexed: 07/05/2024] Open
Abstract
Interpreting transcriptome data is an important yet challenging aspect of bioinformatic analysis. While gene set enrichment analysis is a standard tool for interpreting regulatory changes, we utilize deep learning techniques, specifically autoencoder architectures, to learn latent variables that drive transcriptome signals. We investigate whether simple, variational autoencoder (VAE), and beta-weighted VAE are capable of learning reduced representations of transcriptomes that retain critical biological information. We propose a novel VAE that utilizes priors from biological data to direct the network to learn a representation of the transcriptome that is based on understandable biological concepts. After benchmarking five different autoencoder architectures, we found that each succeeded in reducing the transcriptomes to 50 latent dimensions, which captured enough variation for accurate reconstruction. The simple, fully connected autoencoder, performs best across the benchmarks, but lacks the characteristic of having directly interpretable latent dimensions. The beta-weighted, prior-informed VAE implementation is able to solve the benchmarking tasks, and provide semantically accurate latent features equating to biological pathways. This study opens a new direction for differential pathway analysis in transcriptomics with increased transparency and interpretability.
Collapse
Affiliation(s)
- Bin Liu
- Hannover Medical School, Biomedical Research in Endstage and Obstructive Lung Disease Hannover (BREATH), German Center for Lung Research, Hannover, Lower Saxony, Germany
| | - Bodo Rosenhahn
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Hannover, Lower Saxony, Germany
| | - Thomas Illig
- Hannover Medical School, Biomedical Research in Endstage and Obstructive Lung Disease Hannover (BREATH), German Center for Lung Research, Hannover, Lower Saxony, Germany
- Hannover Unified Biobank, Hannover Medical School, Hannover, Lower Saxony, Germany
| | - David S. DeLuca
- Hannover Medical School, Biomedical Research in Endstage and Obstructive Lung Disease Hannover (BREATH), German Center for Lung Research, Hannover, Lower Saxony, Germany
| |
Collapse
|
13
|
Wagle MM, Long S, Chen C, Liu C, Yang P. Interpretable deep learning in single-cell omics. Bioinformatics 2024; 40:btae374. [PMID: 38889275 PMCID: PMC11211213 DOI: 10.1093/bioinformatics/btae374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 05/11/2024] [Accepted: 06/12/2024] [Indexed: 06/20/2024] Open
Abstract
MOTIVATION Single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheless, the inherent multi-layer nonlinear architecture of deep learning models often makes them 'black boxes' as the reasoning behind predictions is often unknown and not transparent to the user. This has stimulated an increasing body of research for addressing the lack of interpretability in deep learning models, especially in single-cell omics data analyses, where the identification and understanding of molecular regulators are crucial for interpreting model predictions and directing downstream experimental validations. RESULTS In this work, we introduce the basics of single-cell omics technologies and the concept of interpretable deep learning. This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research. Lastly, we highlight the current limitations and discuss potential future directions.
Collapse
Affiliation(s)
- Manoj M Wagle
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Siqu Long
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Carissa Chen
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| |
Collapse
|
14
|
Cottrell S, Hozumi Y, Wei GW. K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis. Comput Biol Med 2024; 175:108497. [PMID: 38678944 PMCID: PMC11090715 DOI: 10.1016/j.compbiomed.2024.108497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 04/08/2024] [Accepted: 04/21/2024] [Indexed: 05/01/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
Collapse
Affiliation(s)
- Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|
15
|
Chen H, Lu Y, Dai Z, Yang Y, Li Q, Rao Y. Comprehensive single-cell RNA-seq analysis using deep interpretable generative modeling guided by biological hierarchy knowledge. Brief Bioinform 2024; 25:bbae314. [PMID: 38960404 PMCID: PMC11221887 DOI: 10.1093/bib/bbae314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/13/2023] [Accepted: 06/20/2024] [Indexed: 07/05/2024] Open
Abstract
Recent advances in microfluidics and sequencing technologies allow researchers to explore cellular heterogeneity at single-cell resolution. In recent years, deep learning frameworks, such as generative models, have brought great changes to the analysis of transcriptomic data. Nevertheless, relying on the potential space of these generative models alone is insufficient to generate biological explanations. In addition, most of the previous work based on generative models is limited to shallow neural networks with one to three layers of latent variables, which may limit the capabilities of the models. Here, we propose a deep interpretable generative model called d-scIGM for single-cell data analysis. d-scIGM combines sawtooth connectivity techniques and residual networks, thereby constructing a deep generative framework. In addition, d-scIGM incorporates hierarchical prior knowledge of biological domains to enhance the interpretability of the model. We show that d-scIGM achieves excellent performance in a variety of fundamental tasks, including clustering, visualization, and pseudo-temporal inference. Through topic pathway studies, we found that d-scIGM-learned topics are better enriched for biologically meaningful pathways compared to the baseline models. Furthermore, the analysis of drug response data shows that d-scIGM can capture drug response patterns in large-scale experiments, which provides a promising way to elucidate the underlying biological mechanisms. Lastly, in the melanoma dataset, d-scIGM accurately identified different cell types and revealed multiple melanin-related driver genes and key pathways, which are critical for understanding disease mechanisms and drug development.
Collapse
Affiliation(s)
- Hegang Chen
- School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou University Town, 510006, Guangzhou, China
| | - Yuyin Lu
- School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou University Town, 510006, Guangzhou, China
| | - Zhiming Dai
- School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou University Town, 510006, Guangzhou, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou University Town, 510006, Guangzhou, China
| | - Qing Li
- Department of Computing, The Hong Kong Polytechnic University, PQ806, Mong Man Wai Building, 999077, Hong Kong SAR
| | - Yanghui Rao
- School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou University Town, 510006, Guangzhou, China
| |
Collapse
|
16
|
Kojima Y, Mii S, Hayashi S, Hirose H, Ishikawa M, Akiyama M, Enomoto A, Shimamura T. Single-cell colocalization analysis using a deep generative model. Cell Syst 2024; 15:180-192.e7. [PMID: 38387441 DOI: 10.1016/j.cels.2024.01.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 03/06/2023] [Accepted: 01/23/2024] [Indexed: 02/24/2024]
Abstract
Analyzing colocalization of single cells with heterogeneous molecular phenotypes is essential for understanding cell-cell interactions, and cellular responses to external stimuli and their biological functions in diseases and tissues. However, existing computational methodologies identified the colocalization patterns between predefined cell populations, which can obscure the molecular signatures arising from intercellular communication. Here, we introduce DeepCOLOR, a computational framework based on a deep generative model that recovers intercellular colocalization networks with single-cell resolution by the integration of single-cell and spatial transcriptomes. Along with colocalized population detection accuracy that is superior to existing methods in simulated dataset, DeepCOLOR identified plausible cell-cell interaction candidates between colocalized single cells and segregated cell populations defined by the colocalization relationships in mouse brain tissues, human squamous cell carcinoma samples, and human lung tissues infected with SARS-CoV-2. DeepCOLOR is applicable to studying cell-cell interactions behind various spatial niches. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Yasuhiro Kojima
- Laboratory of Computational Life Science, National Cancer Center Research Institute, Chuo-ku, Tokyo 104-0045, Japan; Department of Computational and Systems Biology, Medical Research Insitute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-0034, Japan; Division of Systems Biology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan.
| | - Shinji Mii
- Department of Pathology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan
| | - Shuto Hayashi
- Department of Computational and Systems Biology, Medical Research Insitute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-0034, Japan; Division of Systems Biology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan
| | - Haruka Hirose
- Department of Computational and Systems Biology, Medical Research Insitute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-0034, Japan; Division of Systems Biology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan
| | - Masato Ishikawa
- Institute for Life and Medical Sciences, Kyoto University, Kyoto, Kyoto 606-8507, Japan; Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan
| | - Masashi Akiyama
- Department of Dermatology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan
| | - Atsushi Enomoto
- Department of Pathology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan
| | - Teppei Shimamura
- Department of Computational and Systems Biology, Medical Research Insitute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-0034, Japan; Division of Systems Biology, Nagoya University Graduate School of Medicine, Nagoya, Aichi 466-8550, Japan.
| |
Collapse
|
17
|
Liu X, Xing J, Fu H, Shao X, Cai W. Analyzing Molecular Dynamics Trajectories Thermodynamically through Artificial Intelligence. J Chem Theory Comput 2024; 20:665-676. [PMID: 38193858 DOI: 10.1021/acs.jctc.3c00975] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2024]
Abstract
Molecular dynamics simulations produce trajectories that correspond to vast amounts of structure when exploring biochemical processes. Extracting valuable information, e.g., important intermediate states and collective variables (CVs) that describe the major movement modes, from molecular trajectories to understand the underlying mechanisms of biological processes presents a significant challenge. To achieve this goal, we introduce a deep learning approach, coined DIKI (deep identification of key intermediates), to determine low-dimensional CVs distinguishing key intermediate conformations without a-priori assumptions. DIKI dynamically plans the distribution of latent space and groups together similar conformations within the same cluster. Moreover, by incorporating two user-defined parameters, namely, coarse focus knob and fine focus knob, to help identify conformations with low free energy and differentiate the subtle distinctions among these conformations, resolution-tunable clustering was achieved. Furthermore, the integration of DIKI with a path-finding algorithm contributes to the identification of crucial intermediates along the lowest free-energy pathway. We postulate that DIKI is a robust and flexible tool that can find widespread applications in the analysis of complex biochemical processes.
Collapse
Affiliation(s)
- Xuyang Liu
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Jingya Xing
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Haohao Fu
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xueguang Shao
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Wensheng Cai
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| |
Collapse
|
18
|
Ravaee H, Manshaei MH, Safayani M, Sartakhti JS. Intelligent phenotype-detection and gene expression profile generation with generative adversarial networks. J Theor Biol 2024; 577:111636. [PMID: 37944593 DOI: 10.1016/j.jtbi.2023.111636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 08/11/2023] [Accepted: 10/05/2023] [Indexed: 11/12/2023]
Abstract
Gene expression analysis is valuable for cancer type classification and identifying diverse cancer phenotypes. The latest high-throughput RNA sequencing devices have enabled access to large volumes of gene expression data. However, we face several challenges, such as data security and privacy, when we develop machine learning-based classifiers for categorizing cancer types with these datasets. To address these issues, we propose IP3G (Intelligent Phenotype-detection and Gene expression profile Generation with Generative adversarial network), a model based on Generative Adversarial Networks. IP3G tackles two major problems: augmenting gene expression data and unsupervised phenotype discovery. By converting gene expression profiles into 2-Dimensional images and leveraging IP3G, we generate new profiles for specific phenotypes. IP3G learns disentangled representations of gene expression patterns and identifies phenotypes without labeled data. We improve the objective function of the GAN used in IP3G by employing the earth mover distance and a novel mutual information function. IP3G outperforms clustering methods like k-Means, DBSCAN, and GMM in unsupervised phenotype discovery, while also surpassing SVM and CNN classification accuracy by up to 6% through gene expression profile augmentation. The source code for the developed IP3G is accessible to the public on GitHub.
Collapse
Affiliation(s)
- Hamid Ravaee
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran
| | - Mohammad Hossein Manshaei
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran.
| | - Mehran Safayani
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran
| | | |
Collapse
|
19
|
Dong S, Liu Y, Gong Y, Dong X, Zeng X. scCAN: Clustering With Adaptive Neighbor-Based Imputation Method for Single-Cell RNA-Seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:95-105. [PMID: 38285569 DOI: 10.1109/tcbb.2023.3337231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to study cellular heterogeneity in different samples. However, due to technical deficiencies, dropout events often result in zero gene expression values in the gene expression matrix. In this paper, we propose a new imputation method called scCAN, based on adaptive neighborhood clustering, to estimate the zero value of dropouts. Our method continuously updates cell-cell similarity information by simultaneously learning similarity relationships, clustering structures, and imposing new rank constraints on the Laplacian matrix of the similarity matrix, improving the imputation of dropout zero values. To evaluate the performance of this method, we used four simulated and eight real scRNA-seq data for downstream analyses, including cell clustering, recovered gene expression, and reconstructed cell trajectories. Our method improves the performance of the downstream analysis and is better than other imputation methods.
Collapse
|
20
|
Aragones DG, Palomino-Segura M, Sicilia J, Crainiciuc G, Ballesteros I, Sánchez-Cabo F, Hidalgo A, Calvo GF. Variable selection for nonlinear dimensionality reduction of biological datasets through bootstrapping of correlation networks. Comput Biol Med 2024; 168:107827. [PMID: 38086138 DOI: 10.1016/j.compbiomed.2023.107827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 11/15/2023] [Accepted: 12/04/2023] [Indexed: 01/10/2024]
Abstract
Identifying the most relevant variables or features in massive datasets for dimensionality reduction can lead to improved and more informative display, faster computation times, and more explainable models of complex systems. Despite significant advances and available algorithms, this task generally remains challenging, especially in unsupervised settings. In this work, we propose a method that constructs correlation networks using all intervening variables and then selects the most informative ones based on network bootstrapping. The method can be applied in both supervised and unsupervised scenarios. We demonstrate its functionality by applying Uniform Manifold Approximation and Projection for dimensionality reduction to several high-dimensional biological datasets, derived from 4D live imaging recordings of hundreds of morpho-kinetic variables, describing the dynamics of thousands of individual leukocytes at sites of prominent inflammation. We compare our method with other standard ones in the field, such as Principal Component Analysis and Elastic Net, showing that it outperforms them. The proposed method can be employed in a wide range of applications, encompassing data analysis and machine learning.
Collapse
Affiliation(s)
- David G Aragones
- Department of Mathematics & MOLAB-Mathematical Oncology Laboratory, Universidad de Castilla-La Mancha, Ciudad Real, Spain
| | - Miguel Palomino-Segura
- Area of Cell and Developmental Biology, Centro Nacional de Investigaciones Cardiovasculares Carlos III, Madrid, Spain; Immunophysiology Research Group, Instituto Universitario de Investigación Biosanitaria de Extremadura (INUBE), Badajoz, Spain; Department of Physiology, Faculty of Sciences, University of Extremadura, Badajoz, Spain
| | - Jon Sicilia
- Area of Cell and Developmental Biology, Centro Nacional de Investigaciones Cardiovasculares Carlos III, Madrid, Spain
| | - Georgiana Crainiciuc
- Area of Cell and Developmental Biology, Centro Nacional de Investigaciones Cardiovasculares Carlos III, Madrid, Spain
| | - Iván Ballesteros
- Area of Cell and Developmental Biology, Centro Nacional de Investigaciones Cardiovasculares Carlos III, Madrid, Spain
| | - Fátima Sánchez-Cabo
- Bioinformatics Unit, Centro Nacional de Investigaciones Cardiovasculares Carlos III, Madrid, Spain
| | - Andrés Hidalgo
- Vascular Biology and Therapeutics Program and Department of Immunobiology, Yale University School of Medicine, New Haven, CT, USA
| | - Gabriel F Calvo
- Department of Mathematics & MOLAB-Mathematical Oncology Laboratory, Universidad de Castilla-La Mancha, Ciudad Real, Spain.
| |
Collapse
|
21
|
Zhu B, Wang Y, Ku LT, van Dijk D, Zhang L, Hafler DA, Zhao H. scNAT: a deep learning method for integrating paired single-cell RNA and T cell receptor sequencing profiles. Genome Biol 2023; 24:292. [PMID: 38111007 PMCID: PMC10726524 DOI: 10.1186/s13059-023-03129-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 11/27/2023] [Indexed: 12/20/2023] Open
Abstract
Many deep learning-based methods have been proposed to handle complex single-cell data. Deep learning approaches may also prove useful to jointly analyze single-cell RNA sequencing (scRNA-seq) and single-cell T cell receptor sequencing (scTCR-seq) data for novel discoveries. We developed scNAT, a deep learning method that integrates paired scRNA-seq and scTCR-seq data to represent data in a unified latent space for downstream analysis. We demonstrate that scNAT is capable of removing batch effects, and identifying cell clusters and a T cell migration trajectory from blood to cerebrospinal fluid in multiple sclerosis.
Collapse
Affiliation(s)
- Biqing Zhu
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06511, USA
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, USA, MD , 20815
| | - Yuge Wang
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, 06511, USA
| | - Li-Ting Ku
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, 06511, USA
| | - David van Dijk
- Department of Internal Medicine, Yale School of Medicine, New Haven, CT, 06511, USA
- Department of Computer Science, Yale University, New Haven, CT, 06511, USA
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, USA, MD , 20815
| | - Le Zhang
- Department of Neuroscience, School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Immunobiology, School of Medicine, Yale University, New Haven, CT, 06511, USA
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, USA, MD , 20815
| | - David A Hafler
- Department of Neurology, School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Immunobiology, School of Medicine, Yale University, New Haven, CT, 06511, USA
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, USA, MD , 20815
| | - Hongyu Zhao
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06511, USA.
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
22
|
Yang Y, Wang K, Lu Z, Wang T, Wang X. Cytomulate: accurate and efficient simulation of CyTOF data. Genome Biol 2023; 24:262. [PMID: 37974276 PMCID: PMC10652542 DOI: 10.1186/s13059-023-03099-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 10/24/2023] [Indexed: 11/19/2023] Open
Abstract
Recently, many analysis tools have been devised to offer insights into data generated via cytometry by time-of-flight (CyTOF). However, objective evaluations of these methods remain absent as most evaluations are conducted against real data where the ground truth is generally unknown. In this paper, we develop Cytomulate, a reproducible and accurate simulation algorithm of CyTOF data, which could serve as a foundation for future method development and evaluation. We demonstrate that Cytomulate can capture various characteristics of CyTOF data and is superior in learning overall data distributions than single-cell RNA-seq-oriented methods such as scDesign2, Splatter, and generative models like LAMBDA.
Collapse
Affiliation(s)
- Yuqiu Yang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Kaiwen Wang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
| | - Zeyu Lu
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Tao Wang
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Xinlei Wang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA.
- Department of Mathematics, University of Texas at Arlington, Arlington, 76019, USA.
- Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, 76019, USA.
| |
Collapse
|
23
|
Kim G, Chun H. Similarity-assisted variational autoencoder for nonlinear dimension reduction with application to single-cell RNA sequencing data. BMC Bioinformatics 2023; 24:432. [PMID: 37964243 PMCID: PMC10647110 DOI: 10.1186/s12859-023-05552-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 10/30/2023] [Indexed: 11/16/2023] Open
Abstract
BACKGROUND Deep generative models naturally become nonlinear dimension reduction tools to visualize large-scale datasets such as single-cell RNA sequencing datasets for revealing latent grouping patterns or identifying outliers. The variational autoencoder (VAE) is a popular deep generative method equipped with encoder/decoder structures. The encoder and decoder are useful when a new sample is mapped to the latent space and a data point is generated from a point in a latent space. However, the VAE tends not to show grouping pattern clearly without additional annotation information. On the other hand, similarity-based dimension reduction methods such as t-SNE or UMAP present clear grouping patterns even though these methods do not have encoder/decoder structures. RESULTS To bridge this gap, we propose a new approach that adopts similarity information in the VAE framework. In addition, for biological applications, we extend our approach to a conditional VAE to account for covariate effects in the dimension reduction step. In the simulation study and real single-cell RNA sequencing data analyses, our method shows great performance compared to existing state-of-the-art methods by producing clear grouping structures using an inferred encoder and decoder. Our method also successfully adjusts for covariate effects, resulting in more useful dimension reduction. CONCLUSIONS Our method is able to produce clearer grouping patterns than those of other regularized VAE methods by utilizing similarity information encoded in the data via the highly celebrated UMAP loss function.
Collapse
Affiliation(s)
- Gwangwoo Kim
- Graduate School of Data Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
| | - Hyonho Chun
- Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
| |
Collapse
|
24
|
Hassan AZ, Ward HN, Rahman M, Billmann M, Lee Y, Myers CL. Dimensionality reduction methods for extracting functional networks from large-scale CRISPR screens. Mol Syst Biol 2023; 19:e11657. [PMID: 37750448 PMCID: PMC10632734 DOI: 10.15252/msb.202311657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2023] [Revised: 08/28/2023] [Accepted: 09/05/2023] [Indexed: 09/27/2023] Open
Abstract
CRISPR-Cas9 screens facilitate the discovery of gene functional relationships and phenotype-specific dependencies. The Cancer Dependency Map (DepMap) is the largest compendium of whole-genome CRISPR screens aimed at identifying cancer-specific genetic dependencies across human cell lines. A mitochondria-associated bias has been previously reported to mask signals for genes involved in other functions, and thus, methods for normalizing this dominant signal to improve co-essentiality networks are of interest. In this study, we explore three unsupervised dimensionality reduction methods-autoencoders, robust, and classical principal component analyses (PCA)-for normalizing the DepMap to improve functional networks extracted from these data. We propose a novel "onion" normalization technique to combine several normalized data layers into a single network. Benchmarking analyses reveal that robust PCA combined with onion normalization outperforms existing methods for normalizing the DepMap. Our work demonstrates the value of removing low-dimensional signals from the DepMap before constructing functional gene networks and provides generalizable dimensionality reduction-based normalization tools.
Collapse
Affiliation(s)
- Arshia Zernab Hassan
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Henry N Ward
- Bioinformatics and Computational Biology Graduate ProgramUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Mahfuzur Rahman
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Maximilian Billmann
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
- Institute of Human GeneticsUniversity of Bonn, School of Medicine and University Hospital BonnBonnGermany
| | - Yoonkyu Lee
- Bioinformatics and Computational Biology Graduate ProgramUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| | - Chad L Myers
- Department of Computer Science and EngineeringUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
- Bioinformatics and Computational Biology Graduate ProgramUniversity of Minnesota – Twin CitiesMinneapolisMNUSA
| |
Collapse
|
25
|
Huang D, Ye X, Zhang Y, Sakurai T. Collaborative analysis for drug discovery by federated learning on non-IID data. Methods 2023; 219:1-7. [PMID: 37689121 DOI: 10.1016/j.ymeth.2023.09.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 08/23/2023] [Accepted: 09/05/2023] [Indexed: 09/11/2023] Open
Abstract
With the increasing availability of large-scale QSAR (Quantitative Structure-Activity Relationship) datasets, collaborative analysis has become a promising approach for drug discovery. Traditional centralized analysis which typically concentrates data on a central server for training faces challenges such as data privacy and security. Distributed analysis such as federated learning offers a solution by enabling collaborative model training without sharing raw data. However, it may fail when the training data in the local devices are non-independent and identically distributed (non-IID). In this paper, we propose a novel framework for collaborative drug discovery using federated learning on non-IID datasets. We address the difficulty of training on non-IID data by globally sharing a small subset of data among all institutions. Our framework allows multiple institutions to jointly train a robust predictive model while preserving the privacy of their individual data. We leverage the federated learning paradigm to distribute the model training process across local devices, eliminating the need for data exchange. The experimental results on 15 benchmark datasets demonstrate that the proposed method achieves competitive predictive accuracy to centralized analysis while respecting data privacy. Moreover, our framework offers benefits such as reduced data transmission and enhanced scalability, making it suitable for large-scale collaborative drug discovery efforts.
Collapse
Affiliation(s)
- Dong Huang
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.
| | - Ying Zhang
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| |
Collapse
|
26
|
Dutta S, Box AC, Li Y, Sardiu ME. Identifying dynamical persistent biomarker structures for rare events using modern integrative machine learning approach. Proteomics 2023; 23:e2200290. [PMID: 36852539 PMCID: PMC11503472 DOI: 10.1002/pmic.202200290] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 01/30/2023] [Accepted: 02/17/2023] [Indexed: 03/01/2023]
Abstract
The evolution of omics and computational competency has accelerated discoveries of the underlying biological processes in an unprecedented way. High throughput methodologies, such as flow cytometry, can reveal deeper insights into cell processes, thereby allowing opportunities for scientific discoveries related to health and diseases. However, working with cytometry data often imposes complex computational challenges due to high-dimensionality, large size, and nonlinearity of the data structure. In addition, cytometry data frequently exhibit diverse patterns across biomarkers and suffer from substantial class imbalances which can further complicate the problem. The existing methods of cytometry data analysis either predict cell population or perform feature selection. Through this study, we propose a "wisdom of the crowd" approach to simultaneously predict rare cell populations and perform feature selection by integrating a pool of modern machine learning (ML) algorithms. Given that our approach integrates superior performing ML models across different normalization techniques based on entropy and rank, our method can detect diverse patterns existing across the model features. Furthermore, the method identifies a dynamic biomarker structure that divides the features into persistently selected, unselected, and fluctuating assemblies indicating the role of each biomarker in rare cell prediction, which can subsequently aid in studies of disease progression.
Collapse
Affiliation(s)
- Sreejata Dutta
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Andrew C. Box
- Stowers Institute for Medical Research, Kansas City, Missouri, USA
| | - Yanming Li
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, Kansas, USA
- University of Kansas Cancer Center, Kansas City, Kansas, USA
| | - Mihaela E. Sardiu
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, Kansas, USA
- University of Kansas Cancer Center, Kansas City, Kansas, USA
- Kansas Institute for Precision Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA
| |
Collapse
|
27
|
Chen S, Jiang W, Du Y, Yang M, Pan Y, Li H, Cui M. Single-cell analysis technologies for cancer research: from tumor-specific single cell discovery to cancer therapy. Front Genet 2023; 14:1276959. [PMID: 37900181 PMCID: PMC10602688 DOI: 10.3389/fgene.2023.1276959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 09/25/2023] [Indexed: 10/31/2023] Open
Abstract
Single-cell sequencing (SCS) technology is changing our understanding of cellular components, functions, and interactions across organisms, because of its inherent advantage of avoiding noise resulting from genotypic and phenotypic heterogeneity across numerous samples. By directly and individually measuring multiple molecular characteristics of thousands to millions of single cells, SCS technology can characterize multiple cell types and uncover the mechanisms of gene regulatory networks, the dynamics of transcription, and the functional state of proteomic profiling. In this context, we conducted systematic research on SCS techniques, including the fundamental concepts, procedural steps, and applications of scDNA, scRNA, scATAC, scCITE, and scSNARE methods, focusing on the unique clinical advantages of SCS, particularly in cancer therapy. We have explored challenging but critical areas such as circulating tumor cells (CTCs), lineage tracing, tumor heterogeneity, drug resistance, and tumor immunotherapy. Despite challenges in managing and analyzing the large amounts of data that result from SCS, this technique is expected to reveal new horizons in cancer research. This review aims to emphasize the key role of SCS in cancer research and promote the application of single-cell technologies to cancer therapy.
Collapse
Affiliation(s)
- Siyuan Chen
- Department of Hepatobiliary and Pancreatic Surgery, The Second Hospital of Jilin University, Changchun, China
| | - Weibo Jiang
- Department of Orthopaedic, The Second Hospital of Jilin University, Changchun, China
| | - Yanhui Du
- Department of Orthopaedics, Jilin Province People’s Hospital, Changchun, China
| | - Manshi Yang
- Department of Hepatobiliary and Pancreatic Surgery, The Second Hospital of Jilin University, Changchun, China
| | - Yihan Pan
- Department of Hepatobiliary and Pancreatic Surgery, The Second Hospital of Jilin University, Changchun, China
| | - Huan Li
- Department of Hepatobiliary and Pancreatic Surgery, The Second Hospital of Jilin University, Changchun, China
| | - Mengying Cui
- Department of Hepatobiliary and Pancreatic Surgery, The Second Hospital of Jilin University, Changchun, China
| |
Collapse
|
28
|
Du J, Gu XR, Yu XX, Cao YJ, Hou J. Essential procedures of single-cell RNA sequencing in multiple myeloma and its translational value. BLOOD SCIENCE 2023; 5:221-236. [PMID: 37941914 PMCID: PMC10629747 DOI: 10.1097/bs9.0000000000000172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 09/18/2023] [Indexed: 11/10/2023] Open
Abstract
Multiple myeloma (MM) is a malignant neoplasm characterized by clonal proliferation of abnormal plasma cells. In many countries, it ranks as the second most prevalent malignant neoplasm of the hematopoietic system. Although treatment methods for MM have been continuously improved and the survival of patients has been dramatically prolonged, MM remains an incurable disease with a high probability of recurrence. As such, there are still many challenges to be addressed. One promising approach is single-cell RNA sequencing (scRNA-seq), which can elucidate the transcriptome heterogeneity of individual cells and reveal previously unknown cell types or states in complex tissues. In this review, we outlined the experimental workflow of scRNA-seq in MM, listed some commonly used scRNA-seq platforms and analytical tools. In addition, with the advent of scRNA-seq, many studies have made new progress in the key molecular mechanisms during MM clonal evolution, cell interactions and molecular regulation in the microenvironment, and drug resistance mechanisms in target therapy. We summarized the main findings and sequencing platforms for applying scRNA-seq to MM research and proposed broad directions for targeted therapies based on these findings.
Collapse
Affiliation(s)
- Jun Du
- Department of Hematology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China
| | - Xiao-Ran Gu
- School of Medicine, Shanghai Jiao Tong University, Shanghai 200025, China
| | - Xiao-Xiao Yu
- School of Medicine, Shanghai Jiao Tong University, Shanghai 200025, China
| | - Yang-Jia Cao
- Department of Hematology, First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, Shanxi 710000, China
| | - Jian Hou
- Department of Hematology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China
| |
Collapse
|
29
|
Roper B, Mathews JC, Nadeem S, Park JH. Vis-SPLIT: Interactive Hierarchical Modeling for mRNA Expression Classification. IEEE VISUALIZATION CONFERENCE : VIS. IEEE CONFERENCE ON VISUALIZATION 2023; 2023:106-110. [PMID: 38881685 PMCID: PMC11179685 DOI: 10.1109/vis54172.2023.00030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2024]
Abstract
We propose an interactive visual analytics tool, Vis-SPLIT, for partitioning a population of individuals into groups with similar gene signatures. Vis-SPLIT allows users to interactively explore a dataset and exploit visual separations to build a classification model for specific cancers. The visualization components reveal gene expression and correlation to assist specific partitioning decisions, while also providing overviews for the decision model and clustered genetic signatures. We demonstrate the effectiveness of our framework through a case study and evaluate its usability with domain experts. Our results show that Vis-SPLIT can classify patients based on their genetic signatures to effectively gain insights into RNA sequencing data, as compared to an existing classification system.
Collapse
|
30
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
31
|
Jones A, Townes FW, Li D, Engelhardt BE. Alignment of spatial genomics data using deep Gaussian processes. Nat Methods 2023; 20:1379-1387. [PMID: 37592182 PMCID: PMC10482692 DOI: 10.1038/s41592-023-01972-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 07/06/2023] [Indexed: 08/19/2023]
Abstract
Spatially resolved genomic technologies have allowed us to study the physical organization of cells and tissues, and promise an understanding of local interactions between cells. However, it remains difficult to precisely align spatial observations across slices, samples, scales, individuals and technologies. Here, we propose a probabilistic model that aligns spatially-resolved samples onto a known or unknown common coordinate system (CCS) with respect to phenotypic readouts (for example, gene expression). Our method, Gaussian Process Spatial Alignment (GPSA), consists of a two-layer Gaussian process: the first layer maps observed samples' spatial locations onto a CCS, and the second layer maps from the CCS to the observed readouts. Our approach enables complex downstream spatially aware analyses that are impossible or inaccurate with unaligned data, including an analysis of variance, creation of a dense three-dimensional (3D) atlas from sparse two-dimensional (2D) slices or association tests across data modalities.
Collapse
Affiliation(s)
- Andrew Jones
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - F William Townes
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Didong Li
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
| | - Barbara E Engelhardt
- Gladstone Institutes, San Francisco, CA, USA.
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
32
|
Uvarova YE, Demenkov PS, Kuzmicheva IN, Venzel AS, Mischenko EL, Ivanisenko TV, Efimov VM, Bannikova SV, Vasilieva AR, Ivanisenko VA, Peltek SE. Accurate noise-robust classification of Bacillus species from MALDI-TOF MS spectra using a denoising autoencoder. J Integr Bioinform 2023; 20:jib-2023-0017. [PMID: 37978847 PMCID: PMC10757077 DOI: 10.1515/jib-2023-0017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 07/10/2023] [Indexed: 11/19/2023] Open
Abstract
Bacillus strains are ubiquitous in the environment and are widely used in the microbiological industry as valuable enzyme sources, as well as in agriculture to stimulate plant growth. The Bacillus genus comprises several closely related groups of species. The rapid classification of these remains challenging using existing methods. Techniques based on MALDI-TOF MS data analysis hold significant promise for fast and precise microbial strains classification at both the genus and species levels. In previous work, we proposed a geometric approach to Bacillus strain classification based on mass spectra analysis via the centroid method (CM). One limitation of such methods is the noise in MS spectra. In this study, we used a denoising autoencoder (DAE) to improve bacteria classification accuracy under noisy MS spectra conditions. We employed a denoising autoencoder approach to convert noisy MS spectra into latent variables representing molecular patterns in the original MS data, and the Random Forest method to classify bacterial strains by latent variables. Comparison of the DAE-RF with the CM method using the artificially noisy test samples showed that DAE-RF offers higher noise robustness. Hence, the DAE-RF method could be utilized for noise-robust, fast, and neat classification of Bacillus species according to MALDI-TOF MS data.
Collapse
Affiliation(s)
- Yulia E. Uvarova
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| | - Pavel S. Demenkov
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Kurchatov Center for Genome Research, Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Novosibirsk State University, 630090Novosibirsk, Russia
| | | | - Artur S. Venzel
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Novosibirsk State University, 630090Novosibirsk, Russia
| | - Elena L. Mischenko
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| | - Timofey V. Ivanisenko
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Kurchatov Center for Genome Research, Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| | - Vadim M. Efimov
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| | - Svetlana V. Bannikova
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| | - Asya R. Vasilieva
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| | - Vladimir A. Ivanisenko
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Kurchatov Center for Genome Research, Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Novosibirsk State University, 630090Novosibirsk, Russia
| | - Sergey E. Peltek
- Federal Research Center Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
- Kurchatov Center for Genome Research, Institute of Cytology and Genetics SB RAS, 630090Novosibirsk, Russia
| |
Collapse
|
33
|
Gunawan I, Vafaee F, Meijering E, Lock JG. An introduction to representation learning for single-cell data analysis. CELL REPORTS METHODS 2023; 3:100547. [PMID: 37671013 PMCID: PMC10475795 DOI: 10.1016/j.crmeth.2023.100547] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Single-cell-resolved systems biology methods, including omics- and imaging-based measurement modalities, generate a wealth of high-dimensional data characterizing the heterogeneity of cell populations. Representation learning methods are routinely used to analyze these complex, high-dimensional data by projecting them into lower-dimensional embeddings. This facilitates the interpretation and interrogation of the structures, dynamics, and regulation of cell heterogeneity. Reflecting their central role in analyzing diverse single-cell data types, a myriad of representation learning methods exist, with new approaches continually emerging. Here, we contrast general features of representation learning methods spanning statistical, manifold learning, and neural network approaches. We consider key steps involved in representation learning with single-cell data, including data pre-processing, hyperparameter optimization, downstream analysis, and biological validation. Interdependencies and contingencies linking these steps are also highlighted. This overview is intended to guide researchers in the selection, application, and optimization of representation learning strategies for current and future single-cell research applications.
Collapse
Affiliation(s)
- Ihuan Gunawan
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| | - Erik Meijering
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - John George Lock
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
- Ingham Institute for Applied Medical Research, Liverpool, NSW, Australia
| |
Collapse
|
34
|
Kana O, Nault R, Filipovic D, Marri D, Zacharewski T, Bhattacharya S. Generative modeling of single-cell gene expression for dose-dependent chemical perturbations. PATTERNS (NEW YORK, N.Y.) 2023; 4:100817. [PMID: 37602218 PMCID: PMC10436058 DOI: 10.1016/j.patter.2023.100817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/07/2022] [Accepted: 07/14/2023] [Indexed: 08/22/2023]
Abstract
Single-cell sequencing reveals the heterogeneity of cellular response to chemical perturbations. However, testing all relevant combinations of cell types, chemicals, and doses is a daunting task. A deep generative learning formalism called variational autoencoders (VAEs) has been effective in predicting single-cell gene expression perturbations for single doses. Here, we introduce single-cell variational inference of dose-response (scVIDR), a VAE-based model that predicts both single-dose and multiple-dose cellular responses better than existing models. We show that scVIDR can predict dose-dependent gene expression across mouse hepatocytes, human blood cells, and cancer cell lines. We biologically interpret the latent space of scVIDR using a regression model and use scVIDR to order individual cells based on their sensitivity to chemical perturbation by assigning each cell a "pseudo-dose" value. We envision that scVIDR can help reduce the need for repeated animal testing across tissues, chemicals, and doses.
Collapse
Affiliation(s)
- Omar Kana
- Department of Pharmacology and Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Rance Nault
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology Michigan State University, Michigan State University, East Lansing, MI 48824, USA
| | - David Filipovic
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biomedical Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Daniel Marri
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biomedical Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Tim Zacharewski
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology Michigan State University, Michigan State University, East Lansing, MI 48824, USA
| | - Sudin Bhattacharya
- Department of Pharmacology and Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biomedical Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
35
|
Abstract
Advances in single-cell proteomics technologies have resulted in high-dimensional datasets comprising millions of cells that are capable of answering key questions about biology and disease. The advent of these technologies has prompted the development of computational tools to process and visualize the complex data. In this review, we outline the steps of single-cell and spatial proteomics analysis pipelines. In addition to describing available methods, we highlight benchmarking studies that have identified advantages and pitfalls of the currently available computational toolkits. As these technologies continue to advance, robust analysis tools should be developed in tandem to take full advantage of the potential biological insights provided by these data.
Collapse
Affiliation(s)
- Sophia M Guldberg
- Department of Otolaryngology-Head and Neck Surgery and Department of Microbiology and Immunology, University of California, San Francisco, California, USA;
- Biomedical Sciences Graduate Program, University of California, San Francisco, California, USA
- Gladstone-UCSF Institute for Genomic Immunology, San Francisco, California, USA
| | - Trine Line Hauge Okholm
- Department of Otolaryngology-Head and Neck Surgery and Department of Microbiology and Immunology, University of California, San Francisco, California, USA;
- Gladstone-UCSF Institute for Genomic Immunology, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, California, USA
| | - Elizabeth E McCarthy
- Department of Otolaryngology-Head and Neck Surgery and Department of Microbiology and Immunology, University of California, San Francisco, California, USA;
- Biomedical Sciences Graduate Program, University of California, San Francisco, California, USA
- Institute for Human Genetics; Division of Rheumatology, Department of Medicine; Medical Scientist Training Program; and Biological and Medical Informatics Graduate Program, University of California, San Francisco, California, USA
| | - Matthew H Spitzer
- Department of Otolaryngology-Head and Neck Surgery and Department of Microbiology and Immunology, University of California, San Francisco, California, USA;
- Gladstone-UCSF Institute for Genomic Immunology, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, California, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, California, USA
- Chan Zuckerberg Biohub, San Francisco, California, USA
| |
Collapse
|
36
|
Orrapin S, Thongkumkoon P, Udomruk S, Moonmuang S, Sutthitthasakul S, Yongpitakwattana P, Pruksakorn D, Chaiyawat P. Deciphering the Biology of Circulating Tumor Cells through Single-Cell RNA Sequencing: Implications for Precision Medicine in Cancer. Int J Mol Sci 2023; 24:12337. [PMID: 37569711 PMCID: PMC10418766 DOI: 10.3390/ijms241512337] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 07/25/2023] [Accepted: 07/27/2023] [Indexed: 08/13/2023] Open
Abstract
Circulating tumor cells (CTCs) hold unique biological characteristics that directly involve them in hematogenous dissemination. Studying CTCs systematically is technically challenging due to their extreme rarity and heterogeneity and the lack of specific markers to specify metastasis-initiating CTCs. With cutting-edge technology, single-cell RNA sequencing (scRNA-seq) provides insights into the biology of metastatic processes driven by CTCs. Transcriptomics analysis of single CTCs can decipher tumor heterogeneity and phenotypic plasticity for exploring promising novel therapeutic targets. The integrated approach provides a perspective on the mechanisms underlying tumor development and interrogates CTCs interactions with other blood cell types, particularly those of the immune system. This review aims to comprehensively describe the current study on CTC transcriptomic analysis through scRNA-seq technology. We emphasize the workflow for scRNA-seq analysis of CTCs, including enrichment, single cell isolation, and bioinformatic tools applied for this purpose. Furthermore, we elucidated the translational knowledge from the transcriptomic profile of individual CTCs and the biology of cancer metastasis for developing effective therapeutics through targeting key pathways in CTCs.
Collapse
Affiliation(s)
- Santhasiri Orrapin
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
| | - Patcharawadee Thongkumkoon
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
| | - Sasimol Udomruk
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
- Musculoskeletal Science and Translational Research (MSTR) Center, Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand
| | - Sutpirat Moonmuang
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
| | - Songphon Sutthitthasakul
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
| | - Petlada Yongpitakwattana
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
| | - Dumnoensun Pruksakorn
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
- Musculoskeletal Science and Translational Research (MSTR) Center, Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand
- Department of Orthopedics, Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand
| | - Parunya Chaiyawat
- Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand; (S.O.); (P.T.); (S.U.); (S.M.); (S.S.); (P.Y.); (D.P.)
- Musculoskeletal Science and Translational Research (MSTR) Center, Faculty of Medicine, Chiang Mai University, Muang, Chiang Mai 50200, Thailand
| |
Collapse
|
37
|
Ehiro T. Feature importance-based interpretation of UMAP-visualized polymer space. Mol Inform 2023; 42:e2300061. [PMID: 37212494 DOI: 10.1002/minf.202300061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 05/11/2023] [Accepted: 05/19/2023] [Indexed: 05/23/2023]
Abstract
Dimensionality reduction (DR) techniques are used for various purposes such as exploratory data analysis. A commonly employed linear DR technique is principal component analysis (PCA), which is one of the most popular methods for DR. Owing to its linear nature, PCA enables the determination of axes in a low-dimensional space and the calculation of corresponding loading vectors. However, PCA cannot necessarily extract important features of non-linearly distributed data. This study presents a technique aimed at aiding the interpretation of data reduced through non-linear DR methods. In the proposed method, non-linear dimensionally reduced data was clustered via a density-based clustering method. Thereafter, the obtained cluster labels were classified by random forest (RF) classifiers. Further, feature importance (FI) of RF classifiers and Spearman's rank correlation coefficients between predictive probabilities to obtained clusters and original feature values were utilized for characterizing the visualized dimensionally reduced data. The results revealed that the proposed method can provide the interpretable FI-based images of the handwritten digits dataset. Moreover, the proposed method was also applied to the polymer dataset. The study found that incorporating signed FI was advantageous in achieving a meaningful interpretation. Furthermore, Gaussian process regression was utilized to produce intuitive FI-based heatmaps on a 2-dimensional space for greater ease of understanding. Additionally, to enhance the interpretability of the obtained clusters, a feature selection technique called Boruta was applied. The Boruta feature selection method worked effectively to interpret the obtained clusters with limited and commonly important features. Additionally, the study suggested that computing FI solely from substructure-based descriptors could further enhance the interpretability of the results. Finally, the automation of the proposed method was investigated, and through maximizing the target score based on the quality of both the DR and clustering, indicative results were automatically obtained for both the handwritten digits and polymer datasets.
Collapse
Affiliation(s)
- Takuya Ehiro
- Research Division of Polymer Functional Materials, Osaka Research Institute of Industrial Science and Technology, 2-7-1 Ayumino, Izumi, Osaka, 594-1157, Japan
| |
Collapse
|
38
|
Tanabe S, Muraki T, Yaginuma K, Kim S, Kano M. Greedy design space construction based on regression and latent space extraction for pharmaceutical development. Int J Pharm 2023; 642:123178. [PMID: 37364782 DOI: 10.1016/j.ijpharm.2023.123178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Revised: 05/23/2023] [Accepted: 06/22/2023] [Indexed: 06/28/2023]
Abstract
Implementation of the design space (DS) is a scientific concept for ensuring quality to be submitted as a part of the regulatory filing of a drug product for approval to market. An empirical approach is constructing the DS based on the regression model whose inputs are process parameters and material attributes over the different unit operations, i.e., a high-dimensional statistical model. While the high-dimensional model assures quality and process flexibility through a comprehensive process understanding, it has difficulty visualizing the feasible range of input parameters, i.e., DS. Therefore, this study proposes a greedy approach to constructing the extensive and flexible low-dimensional DS based on the high-dimensional statistical model and the observed internal representations that satisfies both comprehensive process understanding and the DS visualization capability. Introducing the observed correlation structure enabled the dimensionality reduction of the DS. The non-critical controllable parameters were fixed to the target values in visualizing the low-dimensional DS as a function of critical parameters. The expected variation of non-critical non-controllable parameters was considered the source of variation in prediction. The case study demonstrated the proposed approach's usefulness for developing the pharmaceutical manufacturing process.
Collapse
Affiliation(s)
- Shuichi Tanabe
- Formulation Technology Research Laboratories, Daiichi Sankyo Co., Ltd., 1-12-1 Shinomiya, 2540014 Hiratsuka, Japan.
| | - Tatsuya Muraki
- Department of Systems Science, Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto 6068501, Japan
| | - Keita Yaginuma
- Formulation Technology Research Laboratories, Daiichi Sankyo Co., Ltd., 1-12-1 Shinomiya, 2540014 Hiratsuka, Japan
| | - Sanghong Kim
- Department of Applied Physics and Chemical Engineering, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei 1840012, Japan
| | - Manabu Kano
- Department of Systems Science, Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto 6068501, Japan
| |
Collapse
|
39
|
Pan W, Long F, Pan J. ScInfoVAE: interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization. BioData Min 2023; 16:17. [PMID: 37301826 DOI: 10.1186/s13040-023-00333-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 06/05/2023] [Indexed: 06/12/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) data can serve as a good indicator of cell-to-cell heterogeneity and can aid in the study of cell growth by identifying cell types. Recently, advances in Variational Autoencoder (VAE) have demonstrated their ability to learn robust feature representations for scRNA-seq. However, it has been observed that VAEs tend to ignore the latent variables when combined with a decoding distribution that is too flexible. In this paper, we introduce ScInfoVAE, a dimensional reduction method based on the mutual information variational autoencoder (InfoVAE), which can more effectively identify various cell types in scRNA-seq data of complex tissues. A joint InfoVAE deep model and zero-inflated negative binomial distributed model design based on ScInfoVAE reconstructs the objective function to noise scRNA-seq data and learn an efficient low-dimensional representation of it. We use ScInfoVAE to analyze the clustering performance of 15 real scRNA-seq datasets and demonstrate that our method provides high clustering performance. In addition, we use simulated data to investigate the interpretability of feature extraction, and visualization results show that the low-dimensional representation learned by ScInfoVAE retains local and global neighborhood structure data well. In addition, our model can significantly improve the quality of the variational posterior.
Collapse
Affiliation(s)
- Weiquan Pan
- School of Mathematics and Statistics, Yulin Normal University, Yulin, China
| | - Faning Long
- School of Computer Science and Engineering, Yulin Normal University, Yulin, China.
| | - Jian Pan
- School of Mathematics and Statistics, Yulin Normal University, Yulin, China
| |
Collapse
|
40
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
41
|
Ding N, Zhang G, Zhang L, Shen Z, Yin L, Zhou S, Deng Y. Engineering an AI-based forward-reverse platform for the design of cross-ribosome binding sites of a transcription factor biosensor. Comput Struct Biotechnol J 2023; 21:2929-2939. [PMID: 38213883 PMCID: PMC10781712 DOI: 10.1016/j.csbj.2023.04.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 04/26/2023] [Accepted: 04/26/2023] [Indexed: 01/13/2024] Open
Abstract
A cross-ribosome binding site (cRBS) adjusts the dynamic range of transcription factor-based biosensors (TFBs) by controlling protein expression and folding. The rational design of a cRBS with desired TFB dynamic range remains an important issue in TFB forward and reverse engineering. Here, we report a novel artificial intelligence (AI)-based forward-reverse engineering platform for TFB dynamic range prediction and de novo cRBS design with selected TFB dynamic ranges. The platform demonstrated superior in processing unbalanced minority-class datasets and was guided by sequence characteristics from trained cRBSs. The platform identified correlations between cRBSs and dynamic ranges to mimic bidirectional design between these factors based on Wasserstein generative adversarial network (GAN) with a gradient penalty (GP) (WGAN-GP) and balancing GAN with GP (BAGAN-GP). For forward and reverse engineering, the predictive accuracy was up to 98% and 82%, respectively. Collectively, we generated an AI-based method for the rational design of TFBs with desired dynamic ranges.
Collapse
Affiliation(s)
- Nana Ding
- National Engineering Research Center for Cereal Fermentation and Food Biomanufacturing, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, People’s Republic of China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, People’s Republic of China
- Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, NO.1239 Siping Road, Shanghai 201210, People’s Republic of China
| | - Guangkun Zhang
- Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, NO.1239 Siping Road, Shanghai 201210, People’s Republic of China
| | - LinPei Zhang
- National Engineering Research Center for Cereal Fermentation and Food Biomanufacturing, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, People’s Republic of China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, People’s Republic of China
| | - Ziyun Shen
- National Engineering Research Center for Cereal Fermentation and Food Biomanufacturing, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, People’s Republic of China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, People’s Republic of China
| | - Lianghong Yin
- State Key Laboratory of Subtropical Silviculture, Zhejiang A&F University, Hangzhou 311300, People’s Republic of China
- Zhejiang Provincial Key Laboratory of Resources Protection and Innovation of Traditional Chinese Medicine, Zhejiang A&F University, Hangzhou 311300, People’s Republic of China
| | - Shenghu Zhou
- National Engineering Research Center for Cereal Fermentation and Food Biomanufacturing, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, People’s Republic of China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, People’s Republic of China
| | - Yu Deng
- National Engineering Research Center for Cereal Fermentation and Food Biomanufacturing, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, People’s Republic of China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, People’s Republic of China
| |
Collapse
|
42
|
Xu Y, Zang Z, Xia J, Tan C, Geng Y, Li SZ. Structure-preserving visualization for single-cell RNA-Seq profiles using deep manifold transformation with batch-correction. Commun Biol 2023; 6:369. [PMID: 37016133 PMCID: PMC10073100 DOI: 10.1038/s42003-023-04662-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 03/06/2023] [Indexed: 04/06/2023] Open
Abstract
Dimensionality reduction and visualization play an important role in biological data analysis, such as data interpretation of single-cell RNA sequences (scRNA-seq). It is desired to have a visualization method that can not only be applicable to various application scenarios, including cell clustering and trajectory inference, but also satisfy a variety of technical requirements, especially the ability to preserve inherent structure of data and handle with batch effects. However, no existing methods can accommodate these requirements in a unified framework. In this paper, we propose a general visualization method, deep visualization (DV), that possesses the ability to preserve inherent structure of data and handle batch effects and is applicable to a variety of datasets from different application domains and dataset scales. The method embeds a given dataset into a 2- or 3-dimensional visualization space, with either a Euclidean or hyperbolic metric depending on a specified task type with type static (at a time point) or dynamic (at a sequence of time points) scRNA-seq data, respectively. Specifically, DV learns a structure graph to describe the relationships between data samples, transforms the data into visualization space while preserving the geometric structure of the data and correcting batch effects in an end-to-end manner. The experimental results on nine datasets in complex tissue from human patients or animal development demonstrate the competitiveness of DV in discovering complex cellular relations, uncovering temporal trajectories, and addressing complex batch factors. We also provide a preliminary attempt to pre-train a DV model for visualization of new incoming data.
Collapse
Affiliation(s)
- Yongjie Xu
- Zhejiang University, Hangzhou, 310058, China
- AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China
| | - Zelin Zang
- Zhejiang University, Hangzhou, 310058, China
- AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China
| | - Jun Xia
- Zhejiang University, Hangzhou, 310058, China
- AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China
| | - Cheng Tan
- Zhejiang University, Hangzhou, 310058, China
- AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China
| | - Yulan Geng
- AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China
| | - Stan Z Li
- AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China.
| |
Collapse
|
43
|
Wang K, Yang Y, Wu F, Song B, Wang X, Wang T. Comparative analysis of dimension reduction methods for cytometry by time-of-flight data. Nat Commun 2023; 14:1836. [PMID: 37005472 PMCID: PMC10067013 DOI: 10.1038/s41467-023-37478-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 03/20/2023] [Indexed: 04/04/2023] Open
Abstract
While experimental and informatic techniques around single cell sequencing (scRNA-seq) are advanced, research around mass cytometry (CyTOF) data analysis has severely lagged behind. CyTOF data are notably different from scRNA-seq data in many aspects. This calls for the evaluation and development of computational methods specific for CyTOF data. Dimension reduction (DR) is one of the critical steps of single cell data analysis. Here, we benchmark the performances of 21 DR methods on 110 real and 425 synthetic CyTOF samples. We find that less well-known methods like SAUCIE, SQuaD-MDS, and scvis are the overall best performers. In particular, SAUCIE and scvis are well balanced, SQuaD-MDS excels at structure preservation, whereas UMAP has great downstream analysis performance. We also find that t-SNE (along with SQuad-MDS/t-SNE Hybrid) possesses the best local structure preservation. Nevertheless, there is a high level of complementarity between these tools, so the choice of method should depend on the underlying data structure and the analytical needs.
Collapse
Affiliation(s)
- Kaiwen Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, 75275, USA
| | - Yuqiu Yang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Fangjiang Wu
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Bing Song
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, 75275, USA.
- Department of Mathematics, University of Texas at Arlington, Arlington, TX, 76019, USA.
- Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, 76019, USA.
| | - Tao Wang
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
44
|
Zernab Hassan A, Ward HN, Rahman M, Billmann M, Lee Y, Myers CL. Dimensionality reduction methods for extracting functional networks from large-scale CRISPR screens. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.22.529573. [PMID: 36993440 PMCID: PMC10054965 DOI: 10.1101/2023.02.22.529573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
CRISPR-Cas9 screens facilitate the discovery of gene functional relationships and phenotype-specific dependencies. The Cancer Dependency Map (DepMap) is the largest compendium of whole-genome CRISPR screens aimed at identifying cancer-specific genetic dependencies across human cell lines. A mitochondria-associated bias has been previously reported to mask signals for genes involved in other functions, and thus, methods for normalizing this dominant signal to improve co-essentiality networks are of interest. In this study, we explore three unsupervised dimensionality reduction methods - autoencoders, robust, and classical principal component analyses (PCA) - for normalizing the DepMap to improve functional networks extracted from these data. We propose a novel "onion" normalization technique to combine several normalized data layers into a single network. Benchmarking analyses reveal that robust PCA combined with onion normalization outperforms existing methods for normalizing the DepMap. Our work demonstrates the value of removing low-dimensional signals from the DepMap before constructing functional gene networks and provides generalizable dimensionality reduction-based normalization tools.
Collapse
Affiliation(s)
- Arshia Zernab Hassan
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Henry N Ward
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Mahfuzur Rahman
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Maximilian Billmann
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
- Institute of Human Genetics, University of Bonn, School of Medicine and University Hospital Bonn, Bonn, Germany
| | - Yoonkyu Lee
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota - Twin Cities, Minneapolis, Minnesota, USA
| |
Collapse
|
45
|
Jee DJ, Kong Y, Chun H. Deep Nonnegative Matrix Factorization Using a Variational Autoencoder With Application to Single-Cell RNA Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:883-893. [PMID: 35511832 DOI: 10.1109/tcbb.2022.3172723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Single-cell RNA sequencing is used to analyze the gene expression data of individual cells, thereby adding to existing knowledge of biological phenomena. Accordingly, this technology is widely used in numerous biomedical studies. Recently, the variational autoencoder has emerged and has been adopted for the analysis of single-cell data owing to its high capacity to manage large-scale data. Many different variants of the variational autoencoder have been applied, and have yielded superior results. However, because it is nonlinear, the model does not provide parameters that can be used to explain the underlying biological patterns. In this paper, we propose an interpretable nonnegative matrix factorization method that decomposes parameters into those shared across cells and those that are cell-specific. Effective nonlinear dimension reduction was achieved via a variational autoencoder applied to the cell-specific parameters. In addition to achieving nonlinear dimension reduction, our model could estimate the cell-type-specific gene expression. To improve the estimation accuracy, we introduced log-regularization, which reflects the single-cell property. Overall, our approach displayed excellent performance in a simulation study and in real data analyses, while maintaining good biological interpretability.
Collapse
|
46
|
Allesøe RL, Lundgaard AT, Hernández Medina R, Aguayo-Orozco A, Johansen J, Nissen JN, Brorsson C, Mazzoni G, Niu L, Biel JH, Brasas V, Webel H, Benros ME, Pedersen AG, Chmura PJ, Jacobsen UP, Mari A, Koivula R, Mahajan A, Vinuela A, Tajes JF, Sharma S, Haid M, Hong MG, Musholt PB, De Masi F, Vogt J, Pedersen HK, Gudmundsdottir V, Jones A, Kennedy G, Bell J, Thomas EL, Frost G, Thomsen H, Hansen E, Hansen TH, Vestergaard H, Muilwijk M, Blom MT, 't Hart LM, Pattou F, Raverdy V, Brage S, Kokkola T, Heggie A, McEvoy D, Mourby M, Kaye J, Hattersley A, McDonald T, Ridderstråle M, Walker M, Forgie I, Giordano GN, Pavo I, Ruetten H, Pedersen O, Hansen T, Dermitzakis E, Franks PW, Schwenk JM, Adamski J, McCarthy MI, Pearson E, Banasik K, Rasmussen S, Brunak S, Thomas CE, Haussler R, Beulens J, Rutters F, Nijpels G, van Oort S, Groeneveld L, Elders P, Giorgino T, Rodriquez M, Nice R, Perry M, Bianzano S, Graefe-Mody U, Hennige A, Grempler R, Baum P, Stærfeldt HH, Shah N, Teare H, Ehrhardt B, Tillner J, Dings C, Lehr T, Scherer N, Sihinevich I, Cabrelli L, Loftus H, Bizzotto R, Tura A, Dekkers K, van Leeuwen N, Groop L, Slieker R, Ramisch A, Jennison C, McVittie I, Frau F, Steckel-Hamann B, Adragni K, Thomas M, Pasdar NA, Fitipaldi H, Kurbasic A, Mutie P, Pomares-Millan H, Bonnefond A, Canouil M, Caiazzo R, Verkindt H, Holl R, Kuulasmaa T, Deshmukh H, Cederberg H, Laakso M, Vangipurapu J, Dale M, Thorand B, Nicolay C, Fritsche A, Hill A, Hudson M, Thorne C, Allin K, Arumugam M, Jonsson A, Engelbrechtsen L, Forman A, Dutta A, Sondertoft N, Fan Y, Gough S, Robertson N, McRobert N, Wesolowska-Andersen A, Brown A, Davtian D, Dawed A, Donnelly L, Palmer C, White M, Ferrer J, Whitcher B, Artati A, Prehn C, Adam J, Grallert H, Gupta R, Sackett PW, Nilsson B, Tsirigos K, Eriksen R, Jablonka B, Uhlen M, Gassenhuber J, Baltauss T, de Preville N, Klintenberg M, Abdalla M. Discovery of drug-omics associations in type 2 diabetes with generative deep-learning models. Nat Biotechnol 2023; 41:399-408. [PMID: 36593394 PMCID: PMC10017515 DOI: 10.1038/s41587-022-01520-x] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 09/20/2022] [Indexed: 01/03/2023]
Abstract
The application of multiple omics technologies in biomedical cohorts has the potential to reveal patient-level disease characteristics and individualized response to treatment. However, the scale and heterogeneous nature of multi-modal data makes integration and inference a non-trivial task. We developed a deep-learning-based framework, multi-omics variational autoencoders (MOVE), to integrate such data and applied it to a cohort of 789 people with newly diagnosed type 2 diabetes with deep multi-omics phenotyping from the DIRECT consortium. Using in silico perturbations, we identified drug-omics associations across the multi-modal datasets for the 20 most prevalent drugs given to people with type 2 diabetes with substantially higher sensitivity than univariate statistical tests. From these, we among others, identified novel associations between metformin and the gut microbiota as well as opposite molecular responses for the two statins, simvastatin and atorvastatin. We used the associations to quantify drug-drug similarities, assess the degree of polypharmacy and conclude that drug effects are distributed across the multi-omics modalities.
Collapse
Affiliation(s)
- Rosa Lundbye Allesøe
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.,Copenhagen Research Centre for Mental Health, Mental Health Centre Copenhagen, Copenhagen University Hospital, Copenhagen, Denmark
| | - Agnete Troen Lundgaard
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Ricardo Hernández Medina
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Alejandro Aguayo-Orozco
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Joachim Johansen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Jakob Nybo Nissen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Caroline Brorsson
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Gianluca Mazzoni
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Lili Niu
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Jorge Hernansanz Biel
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Valentas Brasas
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Henry Webel
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Michael Eriksen Benros
- Copenhagen Research Centre for Mental Health, Mental Health Centre Copenhagen, Copenhagen University Hospital, Copenhagen, Denmark.,Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Anders Gorm Pedersen
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Piotr Jaroslaw Chmura
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Ulrik Plesner Jacobsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Andrea Mari
- C.N.R. Institute of Neuroscience, Padova, Italy
| | - Robert Koivula
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Anubha Mahajan
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Ana Vinuela
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland.,Biosciences Institute, Faculty of Medical Sciences, Newcastle University, Newcastle, UK
| | | | - Sapna Sharma
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Bavaria, Germany.,Institute of Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Bavaria, Germany.,Chair of Food Chemistry and Molecular and Sensory Science, Technical University of Munich, Freising, Germany
| | - Mark Haid
- Metabolomics and Proteomics Core, Helmholtz Zentrum Muenchen, German Research Center for Environmental Health, Neuherberg, Germany
| | - Mun-Gwan Hong
- Affinity Proteomics, Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Petra B Musholt
- Research and Development Global Development, Translational Medicine and Clinical Pharmacology, Sanofi-Aventis Deutschland, Frankfurt, Germany
| | - Federico De Masi
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Josef Vogt
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Helle Krogh Pedersen
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.,Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Valborg Gudmundsdottir
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Angus Jones
- University of Exeter Medical School, Exeter, UK
| | - Gwen Kennedy
- The Immunoassay Biomarker Core Laboratory, School of Medicine, University of Dundee, Dundee, UK
| | - Jimmy Bell
- Research Centre for Optimal Health, Department of Life Sciences, University of Westminster, London, UK
| | - E Louise Thomas
- Research Centre for Optimal Health, Department of Life Sciences, University of Westminster, London, UK
| | - Gary Frost
- Section for Nutrition Research, Faculty of Medicine, Imperial College London, London, UK
| | - Henrik Thomsen
- Department of Radiology, Copenhagen University Hospital Herlev-Gentofte, Herlev, Denmark
| | - Elizaveta Hansen
- Department of Radiology, Copenhagen University Hospital Herlev-Gentofte, Herlev, Denmark
| | - Tue Haldor Hansen
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Henrik Vestergaard
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Mirthe Muilwijk
- Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Marieke T Blom
- Department of General Practice, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Leen M 't Hart
- Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.,Department of Biomedical Data Science, Section Molecular Epidemiology, Leiden University Medical Center, Leiden, the Netherlands.,Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, the Netherlands
| | - Francois Pattou
- Inserm, Univ Lille, CHU Lille, Lille Pasteur Institute, EGID, Lille, France
| | - Violeta Raverdy
- Inserm, Univ Lille, CHU Lille, Lille Pasteur Institute, EGID, Lille, France
| | - Soren Brage
- MRC Epidemiology Unit, University of Cambridge School of Clinical Medicine, Cambridge, UK
| | - Tarja Kokkola
- Department of Medicine, University of Eastern Finland, Kuopio, Finland
| | - Alison Heggie
- Institute of Cellular Medicine, Newcastle University, Newcastle, UK
| | - Donna McEvoy
- Diabetes Research Network, Royal Victoria Infirmary, Newcastle, UK
| | - Miranda Mourby
- Centre for Health, Law and Emerging Technologies (HeLEX), Faculty of Law, University of Oxford, Oxford, UK
| | - Jane Kaye
- Centre for Health, Law and Emerging Technologies (HeLEX), Faculty of Law, University of Oxford, Oxford, UK
| | | | | | - Martin Ridderstråle
- Lund University Diabetes Centre, Department of Clinical Sciences, Lund University, Malmö, Sweden
| | - Mark Walker
- Translational and Clinical Research Institute, Faculty of Medical Sciences, Newcastle University, Newcastle, UK
| | - Ian Forgie
- Division of Population Health & Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Giuseppe N Giordano
- Genetic and Molecular Epidemiology Unit, Lund University Diabetes Centre, Department of Clinical Sciences, CRC, Lund University, SUS, Malmö, Sweden
| | - Imre Pavo
- Eli Lilly Regional Operations, Vienna, Austria
| | - Hartmut Ruetten
- Research and Development Global Development, Translational Medicine and Clinical Pharmacology, Sanofi-Aventis Deutschland, Frankfurt, Germany
| | - Oluf Pedersen
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Torben Hansen
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Emmanouil Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
| | - Paul W Franks
- Lund University Diabetes Centre, Department of Clinical Sciences, Lund University, Malmö, Sweden.,Harvard T.H. Chan School of Public Health, Boston, MA, USA.,OCDEM, Radcliffe Department of Medicine, University of Oxford, Oxford, UK
| | - Jochen M Schwenk
- Affinity Proteomics, Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Jerzy Adamski
- Institute of Experimental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.,Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.,Institute of Biochemistry, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Mark I McCarthy
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK.,Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, UK.,Genentech, South San Francisco, CA, USA
| | - Ewan Pearson
- Division of Population Health & Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Karina Banasik
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark. .,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Dong X, Chowdhury S, Victor U, Li X, Qian L. Semi-Supervised Deep Learning for Cell Type Identification From Single-Cell Transcriptomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1492-1505. [PMID: 35536811 DOI: 10.1109/tcbb.2022.3173587] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Cell type identification from single-cell transcriptomic data is a common goal of single-cell RNA sequencing (scRNAseq) data analysis. Deep neural networks have been employed to identify cell types from scRNAseq data with high performance. However, it requires a large mount of individual cells with accurate and unbiased annotated types to train the identification models. Unfortunately, labeling the scRNAseq data is cumbersome and time-consuming as it involves manual inspection of marker genes. To overcome this challenge, we propose a semi-supervised learning model "SemiRNet" to use unlabeled scRNAseq cells and a limited amount of labeled scRNAseq cells to implement cell identification. The proposed model is based on recurrent convolutional neural networks (RCNN), which includes a shared network, a supervised network and an unsupervised network. The proposed model is evaluated on two large scale single-cell transcriptomic datasets. It is observed that the proposed model is able to achieve encouraging performance by learning on the very limited amount of labeled scRNAseq cells together with a large number of unlabeled scRNAseq cells.
Collapse
|
48
|
Tian T, Zhong C, Lin X, Wei Z, Hakonarson H. Complex hierarchical structures in single-cell genomics data unveiled by deep hyperbolic manifold learning. Genome Res 2023; 33:232-246. [PMID: 36849204 PMCID: PMC10069463 DOI: 10.1101/gr.277068.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 01/24/2023] [Indexed: 03/01/2023]
Abstract
With the advances in single-cell sequencing techniques, numerous analytical methods have been developed for delineating cell development. However, most are based on Euclidean space, which would distort the complex hierarchical structure of cell differentiation. Recently, methods acting on hyperbolic space have been proposed to visualize hierarchical structures in single-cell RNA-seq (scRNA-seq) data and have been proven to be superior to methods acting on Euclidean space. However, these methods have fundamental limitations and are not optimized for the highly sparse single-cell count data. To address these limitations, we propose scDHMap, a model-based deep learning approach to visualize the complex hierarchical structures of scRNA-seq data in low-dimensional hyperbolic space. The evaluations on extensive simulation and real experiments show that scDHMap outperforms existing dimensionality-reduction methods in various common analytical tasks as needed for scRNA-seq data, including revealing trajectory branches, batch correction, and denoising the count matrix with high dropout rates. In addition, we extend scDHMap to visualize single-cell ATAC-seq data.
Collapse
Affiliation(s)
- Tian Tian
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Cheng Zhong
- Department of Computer Science, Ying Wu College of Computing, New Jersey Institute of Technology, Newark, New Jersey 07102, USA
| | - Xiang Lin
- Department of Computer Science, Ying Wu College of Computing, New Jersey Institute of Technology, Newark, New Jersey 07102, USA
| | - Zhi Wei
- Department of Computer Science, Ying Wu College of Computing, New Jersey Institute of Technology, Newark, New Jersey 07102, USA;
| | - Hakon Hakonarson
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA.,Division of Human Genetics, Department of Pediatrics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
49
|
Pandey D, Onkara PP. Improved downstream functional analysis of single-cell RNA-sequence data using DGAN. Sci Rep 2023; 13:1618. [PMID: 36709340 PMCID: PMC9884242 DOI: 10.1038/s41598-023-28952-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 01/27/2023] [Indexed: 01/29/2023] Open
Abstract
The dramatic increase in the number of single-cell RNA-sequence (scRNA-seq) investigations is indeed an endorsement of the new-fangled proficiencies of next generation sequencing technologies that facilitate the accurate measurement of tens of thousands of RNA expression levels at the cellular resolution. Nevertheless, missing values of RNA amplification persist and remain as a significant computational challenge, as these data omission induce further noise in their respective cellular data and ultimately impede downstream functional analysis of scRNA-seq data. Consequently, it turns imperative to develop robust and efficient scRNA-seq data imputation methods for improved downstream functional analysis outcomes. To overcome this adversity, we have designed an imputation framework namely deep generative autoencoder network [DGAN]. In essence, DGAN is an evolved variational autoencoder designed to robustly impute data dropouts in scRNA-seq data manifested as a sparse gene expression matrix. DGAN principally reckons count distribution, besides data sparsity utilizing a gaussian model whereby, cell dependencies are capitalized to detect and exclude outlier cells via imputation. When tested on five publicly available scRNA-seq data, DGAN outperformed every single baseline method paralleled, with respect to downstream functional analysis including cell data visualization, clustering, classification and differential expression analysis. DGAN is executed in Python and is accessible at https://github.com/dikshap11/DGAN .
Collapse
Affiliation(s)
- Diksha Pandey
- Department of Biotechnology, National Institute of Technology, Warangal, India
| | - Perumal P Onkara
- Department of Biotechnology, National Institute of Technology, Warangal, India.
| |
Collapse
|
50
|
Xu X, Li X. Structure-preserved dimension reduction using joint triplets sampling for multi-batch integration of single-cell transcriptomic data. Brief Bioinform 2023; 24:6982727. [PMID: 36627114 DOI: 10.1093/bib/bbac608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 11/18/2022] [Accepted: 12/09/2022] [Indexed: 01/12/2023] Open
Abstract
Dimension reduction (DR) plays an important role in single-cell RNA sequencing (scRNA-seq), such as data interpretation, visualization and other downstream analysis. A desired DR method should be applicable to various application scenarios, including identifying cell types, preserving the inherent structure of data and handling with batch effects. However, most of the existing DR methods fail to accommodate these requirements simultaneously, especially removing batch effects. In this paper, we develop a novel structure-preserved dimension reduction (SPDR) method using intra- and inter-batch triplets sampling. The constructed triplets jointly consider each anchor's mutual nearest neighbors from inter-batch, k-nearest neighbors from intra-batch and randomly selected cells from the whole data, which capture higher order structure information and meanwhile account for batch information of the data. Then we minimize a robust loss function for the chosen triplets to obtain a structure-preserved and batch-corrected low-dimensional representation. Comprehensive evaluations show that SPDR outperforms other competing DR methods, such as INSCT, IVIS, Trimap, Scanorama, scVI and UMAP, in removing batch effects, preserving biological variation, facilitating visualization and improving clustering accuracy. Besides, the two-dimensional (2D) embedding of SPDR presents a clear and authentic expression pattern, and can guide researchers to determine how many cell types should be identified. Furthermore, SPDR is robust to complex data characteristics (such as down-sampling, duplicates and outliers) and varying hyperparameter settings. We believe that SPDR will be a valuable tool for characterizing complex cellular heterogeneity.
Collapse
Affiliation(s)
- Xinyi Xu
- School of Statistics and Mathematics, Central University of Finance and Economics, Beijing, 100081, China
| | - Xiangjie Li
- Changping Laboratory, Beijing, 102206, China
| |
Collapse
|