1
|
Jiang H, Wang MN, Huang YA, Huang Y. Graph-Regularized Non-Negative Matrix Factorization for Single-Cell Clustering in scRNA-Seq Data. IEEE J Biomed Health Inform 2024; 28:4986-4994. [PMID: 38787664 DOI: 10.1109/jbhi.2024.3400050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) has brought forth fresh perspectives on intricate biological processes, revealing the nuances and divergences present among distinct cells. Accurate single-cell analysis is a crucial prerequisite for in-depth investigation into the underlying mechanisms of heterogeneity. Due to various technical noises, like the impact of dropout values, scRNA-seq data remains challenging to interpret. In this work, we propose an unsupervised learning framework for scRNA-seq data analysis (aka Sc-GNNMF). Based on the non-negativity and sparsity of scRNA-seq data, we propose employing graph-regularized non-negative matrix factorization (GNNMF) algorithm for the analysis of scRNA-seq data, which involves estimating cell-cell sparse similarity and gene-gene sparse similarity through Laplacian kernels and p-nearest neighbor graphs ( p-NNG). By assuming intrinsic geometric local invariance, we use a weighted p-nearest known neighbors ( p-NKN) to optimize the scRNA-seq data. The optimized scRNA-seq data then participates in the matrix decomposition process, promoting the closeness of cells with similar types in cell-gene data space and determining a more suitable embedding space for clustering. Sc-GNNMF demonstrates superior performance compared to other methods and maintains satisfactory compatibility and robustness, as evidenced by experiments on 11 real scRNA-seq datasets. Furthermore, Sc-GNNMF yields excellent results in clustering tasks, extracting useful gene markers, and pseudo-temporal analysis.
Collapse
|
2
|
Qiu Y, Yang L, Jiang H, Zou Q. scTPC: a novel semisupervised deep clustering model for scRNA-seq data. Bioinformatics 2024; 40:btae293. [PMID: 38684178 PMCID: PMC11091743 DOI: 10.1093/bioinformatics/btae293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/14/2024] [Accepted: 04/26/2024] [Indexed: 05/02/2024] Open
Abstract
MOTIVATION Continuous advancements in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to further explore the study of cell heterogeneity, trajectory inference, identification of rare cell types, and neurology. Accurate scRNA-seq data clustering is crucial in single-cell sequencing data analysis. However, the high dimensionality, sparsity, and presence of "false" zero values in the data can pose challenges to clustering. Furthermore, current unsupervised clustering algorithms have not effectively leveraged prior biological knowledge, making cell clustering even more challenging. RESULTS This study investigates a semisupervised clustering model called scTPC, which integrates the triplet constraint, pairwise constraint, and cross-entropy constraint based on deep learning. Specifically, the model begins by pretraining a denoising autoencoder based on a zero-inflated negative binomial distribution. Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model. A series of experimental results on 10 real scRNA-seq datasets and five simulated datasets demonstrate that scTPC achieves accurate clustering with a well-designed framework. AVAILABILITY AND IMPLEMENTATION scTPC is a Python-based algorithm, and the code is available from https://github.com/LF-Yang/Code or https://zenodo.org/records/10951780.
Collapse
Affiliation(s)
- Yushan Qiu
- School of Mathematical Sciences, Shenzhen University, Shenzhen, Guangdong 518000, China
| | - Lingfei Yang
- School of Mathematical Sciences, Shenzhen University, Shenzhen, Guangdong 518000, China
| | - Hao Jiang
- School of Mathematics, Renmin University of China, Haidian District, Beijing 100872, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610056, China
| |
Collapse
|
3
|
Wang H, Liu Z, Ma X. Learning Consistency and Specificity of Cells From Single-Cell Multi-Omic Data. IEEE J Biomed Health Inform 2024; 28:3134-3145. [PMID: 38709615 DOI: 10.1109/jbhi.2024.3370868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Advancements in single-cell technologies concomitantly develop the epigenomic and transcriptomic profiles at the cell levels, providing opportunities to explore the potential biological mechanisms. Even though significant efforts have been dedicated to them, it remains challenging for the integration analysis of multi-omic data of single-cell because of the heterogeneity, complicated coupling and interpretability of data. To handle these issues, we propose a novel self-representation Learning-based Multi-omics data Integrative Clustering algorithm (sLMIC) for the integration of single-cell epigenomic profiles (DNA methylation or scATAC-seq) and transcriptomic (scRNA-seq), which the consistent and specific features of cells are explicitly extracted facilitating the cell clustering. Specifically, sLMIC constructs a graph for each type of single-cell data, thereby transforming omics data into multi-layer networks, which effectively removes heterogeneity of omic data. Then, sLMIC employs the low-rank and exclusivity constraints to separate the self-representation of cells into two parts, i.e., the shared and specific features, which explicitly characterize the consistency and diversity of omic data, providing an effective strategy to model the structure of cell types. Feature extraction and cell clustering are jointly formulated as an overall objective function, where latent features of data are obtained under the guidance of cell clustering. The extensive experimental results on 13 multi-omics datasets of single-cell from diverse organisms and tissues indicate that sLMIC observably exceeds the advanced algorithms regarding various measurements.
Collapse
|
4
|
Wang H, Zhang W, Ma X. Contrastive and adversarial regularized multi-level representation learning for incomplete multi-view clustering. Neural Netw 2024; 172:106102. [PMID: 38219677 DOI: 10.1016/j.neunet.2024.106102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Revised: 11/20/2023] [Accepted: 01/04/2024] [Indexed: 01/16/2024]
Abstract
Incomplete multi-view clustering is a significant task in machine learning, given that complex systems in nature and society cannot be fully observed; it provides an opportunity to exploit the structure and functions of underlying systems. Current algorithms are criticized for failing either to balance data restoration and clustering or to capture the consistency of the representation of various views. To address these problems, a novel Multi-level Representation Learning Contrastive and Adversarial Learning (aka MRL_CAL) for incomplete multi-view clustering is proposed, in which data restoration, consistent representation, and clustering are jointly learned by exploiting features in various subspaces. Specifically, MRL_CAL employs v auto-encoder to obtain a low-level specific-view representation of instances, which restores data by estimating the distribution of the original incomplete data with adversarial learning. Then, MRL_CAL extracts a high-level representation of instances, in which the consistency of various views and labels of clusters is incorporated with contrastive learning. In this case, MRL_CAL simultaneously learns multi-level features of instances in various subspaces, which not only overcomes the confliction of representations but also improves the quality of features. Finally, MRL_CAL transforms incomplete multi-view clustering into an overall objective, where features are learned under the guidance of clustering. Extensive experimental results indicate that MRL_CAL outperforms state-of-the-art algorithms in terms of various measurements, implying that the proposed method is promising for incomplete multi-view clustering.
Collapse
Affiliation(s)
- Haiyue Wang
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, 710071, China
| | - Wensheng Zhang
- School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, 510006, China
| | - Xiaoke Ma
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, 710071, China.
| |
Collapse
|
5
|
Yu L, Liu C, Yang JYH, Yang P. Ensemble deep learning of embeddings for clustering multimodal single-cell omics data. Bioinformatics 2023; 39:btad382. [PMID: 37314966 PMCID: PMC10287920 DOI: 10.1093/bioinformatics/btad382] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/16/2023] [Accepted: 06/12/2023] [Indexed: 06/16/2023] Open
Abstract
MOTIVATION Recent advances in multimodal single-cell omics technologies enable multiple modalities of molecular attributes, such as gene expression, chromatin accessibility, and protein abundance, to be profiled simultaneously at a global level in individual cells. While the increasing availability of multiple data modalities is expected to provide a more accurate clustering and characterization of cells, the development of computational methods that are capable of extracting information embedded across data modalities is still in its infancy. RESULTS We propose SnapCCESS for clustering cells by integrating data modalities in multimodal single-cell omics data using an unsupervised ensemble deep learning framework. By creating snapshots of embeddings of multimodality using variational autoencoders, SnapCCESS can be coupled with various clustering algorithms for generating consensus clustering of cells. We applied SnapCCESS with several clustering algorithms to various datasets generated from popular multimodal single-cell omics technologies. Our results demonstrate that SnapCCESS is effective and more efficient than conventional ensemble deep learning-based clustering methods and outperforms other state-of-the-art multimodal embedding generation methods in integrating data modalities for clustering cells. The improved clustering of cells from SnapCCESS will pave the way for more accurate characterization of cell identity and types, an essential step for various downstream analyses of multimodal single-cell omics data. AVAILABILITY AND IMPLEMENTATION SnapCCESS is implemented as a Python package and is freely available from https://github.com/PYangLab/SnapCCESS under the open-source license of GPL-3. The data used in this study are publicly available (see section 'Data availability').
Collapse
Affiliation(s)
- Lijia Yu
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D4H), Hong Kong Science Park, Hong Kong SAR, China
| | - Pengyi Yang
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D4H), Hong Kong Science Park, Hong Kong SAR, China
| |
Collapse
|
6
|
Tang Z, Zhang T, Yang B, Su J, Song Q. spaCI: deciphering spatial cellular communications through adaptive graph model. Brief Bioinform 2023; 24:bbac563. [PMID: 36545790 PMCID: PMC9851335 DOI: 10.1093/bib/bbac563] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 10/26/2022] [Accepted: 11/18/2022] [Indexed: 12/24/2022] Open
Abstract
Cell-cell communications are vital for biological signalling and play important roles in complex diseases. Recent advances in single-cell spatial transcriptomics (SCST) technologies allow examining the spatial cell communication landscapes and hold the promise for disentangling the complex ligand-receptor (L-R) interactions across cells. However, due to frequent dropout events and noisy signals in SCST data, it is challenging and lack of effective and tailored methods to accurately infer cellular communications. Herein, to decipher the cell-to-cell communications from SCST profiles, we propose a novel adaptive graph model with attention mechanisms named spaCI. spaCI incorporates both spatial locations and gene expression profiles of cells to identify the active L-R signalling axis across neighbouring cells. Through benchmarking with currently available methods, spaCI shows superior performance on both simulation data and real SCST datasets. Furthermore, spaCI is able to identify the upstream transcriptional factors mediating the active L-R interactions. For biological insights, we have applied spaCI to the seqFISH+ data of mouse cortex and the NanoString CosMx Spatial Molecular Imager (SMI) data of non-small cell lung cancer samples. spaCI reveals the hidden L-R interactions from the sparse seqFISH+ data, meanwhile identifies the inconspicuous L-R interactions including THBS1-ITGB1 between fibroblast and tumours in NanoString CosMx SMI data. spaCI further reveals that SMAD3 plays an important role in regulating the crosstalk between fibroblasts and tumours, which contributes to the prognosis of lung cancer patients. Collectively, spaCI addresses the challenges in interrogating SCST data for gaining insights into the underlying cellular communications, thus facilitates the discoveries of disease mechanisms, effective biomarkers and therapeutic targets.
Collapse
Affiliation(s)
- Ziyang Tang
- Department of Computer and Information Technology, Purdue University, Indiana, USA
| | - Tonglin Zhang
- Department of Statistics, Purdue University, Indiana, USA
| | - Baijian Yang
- Department of Computer and Information Technology, Purdue University, Indiana, USA
| | - Jing Su
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indiana, USA
| | - Qianqian Song
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Atrium Health Wake Forest Baptist, Winston Salem, NC, USA
- Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC, USA
| |
Collapse
|
7
|
Chuwdhury GS, Ng IOL, Ho DWH. scAnalyzeR: A Comprehensive Software Package With Graphical User Interface for Single-Cell RNA Sequencing Analysis and its Application on Liver Cancer. Technol Cancer Res Treat 2022; 21:15330338221142729. [PMID: 36476060 PMCID: PMC9742707 DOI: 10.1177/15330338221142729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Introduction: The application of single-cell RNA sequencing to delineate tissue heterogeneity and complexity has become increasingly popular. Given its tremendous resolution and high-dimensional capacity for in-depth investigation, single-cell RNA sequencing offers an unprecedented research power. Although some popular software packages are available for single-cell RNA sequencing data analysis and visualization, it is still a big challenge for their usage, as they provide only a command-line interface and require significant level of bioinformatics skills. Methods: We have developed scAnalyzeR, which is a single-cell RNA sequencing analysis pipeline with an interactive and user-friendly graphical interface for analyzing and visualizing single-cell RNA sequencing data. It accepts single-cell RNA sequencing data from various technology platforms and different model organisms (human and mouse) and allows flexibility in input file format. It provides functionalities for data preprocessing, quality control, basic summary statistics, dimension reduction, unsupervised clustering, differential gene expression, gene set enrichment analysis, correlation analysis, pseudotime cell trajectory inference, and various visualization plots. It also provides default parameters for easy usage and allows a wide range of flexibility and optimization by accepting user-defined options. It has been developed as a docker image that can be run in any docker-supported environment including Linux, Mac, and Windows, without installing any dependencies. Results: We compared the performance of scAnalyzeR with 2 other graphical tools that are popular for analyzing single-cell RNA sequencing data. The comparison was based on the comprehensiveness of functionalities, ease of usage and flexibility, and execution time. In general, scAnalyzeR outperformed the other tested counterparts in various aspects, demonstrating its superior overall performance. To illustrate the usefulness of scAnalyzeR in cancer research, we have analyzed the in-house liver cancer single-cell RNA sequencing dataset. Liver cancer tumor cells were revealed to have multiple subpopulations with distinctive gene expression signatures. Conclusion: scAnalyzeR has comprehensive functionalities and demonstrated usability. We anticipate more functionalities to be adopted in the future development.
Collapse
Affiliation(s)
- GS Chuwdhury
- Department of Pathology and State Key Laboratory of Liver Research, The University of Hong Kong, Hong Kong
| | - Irene Oi-Lin Ng
- Department of Pathology and State Key Laboratory of Liver Research, The University of Hong Kong, Hong Kong
| | - Daniel Wai-Hung Ho
- Department of Pathology and State Key Laboratory of Liver Research, The University of Hong Kong, Hong Kong,Daniel Ho, Department of Pathology and State Key Laboratory of Liver Research, The University of Hong Kong, Hong Kong.
| |
Collapse
|