1
|
Jiang H, Wang MN, Huang YA, Huang Y. Graph-Regularized Non-Negative Matrix Factorization for Single-Cell Clustering in scRNA-Seq Data. IEEE J Biomed Health Inform 2024; 28:4986-4994. [PMID: 38787664 DOI: 10.1109/jbhi.2024.3400050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) has brought forth fresh perspectives on intricate biological processes, revealing the nuances and divergences present among distinct cells. Accurate single-cell analysis is a crucial prerequisite for in-depth investigation into the underlying mechanisms of heterogeneity. Due to various technical noises, like the impact of dropout values, scRNA-seq data remains challenging to interpret. In this work, we propose an unsupervised learning framework for scRNA-seq data analysis (aka Sc-GNNMF). Based on the non-negativity and sparsity of scRNA-seq data, we propose employing graph-regularized non-negative matrix factorization (GNNMF) algorithm for the analysis of scRNA-seq data, which involves estimating cell-cell sparse similarity and gene-gene sparse similarity through Laplacian kernels and p-nearest neighbor graphs ( p-NNG). By assuming intrinsic geometric local invariance, we use a weighted p-nearest known neighbors ( p-NKN) to optimize the scRNA-seq data. The optimized scRNA-seq data then participates in the matrix decomposition process, promoting the closeness of cells with similar types in cell-gene data space and determining a more suitable embedding space for clustering. Sc-GNNMF demonstrates superior performance compared to other methods and maintains satisfactory compatibility and robustness, as evidenced by experiments on 11 real scRNA-seq datasets. Furthermore, Sc-GNNMF yields excellent results in clustering tasks, extracting useful gene markers, and pseudo-temporal analysis.
Collapse
|
2
|
Qiao TJ, Li F, Yuan SS, Dai LY, Wang J. A Fusion Learning Model Based on Deep Learning for Single-Cell RNA Sequencing Data Clustering. J Comput Biol 2024; 31:576-588. [PMID: 38758925 DOI: 10.1089/cmb.2024.0512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technology provides a means for studying biology from a cellular perspective. The fundamental goal of scRNA-seq data analysis is to discriminate single-cell types using unsupervised clustering. Few single-cell clustering algorithms have taken into account both deep and surface information, despite the recent slew of suggestions. Consequently, this article constructs a fusion learning framework based on deep learning, namely scGASI. For learning a clustering similarity matrix, scGASI integrates data affinity recovery and deep feature embedding in a unified scheme based on various top feature sets. Next, scGASI learns the low-dimensional latent representation underlying the data using a graph autoencoder to mine the hidden information residing in the data. To efficiently merge the surface information from raw area and the deeper potential information from underlying area, we then construct a fusion learning model based on self-expression. scGASI uses this fusion learning model to learn the similarity matrix of an individual feature set as well as the clustering similarity matrix of all feature sets. Lastly, gene marker identification, visualization, and clustering are accomplished using the clustering similarity matrix. Extensive verification on actual data sets demonstrates that scGASI outperforms many widely used clustering techniques in terms of clustering accuracy.
Collapse
Affiliation(s)
- Tian-Jing Qiao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Ling-Yun Dai
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
3
|
Cai X, Zhang W, Zheng X, Xu Y, Li Y. scEM: A New Ensemble Framework for Predicting Cell Type Composition Based on scRNA-Seq Data. Interdiscip Sci 2024; 16:304-317. [PMID: 38368575 DOI: 10.1007/s12539-023-00601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 02/19/2024]
Abstract
With the advent of single-cell RNA sequencing (scRNA-seq) technology, many scRNA-seq data have become available, providing an unprecedented opportunity to explore cellular composition and heterogeneity. Recently, many computational algorithms for predicting cell type composition have been developed, and these methods are typically evaluated on different datasets and performance metrics using diverse techniques. Consequently, the lack of comprehensive and standardized comparative analysis makes it difficult to gain a clear understanding of the strengths and weaknesses of these methods. To address this gap, we reviewed 20 cutting-edge unsupervised cell type identification methods and evaluated these methods comprehensively using 24 real scRNA-seq datasets of varying scales. In addition, we proposed a new ensemble cell-type identification method, named scEM, which learns the consensus similarity matrix by applying the entropy weight method to the four representative methods are selected. The Louvain algorithm is adopted to obtain the final classification of individual cells based on the consensus matrix. Extensive evaluation and comparison with 11 other similarity-based methods under real scRNA-seq datasets demonstrate that the newly developed ensemble algorithm scEM is effective in predicting cellular type composition.
Collapse
Affiliation(s)
- Xianxian Cai
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| | - Wei Zhang
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China.
| | - Xiaoying Zheng
- Operations research and planning department, Naval University of Engineering, Wuhan, 430033, China
| | - Yaxin Xu
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| | - Yuanyuan Li
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan, China
| |
Collapse
|
4
|
Cottrell S, Hozumi Y, Wei GW. K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis. Comput Biol Med 2024; 175:108497. [PMID: 38678944 PMCID: PMC11090715 DOI: 10.1016/j.compbiomed.2024.108497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 04/08/2024] [Accepted: 04/21/2024] [Indexed: 05/01/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
Collapse
Affiliation(s)
- Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|
5
|
Wang GF, Shen L. Cauchy hyper-graph Laplacian nonnegative matrix factorization for single-cell RNA-sequencing data analysis. BMC Bioinformatics 2024; 25:169. [PMID: 38684942 PMCID: PMC11059750 DOI: 10.1186/s12859-024-05797-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 04/24/2024] [Indexed: 05/02/2024] Open
Abstract
Many important biological facts have been found as single-cell RNA sequencing (scRNA-seq) technology has advanced. With the use of this technology, it is now possible to investigate the connections among individual cells, genes, and illnesses. For the analysis of single-cell data, clustering is frequently used. Nevertheless, biological data usually contain a large amount of noise data, and traditional clustering methods are sensitive to noise. However, acquiring higher-order spatial information from the data alone is insufficient. As a result, getting trustworthy clustering findings is challenging. We propose the Cauchy hyper-graph Laplacian non-negative matrix factorization (CHLNMF) as a unique approach to address these issues. In CHLNMF, we replace the measurement based on Euclidean distance in the conventional non-negative matrix factorization (NMF), which can lessen the influence of noise, with the Cauchy loss function (CLF). The model also incorporates the hyper-graph constraint, which takes into account the high-order link among the samples. The CHLNMF model's best solution is then discovered using a half-quadratic optimization approach. Finally, using seven scRNA-seq datasets, we contrast the CHLNMF technique with the other nine top methods. The validity of our technique was established by analysis of the experimental outcomes.
Collapse
Affiliation(s)
- Gao-Fei Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, Shandong, China.
| | - Longying Shen
- School of Computer Science, Qufu Normal University, Rizhao, 276826, Shandong, China
| |
Collapse
|
6
|
Hozumi Y, Tanemura KA, Wei GW. Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection. J Chem Inf Model 2024; 64:2829-2838. [PMID: 37402705 PMCID: PMC11009150 DOI: 10.1021/acs.jcim.3c00674] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Kiyoto Aramis Tanemura
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
7
|
Ren L, Wang J, Li W, Guo M, Yu G. Single-cell RNA-seq data clustering by deep information fusion. Brief Funct Genomics 2024; 23:128-137. [PMID: 37208992 DOI: 10.1093/bfgp/elad017] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 02/13/2023] [Indexed: 05/21/2023] Open
Abstract
Determining cell types by single-cell transcriptomics data is fundamental for downstream analysis. However, cell clustering and data imputation still face the computation challenges, due to the high dropout rate, sparsity and dimensionality of single-cell data. Although some deep learning based solutions have been proposed to handle these challenges, they still can not leverage gene attribute information and cell topology in a sensible way to explore the consistent clustering. In this paper, we present scDeepFC, a deep information fusion-based single-cell data clustering method for cell clustering and data imputation. Specifically, scDeepFC uses a deep auto-encoder (DAE) network and a deep graph convolution network to embed high-dimensional gene attribute information and high-order cell-cell topological information into different low-dimensional representations, and then fuses them to generate a more comprehensive and accurate consensus representation via a deep information fusion network. In addition, scDeepFC integrates the zero-inflated negative binomial (ZINB) into DAE to model the dropout events. By jointly optimizing the ZINB loss and cell graph reconstruction loss, scDeepFC generates a salient embedding representation for clustering cells and imputing missing data. Extensive experiments on real single-cell datasets prove that scDeepFC outperforms other popular single-cell analysis methods. Both the gene attribute and cell topology information can improve the cell clustering.
Collapse
Affiliation(s)
- Liangrui Ren
- School of Software, Shandong University, 250101 Ji'nan, China
| | - Jun Wang
- Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, 250101 Ji'nan, China
| | - Wei Li
- School of Control Science and Engineering, Shandong University, 250061 Ji'nan, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, 100044,Bei'jing, China
| | - Guoxian Yu
- School of Software, Shandong University, 250101 Ji'nan, China
| |
Collapse
|
8
|
Xu Y, Zhang W, Zheng X, Cai X. Combining Global-Constrained Concept Factorization and a Regularized Gaussian Graphical Model for Clustering Single-Cell RNA-seq Data. Interdiscip Sci 2024; 16:1-15. [PMID: 37815679 DOI: 10.1007/s12539-023-00587-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 09/14/2023] [Accepted: 09/17/2023] [Indexed: 10/11/2023]
Abstract
Single-cell RNA sequencing technology is one of the most cost-effective ways to uncover transcriptomic heterogeneity. With the rapid rise of this technology, enormous amounts of scRNA-seq data have been produced. Due to the high dimensionality, noise, sparsity and missing features of the available scRNA-seq data, accurately clustering the scRNA-seq data for downstream analysis is a significant challenge. Many computational methods have been designed to address this issue; nevertheless, the efficacy of the available methods is still inadequate. In addition, most similarity-based methods require a number of clusters as input, which is difficult to achieve in real applications. In this study, we developed a novel computational method for clustering scRNA-seq data by considering both global and local information, named GCFG. This method characterizes the global properties of data by applying concept factorization, and the regularized Gaussian graphical model is utilized to evaluate the local embedding relationship of data. To learn the cell-cell similarity matrix, we integrated the two components, and an iterative optimization algorithm was developed. The categorization of single cells is obtained by applying Louvain, a modularity-based community discovery algorithm, to the similarity matrix. The behavior of the GCFG approach is assessed on 14 real scRNA-seq datasets in terms of ACC and ARI, and comparison results with 17 other competitive methods suggest that GCFG is effective and robust.
Collapse
Affiliation(s)
- Yaxin Xu
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| | - Wei Zhang
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China.
| | - Xiaoying Zheng
- Operations Research and Planning Department, Naval University of Engineering, Wuhan, 430033, China
| | - Xianxian Cai
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| |
Collapse
|
9
|
Zhang DJ, Gao YL, Zhao JX, Zheng CH, Liu JX. A New Graph Autoencoder-Based Consensus-Guided Model for scRNA-seq Cell Type Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2473-2483. [PMID: 35857730 DOI: 10.1109/tnnls.2022.3190289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) technology is famous for providing a microscopic view to help capture cellular heterogeneity. This characteristic has advanced the field of genomics by enabling the delicate differentiation of cell types. However, the properties of single-cell datasets, such as high dropout events, noise, and high dimensionality, are still a research challenge in the single-cell field. To utilize single-cell data more efficiently and to better explore the heterogeneity among cells, a new graph autoencoder (GAE)-based consensus-guided model (scGAC) is proposed in this article. The data are preprocessed into multiple top-level feature datasets. Then, feature learning is performed by using GAEs to generate new feature matrices, followed by similarity learning based on distance fusion methods. The learned similarity matrices are fed back to the GAEs to guide their feature learning process. Finally, the abovementioned steps are iterated continuously to integrate the final consistent similarity matrix and perform other related downstream analyses. The scGAC model can accurately identify critical features and effectively preserve the internal structure of the data. This can further improve the accuracy of cell type identification.
Collapse
|
10
|
Lan W, Liu M, Chen J, Ye J, Zheng R, Zhu X, Peng W. JLONMFSC: Clustering scRNA-seq data based on joint learning of non-negative matrix factorization and subspace clustering. Methods 2024; 222:1-9. [PMID: 38128706 DOI: 10.1016/j.ymeth.2023.11.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 11/07/2023] [Accepted: 11/29/2023] [Indexed: 12/23/2023] Open
Abstract
The development of single cell RNA sequencing (scRNA-seq) has provided new perspectives to study biological problems at the single cell level. One of the key issues in scRNA-seq data analysis is to divide cells into several clusters for discovering the heterogeneity and diversity of cells. However, the existing scRNA-seq data are high-dimensional, sparse, and noisy, which challenges the existing single-cell clustering methods. In this study, we propose a joint learning framework (JLONMFSC) for clustering scRNA-seq data. In our method, the dimension of the original data is reduced to minimize the effect of noise. In addition, the graph regularized matrix factorization is used to learn the local features. Further, the Low-Rank Representation (LRR) subspace clustering is utilized to learn the global features. Finally, the joint learning of local features and global features is performed to obtain the results of clustering. We compare the proposed algorithm with eight state-of-the-art algorithms for clustering performance on six datasets, and the experimental results demonstrate that the JLONMFSC achieves better performance in all datasets. The code is avalable at https://github.com/lanbiolab/JLONMFSC.
Collapse
Affiliation(s)
- Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning, China; Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning, China.
| | - Mingyang Liu
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Jianwei Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Jin Ye
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Ruiqing Zheng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xiaoshu Zhu
- School of Computer Science and Information Security, Guilin University of Science and Technology, Guilin, China
| | - Wei Peng
- School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| |
Collapse
|
11
|
Chen Y, Zheng R, Liu J, Li M. scMLC: an accurate and robust multiplex community detection method for single-cell multi-omics data. Brief Bioinform 2024; 25:bbae101. [PMID: 38493339 PMCID: PMC10944569 DOI: 10.1093/bib/bbae101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 01/03/2024] [Accepted: 02/15/2024] [Indexed: 03/18/2024] Open
Abstract
Clustering cells based on single-cell multi-modal sequencing technologies provides an unprecedented opportunity to create high-resolution cell atlas, reveal cellular critical states and study health and diseases. However, effectively integrating different sequencing data for cell clustering remains a challenging task. Motivated by the successful application of Louvain in scRNA-seq data, we propose a single-cell multi-modal Louvain clustering framework, called scMLC, to tackle this problem. scMLC builds multiplex single- and cross-modal cell-to-cell networks to capture modal-specific and consistent information between modalities and then adopts a robust multiplex community detection method to obtain the reliable cell clusters. In comparison with 15 state-of-the-art clustering methods on seven real datasets simultaneously measuring gene expression and chromatin accessibility, scMLC achieves better accuracy and stability in most datasets. Synthetic results also indicate that the cell-network-based integration strategy of multi-omics data is superior to other strategies in terms of generalization. Moreover, scMLC is flexible and can be extended to single-cell sequencing data with more than two modalities.
Collapse
Affiliation(s)
- Yuxuan Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jin Liu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
12
|
Li J, Wang Y. nPCA: a linear dimensionality reduction method using a multilayer perceptron. Front Genet 2024; 14:1290447. [PMID: 38259616 PMCID: PMC10800564 DOI: 10.3389/fgene.2023.1290447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 12/13/2023] [Indexed: 01/24/2024] Open
Abstract
Background: Linear dimensionality reduction techniques are widely used in many applications. The goal of dimensionality reduction is to eliminate the noise of data and extract the main features of data. Several dimension reduction methods have been developed, such as linear-based principal component analysis (PCA), nonlinear-based t-distributed stochastic neighbor embedding (t-SNE), and deep-learning-based autoencoder (AE). However, PCA only determines the projection direction with the highest variance, t-SNE is sometimes only suitable for visualization, and AE and nonlinear methods discard the linear projection. Results: To retain the linear projection of raw data and generate a better result of dimension reduction either for visualization or downstream analysis, we present neural principal component analysis (nPCA), an unsupervised deep learning approach capable of retaining richer information of raw data as a promising improvement to PCA. To evaluate the performance of the nPCA algorithm, we compare the performance of 10 public datasets and 6 single-cell RNA sequencing (scRNA-seq) datasets of the pancreas, benchmarking our method with other classic linear dimensionality reduction methods. Conclusion: We concluded that the nPCA method is a competitive alternative method for dimensionality reduction tasks.
Collapse
Affiliation(s)
- Juzeng Li
- Ministry of Education Key Laboratory of Contemporary Anthropology, Department of Anthropology and Human Genetics, School of Life Sciences, Fudan University, Shanghai, China
| | - Yi Wang
- Ministry of Education Key Laboratory of Contemporary Anthropology, Department of Anthropology and Human Genetics, School of Life Sciences, Fudan University, Shanghai, China
- Human Phenome Institute, Fudan University, Shanghai, China
| |
Collapse
|
13
|
Wang TG, Shang JL, Liu JX, Li F, Yuan S, Wang J. Joint L 2,p-norm and random walk graph constrained PCA for single-cell RNA-seq data. Comput Methods Biomech Biomed Engin 2024; 27:498-511. [PMID: 36912759 DOI: 10.1080/10255842.2023.2188106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Accepted: 03/02/2023] [Indexed: 03/14/2023]
Abstract
The development and widespread utilization of high-throughput sequencing technologies in biology has fueled the rapid growth of single-cell RNA sequencing (scRNA-seq) data over the past decade. The development of scRNA-seq technology has significantly expanded researchers' understanding of cellular heterogeneity. Accurate cell type identification is the prerequisite for any research on heterogeneous cell populations. However, due to the high noise and high dimensionality of scRNA-seq data, improving the effectiveness of cell type identification remains a challenge. As an effective dimensionality reduction method, Principal Component Analysis (PCA) is an essential tool for visualizing high-dimensional scRNA-seq data and identifying cell subpopulations. However, traditional PCA has some defects when used in mining the nonlinear manifold structure of the data and usually suffers from over-density of principal components (PCs). Therefore, we present a novel method in this paper called joint L 2 , p -norm and random walk graph constrained PCA (RWPPCA). RWPPCA aims to retain the data's local information in the process of mapping high-dimensional data to low-dimensional space, to more accurately obtain sparse principal components and to then identify cell types more precisely. Specifically, RWPPCA combines the random walk (RW) algorithm with graph regularization to more accurately determine the local geometric relationships between data points. Moreover, to mitigate the adverse effects of dense PCs, the L 2 , p -norm is introduced to make the PCs sparser, thus increasing their interpretability. Then, we evaluate the effectiveness of RWPPCA on simulated data and scRNA-seq data. The results show that RWPPCA performs well in cell type identification and outperforms other comparison methods.
Collapse
Affiliation(s)
- Tai-Ge Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jun-Liang Shang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Shasha Yuan
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| |
Collapse
|
14
|
Zheng R, Xu Z, Zeng Y, Wang E, Li M. SPIDE: A single cell potency inference method based on the local cell-specific network entropy. Methods 2023; 220:90-97. [PMID: 37952704 DOI: 10.1016/j.ymeth.2023.11.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 10/25/2023] [Accepted: 11/06/2023] [Indexed: 11/14/2023] Open
Abstract
For a given single cell RNA-seq data, it is critical to pinpoint key cellular stages and quantify cells' differentiation potency along a differentiation pathway in a time course manner. Currently, several methods based on the entropy of gene functions or PPI network have been proposed to solve the problem. Nevertheless, these methods still suffer from the inaccurate interactions and noises originating from scRNA-seq profile. In this study, we proposed a cell potency inference method based on cell-specific network entropy, called SPIDE. SPIDE introduces the local weighted cell-specific network for each cell to maintain cell heterogeneity and calculates the entropy by incorporating gene expression with network structure. In this study, we compared three cell entropy estimation models on eight scRNA-Seq datasets. The results show that SPIDE obtains consistent conclusions with real cell differentiation potency on most datasets. Moreover, SPIDE accurately recovers the continuous changes of potency during cell differentiation and significantly correlates with the stemness of tumor cells in Colorectal cancer. To conclude, our study provides a universal and accurate framework for cell entropy estimation, which deepens our understanding of cell differentiation, the development of diseases and other related biological research.
Collapse
Affiliation(s)
- Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ziwei Xu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yanping Zeng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Edwin Wang
- Department of Biochemistry and Molecular Biology, Cumming School of Medicine, University of Calgary, Calgary T2N 4N1, Alberta, Canada
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
15
|
Liu J, Zeng W, Kan S, Li M, Zheng R. CAKE: a flexible self-supervised framework for enhancing cell visualization, clustering and rare cell identification. Brief Bioinform 2023; 25:bbad475. [PMID: 38145950 PMCID: PMC10749894 DOI: 10.1093/bib/bbad475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/13/2023] [Accepted: 11/30/2023] [Indexed: 12/27/2023] Open
Abstract
Single cell sequencing technology has provided unprecedented opportunities for comprehensively deciphering cell heterogeneity. Nevertheless, the high dimensionality and intricate nature of cell heterogeneity have presented substantial challenges to computational methods. Numerous novel clustering methods have been proposed to address this issue. However, none of these methods achieve the consistently better performance under different biological scenarios. In this study, we developed CAKE, a novel and scalable self-supervised clustering method, which consists of a contrastive learning model with a mixture neighborhood augmentation for cell representation learning, and a self-Knowledge Distiller model for the refinement of clustering results. These designs provide more condensed and cluster-friendly cell representations and improve the clustering performance in term of accuracy and robustness. Furthermore, in addition to accurately identifying the major type cells, CAKE could also find more biologically meaningful cell subgroups and rare cell types. The comprehensive experiments on real single-cell RNA sequencing datasets demonstrated the superiority of CAKE in visualization and clustering over other comparison methods, and indicated its extensive application in the field of cell heterogeneity analysis. Contact: Ruiqing Zheng. (rqzheng@csu.edu.cn).
Collapse
Affiliation(s)
- Jin Liu
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Weixing Zeng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Shichao Kan
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| |
Collapse
|
16
|
Jiang H, Huang Y, Li Q, Feng B. ScLSTM: single-cell type detection by siamese recurrent network and hierarchical clustering. BMC Bioinformatics 2023; 24:417. [PMID: 37932672 PMCID: PMC10629177 DOI: 10.1186/s12859-023-05494-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 09/21/2023] [Indexed: 11/08/2023] Open
Abstract
MOTIVATION Categorizing cells into distinct types can shed light on biological tissue functions and interactions, and uncover specific mechanisms under pathological conditions. Since gene expression throughout a population of cells is averaged out by conventional sequencing techniques, it is challenging to distinguish between different cell types. The accumulation of single-cell RNA sequencing (scRNA-seq) data provides the foundation for a more precise classification of cell types. It is crucial building a high-accuracy clustering approach to categorize cell types since the imbalance of cell types and differences in the distribution of scRNA-seq data affect single-cell clustering and visualization outcomes. RESULT To achieve single-cell type detection, we propose a meta-learning-based single-cell clustering model called ScLSTM. Specifically, ScLSTM transforms the single-cell type detection problem into a hierarchical classification problem based on feature extraction by the siamese long-short term memory (LSTM) network. The similarity matrix derived from the improved sigmoid kernel is mapped to the siamese LSTM feature space to analyze the differences between cells. ScLSTM demonstrated superior classification performance on 8 scRNA-seq data sets of different platforms, species, and tissues. Further quantitative analysis and visualization of the human breast cancer data set validated the superiority and capability of ScLSTM in recognizing cell types.
Collapse
Affiliation(s)
- Hanjing Jiang
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, Institute of Artificial Intelligence, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Yabing Huang
- Department of Pathology, Renmin Hospital of Wuhan University, Wuhan, 430060, China.
| | - Qianpeng Li
- Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
| | - Boyuan Feng
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, Institute of Artificial Intelligence, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| |
Collapse
|
17
|
Wang J, Wang LP, Yuan SS, Li F, Liu JX, Shang JL. NLRRC: A Novel Clustering Method of Jointing Non-Negative LRR and Random Walk Graph Regularized NMF for Single-Cell Type Identification. IEEE J Biomed Health Inform 2023; 27:5199-5209. [PMID: 37506010 DOI: 10.1109/jbhi.2023.3299748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2023]
Abstract
The development of single-cell RNA sequencing (scRNA-seq) technology has opened up a new perspective for us to study disease mechanisms at the single cell level. Cell clustering reveals the natural grouping of cells, which is a vital step in scRNA-seq data analysis. However, the high noise and dropout of single-cell data pose numerous challenges to cell clustering. In this study, we propose a novel matrix factorization method named NLRRC for single-cell type identification. NLRRC joins non-negative low-rank representation (LRR) and random walk graph regularized NMF (RWNMFC) to accurately reveal the natural grouping of cells. Specifically, we find the lowest rank representation of single-cell samples by non-negative LRR to reduce the difficulty of analyzing high-dimensional samples and capture the global information of the samples. Meanwhile, by using random walk graph regularization (RWGR) and NMF, RWNMFC captures manifold structure and cluster information before generating a cluster allocation matrix. The cluster assignment matrix contains cluster labels, which can be used directly to get the clustering results. The performance of NLRRC is validated on simulated and real single-cell datasets. The results of the experiments illustrate that NLRRC has a significant advantage in single-cell type identification.
Collapse
|
18
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
19
|
Liu Q, Wang D, Zhou L, Li J, Wang G. MTGDC: A Multi-Scale Tensor Graph Diffusion Clustering for Single-Cell RNA Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3056-3067. [PMID: 37418411 DOI: 10.1109/tcbb.2023.3293112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/09/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a new technology that focuses on the expression levels for each cell to study cell heterogeneity. Thus, new computational methods matching scRNA-seq are designed to detect cell types among various cell groups. Herein, we propose a Multi-scale Tensor Graph Diffusion Clustering (MTGDC) for single-cell RNA sequencing data. It has the following mechanisms: 1) To mine potential similarity distributions among cells, we design a multi-scale affinity learning method to construct a fully connected graph between cells; 2) For each affinity matrix, we propose an efficient tensor graph diffusion learning framework to learn high-order information among multi-scale affinity matrices. First, the tensor graph is explicitly introduced to measure cell-cell edges with local high-order relationship information. To further preserve more global topology structure information in the tensor graph, MTGDC implicitly considers the propagation of information via a data diffusion process by designing a simple and efficient tensor graph diffusion update algorithm. 3) Finally, we mix together the multi-scale tensor graphs to obtain the fusion high-order affinity matrix and apply it to spectral clustering. Experiments and case studies showed that MTGDC had obvious advantages over the state-of-art algorithms in robustness, accuracy, visualization, and speed.
Collapse
|
20
|
Wang ZC, Liu JX, Shang JL, Dai LY, Zheng CH, Wang J. ARGLRR: A Sparse Low-Rank Representation Single-Cell RNA-Sequencing Data Clustering Method Combined with a New Graph Regularization. J Comput Biol 2023; 30:848-860. [PMID: 37471220 DOI: 10.1089/cmb.2023.0077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023] Open
Abstract
The development of single-cell transcriptome sequencing technologies has opened new ways to study biological phenomena at the cellular level. A key application of such technologies involves the employment of single-cell RNA sequencing (scRNA-seq) data to identify distinct cell types through clustering, which in turn provides evidence for revealing heterogeneity. Despite the promise of this approach, the inherent characteristics of scRNA-seq data, such as higher noise levels and lower coverage, pose major challenges to existing clustering methods and compromise their accuracy. In this study, we propose a method called Adjusted Random walk Graph regularization Sparse Low-Rank Representation (ARGLRR), a practical sparse subspace clustering method, to identify cell types. The fundamental low-rank representation (LRR) model is concerned with the global structure of data. To address the limited ability of the LRR method to capture local structure, we introduced adjusted random walk graph regularization in its framework. ARGLRR allows for the capture of both local and global structures in scRNA-seq data. Additionally, the imposition of similarity constraints into the LRR framework further improves the ability of the proposed model to estimate cell-to-cell similarity and capture global structural relationships between cells. ARGLRR surpasses other advanced comparison approaches on nine known scRNA-seq data sets judging by the results. In the normalized mutual information and Adjusted Rand Index metrics on the scRNA-seq data sets clustering experiments, ARGLRR outperforms the best-performing comparative method by 6.99% and 5.85%, respectively. In addition, we visualize the result using Uniform Manifold Approximation and Projection. Visualization results show that the usage of ARGLRR enhances the separation of different cell types within the similarity matrix.
Collapse
Affiliation(s)
- Zhen-Chang Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Jun-Liang Shang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Ling-Yun Dai
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Chun-Hou Zheng
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
21
|
Yan X, Zheng R, Chen J, Li M. scNCL: transferring labels from scRNA-seq to scATAC-seq data with neighborhood contrastive regularization. Bioinformatics 2023; 39:btad505. [PMID: 37584660 PMCID: PMC10457667 DOI: 10.1093/bioinformatics/btad505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 07/17/2023] [Accepted: 08/12/2023] [Indexed: 08/17/2023] Open
Abstract
MOTIVATION scATAC-seq has enabled chromatin accessibility landscape profiling at the single-cell level, providing opportunities for determining cell-type-specific regulation codes. However, high dimension, extreme sparsity, and large scale of scATAC-seq data have posed great challenges to cell-type identification. Thus, there has been a growing interest in leveraging the well-annotated scRNA-seq data to help annotate scATAC-seq data. However, substantial computational obstacles remain to transfer information from scRNA-seq to scATAC-seq, especially for their heterogeneous features. RESULTS We propose a new transfer learning method, scNCL, which utilizes prior knowledge and contrastive learning to tackle the problem of heterogeneous features. Briefly, scNCL transforms scATAC-seq features into gene activity matrix based on prior knowledge. Since feature transformation can cause information loss, scNCL introduces neighborhood contrastive learning to preserve the neighborhood structure of scATAC-seq cells in raw feature space. To learn transferable latent features, scNCL uses a feature projection loss and an alignment loss to harmonize embeddings between scRNA-seq and scATAC-seq. Experiments on various datasets demonstrated that scNCL not only realizes accurate and robust label transfer for common types, but also achieves reliable detection of novel types. scNCL is also computationally efficient and scalable to million-scale datasets. Moreover, we prove scNCL can help refine cell-type annotations in existing scATAC-seq atlases. AVAILABILITY AND IMPLEMENTATION The source code and data used in this paper can be found in https://github.com/CSUBioGroup/scNCL-release.
Collapse
Affiliation(s)
- Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jinmiao Chen
- Singapore Immunology Network (SIgN), Agency for Science, Technology and Research (A*STAR), Singapore 138648, Singapore
- Immunology Translational Research Program, Department of Microbiology and Immunology, Yong Loo Lin School of Medicine, National University of Singapore (NUS), Singapore 117545, Singapore
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
22
|
Chen S, Zheng R, Tian L, Wu FX, Li M. A posterior probability based Bayesian method for single-cell RNA-seq data imputation. Methods 2023; 216:21-38. [PMID: 37315825 DOI: 10.1016/j.ymeth.2023.06.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/19/2023] [Accepted: 06/07/2023] [Indexed: 06/16/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) data suffer from a lot of zeros. Such dropout events impede the downstream data analyses. We propose BayesImpute to infer and impute dropouts from the scRNA-seq data. Using the expression rate and coefficient of variation of the genes within the cell subpopulation, BayesImpute first determines likely dropouts, and then constructs the posterior distribution for each gene and uses the posterior mean to impute dropout values. Some simulated and real experiments show that BayesImpute can effectively identify dropout events and reduce the introduction of false positive signals. Additionally, BayesImpute successfully recovers the true expression levels of missing values, restores the gene-to-gene and cell-to-cell correlation coefficient, and maintains the biological information in bulk RNA-seq data. Furthermore, BayesImpute boosts the clustering and visualization of cell subpopulations and improves the identification of differentially expressed genes. We further demonstrate that, in comparison to other statistical-based imputation methods, BayesImpute is scalable and fast with minimal memory usage.
Collapse
Affiliation(s)
- Siqi Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Luyi Tian
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Fang-Xiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
23
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
24
|
Wang LP, Liu JX, Shang JL, Kong XZ, Guan BX, Wang J. KGLRR: A low-rank representation K-means with graph regularization constraint method for Single-cell type identification. Comput Biol Chem 2023; 104:107862. [PMID: 37031647 DOI: 10.1016/j.compbiolchem.2023.107862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 02/26/2023] [Accepted: 03/30/2023] [Indexed: 04/05/2023]
Abstract
Single-cell RNA sequencing technology provides a tremendous opportunity for studying disease mechanisms at the single-cell level. Cell type identification is a key step in the research of disease mechanisms. Many clustering algorithms have been proposed to identify cell types. Most clustering algorithms perform similarity calculation before cell clustering. Because clustering and similarity calculation are independent, a low-rank matrix obtained only by similarity calculation may be unable to fully reveal the patterns in single-cell data. In this study, to capture accurate single-cell clustering information, we propose a novel method based on a low-rank representation model, called KGLRR, that combines the low-rank representation approach with K-means clustering. The cluster centroid is updated as the cell dimension decreases to better from new clusters and improve the quality of clustering information. In addition, the low-rank representation model ignores local geometric information, so the graph regularization constraint is introduced. KGLRR is tested on both simulated and real single-cell datasets to validate the effectiveness of the new method. The experimental results show that KGLRR is more robust and accurate in cell type identification than other advanced algorithms.
Collapse
Affiliation(s)
- Lin-Ping Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jun-Liang Shang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Xiang-Zhen Kong
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Bo-Xin Guan
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China.
| |
Collapse
|
25
|
Li X, Lin Y, Xie C, Li Z, Chen M, Wang P, Zhou J. A Clustering Method Unifying Cell-Type Recognition and Subtype Identification for Tumor Heterogeneity Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:822-832. [PMID: 36044493 DOI: 10.1109/tcbb.2022.3203185] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The rapid development of single-cell technology has opened up a whole new perspective for identifying cell types in multicellular organisms and understanding the relationships between them. Distinguishing different cell types and subtypes can identify the components of different immune cells and different tumor clones in the tumor microenvironment, which is the basic work of tumor heterogeneity analysis and can help researchers understand the mechanism of tumor immune escape. Existing algorithms treat both cell types and subtypes as populations of cells with specific gene expression patterns, which is not conducive to accurate cell typing. For that, we proposed a cell similarity metric that unifies cell type recognition and subtype identification (UCRSI), with the assumption that selectively expressed genes represent differences in underlying cell type with on/off manner, while differences in expression level represent different cell subtype with more/less manner. Our method calculates these two kinds of differences separately, and then combines them using a consensus adjacency matrix, and finally cell typing is completed using spectral clustering algorithm. The results show that UCRSI can reconstruct expert annotation of single-cell RNA sequencing datasets more robustly than existing methods. And, UCRSI is useful for analyzing tumor heterogeneity and improving visualization of large-scale cell clustering.
Collapse
|
26
|
Ning Z, Dai Z, Zhang H, Chen Y, Yuan Z. A clustering method for small scRNA-seq data based on subspace and weighted distance. PeerJ 2023; 11:e14706. [PMID: 36710872 PMCID: PMC9879162 DOI: 10.7717/peerj.14706] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 12/15/2022] [Indexed: 01/24/2023] Open
Abstract
Background Identifying the cell types using unsupervised methods is essential for scRNA-seq research. However, conventional similarity measures introduce challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout. Methods We proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. To accurately capture the intrinsic relationship among cells or genes, a new distance metric that combines Euclidean and Pearson distance through a weighting strategy was proposed. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom. Results We compared SSWD with seven prevailing methods on eight publicly scRNA-seq datasets. The experimental results show that the SSWD has better clustering accuracy and the partitioning ability of cell groups. SSWD can be downloaded at https://github.com/ningzilan/SSWD.
Collapse
Affiliation(s)
- Zilan Ning
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China,Hunan Agricultural University, College of Information and Intelligence, Changsha, Hunan, China
| | - Zhijun Dai
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China
| | - Hongyan Zhang
- Hunan Agricultural University, College of Information and Intelligence, Changsha, Hunan, China
| | - Yuan Chen
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China
| | - Zheming Yuan
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China
| |
Collapse
|
27
|
Chen S, Yan X, Zheng R, Li M. Bubble: a fast single-cell RNA-seq imputation using an autoencoder constrained by bulk RNA-seq data. Brief Bioinform 2023; 24:6960616. [PMID: 36567258 DOI: 10.1093/bib/bbac580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Revised: 11/10/2022] [Accepted: 11/28/2022] [Indexed: 12/27/2022] Open
Abstract
Single-cell RNA-sequencing technology (scRNA-seq) brings research to single-cell resolution. However, a major drawback of scRNA-seq is large sparsity, i.e. expressed genes with no reads due to technical noise or limited sequence depth during the scRNA-seq protocol. This phenomenon is also called 'dropout' events, which likely affect downstream analyses such as differential expression analysis, the clustering and visualization of cell subpopulations, cellular trajectory inference, etc. Therefore, there is a need to develop a method to identify and impute these dropout events. We propose Bubble, which first identifies dropout events from all zeros based on expression rate and coefficient of variation of genes within cell subpopulation, and then leverages an autoencoder constrained by bulk RNA-seq data to only impute those values. Unlike other deep learning-based imputation methods, Bubble fuses the matched bulk RNA-seq data as a constraint to reduce the introduction of false positive signals. Using simulated and several real scRNA-seq datasets, we demonstrate that Bubble enhances the recovery of missing values, gene-to-gene and cell-to-cell correlations, and reduces the introduction of false positive signals. Regarding some crucial downstream analyses of scRNA-seq data, Bubble facilitates the identification of differentially expressed genes, improves the performance of clustering and visualization, and aids the construction of cellular trajectory. More importantly, Bubble provides fast and scalable imputation with minimal memory usage.
Collapse
Affiliation(s)
- Siqi Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
28
|
Zhou J, Li X, Ma Y, Wu Z, Xie Z, Zhang Y, Wei Y. Optimal modeling of anti-breast cancer candidate drugs screening based on multi-model ensemble learning with imbalanced data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:5117-5134. [PMID: 36896538 DOI: 10.3934/mbe.2023237] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The imbalanced data makes the machine learning model seriously biased, which leads to false positive in screening of therapeutic drugs for breast cancer. In order to deal with this problem, a multi-model ensemble framework based on tree-model, linear model and deep-learning model is proposed. Based on the methodology constructed in this study, we screened the 20 most critical molecular descriptors from 729 molecular descriptors of 1974 anti-breast cancer drug candidates and, in order to measure the pharmacokinetic properties and safety of the drug candidates, the screened molecular descriptors were used in this study for subsequent bioactivity, absorption, distribution metabolism, excretion, toxicity, and other prediction tasks. The results show that the method constructed in this study is superior and more stable than the individual models used in the ensemble approach.
Collapse
Affiliation(s)
- Juan Zhou
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Xiong Li
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Yuanting Ma
- School of Economics and Management, East China Jiaotong University, Nanchang 330013, China
| | - Zejiu Wu
- School of Science, East China Jiaotong University, Nanchang 330013, China
| | - Ziruo Xie
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Yuqi Zhang
- School of Foreign Languages, East China Jiaotong University, Nanchang 330013, China
| | - Yiming Wei
- School of Software, East China Jiaotong University, Nanchang 330013, China
| |
Collapse
|
29
|
Li D, Liang H, Qin P, Wang J. A self-training subspace clustering algorithm based on adaptive confidence for gene expression data. Front Genet 2023; 14:1132370. [PMID: 37025450 PMCID: PMC10070828 DOI: 10.3389/fgene.2023.1132370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Accepted: 03/07/2023] [Indexed: 04/08/2023] Open
Abstract
Gene clustering is one of the important techniques to identify co-expressed gene groups from gene expression data, which provides a powerful tool for investigating functional relationships of genes in biological process. Self-training is a kind of important semi-supervised learning method and has exhibited good performance on gene clustering problem. However, the self-training process inevitably suffers from mislabeling, the accumulation of which will lead to the degradation of semi-supervised learning performance of gene expression data. To solve the problem, this paper proposes a self-training subspace clustering algorithm based on adaptive confidence for gene expression data (SSCAC), which combines the low-rank representation of gene expression data and adaptive adjustment of label confidence to better guide the partition of unlabeled data. The superiority of the proposed SSCAC algorithm is mainly reflected in the following aspects. 1) In order to improve the discriminative property of gene expression data, the low-rank representation with distance penalty is used to mine the potential subspace structure of data. 2) Considering the problem of mislabeling in self-training, a semi-supervised clustering objective function with label confidence is proposed, and a self-training subspace clustering framework is constructed on this basis. 3) In order to mitigate the negative impact of mislabeled data, an adaptive adjustment strategy based on gravitational search algorithm is proposed for label confidence. Compared with a variety of state-of-the-art unsupervised and semi-supervised learning algorithms, the SSCAC algorithm has demonstrated its superiority through extensive experiments on two benchmark gene expression datasets.
Collapse
Affiliation(s)
- Dan Li
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hongnan Liang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Pan Qin
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
- *Correspondence: Pan Qin, ; Jia Wang,
| | - Jia Wang
- Department of Breast Surgery, The Second Hospital of Dalian Medical University, Dalian, Liaoning, China
- *Correspondence: Pan Qin, ; Jia Wang,
| |
Collapse
|
30
|
Liu Q, Zhao X, Wang G. A Clustering Ensemble Method for Cell Type Detection by Multiobjective Particle Optimization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1-14. [PMID: 34860653 DOI: 10.1109/tcbb.2021.3132400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a new technology different from previous sequencing methods that measure the average expression level for each gene across a large population of cells. Thus, new computational methods are required to reveal cell types among cell populations. We present a clustering ensemble algorithm using optimized multiobjective particle (CEMP). It is featured with several mechanisms: 1) A multi-subspace projection method for mapping the original data to low-dimensional subspaces is applied in order to detect complex data structure at both gene level and sample level. 2) The basic partition module in different subspaces is utilized to generate clustering solutions. 3) A transforming representation between clusters and particles is used to bridge the gap between the discrete clustering ensemble optimization problem and the continuous multiobjective optimization algorithm. 4) We propose a clustering ensemble optimization. To guide the multiobjective ensemble optimization process, three cluster metrics are embedded into CEMP as objective functions in which the final clustering will be dynamically evaluated. Experiments on 9 real scRNA-seq datasets indicated that CEMP had superior performance over several other clustering algorithms in clustering accuracy and robustness. The case study conducted on mouse neuronal cells identified main cell types and cell subtypes successfully.
Collapse
|
31
|
Liu JX, Yin MM, Gao YL, Shang J, Zheng CH. MSF-LRR: Multi-Similarity Information Fusion Through Low-Rank Representation to Predict Disease-Associated Microbes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:534-543. [PMID: 35085090 DOI: 10.1109/tcbb.2022.3146176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
An Increase in microbial activity is shown to be intimately connected with the pathogenesis of diseases. Considering the expense of traditional verification methods, researchers are working to develop high-efficiency methods for detecting potential disease-related microbes. In this article, a new prediction method, MSF-LRR, is established, which uses Low-Rank Representation (LRR) to perform multi-similarity information fusion to predict disease-related microbes. Considering that most existing methods only use one class of similarity, three classes of microbe and disease similarity are added. Then, LRR is used to obtain low-rank structural similarity information. Additionally, the method adaptively extracts the local low-rank structure of the data from a global perspective, to make the information used for the prediction more effective. Finally, a neighbor-based prediction method that utilizes the concept of collaborative filtering is applied to predict unknown microbe-disease pairs. As a result, the AUC value of MSF-LRR is superior to other existing algorithms under 5-fold cross-validation. Furthermore, in case studies, excluding originally known associations, 16 and 19 of the top 20 microbes associated with Bacterial Vaginosis and Irritable Bowel Syndrome, respectively, have been confirmed by the recent literature. In summary, MSF-LRR is a good predictor of potential microbe-disease associations and can contribute to drug discovery and biological research.
Collapse
|
32
|
Liu Y, Li HD, Xu Y, Liu YW, Peng X, Wang J. IsoCell: An Approach to Enhance Single Cell Clustering by Integrating Isoform-Level Expression Through Orthogonal Projection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:465-475. [PMID: 35100120 DOI: 10.1109/tcbb.2022.3147193] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Single cell RNA sequencing (scRNA-seq) provides a powerful approach for profiling transcriptomes at single cell resolution. An essential application of scRNA-seq is the discovery of cell types with the aid of clustering analysis. Currently, existing single cell clustering methods are exclusively based on gene-level expression data, without considering alternative splicing information. It has been shown that alternative splicing has an important influence on biological processes such as cell differentiation and cell cycle. We therefore hypothesize that adding information about alternative splicing may help enhance single cell clustering. This motivates us to develop a way to integrate isoform-level expression and gene-level expression. We report an approach to enhance single cell clustering by integrating isoform-level expression through orthogonal projection. First, we construct an orthogonal projection matrix based on gene expression data. Second, isoforms are projected to the gene space to remove the redundant information between them. Third, isoform selection is performed based on the residual of the projected expression and the selected isoforms are combined with gene expression data for subsequent clustering. We applied our method to sixteen scRNA-seq datasets. We find that alternative splicing contains differential information among cell types and can be integrated to enhance single cell clustering. Compared with using only gene-level expression data, the integration of isoform-level expression leads to better clustering performances for most of the datasets. The integration of isoform-level expression also has potential in the detection of novel cell subgroups. Our study shows that integrating isoform and gene-level expression is a promising way to improve single cell clustering. The IsoCell R package is freely available at both Github (https://github.com/genemine/IsoCell) and Zenodo (https://zenodo.org/record/4395707).
Collapse
|
33
|
Wang J, Zhang N, Yuan S, Shang J, Dai L, Li F, Liu J. Non-negative low-rank representation based on dictionary learning for single-cell RNA-sequencing data analysis. BMC Genomics 2022; 23:851. [PMID: 36564711 PMCID: PMC9789616 DOI: 10.1186/s12864-022-09027-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 11/21/2022] [Indexed: 12/24/2022] Open
Abstract
In the analysis of single-cell RNA-sequencing (scRNA-seq) data, how to effectively and accurately identify cell clusters from a large number of cell mixtures is still a challenge. Low-rank representation (LRR) method has achieved excellent results in subspace clustering. But in previous studies, most LRR-based methods usually choose the original data matrix as the dictionary. In addition, the methods based on LRR usually use spectral clustering algorithm to complete cell clustering. Therefore, there is a matching problem between the spectral clustering method and the affinity matrix, which is difficult to ensure the optimal effect of clustering. Considering the above two points, we propose the DLNLRR method to better identify the cell type. First, DLNLRR can update the dictionary during the optimization process instead of using the predefined fixed dictionary, so it can realize dictionary learning and LRR learning at the same time. Second, DLNLRR can realize subspace clustering without relying on spectral clustering algorithm, that is, we can perform clustering directly based on the low-rank matrix. Finally, we carry out a large number of experiments on real single-cell datasets and experimental results show that DLNLRR is superior to other scRNA-seq data analysis algorithms in cell type identification.
Collapse
Affiliation(s)
- Juan Wang
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| | - Nana Zhang
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| | - Shasha Yuan
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| | - Lingyun Dai
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| | - Feng Li
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| | - Jinxing Liu
- grid.412638.a0000 0001 0227 8151School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
34
|
Guo L, Zhang X, Zhang R, Wang Q, Xue X, Liu Z. Robust graph representation clustering based on adaptive data correction. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04268-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
35
|
Shu Z, Long Q, Zhang L, Yu Z, Wu XJ. Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering. J Chem Inf Model 2022; 62:6271-6286. [PMID: 36459053 DOI: 10.1021/acs.jcim.2c01305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
The notable progress in single-cell RNA sequencing (ScRNA-seq) technology is beneficial to accurately discover the heterogeneity and diversity of cells. Clustering is an extremely important step during the ScRNA-seq data analysis. However, it cannot achieve satisfactory performances by directly clustering ScRNA-seq data due to its high dimensionality and noise. To address these issues, we propose a novel ScRNA-seq data representation model, termed Robust Graph regularized Non-Negative Matrix Factorization with Dissimilarity and Similarity constraints (RGNMF-DS), for ScRNA-seq data clustering. To accurately characterize the structure information of the labeled samples and the unlabeled samples, respectively, the proposed RGNMF-DS model adopts a couple of complementary regularizers (i.e., similarity and dissimilar regularizers) to guide matrix decomposition. In addition, we construct a graph regularizer to discover the local geometric structure hidden in ScRNA-seq data. Moreover, we adopt the l2,1-norm to measure the reconstruction error and thereby effectively improve the robustness of the proposed RGNMF-DS model to the noises. Experimental results on several ScRNA-seq datasets have demonstrated that our proposed RGNMF-DS model outperforms other state-of-the-art competitors in clustering.
Collapse
Affiliation(s)
- Zhenqiu Shu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Qinghan Long
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Luping Zhang
- Library of Kunming Medical University, Kunming 650031, China
| | - Zhengtao Yu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Xiao-Jun Wu
- Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
36
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China.,Department of Immunology, Nanjing Medical University, Nanjing, 211166, China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children's Hospital of Chongqing Medical University, Chongqing, 401174, China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110, Guangdong, China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China.
| | - Murray J Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW, 2308, Australia. .,Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW, 2305, Australia.
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China.
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China.
| |
Collapse
|
37
|
Unified K-means coupled self-representation and neighborhood kernel learning for clustering single-cell RNA-sequencing data. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
38
|
Yan X, Zheng R, Li M. GLOBE: a contrastive learning-based framework for integrating single-cell transcriptome datasets. Brief Bioinform 2022; 23:6651304. [PMID: 35901449 DOI: 10.1093/bib/bbac311] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 06/29/2022] [Accepted: 07/09/2022] [Indexed: 11/13/2022] Open
Abstract
Integration of single-cell transcriptome datasets from multiple sources plays an important role in investigating complex biological systems. The key to integration of transcriptome datasets is batch effect removal. Recent methods attempt to apply a contrastive learning strategy to correct batch effects. Despite their encouraging performance, the optimal contrastive learning framework for batch effect removal is still under exploration. We develop an improved contrastive learning-based batch correction framework, GLOBE. GLOBE defines adaptive translation transformations for each cell to guarantee the stability of approximating batch effects. To enhance the consistency of representations alignment, GLOBE utilizes a loss function that is both hardness-aware and consistency-aware to learn batch effect-invariant representations. Moreover, GLOBE computes batch-corrected gene matrix in a transparent approach to support diverse downstream analysis. Benchmarking results on a wide spectrum of datasets show that GLOBE outperforms other state-of-the-art methods in terms of robust batch mixing and superior conservation of biological signals. We further apply GLOBE to integrate two developing mouse neocortex datasets and show GLOBE succeeds in removing batch effects while preserving the contiguous structure of cells in raw data. Finally, a comprehensive study is conducted to validate the effectiveness of GLOBE.
Collapse
Affiliation(s)
- Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| |
Collapse
|
39
|
Ding Q, Yang W, Luo M, Xu C, Xu Z, Pang F, Cai Y, Anashkina AA, Su X, Chen N, Jiang Q. CBLRR: a cauchy-based bounded constraint low-rank representation method to cluster single-cell RNA-seq data. Brief Bioinform 2022; 23:6649282. [DOI: 10.1093/bib/bbac300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/17/2022] [Accepted: 07/02/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
The rapid development of single-cel+l RNA sequencing (scRNA-seq) technology provides unprecedented opportunities for exploring biological phenomena at the single-cell level. The discovery of cell types is one of the major applications for researchers to explore the heterogeneity of cells. Some computational methods have been proposed to solve the problem of scRNA-seq data clustering. However, the unavoidable technical noise and notorious dropouts also reduce the accuracy of clustering methods. Here, we propose the cauchy-based bounded constraint low-rank representation (CBLRR), which is a low-rank representation-based method by introducing cauchy loss function (CLF) and bounded nuclear norm regulation, aiming to alleviate the above issue. Specifically, as an effective loss function, the CLF is proven to enhance the robustness of the identification of cell types. Then, we adopt the bounded constraint to ensure the entry values of single-cell data within the restricted interval. Finally, the performance of CBLRR is evaluated on 15 scRNA-seq datasets, and compared with other state-of-the-art methods. The experimental results demonstrate that CBLRR performs accurately and robustly on clustering scRNA-seq data. Furthermore, CBLRR is an effective tool to cluster cells, and provides great potential for downstream analysis of single-cell data. The source code of CBLRR is available online at https://github.com/Ginnay/CBLRR.
Collapse
Affiliation(s)
- Qian Ding
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Wenyi Yang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Meng Luo
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Chang Xu
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Zhaochun Xu
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Fenglan Pang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Yideng Cai
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Anastasia A Anashkina
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences , Moscow, Russia
| | - Xi Su
- Foshan Maternity & Child Healthcare Hospital, Southern Medical University , Foshan, Guangdong, China
| | - Na Chen
- Department of Hematology, Shandong Provincial Hospital Affiliated to Shandong First Medical University , Jinan, Shandong, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| |
Collapse
|
40
|
Liu G, Li M, Wang H, Lin S, Xu J, Li R, Tang M, Li C. D3K: The Dissimilarity-Density-Dynamic Radius K-means Clustering Algorithm for scRNA-Seq Data. Front Genet 2022; 13:912711. [PMID: 35846121 PMCID: PMC9284269 DOI: 10.3389/fgene.2022.912711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 04/25/2022] [Indexed: 12/02/2022] Open
Abstract
A single-cell sequencing data set has always been a challenge for clustering because of its high dimension and multi-noise points. The traditional K-means algorithm is not suitable for this type of data. Therefore, this study proposes a Dissimilarity-Density-Dynamic Radius-K-means clustering algorithm. The algorithm adds the dynamic radius parameter to the calculation. It flexibly adjusts the active radius according to the data characteristics, which can eliminate the influence of noise points and optimize the clustering results. At the same time, the algorithm calculates the weight through the dissimilarity density of the data set, the average contrast of candidate clusters, and the dissimilarity of candidate clusters. It obtains a set of high-quality initial center points, which solves the randomness of the K-means algorithm in selecting the center points. Finally, compared with similar algorithms, this algorithm shows a better clustering effect on single-cell data. Each clustering index is higher than other single-cell clustering algorithms, which overcomes the shortcomings of the traditional K-means algorithm.
Collapse
Affiliation(s)
- Guoyun Liu
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Manzhi Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, China
- *Correspondence: Manzhi Li,
| | - Hongtao Wang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Shijun Lin
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Junlin Xu
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Ruixi Li
- Geneis Beijing Co., Ltd., Beijing, China
| | - Min Tang
- School of Life Sciences, Jiangsu University, Zhenjiang, China
| | - Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
41
|
Xu J, Cui L, Zhuang J, Meng Y, Bing P, He B, Tian G, Kwok Pui C, Wu T, Wang B, Yang J. Evaluating the performance of dropout imputation and clustering methods for single-cell RNA sequencing data. Comput Biol Med 2022; 146:105697. [PMID: 35697529 DOI: 10.1016/j.compbiomed.2022.105697] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 05/16/2022] [Accepted: 06/04/2022] [Indexed: 11/03/2022]
Abstract
Recent advances in single-cell RNA sequencing (scRNA-seq) provide exciting opportunities for transcriptome analysis at single-cell resolution. Clustering individual cells is a key step to reveal cell subtypes and infer cell lineage in scRNA-seq analysis. Although many dedicated algorithms have been proposed, clustering quality remains a computational challenge for scRNA-seq data, which is exacerbated by inflated zero counts due to various technical noise. To address this challenge, we assess the combinations of nine popular dropout imputation methods and eight clustering methods on a collection of 10 well-annotated scRNA-seq datasets with different sample sizes. Our results show that (i) imputation algorithms do typically improve the performance of clustering methods, and the quality of data visualization using t-Distributed Stochastic Neighbor Embedding; and (ii) the performance of a particular combination of imputation and clustering methods varies with dataset size. For example, the combination of single-cell analysis via expression recovery and Sparse Subspace Clustering (SSC) methods usually works well on smaller datasets, while the combination of adaptively-thresholded low-rank approximation and single-cell interpretation via multikernel learning (SIMLR) usually achieves the best performance on larger datasets.
Collapse
Affiliation(s)
- Junlin Xu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Lingyu Cui
- College of Life Science, Northeast Forestry University, Harbin, Heilongjiang, 150000, China
| | - Jujuan Zhuang
- School of Science, Dalian Maritime University, Dalian, Liaoning, 116026, China
| | - Yajie Meng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Pingping Bing
- Academician Workstation, Changsha Medical University, Changsha, 410219, China
| | - Binsheng He
- Academician Workstation, Changsha Medical University, Changsha, 410219, China
| | - Geng Tian
- Geneis Beijing Co., Ltd., Beijing, 100102, China; Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, 266000, China
| | - Choi Kwok Pui
- Department of Statistics and Data Science, Department of Mathematics, National University of Singapore, Singapore, 117546, Republic of Singapore
| | - Taoyang Wu
- School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
| | - Bing Wang
- School of Electrical & Information Engineering, Anhui University of Technology, Anhui, 243002, China.
| | - Jialiang Yang
- Geneis Beijing Co., Ltd., Beijing, 100102, China; Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, 266000, China.
| |
Collapse
|
42
|
Liang Z, Zheng R, Chen S, Yan X, Li M. A deep matrix factorization based approach for single-cell RNA-seq data clustering. Methods 2022; 205:114-122. [PMID: 35777719 DOI: 10.1016/j.ymeth.2022.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 05/28/2022] [Accepted: 06/24/2022] [Indexed: 11/17/2022] Open
Abstract
The rapid development of single-cell sequencing technologies makes it possible to analyze cellular heterogeneity at the single-cell level. Cell clustering is one of the most fundamental and common steps in the heterogeneity analysis. However, due to the high noise level, high dimensionality and high sparsity, accurate cell clustering is still challengeable. Here, we present DeepCI, a new clustering approach for scRNA-seq data. Using two autoencoders to obtain cell embedding and gene embedding, DeepCI can simultaneously learn cell low-dimensional representation and clustering. In addition, the recovered gene expression matrix can be obtained by the matrix multiplication of cell and gene embedding. To evaluate the performance of DeepCI, we performed it on several real scRNA-seq datasets for clustering and visualization analysis. The experimental results show that DeepCI obtains the overall better performance than several popular single cell analysis methods. We also evaluated the imputation performance of DeepCI by a dedicated experiment. The corresponding results show that the imputed gene expression of known specific marker gene can greatly improve the accuracy of cell type classification.
Collapse
Affiliation(s)
- Zhenlan Liang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| | - Siqi Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
43
|
Pouryahya M, Oh JH, Javanmard P, Mathews JC, Belkhatir Z, Deasy JO, Tannenbaum AR. aWCluster: A Novel Integrative Network-Based Clustering of Multiomics for Subtype Analysis of Cancer Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1472-1483. [PMID: 33226952 PMCID: PMC9518829 DOI: 10.1109/tcbb.2020.3039511] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The remarkable growth of multi-platform genomic profiles has led to the challenge of multiomics data integration. In this study, we present a novel network-based multiomics clustering founded on the Wasserstein distance from optimal mass transport. This distance has many important geometric properties making it a suitable choice for application in machine learning and clustering. Our proposed method of aggregating multiomics and Wasserstein distance clustering (aWCluster) is applied to breast carcinoma as well as bladder carcinoma, colorectal adenocarcinoma, renal carcinoma, lung non-small cell adenocarcinoma, and endometrial carcinoma from The Cancer Genome Atlas project. Subtypes were characterized by the concordant effect of mRNA expression, DNA copy number alteration, and DNA methylation of genes and their neighbors in the interaction network. aWCluster successfully clusters all cancer types into classes with significantly different survival rates. Also, a gene ontology enrichment analysis of significant genes in the low survival subgroup of breast cancer leads to the well-known phenomenon of tumor hypoxia and the transcription factor ETS1 whose expression is induced by hypoxia. We believe aWCluster has the potential to discover novel subtypes and biomarkers by accentuating the genes that have concordant multiomics measurements in their interaction network, which are challenging to find without the network inference or with single omics analysis.
Collapse
|
44
|
Meng X, Xiang J, Zheng R, Wu FX, Li M. DPCMNE: Detecting Protein Complexes From Protein-Protein Interaction Networks Via Multi-Level Network Embedding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1592-1602. [PMID: 33417563 DOI: 10.1109/tcbb.2021.3050102] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Biological functions of a cell are typically carried out through protein complexes. The detection of protein complexes is therefore of great significance for understanding the cellular organizations and protein functions. In the past decades, many computational methods have been proposed to detect protein complexes. However, most of the existing methods just search the local topological information to mine dense subgraphs as protein complexes, ignoring the global topological information. To tackle this issue, we propose the DPCMNE method to detect protein complexes via multi-level network embedding. It can preserve both the local and global topological information of biological networks. First, DPCMNE employs a hierarchical compressing strategy to recursively compress the input protein-protein interaction (PPI) network into multi-level smaller PPI networks. Then, a network embedding method is applied on these smaller PPI networks to learn protein embeddings of different levels of granularity. The embeddings learned from all the compressed PPI networks are concatenated to represent the final protein embeddings of the original input PPI network. Finally, a core-attachment based strategy is adopted to detect protein complexes in the weighted PPI network constructed by the pairwise similarity of protein embeddings. To assess the efficiency of our proposed method, DPCMNE is compared with other eight clustering algorithms on two yeast datasets. The experimental results show that the performance of DPCMNE outperforms those state-of-the-art complex detection methods in terms of F1 and F1+Acc. Furthermore, the results of functional enrichment analysis indicate that protein complexes detected by DPCMNE are more biologically significant in terms of P-score.
Collapse
|
45
|
Jiang H, Huang Y, Li Q. Spectral clustering of single cells using Siamese nerual network combined with improved affinity matrix. Brief Bioinform 2022; 23:6567703. [PMID: 35419595 DOI: 10.1093/bib/bbac113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 03/02/2022] [Accepted: 03/08/2022] [Indexed: 11/14/2022] Open
Abstract
Limitations of bulk sequencing techniques on cell heterogeneity and diversity analysis have been pushed with the development of single-cell RNA-sequencing (scRNA-seq). To detect clusters of cells is a key step in the analysis of scRNA-seq. However, the high-dimensionality of scRNA-seq data and the imbalances in the number of different subcellular types are ubiquitous in real scRNA-seq data sets, which poses a huge challenge to the single-cell-type detection.We propose a meta-learning-based model, SiaClust, which is the combination of Siamese Convolutional Neural Network (CNN) and improved spectral clustering, to achieve scRNA-seq cell type detection. To be specific, with the help of the constrained Sigmoid kernel, the raw high-dimensionality data is mapped to a low-dimensional space, and the Siamese CNN learns the differences between the cell types in the low-dimensional feature space. The similarity matrix learned by Siamese CNN is used in combination with improved spectral clustering and t-distribution Stochastic Neighbor Embedding (t-SNE) for visualization. SiaClust highlights the differences between cell types by comparing the similarity of the samples, whereas blurring the differences within the cell types is better in processing high-dimensional and imbalanced data. SiaClust significantly improves clustering accuracy by using data generated by nine different species and tissues through different scNA-seq protocols for extensive evaluation, as well as analogies to state-of-the-art single-cell clustering models. More importantly, SiaClust accurately locates the exact site of dropout gene, and is more flexible with data size and cell type.
Collapse
Affiliation(s)
- Hanjing Jiang
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, Institute of Artificial Intelligence, School of Artificial Intelligence and Automation, 430074, Wuhan, China
| | - Yabing Huang
- Renmin Hospital of Wuhan University, Department of Pathology, 430060, Wuhan, China
| | - Qianpeng Li
- Chinese Academy of Sciences, Institute of Automation, 100190, Beijing, China
| |
Collapse
|
46
|
Wang CY, Gao YL, Liu JX, Kong XZ, Zheng CH. Single-Cell RNA Sequencing Data Clustering by Low-Rank Subspace Ensemble Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1154-1164. [PMID: 33026977 DOI: 10.1109/tcbb.2020.3029187] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of single-cell RNA sequencing (scRNA-seq)technology reveals the gene expression status and gene structure of individual cells, reflecting the heterogeneity and diversity of cells. The traditional methods of scRNA-seq data analysis treat data as the same subspace, and hide structural information in other subspaces. In this paper, we propose a low-rank subspace ensemble clustering framework (LRSEC)to analyze scRNA-seq data. Assuming that the scRNA-seq data exist in multiple subspaces, the low-rank model is used to find the lowest rank representation of the data in the subspace. It is worth noting that the penalty factor of the low-rank kernel function is uncertain, and different penalty factors correspond to different low-rank structures. Moreover, the single cluster model is difficult to find the cellular structure of all datasets. To strengthen the correlation between model solutions, we construct a new ensemble clustering framework LRSEC by using the low-rank model as the basic learner. The LRSEC framework captures the global structure of data through low-rank subspaces, which has better clustering performance than a single clustering model. We validate the performance of the LRSEC framework on seven small datasets and one large dataset and obtain satisfactory results.
Collapse
|
47
|
Mirzal A. Statistical Analysis of Microarray Data Clustering using NMF, Spectral Clustering, Kmeans, and GMM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1173-1192. [PMID: 32956065 DOI: 10.1109/tcbb.2020.3025486] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In unsupervised learning literature, the study of clustering using microarray gene expression datasets has been extensively conducted with nonnegative matrix factorization (NMF), spectral clustering, kmeans, and gaussian mixture model (GMM)are some of the most used methods. However, there is still a limited number of works that utilize statistical analysis to measure the significances of performance differences between these methods. In this paper, statistical analysis of performance differences between ten NMF, six spectral clustering, four GMM, and the standard kmeans algorithms in clustering eleven publicly available microarray gene expression datasets with the number of clusters ranges from two to ten is presented. The experimental results show that statistically NMFs and kmeans have similar performances and outperform spectral clustering. As spectral clustering can be used to uncover hidden manifold structures, the underperformance of spectral methods leads us to question whether the datasets have manifold structures. Visual inspection using multidimensional scaling plots indicates that such structures do not exist. Moreover, as the plots indicate that clusters in some datasets have elliptical boundaries, GMM methods are also utilized. The experimental results show that GMM methods outperform the other methods to some degree, and thus imply that the datasets follow gaussian distributions.
Collapse
|
48
|
Huizing GJ, Peyré G, Cantini L. Optimal transport improves cell-cell similarity inference in single-cell omics data. Bioinformatics 2022; 38:2169-2177. [PMID: 35157031 PMCID: PMC9004651 DOI: 10.1093/bioinformatics/btac084] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 12/17/2021] [Accepted: 02/08/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION High-throughput single-cell molecular profiling is revolutionizing biology and medicine by unveiling the diversity of cell types and states contributing to development and disease. The identification and characterization of cellular heterogeneity are typically achieved through unsupervised clustering, which crucially relies on a similarity metric. RESULTS We here propose the use of Optimal Transport (OT) as a cell-cell similarity metric for single-cell omics data. OT defines distances to compare high-dimensional data represented as probability distributions. To speed up computations and cope with the high dimensionality of single-cell data, we consider the entropic regularization of the classical OT distance. We then extensively benchmark OT against state-of-the-art metrics over 13 independent datasets, including simulated, scRNA-seq, scATAC-seq and single-cell DNA methylation data. First, we test the ability of the metrics to detect the similarity between cells belonging to the same groups (e.g. cell types, cell lines of origin). Then, we apply unsupervised clustering and test the quality of the resulting clusters. OT is found to improve cell-cell similarity inference and cell clustering in all simulated and real scRNA-seq data, as well as in scATAC-seq and single-cell DNA methylation data. AVAILABILITY AND IMPLEMENTATION All our analyses are reproducible through the OT-scOmics Jupyter notebook available at https://github.com/ComputationalSystemsBiology/OT-scOmics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Gabriel Peyré
- Département de Mathématiques et Applications de l’Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure, Université PSL, 75005 Paris, France
| | | |
Collapse
|
49
|
Zhang NN, Liu JX, Zheng CH, Wang J. SLRRSC: single-cell type recognition method based on similarity and graph regularization constraints. IEEE J Biomed Health Inform 2022; 26:3556-3566. [PMID: 35120014 DOI: 10.1109/jbhi.2022.3148286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Single-cell clustering is a crucial task of scRNA-seq analysis, which reveals the natural grouping of cells. However, due to the high noise and high dimension in scRNA-seq data, how to effectively and accurately identify cell types from a great quantity of cell mixtures is still a challenge. Considering this, in this paper, we propose a novel subspace clustering algorithm termed SLRRSC. This method is developed based on the low-rank representation model, and it aims to capture the global and local properties inherent in data. In order to make the LRR matrix describe the spatial relationship of samples more accurately, we introduce the manifold-based graph regularization and similarity constraint into the LRR-based method SLRRSC. The graph regularization can preserve the local geometric structure of the data in low-rank decomposition, so that the low-rank representation matrix contains more local structure information. By imposing similarity constraint on the low-rank matrix, the similarity information between sample pairs is further introduced into the SLRRSC model to improve the learning ability of low-rank method for global structure. At the same time, the similarity constraint makes the low-rank representation matrix symmetric, which makes it better interpretable in clustering application. We compare the effectiveness of the SLRRSC algorithm with other single-cell clustering methods on simulated data and real single-cell datasets. The results show that this method can obtain more accurate sample similarity matrix and effectively solve the problem of cell type recognition.
Collapse
|
50
|
|