1
|
Hozumi Y, Tanemura KA, Wei GW. Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection. J Chem Inf Model 2024; 64:2829-2838. [PMID: 37402705 PMCID: PMC11009150 DOI: 10.1021/acs.jcim.3c00674] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Kiyoto Aramis Tanemura
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
2
|
Qiao Q, Yuan SS, Shang J, Liu JX. Multi-View Enhanced Tensor Nuclear Norm and Local Constraint Model for Cancer Clustering and Feature Gene Selection. J Comput Biol 2023; 30:889-899. [PMID: 37471239 DOI: 10.1089/cmb.2023.0107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023] Open
Abstract
The analysis of cancer data from multi-omics can effectively promote cancer research. The main focus of this article is to cluster cancer samples and identify feature genes to reveal the correlation between cancers and genes, with the primary approach being the analysis of multi-view cancer omics data. Our proposed solution, the Multi-View Enhanced Tensor Nuclear Norm and Local Constraint (MVET-LC) model, aims to utilize the consistency and complementarity of omics data to support biological research. The model is designed to maximize the utilization of multi-view data and incorporates a nuclear norm and local constraint to achieve this goal. The first step involves introducing the concept of enhanced partial sum of tensor nuclear norm, which significantly enhances the flexibility of the tensor nuclear norm. After that, we incorporate total variation regularization into the MVET-LC model to further augment its performance. It enables MVET-LC to make use of the relationship between tensor data structures and sparse data while paying attention to the feature details of the tensor data. To tackle the iterative optimization problem of MVET-LC, the alternating direction method of multipliers is utilized. Through experimental validation, it is demonstrated that our proposed model outperforms other comparison models.
Collapse
Affiliation(s)
- Qian Qiao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
3
|
Shu Z, Long Q, Zhang L, Yu Z, Wu XJ. Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering. J Chem Inf Model 2022; 62:6271-6286. [PMID: 36459053 DOI: 10.1021/acs.jcim.2c01305] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
The notable progress in single-cell RNA sequencing (ScRNA-seq) technology is beneficial to accurately discover the heterogeneity and diversity of cells. Clustering is an extremely important step during the ScRNA-seq data analysis. However, it cannot achieve satisfactory performances by directly clustering ScRNA-seq data due to its high dimensionality and noise. To address these issues, we propose a novel ScRNA-seq data representation model, termed Robust Graph regularized Non-Negative Matrix Factorization with Dissimilarity and Similarity constraints (RGNMF-DS), for ScRNA-seq data clustering. To accurately characterize the structure information of the labeled samples and the unlabeled samples, respectively, the proposed RGNMF-DS model adopts a couple of complementary regularizers (i.e., similarity and dissimilar regularizers) to guide matrix decomposition. In addition, we construct a graph regularizer to discover the local geometric structure hidden in ScRNA-seq data. Moreover, we adopt the l2,1-norm to measure the reconstruction error and thereby effectively improve the robustness of the proposed RGNMF-DS model to the noises. Experimental results on several ScRNA-seq datasets have demonstrated that our proposed RGNMF-DS model outperforms other state-of-the-art competitors in clustering.
Collapse
Affiliation(s)
- Zhenqiu Shu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Qinghan Long
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Luping Zhang
- Library of Kunming Medical University, Kunming 650031, China
| | - Zhengtao Yu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Xiao-Jun Wu
- Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
4
|
Gene Expression Analysis through Parallel Non-Negative Matrix Factorization. COMPUTATION 2021. [DOI: 10.3390/computation9100106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Genetic expression analysis is a principal tool to explain the behavior of genes in an organism when exposed to different experimental conditions. In the state of art, many clustering algorithms have been proposed. It is overwhelming the amount of biological data whose high-dimensional structure exceeds mostly current computational architectures. The computational time and memory consumption optimization actually become decisive factors in choosing clustering algorithms. We propose a clustering algorithm based on Non-negative Matrix Factorization and K-means to reduce data dimensionality but whilst preserving the biological context and prioritizing gene selection, and it is implemented within parallel GPU-based environments through the CUDA library. A well-known dataset is used in our tests and the quality of the results is measured through the Rand and Accuracy Index. The results show an increase in the acceleration of 6.22× compared to the sequential version. The algorithm is competitive in the biological datasets analysis and it is invariant with respect to the classes number and the size of the gene expression matrix.
Collapse
|
5
|
Ren LR, Liu JX, Gao YL, Kong XZ, Zheng CH. Kernel Risk-Sensitive Loss based Hyper-graph Regularized Robust Extreme Learning Machine and Its Semi-supervised Extension for Classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107226] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
6
|
Zhao YY, Jiao CN, Wang ML, Liu JX, Wang J, Zheng CH. HTRPCA: Hypergraph Regularized Tensor Robust Principal Component Analysis for Sample Clustering in Tumor Omics Data. Interdiscip Sci 2021; 14:22-33. [PMID: 34115312 DOI: 10.1007/s12539-021-00441-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 04/27/2021] [Accepted: 05/15/2021] [Indexed: 12/24/2022]
Abstract
In recent years, clustering analysis of cancer genomics data has gained widespread attention. However, limited by the dimensions of the matrix, the traditional methods cannot fully mine the underlying geometric structure information in the data. Besides, noise and outliers inevitably exist in the data. To solve the above two problems, we come up with a new method which uses tensor to represent cancer omics data and applies hypergraph to save the geometric structure information in original data. This model is called hypergraph regularized tensor robust principal component analysis (HTRPCA). The data processed by HTRPCA becomes two parts, one of which is a low-rank component that contains pure underlying structure information between samples, and the other is some sparse interference points. So we can use the low-rank component for clustering. This model can retain complex geometric information between more sample points due to the addition of the hypergraph regularization. Through clustering, we can demonstrate the effectiveness of HTRPCA, and the experimental results on TCGA datasets demonstrate that HTRPCA precedes other advanced methods. This paper proposes a new method of using tensors to represent cancer omics data and introduces hypergraph items to save the geometric structure information of the original data. At the same time, the model decomposes the original tensor into low-order tensors and sparse tensors. The low-rank tensor was used to cluster cancer samples to verify the effectiveness of the method.
Collapse
Affiliation(s)
- Yu-Ying Zhao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Cui-Na Jiao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Mao-Li Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China. .,Rizhao Huilian Zhongchuang Institute of Intelligent Technology, Rizhao, 276826, China.
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Chun-Hou Zheng
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
7
|
Ren LR, Gao YL, Liu JX, Zhu R, Kong XZ. L 2,1-Extreme Learning Machine: An Efficient Robust Classifier for Tumor Classification. Comput Biol Chem 2020; 89:107368. [PMID: 32919230 DOI: 10.1016/j.compbiolchem.2020.107368] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 08/17/2020] [Accepted: 08/25/2020] [Indexed: 12/12/2022]
Abstract
With the development of cancer research, various gene expression datasets containing cancer information show an explosive growth trend. In addition, due to the continuous maturity of single-cell RNA sequencing (scRNA-seq) technology, the protein information and pedigree information of a single cell are also continuously mined. It is a technical problem of how to classify these high-dimensional data correctly. In recent years, Extreme Learning Machine (ELM) has been widely used in the field of supervised learning and unsupervised learning. However, the traditional ELM does not consider the robustness of the method. To improve the robustness of ELM, in this paper, a novel ELM method based on L2,1-norm named L2,1-Extreme Learning Machine (L2,1 -ELM) has been proposed. The method introduces L2,1-norm on loss function to improve the robustness, and minimizes the influence of noise and outliers. Firstly, we evaluate the new method on five UCI datasets. The experiment results prove that our method can achieve competitive results. Next, the novel method is applied to the problem of classification of cancer samples and single-cell RNA sequencing datasets. The experimental results on The Cancer Genome Atlas (TCGA) datasets and scRNA-seq datasets prove that ELM and its variants has great potential in the classification of cancer samples.
Collapse
Affiliation(s)
- Liang-Rui Ren
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China.
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China.
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China.
| | - Rong Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China.
| | - Xiang-Zhen Kong
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China.
| |
Collapse
|