1
|
Wang R, Schlick T. How Large is the Universe of RNA-Like Motifs? A Clustering Analysis of RNA Graph Motifs Using Topological Descriptors. ARXIV 2025:arXiv:2501.04258v1. [PMID: 39867422 PMCID: PMC11760235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Identifying novel and functional RNA structures remains a significant challenge in RNA motif design and is crucial for developing RNA-based therapeutics. Here we introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% (121 dual graphs) correspond to approximately 200,000 known RNA atomic fragments/substructures (collected in 2021) using the RNA-as-Graphs (RAG) mapping method. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters. The cluster with the higher percentage of known dual graphs for RNA is defined as the "RNA-like" cluster, while the other is considered as "non-RNA-like". The distance of each dual graph to the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Among the top 15 graphs identified as high-likelihood candidates for novel RNA motifs, 4 were confirmed from the RNA dataset collected in 2022. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs (subgraphs preserve pseudoknots and junctions). Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features (e.g. Betti-0 and Betti-1 numbers). These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly.
Collapse
Affiliation(s)
- Rui Wang
- Simons Center for Computational Physical Chemistry, New York University, New York, NY 10003, USA
| | - Tamar Schlick
- Simons Center for Computational Physical Chemistry, New York University, New York, NY 10003, USA
- Department of Chemistry, New York University, New York, NY 10003, USA
- Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA
- New York University-East China Normal University Center for Computational Chemistry, New York University Shanghai, Shanghai 200122, China
| |
Collapse
|
2
|
Hozumi Y, Wei GW. Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE. PLoS One 2024; 19:e0311791. [PMID: 39671349 PMCID: PMC11642954 DOI: 10.1371/journal.pone.0311791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 09/24/2024] [Indexed: 12/15/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing scRNA-seq data. CCP utilizes gene-gene correlations to partition the genes and, based on the partition, employs cell-cell interactions to obtain super-genes. Because CCP is a data-domain approach that does not require matrix diagonalization, it can be used in many downstream machine learning tasks. In this work, we utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, we have found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy. More specifically, CCP improves UMAP by 22% in ARI, 14% in NMI and 15% in ECM, and improves tSNE by 11% in ARI, 9% in NMI and 8% in ECM.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
3
|
Wang GF, Shen L. Cauchy hyper-graph Laplacian nonnegative matrix factorization for single-cell RNA-sequencing data analysis. BMC Bioinformatics 2024; 25:169. [PMID: 38684942 PMCID: PMC11059750 DOI: 10.1186/s12859-024-05797-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 04/24/2024] [Indexed: 05/02/2024] Open
Abstract
Many important biological facts have been found as single-cell RNA sequencing (scRNA-seq) technology has advanced. With the use of this technology, it is now possible to investigate the connections among individual cells, genes, and illnesses. For the analysis of single-cell data, clustering is frequently used. Nevertheless, biological data usually contain a large amount of noise data, and traditional clustering methods are sensitive to noise. However, acquiring higher-order spatial information from the data alone is insufficient. As a result, getting trustworthy clustering findings is challenging. We propose the Cauchy hyper-graph Laplacian non-negative matrix factorization (CHLNMF) as a unique approach to address these issues. In CHLNMF, we replace the measurement based on Euclidean distance in the conventional non-negative matrix factorization (NMF), which can lessen the influence of noise, with the Cauchy loss function (CLF). The model also incorporates the hyper-graph constraint, which takes into account the high-order link among the samples. The CHLNMF model's best solution is then discovered using a half-quadratic optimization approach. Finally, using seven scRNA-seq datasets, we contrast the CHLNMF technique with the other nine top methods. The validity of our technique was established by analysis of the experimental outcomes.
Collapse
Affiliation(s)
- Gao-Fei Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, Shandong, China.
| | - Longying Shen
- School of Computer Science, Qufu Normal University, Rizhao, 276826, Shandong, China
| |
Collapse
|
4
|
Su Z, Tong Y, Wei GW. Hodge Decomposition of Single-Cell RNA Velocity. J Chem Inf Model 2024; 64:3558-3568. [PMID: 38572676 PMCID: PMC11035094 DOI: 10.1021/acs.jcim.4c00132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 03/21/2024] [Accepted: 03/22/2024] [Indexed: 04/05/2024]
Abstract
RNA velocity has the ability to capture the cell dynamic information in the biological processes; yet, a comprehensive analysis of the cell state transitions and their associated chemical and biological processes remains a gap. In this work, we provide the Hodge decomposition, coupled with discrete exterior calculus (DEC), to unveil cell dynamics by examining the decomposed curl-free, divergence-free, and harmonic components of the RNA velocity field in a low dimensional representation, such as a UMAP or a t-SNE representation. Decomposition results show that the decomposed components distinctly reveal key cell dynamic features such as cell cycle, bifurcation, and cell lineage differentiation, regardless of the choice of the low-dimensional representations. The consistency across different representations demonstrates that the Hodge decomposition is a reliable and robust way to extract these cell dynamic features, offering unique analysis and insightful visualization of single-cell RNA velocity fields.
Collapse
Affiliation(s)
- Zhe Su
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yiying Tong
- Department
of Computer Science and Engineering, Michigan
State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department
of Electrical and Computer Engineering, Michigan State University, East
Lansing, Michigan 48824, United States
- Department
of Biochemistry and Molecular Biology, Michigan
State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
5
|
Feng H, Cottrell S, Hozumi Y, Wei GW. Multiscale differential geometry learning of networks with applications to single-cell RNA sequencing data. Comput Biol Med 2024; 171:108211. [PMID: 38422960 PMCID: PMC10965033 DOI: 10.1016/j.compbiomed.2024.108211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/02/2024] [Accepted: 02/25/2024] [Indexed: 03/02/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a transformative technology, offering unparalleled insights into the intricate landscape of cellular diversity and gene expression dynamics. scRNA-seq analysis represents a challenging and cutting-edge frontier within the field of biological research. Differential geometry serves as a powerful mathematical tool in various applications of scientific research. In this study, we introduce, for the first time, a multiscale differential geometry (MDG) strategy for addressing the challenges encountered in scRNA-seq data analysis. We assume that intrinsic properties of cells lie on a family of low-dimensional manifolds embedded in the high-dimensional space of scRNA-seq data. Multiscale cell-cell interactive manifolds are constructed to reveal complex relationships in the cell-cell network, where curvature-based features for cells can decipher the intricate structural and biological information. We showcase the utility of our novel approach by demonstrating its effectiveness in classifying cell types. This innovative application of differential geometry in scRNA-seq analysis opens new avenues for understanding the intricacies of biological networks and holds great potential for network analysis in other fields.
Collapse
Affiliation(s)
- Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|