1
|
Bilous M, Hérault L, Gabriel AA, Teleman M, Gfeller D. Building and analyzing metacells in single-cell genomics data. Mol Syst Biol 2024; 20:744-766. [PMID: 38811801 PMCID: PMC11220014 DOI: 10.1038/s44320-024-00045-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 05/03/2024] [Accepted: 05/08/2024] [Indexed: 05/31/2024] Open
Abstract
The advent of high-throughput single-cell genomics technologies has fundamentally transformed biological sciences. Currently, millions of cells from complex biological tissues can be phenotypically profiled across multiple modalities. The scaling of computational methods to analyze and visualize such data is a constant challenge, and tools need to be regularly updated, if not redesigned, to cope with ever-growing numbers of cells. Over the last few years, metacells have been introduced to reduce the size and complexity of single-cell genomics data while preserving biologically relevant information and improving interpretability. Here, we review recent studies that capitalize on the concept of metacells-and the many variants in nomenclature that have been used. We further outline how and when metacells should (or should not) be used to analyze single-cell genomics data and what should be considered when analyzing such data at the metacell level. To facilitate the exploration of metacells, we provide a comprehensive tutorial on the construction and analysis of metacells from single-cell RNA-seq data ( https://github.com/GfellerLab/MetacellAnalysisTutorial ) as well as a fully integrated pipeline to rapidly build, visualize and evaluate metacells with different methods ( https://github.com/GfellerLab/MetacellAnalysisToolkit ).
Collapse
Affiliation(s)
- Mariia Bilous
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - Léonard Hérault
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - Aurélie Ag Gabriel
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - Matei Teleman
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - David Gfeller
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland.
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland.
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland.
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland.
| |
Collapse
|
2
|
Aihara G, Clifton K, Chen M, Li Z, Atta L, Miller BF, Satija R, Hickey JW, Fan J. SEraster: a rasterization preprocessing framework for scalable spatial omics data analysis. Bioinformatics 2024; 40:btae412. [PMID: 38902953 PMCID: PMC11226864 DOI: 10.1093/bioinformatics/btae412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 05/15/2024] [Accepted: 06/19/2024] [Indexed: 06/22/2024] Open
Abstract
MOTIVATION Spatial omics data demand computational analysis but many analysis tools have computational resource requirements that increase with the number of cells analyzed. This presents scalability challenges as researchers use spatial omics technologies to profile millions of cells. RESULTS To enhance the scalability of spatial omics data analysis, we developed a rasterization preprocessing framework called SEraster that aggregates cellular information into spatial pixels. We apply SEraster to both real and simulated spatial omics data prior to spatial variable gene expression analysis to demonstrate that such preprocessing can reduce computational resource requirements while maintaining high performance, including as compared to other down-sampling approaches. We further integrate SEraster with existing analysis tools to characterize cell-type spatial co-enrichment across length scales. Finally, we apply SEraster to enable analysis of a mouse pup spatial omics dataset with over a million cells to identify tissue-level and cell-type-specific spatially variable genes as well as spatially co-enriched cell types that recapitulate expected organ structures. AVAILABILITY AND IMPLEMENTATION SEraster is implemented as an R package on GitHub (https://github.com/JEFworks-Lab/SEraster) with additional tutorials at https://JEF.works/SEraster.
Collapse
Affiliation(s)
- Gohta Aihara
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21211, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Kalen Clifton
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21211, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Mayling Chen
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21211, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Zhuoyan Li
- New York Genome Center, New York, NY 10013, United States
| | - Lyla Atta
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21211, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Brendan F Miller
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21211, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Rahul Satija
- New York Genome Center, New York, NY 10013, United States
- Center for Genomics and Systems Biology, New York University, New York, NY 10003, United States
| | - John W Hickey
- Department of Biomedical Engineering, Duke University, Durham, NC 27708, United States
| | - Jean Fan
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21211, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| |
Collapse
|
3
|
Cottrell S, Wang R, Wei GW. PLPCA: Persistent Laplacian-Enhanced PCA for Microarray Data Analysis. J Chem Inf Model 2024; 64:2405-2420. [PMID: 37738663 PMCID: PMC10999748 DOI: 10.1021/acs.jcim.3c01023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/24/2023]
Abstract
Over the years, Principal Component Analysis (PCA) has served as the baseline approach for dimensionality reduction in gene expression data analysis. Its primary objective is to identify a subset of disease-causing genes from a vast pool of thousands of genes. However, PCA possesses inherent limitations that hinder its interpretability, introduce class ambiguity, and fail to capture complex geometric structures in the data. Although these limitations have been partially addressed in the literature by incorporating various regularizers, such as graph Laplacian regularization, existing PCA based methods still face challenges related to multiscale analysis and capturing higher-order interactions in the data. To address these challenges, we propose a novel approach called Persistent Laplacian-enhanced Principal Component Analysis (PLPCA). PLPCA amalgamates the advantages of earlier regularized PCA methods with persistent spectral graph theory, specifically persistent Laplacians derived from algebraic topology. In contrast to graph Laplacians, persistent Laplacians enable multiscale analysis through filtration and can incorporate higher-order simplicial complexes to capture higher-order interactions in the data. We evaluate and validate the performance of PLPCA using ten benchmark microarray data sets that exhibit a wide range of dimensions and data imbalance ratios. Our extensive studies over these data sets demonstrate that PLPCA provides up to 12% improvement to the current state-of-the-art PCA models on five evaluation metrics for classification tasks after dimensionality reduction.
Collapse
Affiliation(s)
- Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Rui Wang
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
4
|
Ko KD, Sartorelli V. A deep learning adversarial autoencoder with dynamic batching displays high performance in denoising and ordering scRNA-seq data. iScience 2024; 27:109027. [PMID: 38361616 PMCID: PMC10867661 DOI: 10.1016/j.isci.2024.109027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 11/20/2023] [Accepted: 01/22/2024] [Indexed: 02/17/2024] Open
Abstract
By providing high-resolution of cell-to-cell variation in gene expression, single-cell RNA sequencing (scRNA-seq) offers insights into cell heterogeneity, differentiating dynamics, and disease mechanisms. However, challenges such as low capture rates and dropout events can introduce noise in data analysis. Here, we propose a deep neural generative framework, the dynamic batching adversarial autoencoder (DB-AAE), which excels at denoising scRNA-seq datasets. DB-AAE directly captures optimal features from input data and enhances feature preservation, including cell type-specific gene expression patterns. Comprehensive evaluation on simulated and real datasets demonstrates that DB-AAE outperforms other methods in denoising accuracy and biological signal preservation. It also improves the accuracy of other algorithms in establishing pseudo-time inference. This study highlights DB-AAE's effectiveness and potential as a valuable tool for enhancing the quality and reliability of downstream analyses in scRNA-seq research.
Collapse
Affiliation(s)
- Kyung Dae Ko
- Laboratory of Muscle Stem Cells & Gene Regulation, NIAMS, NIH, Bethesda, MD, USA
| | - Vittorio Sartorelli
- Laboratory of Muscle Stem Cells & Gene Regulation, NIAMS, NIH, Bethesda, MD, USA
| |
Collapse
|
5
|
Liu Y, Li F, Shang J, Liu J, Wang J, Ge D. scFED: Clustering Identifying Cell Types of scRNA-Seq Data Based on Feature Engineering Denoising. Interdiscip Sci 2023; 15:590-601. [PMID: 37402002 DOI: 10.1007/s12539-023-00574-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 05/31/2023] [Accepted: 06/06/2023] [Indexed: 07/05/2023]
Abstract
Recently developed single-cell RNA-seq (scRNA-seq) technology has given researchers the chance to investigate single-cell level of disease development. Clustering is one of the most essential strategies for analyzing scRNA-seq data. Choosing high-quality feature sets can significantly enhance the outcomes of single-cell clustering and classification. But computationally burdensome and highly expressed genes cannot afford a stabilized and predictive feature set for technical reasons. In this study, we introduce scFED, a feature-engineered gene selection framework. scFED identifies prospective feature sets to eliminate the noise fluctuation. And fuse them with existing knowledge from the tissue-specific cellular taxonomy reference database (CellMatch) to avoid the influence of subjective factors. Then present a reconstruction approach for noise reduction and crucial information amplification. We apply scFED on four genuine single-cell datasets and compare it with other techniques. According to the results, scFED improves clustering, decreases dimension of the scRNA-seq data, improves cell type identification when combined with clustering algorithms, and has higher performance than other methods. Therefore, scFED offers certain benefits in scRNA-seq data gene selection.
Collapse
Affiliation(s)
- Yang Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China.
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Jinxing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Daohui Ge
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| |
Collapse
|
6
|
Toseef M, Olayemi Petinrin O, Wang F, Rahaman S, Liu Z, Li X, Wong KC. Deep transfer learning for clinical decision-making based on high-throughput data: comprehensive survey with benchmark results. Brief Bioinform 2023:bbad254. [PMID: 37455245 DOI: 10.1093/bib/bbad254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 06/04/2023] [Accepted: 06/20/2023] [Indexed: 07/18/2023] Open
Abstract
The rapid growth of omics-based data has revolutionized biomedical research and precision medicine, allowing machine learning models to be developed for cutting-edge performance. However, despite the wealth of high-throughput data available, the performance of these models is hindered by the lack of sufficient training data, particularly in clinical research (in vivo experiments). As a result, translating this knowledge into clinical practice, such as predicting drug responses, remains a challenging task. Transfer learning is a promising tool that bridges the gap between data domains by transferring knowledge from the source to the target domain. Researchers have proposed transfer learning to predict clinical outcomes by leveraging pre-clinical data (mouse, zebrafish), highlighting its vast potential. In this work, we present a comprehensive literature review of deep transfer learning methods for health informatics and clinical decision-making, focusing on high-throughput molecular data. Previous reviews mostly covered image-based transfer learning works, while we present a more detailed analysis of transfer learning papers. Furthermore, we evaluated original studies based on different evaluation settings across cross-validations, data splits and model architectures. The result shows that those transfer learning methods have great potential; high-throughput sequencing data and state-of-the-art deep learning models lead to significant insights and conclusions. Additionally, we explored various datasets in transfer learning papers with statistics and visualization.
Collapse
Affiliation(s)
- Muhammad Toseef
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | | | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Saifur Rahaman
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Zhe Liu
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
- Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
7
|
Qi R, Zou Q. Trends and Potential of Machine Learning and Deep Learning in Drug Study at Single-Cell Level. RESEARCH (WASHINGTON, D.C.) 2023; 6:0050. [PMID: 36930772 PMCID: PMC10013796 DOI: 10.34133/research.0050] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Accepted: 12/27/2022] [Indexed: 01/12/2023]
Abstract
Cancer treatments always face challenging problems, particularly drug resistance due to tumor cell heterogeneity. The existing datasets include the relationship between gene expression and drug sensitivities; however, the majority are based on tissue-level studies. Study drugs at the single-cell level are perspective to overcome minimal residual disease caused by subclonal resistant cancer cells retained after initial curative therapy. Fortunately, machine learning techniques can help us understand how different types of cells respond to different cancer drugs from the perspective of single-cell gene expression. Good modeling using single-cell data and drug response information will not only improve machine learning for cell-drug outcome prediction but also facilitate the discovery of drugs for specific cancer subgroups and specific cancer treatments. In this paper, we review machine learning and deep learning approaches in drug research. By analyzing the application of these methods on cancer cell lines and single-cell data and comparing the technical gap between single-cell sequencing data analysis and single-cell drug sensitivity analysis, we hope to explore the trends and potential of drug research at the single-cell data level and provide more inspiration for drug research at the single-cell level. We anticipate that this review will stimulate the innovative use of machine learning methods to address new challenges in precision medicine more broadly.
Collapse
Affiliation(s)
- Ren Qi
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
8
|
Wang H, Zhao J, Zheng C, Su Y. scDSSC: Deep Sparse Subspace Clustering for scRNA-seq Data. PLoS Comput Biol 2022; 18:e1010772. [PMID: 36534702 PMCID: PMC9810169 DOI: 10.1371/journal.pcbi.1010772] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 01/03/2023] [Accepted: 11/28/2022] [Indexed: 12/23/2022] Open
Abstract
Single cell RNA sequencing (scRNA-seq) enables researchers to characterize transcriptomic profiles at the single-cell resolution with increasingly high throughput. Clustering is a crucial step in single cell analysis. Clustering analysis of transcriptome profiled by scRNA-seq can reveal the heterogeneity and diversity of cells. However, single cell study still remains great challenges due to its high noise and dimension. Subspace clustering aims at discovering the intrinsic structure of data in unsupervised fashion. In this paper, we propose a deep sparse subspace clustering method scDSSC combining noise reduction and dimensionality reduction for scRNA-seq data, which simultaneously learns feature representation and clustering via explicit modelling of scRNA-seq data generation. Experiments on a variety of scRNA-seq datasets from thousands to tens of thousands of cells have shown that scDSSC can significantly improve clustering performance and facilitate the interpretability of clustering and downstream analysis. Compared to some popular scRNA-deq analysis methods, scDSSC outperformed state-of-the-art methods under various clustering performance metrics.
Collapse
Affiliation(s)
- HaiYun Wang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - JianPing Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
- * E-mail: (JPZ); (CHZ); (YSS)
| | - ChunHou Zheng
- School of Artificial Intelligence, Anhui University, Hefei, China
- * E-mail: (JPZ); (CHZ); (YSS)
| | - YanSen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
- * E-mail: (JPZ); (CHZ); (YSS)
| |
Collapse
|
9
|
Wang HY, Zhao JP, Su YS, Zheng CH. scCDG: A Method Based on DAE and GCN for scRNA-Seq Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3685-3694. [PMID: 34752401 DOI: 10.1109/tcbb.2021.3126641] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Identifying cell types is one of the main goals of single-cell RNA sequencing (scRNA-seq) analysis, and clustering is a common method for this item. However, the massive amount of data and the excess noise level bring challenge for single cell clustering. To address this challenge, in this paper, we introduced a novel method named single-cell clustering based on denoising autoencoder and graph convolution network (scCDG), which consists of two core models. The first model is a denoising autoencoder (DAE) used to fit the data distribution for data denoising. The second model is a graph autoencoder using graph convolution network (GCN), which projects the data into a low-dimensional space (compressed) preserving topological structure information and feature information in scRNA-seq data simultaneously. Extensive analysis on seven real scRNA-seq datasets demonstrate that scCDG outperforms state-of-the-art methods in some research sub-fields, including single cell clustering, visualization of transcriptome landscape, and trajectory inference.
Collapse
|
10
|
Bilous M, Tran L, Cianciaruso C, Gabriel A, Michel H, Carmona SJ, Pittet MJ, Gfeller D. Metacells untangle large and complex single-cell transcriptome networks. BMC Bioinformatics 2022; 23:336. [PMID: 35963997 PMCID: PMC9375201 DOI: 10.1186/s12859-022-04861-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 07/23/2022] [Indexed: 12/13/2022] Open
Abstract
Background Single-cell RNA sequencing (scRNA-seq) technologies offer unique opportunities for exploring heterogeneous cell populations. However, in-depth single-cell transcriptomic characterization of complex tissues often requires profiling tens to hundreds of thousands of cells. Such large numbers of cells represent an important hurdle for downstream analyses, interpretation and visualization. Results We develop a framework called SuperCell to merge highly similar cells into metacells and perform standard scRNA-seq data analyses at the metacell level. Our systematic benchmarking demonstrates that metacells not only preserve but often improve the results of downstream analyses including visualization, clustering, differential expression, cell type annotation, gene correlation, imputation, RNA velocity and data integration. By capitalizing on the redundancy inherent to scRNA-seq data, metacells significantly facilitate and accelerate the construction and interpretation of single-cell atlases, as demonstrated by the integration of 1.46 million cells from COVID-19 patients in less than two hours on a standard desktop. Conclusions SuperCell is a framework to build and analyze metacells in a way that efficiently preserves the results of scRNA-seq data analyses while significantly accelerating and facilitating them.
Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04861-1.
Collapse
Affiliation(s)
- Mariia Bilous
- Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Loc Tran
- Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Chiara Cianciaruso
- Department of Pathology and Immunology, University of Geneva, Geneva, Switzerland
| | - Aurélie Gabriel
- Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Hugo Michel
- Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland
| | - Santiago J Carmona
- Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Mikael J Pittet
- Department of Pathology and Immunology, University of Geneva, Geneva, Switzerland.,Department of Oncology, Geneva University Hospitals, Geneva, Switzerland.,Center for Systems Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - David Gfeller
- Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland. .,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
11
|
Ren J, Zhang Q, Zhou Y, Hu Y, Lyu X, Fang H, Yang J, Yu R, Shi X, Li Q. A downsampling Method Enables Robust Clustering and Integration of Single-Cell Transcriptome Data. J Biomed Inform 2022; 130:104093. [DOI: 10.1016/j.jbi.2022.104093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Revised: 04/06/2022] [Accepted: 05/03/2022] [Indexed: 11/27/2022]
|
12
|
Wang CY, Gao YL, Liu JX, Kong XZ, Zheng CH. Single-Cell RNA Sequencing Data Clustering by Low-Rank Subspace Ensemble Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1154-1164. [PMID: 33026977 DOI: 10.1109/tcbb.2020.3029187] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of single-cell RNA sequencing (scRNA-seq)technology reveals the gene expression status and gene structure of individual cells, reflecting the heterogeneity and diversity of cells. The traditional methods of scRNA-seq data analysis treat data as the same subspace, and hide structural information in other subspaces. In this paper, we propose a low-rank subspace ensemble clustering framework (LRSEC)to analyze scRNA-seq data. Assuming that the scRNA-seq data exist in multiple subspaces, the low-rank model is used to find the lowest rank representation of the data in the subspace. It is worth noting that the penalty factor of the low-rank kernel function is uncertain, and different penalty factors correspond to different low-rank structures. Moreover, the single cluster model is difficult to find the cellular structure of all datasets. To strengthen the correlation between model solutions, we construct a new ensemble clustering framework LRSEC by using the low-rank model as the basic learner. The LRSEC framework captures the global structure of data through low-rank subspaces, which has better clustering performance than a single clustering model. We validate the performance of the LRSEC framework on seven small datasets and one large dataset and obtain satisfactory results.
Collapse
|
13
|
Ranjan B, Sun W, Park J, Mishra K, Schmidt F, Xie R, Alipour F, Singhal V, Joanito I, Honardoost MA, Yong JMY, Koh ET, Leong KP, Rayan NA, Lim MGL, Prabhakar S. DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data. Nat Commun 2021; 12:5849. [PMID: 34615861 PMCID: PMC8494900 DOI: 10.1038/s41467-021-26085-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 09/15/2021] [Indexed: 11/09/2022] Open
Abstract
Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.
Collapse
Affiliation(s)
- Bobby Ranjan
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Wenjie Sun
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Jinyu Park
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Kunal Mishra
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Florian Schmidt
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Ronald Xie
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Fatemeh Alipour
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Vipul Singhal
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Ignasius Joanito
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Mohammad Amin Honardoost
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
- Department of Medicine, School of Medicine, National University of Singapore, 21 Lower Kent Ridge Road, Singapore, 119077, Singapore
| | - Jacy Mei Yun Yong
- Department of Rheumatology, Allergy and Immunology, Tan Tock Seng Hospital, Singapore, 308433, Singapore
| | - Ee Tzun Koh
- Department of Rheumatology, Allergy and Immunology, Tan Tock Seng Hospital, Singapore, 308433, Singapore
| | - Khai Pang Leong
- Department of Rheumatology, Allergy and Immunology, Tan Tock Seng Hospital, Singapore, 308433, Singapore
| | - Nirmala Arul Rayan
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Michelle Gek Liang Lim
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore
| | - Shyam Prabhakar
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
| |
Collapse
|
14
|
Zhang W, Xue X, Zheng X, Fan Z. NMFLRR: Clustering scRNA-seq data by integrating non-negative matrix factorization with low rank representation. IEEE J Biomed Health Inform 2021; 26:1394-1405. [PMID: 34310328 DOI: 10.1109/jbhi.2021.3099127] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Fast-developing single-cell technologies create unprecedented opportunities to reveal cell heterogeneity and diversity. Accurate classification of single cells is a critical prerequisite for recovering the mechanisms of heterogeneity. However, the scRNA-seq profiles we obtained at present have high dimensionality, sparsity, and noise, which pose challenges for existing clustering methods in grouping cells that belong to the same subpopulation based on transcriptomic profiles. Although many computational methods have been proposed developing novel and effective computational methods to accurately identify cell types remains a considerable challenge. We present a new computational framework to identify cell types by integrating low-rank representation (LRR) and nonnegative matrix factorization (NMF); this framework is named NMFLRR. The LRR captures the global properties of original data by using nuclear norms, and a locality constrained graph regularization term is introduced to characterize the data's local geometric information. The similarity matrix and low-dimensional features of data can be simultaneously obtained by applying the alternating direction method of multipliers (ADMM) algorithm to handle each variable alternatively in an iterative way. We finally obtained the predicted cell types by using a spectral algorithm based on the optimized similarity matrix. Nine real scRNA-seq datasets were used to test the performance of NMFLRR and fifteen other competitive methods, and the accuracy and robustness of the simulation results suggest the NMFLRR is a promising algorithm for the classification of single cells. The simulation code is freely available at: https://github.com/wzhangwhu/NMFLRR_code.
Collapse
|
15
|
Rahnavard A, Chatterjee S, Sayoldin B, Crandall KA, Tekola-Ayele F, Mallick H. Omics community detection using multi-resolution clustering. Bioinformatics 2021; 37:3588-3594. [PMID: 33974004 PMCID: PMC8545346 DOI: 10.1093/bioinformatics/btab317] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/23/2021] [Accepted: 04/26/2021] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION The discovery of biologically interpretable and clinically actionable communities in heterogeneous omics data is a necessary first step towards deriving mechanistic insights into complex biological phenomena. Here we present a novel clustering approach, omeClust, for community detection in omics profiles by simultaneously incorporating similarities among measurements and the overall complex structure of the data. RESULTS We show that omeClust outperforms published methods in inferring the true community structure as measured by both sensitivity and misclassification rate on simulated datasets. We further validated omeClust in diverse, multiple omics datasets, revealing new communities and functionally related groups in microbial strains, cell line gene expression patterns, and fetal genomic variation. We also derived enrichment scores attributable to putatively meaningful biological factors in these datasets that can serve as hypothesis generators facilitating new sets of testable hypotheses. AVAILABILITY omeClust is open-source software, and the implementation is available online at http://github.com/omicsEye/omeClust. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ali Rahnavard
- Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
| | - Suvo Chatterjee
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Bahar Sayoldin
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
| | - Keith A Crandall
- Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
| | - Fasil Tekola-Ayele
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Himel Mallick
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA
| |
Collapse
|
16
|
Yan R, Fan C, Yin Z, Wang T, Chen X. Potential applications of deep learning in single-cell RNA sequencing analysis for cell therapy and regenerative medicine. Stem Cells 2021; 39:511-521. [PMID: 33587792 DOI: 10.1002/stem.3336] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 12/07/2020] [Indexed: 12/26/2022]
Abstract
When used in cell therapy and regenerative medicine strategies, stem cells have potential to treat many previously incurable diseases. However, current application methods using stem cells are underdeveloped, as these cells are used directly regardless of their culture medium and subgroup. For example, when using mesenchymal stem cells (MSCs) in cell therapy, researchers do not consider their source and culture method nor their application angle and function (soft tissue regeneration, hard tissue regeneration, suppression of immune function, or promotion of immune function). By combining machine learning methods (such as deep learning) with data sets obtained through single-cell RNA sequencing (scRNA-seq) technology, we can discover the hidden structure of these cells, predict their effects more accurately, and effectively use subpopulations with differentiation potential for stem cell therapy. scRNA-seq technology has changed the study of transcription, because it can express single-cell genes with single-cell anatomical resolution. However, this powerful technology is sensitive to biological and technical noise. The subsequent data analysis can be computationally difficult for a variety of reasons, such as denoising single cell data, reducing dimensionality, imputing missing values, and accounting for the zero-inflated nature. In this review, we discussed how deep learning methods combined with scRNA-seq data for research, how to interpret scRNA-seq data in more depth, improve the follow-up analysis of stem cells, identify potential subgroups, and promote the implementation of cell therapy and regenerative medicine measures.
Collapse
Affiliation(s)
- Ruojin Yan
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| | - Chunmei Fan
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| | - Zi Yin
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| | - Tingzhang Wang
- Key Laboratory of Microbial Technology and Bioinformatics of Zhejiang Province, Hangzhou, People's Republic of China.,NMPA Key laboratory for Testing and Risk Warning of Pharmaceutical Microbiology, Hangzhou, People's Republic of China
| | - Xiao Chen
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| |
Collapse
|
17
|
Wang HY, Zhao JP, Zheng CH. SUSCC: Secondary Construction of Feature Space based on UMAP for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data. Interdiscip Sci 2021; 13:83-90. [PMID: 33475958 DOI: 10.1007/s12539-020-00411-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 12/08/2020] [Accepted: 12/19/2020] [Indexed: 10/22/2022]
Abstract
Clustering is a common method to identify cell types in single cell analysis, but the increasing size of scRNA-seq datasets brings challenges to single cell clustering. Therefore, it is an urgent need to design a faster and more accurate clustering method for large-scale scRNA-seq data. In this paper, we proposed a new method for single cell clustering. First, a count matrix is constructed through normalization and gene filtration. Second, the raw data of gene expression matrix are projected to feature space constructed by secondary construction of feature space based on UMAP (Uniform Manifold Approximation and Projection). Third, the low-dimensional matrix on the feature space is randomly divided into two sub-matrices according to a certain proportion for clustering and classifying, respectively. Finally, one subset is clustered by k-means algorithm and then the other subset is classified by k-nearest neighbor algorithm based on clustering results. Experimental results show that our method can cluster the scRNA-seq datasets effectively.
Collapse
Affiliation(s)
- Hai-Yun Wang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jian-Ping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China. .,Institute of Mathematics and Physics, Xinjiang University, Urumqi, China.
| | - Chun-Hou Zheng
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China. .,College of Computer Science and Technology, Anhui University, Hefei, China.
| |
Collapse
|
18
|
Zhang W, Li Y, Zou X. SCCLRR: A Robust Computational Method for Accurate Clustering Single Cell RNA-Seq Data. IEEE J Biomed Health Inform 2021; 25:247-256. [PMID: 32356764 DOI: 10.1109/jbhi.2020.2991172] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Single-cell RNA transcriptome data present a tremendous opportunity for studying the cellular heterogeneity. Identifying subpopulations based on scRNA-seq data is a hot topic in recent years, although many researchers have been focused on designing elegant computational methods for identifying new cell types; however, the performance of these methods is still unsatisfactory due to the high dimensionality, sparsity and noise of scRNA-seq data. In this study, we propose a new cell type detection method by learning a robust and accurate similarity matrix, named SCCLRR. The method simultaneously captures both global and local intrinsic properties of data based on a low rank representation (LRR) framework mathematical model. The integrated normalized Euclidean distance and cosine similarity are used to balance the intrinsic linear and nonlinear manifold of data in the local regularization term. To solve the non-convex optimization model, we present an iterative optimization procedure using the alternating direction method of multipliers (ADMM) algorithm. We evaluate the performance of the SCCLRR method on nine real scRNA-seq datasets and compare it with seven state-of-the-art methods. The simulation results show that the SCCLRR outperforms other methods and is robust and effective for clustering scRNA-seq data. (The code of SCCLRR is free available for academic https://github.com/wzhangwhu/SCCLRR).
Collapse
|
19
|
Song Q, Su J, Miller LD, Zhang W. scLM: Automatic Detection of Consensus Gene Clusters Across Multiple Single-cell Datasets. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 19:330-341. [PMID: 33359676 PMCID: PMC8602751 DOI: 10.1016/j.gpb.2020.09.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Revised: 08/11/2020] [Accepted: 10/27/2020] [Indexed: 12/16/2022]
Abstract
In gene expression profiling studies, including single-cell RNAsequencing (scRNA-seq) analyses, the identification and characterization of co-expressed genes provides critical information on cell identity and function. Gene co-expression clustering in scRNA-seq data presents certain challenges. We show that commonly used methods for single-cell data are not capable of identifying co-expressed genes accurately, and produce results that substantially limit biological expectations of co-expressed genes. Herein, we present single-cell Latent-variable Model (scLM), a gene co-clustering algorithm tailored to single-cell data that performs well at detecting gene clusters with significant biologic context. Importantly, scLM can simultaneously cluster multiple single-cell datasets, i.e., consensus clustering, enabling users to leverage single-cell data from multiple sources for novel comparative analysis. scLM takes raw count data as input and preserves biological variation without being influenced by batch effects from multiple datasets. Results from both simulation data and experimental data demonstrate that scLM outperforms the existing methods with considerably improved accuracy. To illustrate the biological insights of scLM, we apply it to our in-house and public experimental scRNA-seq datasets. scLM identifies novel functional gene modules and refines cell states, which facilitates mechanism discovery and understanding of complex biosystems such as cancers. A user-friendly R package with all the key features of the scLM method is available at https://github.com/QSong-github/scLM.
Collapse
Affiliation(s)
- Qianqian Song
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC 27157, USA; Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC 27157, USA
| | - Jing Su
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC 27157, USA; Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Lance D Miller
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC 27157, USA; Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC 27157, USA
| | - Wei Zhang
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC 27157, USA; Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC 27157, USA.
| |
Collapse
|
20
|
Singh R. Single-Cell Sequencing in Human Genital Infections. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2020; 1255:203-220. [PMID: 32949402 DOI: 10.1007/978-981-15-4494-1_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
Abstract
Human genital infections are one of the most concerning issues worldwide and can be categorized into sexually transmitted, urinary tract and vaginal infections. These infections, if left untreated, can disseminate to the other parts of the body and cause more complicated illnesses such as pelvic inflammatory disease, urethritis, and anogenital cancers. The effective treatment against these infections is further complicated by the emergence of antimicrobial resistance in the genital infection causing pathogens. Furthermore, the development and applications of single-cell sequencing technologies have open new possibilities to study the drug resistant clones, cell to cell variations, the discovery of acquired drug resistance mutations, transcriptional diversity of a pathogen across different infection stages, to identify rare cell types and investigate different cellular states of genital infection causing pathogens, and to develop novel therapeutical strategies. In this chapter, I will provide a complete review of the applications of single-cell sequencing in human genital infections before discussing their limitations and challenges.
Collapse
Affiliation(s)
- Reema Singh
- Department of Biochemistry, Microbiology and Immunology, College of Medicine, University of Saskatchewan, Saskatoon, SK, Canada. .,Vaccine and Infectious Disease Organization-International Vaccine Centre, Saskatoon, SK, Canada.
| |
Collapse
|
21
|
Qi Y, Guo Y, Jiao H, Shang X. A flexible network-based imputing-and-fusing approach towards the identification of cell types from single-cell RNA-seq data. BMC Bioinformatics 2020; 21:240. [PMID: 32527285 PMCID: PMC7291547 DOI: 10.1186/s12859-020-03547-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Accepted: 05/13/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) provides an effective tool to investigate the transcriptomic characteristics at the single-cell resolution. Due to the low amounts of transcripts in single cells and the technical biases in experiments, the raw scRNA-seq data usually includes large noise and makes the downstream analyses complicated. Although many methods have been proposed to impute the noisy scRNA-seq data in recent years, few of them take into account the prior associations across genes in imputation and integrate multiple types of imputation data to identify cell types. RESULTS We present a new framework, NetImpute, towards the identification of cell types from scRNA-seq data by integrating multiple types of biological networks. We employ a statistic method to detect the noise data items in scRNA-seq data and develop a new imputation model to estimate the real values of data noise by integrating the PPI network and gene pathways. Meanwhile, based on the data imputed by multiple types of biological networks, we propose an integrated approach to identify cell types from scRNA-seq data. Comprehensive experiments demonstrate that the proposed network-based imputation model can estimate the real values of noise data items accurately and integrating the imputation data based on multiple types of biological networks can improve the identification of cell types from scRNA-seq data. CONCLUSIONS Incorporating the prior gene associations in biological networks can potentially help to improve the imputation of noisy scRNA-seq data and integrating multiple types of network-based imputation data can enhance the identification of cell types. The proposed NetImpute provides an open framework for incorporating multiple types of biological network data to identify cell types from scRNA-seq data.
Collapse
Affiliation(s)
- Yang Qi
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Yang Guo
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Huixin Jiao
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.
| |
Collapse
|
22
|
Zheng R, Liang Z, Chen X, Tian Y, Cao C, Li M. An Adaptive Sparse Subspace Clustering for Cell Type Identification. Front Genet 2020; 11:407. [PMID: 32425984 PMCID: PMC7212354 DOI: 10.3389/fgene.2020.00407] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 03/31/2020] [Indexed: 01/04/2023] Open
Abstract
The rapid development of single-cell transcriptome sequencing technology has provided us with a cell-level perspective to study biological problems. Identification of cell types is one of the fundamental issues in computational analysis of single-cell data. Due to the large amount of noise from single-cell technologies and high dimension of expression profiles, traditional clustering methods are not so applicable to solve it. To address the problem, we have designed an adaptive sparse subspace clustering method, called AdaptiveSSC, to identify cell types. AdaptiveSSC is based on the assumption that the expression of cells with the same type lies in the same subspace; one cell can be expressed as a linear combination of the other cells. Moreover, it uses a data-driven adaptive sparse constraint to construct the similarity matrix. The comparison results of 10 scRNA-seq datasets show that AdaptiveSSC outperforms original subspace clustering and other state-of-art methods in most cases. Moreover, the learned similarity matrix can also be integrated with a modified t-SNE to obtain an improved visualization result.
Collapse
Affiliation(s)
- Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Zhenlan Liang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xiang Chen
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Yu Tian
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chen Cao
- Departments of Biochemistry & Molecular Biology and Medical Genetics, Alberta Children's Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
23
|
Krzak M, Raykov Y, Boukouvalas A, Cutillo L, Angelini C. Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front Genet 2019; 10:1253. [PMID: 31921297 PMCID: PMC6918801 DOI: 10.3389/fgene.2019.01253] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 11/13/2019] [Indexed: 01/04/2023] Open
Abstract
Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.
Collapse
Affiliation(s)
- Monika Krzak
- Institute for Applied Mathematics “Mauro Picone”, Naples, Italy
| | - Yordan Raykov
- Department of Mathematics, Aston University, Birmingham, United Kingdom
| | | | - Luisa Cutillo
- School of Mathematics, University of Leeds, Leeds, United Kingdom
| | | |
Collapse
|
24
|
Wang T, Johnson TS, Shao W, Lu Z, Helm BR, Zhang J, Huang K. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol 2019; 20:165. [PMID: 31405383 PMCID: PMC6691531 DOI: 10.1186/s13059-019-1764-6] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 07/17/2019] [Indexed: 12/21/2022] Open
Abstract
To fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for identifying cell lineages and bona fide transcriptional signals, it is necessary to combine data from multiple experiments. We present BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data. BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate that BERMUDA outperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets.
Collapse
Affiliation(s)
- Tongxin Wang
- Department of Computer Science, Indiana University Bloomington, Bloomington, IN, USA
| | - Travis S Johnson
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Wei Shao
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Zixiao Lu
- Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou, China
| | - Bryan R Helm
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Jie Zhang
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA.
| | - Kun Huang
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA.
- Regenstrief Institute, Indianapolis, IN, USA.
| |
Collapse
|