1
|
Jiang H, Miao X, Thairu MW, Beebe M, Grupe DW, Davidson RJ, Handelsman J, Sankaran K. Multimedia: multimodal mediation analysis of microbiome data. Microbiol Spectr 2024:e0113124. [PMID: 39688588 DOI: 10.1128/spectrum.01131-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 10/30/2024] [Indexed: 12/18/2024] Open
Abstract
Mediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regressions, this analysis quantifies how complementary data relate to one another and respond to treatments. Despite these advances, existing software's rigid assumptions often result in users viewing mediation analysis as a black box. We designed the multimedia R package to make advanced mediation analysis techniques accessible, ensuring that statistical components are interpretable and adaptable. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, bootstrap confidence interval construction, and sensitivity analysis, enabling experimentation with various mediator and outcome models while maintaining a simple overall workflow. The software includes modules for regularized linear, compositional, random forest, hierarchical, and hurdle modeling, making it well-suited to microbiome data. We illustrate the package through two case studies. The first re-analyzes a study of the microbiome and metabolome of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. The second analyzes new data about the influence of mindfulness practice on the microbiome. The mediation analysis highlights shifts in taxa previously associated with depression that cannot be explained indirectly by diet or sleep behaviors alone. A gallery of examples and further documentation can be found at https://go.wisc.edu/830110. IMPORTANCE Microbiome studies routinely gather complementary data to capture different aspects of a microbiome's response to a change, such as the introduction of a therapeutic. Mediation analysis clarifies the extent to which responses occur sequentially via mediators, thereby supporting causal, rather than purely descriptive, interpretation. Multimedia is a modular R package with close ties to the wider microbiome software ecosystem that makes statistically rigorous, flexible mediation analysis easily accessible, setting the stage for precise and causally informed microbiome engineering.
Collapse
Affiliation(s)
- Hanying Jiang
- Statistics Department, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Xinran Miao
- Statistics Department, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Margaret W Thairu
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Mara Beebe
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Dan W Grupe
- Center for Healthy Minds, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Richard J Davidson
- Center for Healthy Minds, University of Wisconsin-Madison, Madison, Wisconsin, USA
- Psychology Department, University of Wisconsin-Madison, Madison, Wisconsin, USA
- Psychiatry Department, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Jo Handelsman
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, USA
- Plant Pathology Department, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Kris Sankaran
- Statistics Department, University of Wisconsin-Madison, Madison, Wisconsin, USA
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, USA
| |
Collapse
|
2
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.01.605536. [PMID: 39149356 PMCID: PMC11326131 DOI: 10.1101/2024.08.01.605536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable REsidual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
- Princess Margaret Research Institute, University Health Network, Toronto, Ontario, Canada
- CIFAR Multiscale Human Program, CIFAR, Toronto, Ontario, Canada
| |
Collapse
|
3
|
Hozumi Y, Wei GW. Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE. PLoS One 2024; 19:e0311791. [PMID: 39671349 PMCID: PMC11642954 DOI: 10.1371/journal.pone.0311791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 09/24/2024] [Indexed: 12/15/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing scRNA-seq data. CCP utilizes gene-gene correlations to partition the genes and, based on the partition, employs cell-cell interactions to obtain super-genes. Because CCP is a data-domain approach that does not require matrix diagonalization, it can be used in many downstream machine learning tasks. In this work, we utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, we have found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy. More specifically, CCP improves UMAP by 22% in ARI, 14% in NMI and 15% in ECM, and improves tSNE by 11% in ARI, 9% in NMI and 8% in ECM.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
4
|
Luo E, Hao M, Wei L, Zhang X. scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics 2024; 40:btae518. [PMID: 39171840 PMCID: PMC11368386 DOI: 10.1093/bioinformatics/btae518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 08/10/2024] [Accepted: 08/20/2024] [Indexed: 08/23/2024] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, diffusion models have shown their power in generating data with high fidelity, providing a new opportunity for scRNA-seq generation. RESULTS In this study, we developed scDiffusion, a generative model combining the diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion could generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research. AVAILABILITY AND IMPLEMENTATION scDiffusion is openly available at the GitHub repository https://github.com/EperLuo/scDiffusion or Zenodo https://zenodo.org/doi/10.5281/zenodo.13268742.
Collapse
Affiliation(s)
- Erpai Luo
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Minsheng Hao
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
- School of Life Sciences and School of Medicine, Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
5
|
Hsu CY, Liu Q, Shyr Y. A distribution-free and analytic method for power and sample size calculation in single-cell differential expression. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae540. [PMID: 39231036 PMCID: PMC11407695 DOI: 10.1093/bioinformatics/btae540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 08/28/2024] [Accepted: 09/02/2024] [Indexed: 09/06/2024]
Abstract
MOTIVATION Differential expression analysis in single-cell transcriptomics unveils cell type-specific responses to various treatments or biological conditions. To ensure the robustness and reliability of the analysis, it is essential to have a solid experimental design with ample statistical power and sample size. However, existing methods for power and sample size calculation often assume a specific distribution for single-cell transcriptomics data, potentially deviating from the true data distribution. Moreover, they commonly overlook cell-cell correlations within individual samples, posing challenges in accurately representing biological phenomena. Additionally, due to the complexity of deriving an analytic formula, most methods employ time-consuming simulation-based strategies. RESULTS We propose an analytic-based method named scPS for calculating power and sample sizes based on generalized estimating equations. scPS stands out by making no assumptions about the data distribution and considering cell-cell correlations within individual samples. scPS is a rapid and powerful approach for designing experiments in single-cell differential expression analysis. AVAILABILITY AND IMPLEMENTATION scPS is freely available at https://github.com/cyhsuTN/scPS and Zenodo https://zenodo.org/records/13375996.
Collapse
Affiliation(s)
- Chih-Yuan Hsu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| |
Collapse
|
6
|
Zhang J, Larschan E, Bigness J, Singh R. scNODE : generative model for temporal single cell transcriptomic data prediction. Bioinformatics 2024; 40:ii146-ii154. [PMID: 39230694 PMCID: PMC11373355 DOI: 10.1093/bioinformatics/btae393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
SUMMARY Measurement of single-cell gene expression at different timepoints enables the study of cell development. However, due to the resource constraints and technical challenges associated with the single-cell experiments, researchers can only profile gene expression at discrete and sparsely sampled timepoints. This missing timepoint information impedes downstream cell developmental analyses. We propose scNODE, an end-to-end deep learning model that can predict in silico single-cell gene expression at unobserved timepoints. scNODE integrates a variational autoencoder with neural ordinary differential equations to predict gene expression using a continuous and nonlinear latent space. Importantly, we incorporate a dynamic regularization term to learn a latent space that is robust against distribution shifts when predicting single-cell gene expression at unobserved timepoints. Our evaluations on three real-world scRNA-seq datasets show that scNODE achieves higher predictive performance than state-of-the-art methods. We further demonstrate that scNODE's predictions help cell trajectory inference under the missing timepoint paradigm and the learned latent space is useful for in silico perturbation analysis of relevant genes along a developmental cell path. AVAILABILITY AND IMPLEMENTATION The data and code are publicly available at https://github.com/rsinghlab/scNODE.
Collapse
Affiliation(s)
- Jiaqi Zhang
- Department of Computer Science, Brown University, Providence, RI 02906, United States
| | - Erica Larschan
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI 02912, United States
| | - Jeremy Bigness
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI 02906, United States
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| |
Collapse
|
7
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. RESEARCH SQUARE 2024:rs.3.rs-4819117. [PMID: 39149508 PMCID: PMC11326389 DOI: 10.21203/rs.3.rs-4819117/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable Residual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
- Princess Margaret Research Institute, University Health Network, Toronto, Ontario, Canada
- CIFAR Multiscale Human Program, CIFAR, Toronto, Ontario, Canada
| |
Collapse
|
8
|
Duo H, Li Y, Lan Y, Tao J, Yang Q, Xiao Y, Sun J, Li L, Nie X, Zhang X, Liang G, Liu M, Hao Y, Li B. Systematic evaluation with practical guidelines for single-cell and spatially resolved transcriptomics data simulation under multiple scenarios. Genome Biol 2024; 25:145. [PMID: 38831386 PMCID: PMC11149245 DOI: 10.1186/s13059-024-03290-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines. RESULTS We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation. CONCLUSIONS No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.
Collapse
Affiliation(s)
- Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Yinghong Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, People's Republic of China
| | - Yang Lan
- Institute of Pathology and Southwest Cancer Center, Southwest Hospital, Army Medical University, Chongqing, 400038, People's Republic of China
| | - Jingxin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310058, People's Republic of China
| | - Yingxue Xiao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Jing Sun
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Lei Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Xiaoxi Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Mingwei Liu
- Key Laboratory of Clinical Laboratory Diagnostics, College of Laboratory Medicine, Chongqing Medical University, Chongqing, 400016, People's Republic of China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| |
Collapse
|
9
|
Cottrell S, Hozumi Y, Wei GW. K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis. Comput Biol Med 2024; 175:108497. [PMID: 38678944 PMCID: PMC11090715 DOI: 10.1016/j.compbiomed.2024.108497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 04/08/2024] [Accepted: 04/21/2024] [Indexed: 05/01/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
Collapse
Affiliation(s)
- Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|
10
|
Hozumi Y, Tanemura KA, Wei GW. Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection. J Chem Inf Model 2024; 64:2829-2838. [PMID: 37402705 PMCID: PMC11009150 DOI: 10.1021/acs.jcim.3c00674] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Kiyoto Aramis Tanemura
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
11
|
Liu R, Qian K, He X, Li H. Integration of scRNA-seq data by disentangled representation learning with condition domain adaptation. BMC Bioinformatics 2024; 25:116. [PMID: 38493095 PMCID: PMC10944609 DOI: 10.1186/s12859-024-05706-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 02/15/2024] [Indexed: 03/18/2024] Open
Abstract
BACKGROUND The integration of single-cell RNA sequencing data from multiple experimental batches and diverse biological conditions holds significant importance in the study of cellular heterogeneity. RESULTS To expedite the exploration of systematic disparities under various biological contexts, we propose a scRNA-seq integration method called scDisco, which involves a domain-adaptive decoupling representation learning strategy for the integration of dissimilar single-cell RNA data. It constructs a condition-specific domain-adaptive network founded on variational autoencoders. scDisco not only effectively reduces batch effects but also successfully disentangles biological effects and condition-specific effects, and further augmenting condition-specific representations through the utilization of condition-specific Domain-Specific Batch Normalization layers. This enhancement enables the identification of genes specific to particular conditions. The effectiveness and robustness of scDisco as an integration method were analyzed using both simulated and real datasets, and the results demonstrate that scDisco can yield high-quality visualizations and quantitative outcomes. Furthermore, scDisco has been validated using real datasets, affirming its proficiency in cell clustering quality, retaining batch-specific cell types and identifying condition-specific genes. CONCLUSION scDisco is an effective integration method based on variational autoencoders, which improves analytical tasks of reducing batch effects, cell clustering, retaining batch-specific cell types and identifying condition-specific genes.
Collapse
Affiliation(s)
- Renjing Liu
- School of Mathematics and Physics, China University of Geosciences (Wuhan), Wuhan, 430074, China
| | - Kun Qian
- School of Mathematics and Physics, China University of Geosciences (Wuhan), Wuhan, 430074, China
| | - Xinwei He
- School of Mathematics and Physics, China University of Geosciences (Wuhan), Wuhan, 430074, China
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences (Wuhan), Wuhan, 430074, China.
| |
Collapse
|
12
|
Song D, Wang Q, Yan G, Liu T, Sun T, Li JJ. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol 2024; 42:247-252. [PMID: 37169966 PMCID: PMC11182337 DOI: 10.1038/s41587-023-01772-1] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 03/30/2023] [Indexed: 05/13/2023]
Abstract
We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA
| | - Qingyang Wang
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Guanao Yan
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyang Liu
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyi Sun
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA.
- Department of Statistics, University of California, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, CA, USA.
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
13
|
Yan G, Song D, Li JJ. scReadSim: a single-cell RNA-seq and ATAC-seq read simulator. Nat Commun 2023; 14:7482. [PMID: 37980428 PMCID: PMC10657386 DOI: 10.1038/s41467-023-43162-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 11/02/2023] [Indexed: 11/20/2023] Open
Abstract
Benchmarking single-cell RNA-seq (scRNA-seq) and single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) computational tools demands simulators to generate realistic sequencing reads. However, none of the few read simulators aim to mimic real data. To fill this gap, we introduce scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads (in a FASTQ or BAM file) by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data. Moreover, scReadSim provides ground truths, including unique molecular identifier (UMI) counts for scRNA-seq and open chromatin regions for scATAC-seq. In particular, scReadSim allows users to design cell-type-specific ground-truth open chromatin regions for scATAC-seq data generation. In benchmark applications of scReadSim, we show that UMI-tools achieves the top accuracy in scRNA-seq UMI deduplication, and HMMRATAC and MACS3 achieve the top performance in scATAC-seq peak calling.
Collapse
Affiliation(s)
- Guanao Yan
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA
| | - Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, 90095-7246, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA.
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, 90095-7246, USA.
- Department of Human Genetics, University of California, Los Angeles, CA, 90095-7088, USA.
- Department of Computational Medicine, University of California, Los Angeles, CA, 90095-1766, USA.
- Department of Biostatistics, University of California, Los Angeles, CA, 90095-1772, USA.
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, 02138, USA.
| |
Collapse
|
14
|
Cottrell S, Hozumi Y, Wei GW. K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis. ARXIV 2023:arXiv:2310.14521v1. [PMID: 37961744 PMCID: PMC10635285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
Collapse
Affiliation(s)
- Sean Cottrell
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
15
|
Liu J, Kreimer A, Li WV. Differential variability analysis of single-cell gene expression data. Brief Bioinform 2023; 24:bbad294. [PMID: 37598422 PMCID: PMC10516347 DOI: 10.1093/bib/bbad294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/18/2023] [Accepted: 07/29/2023] [Indexed: 08/22/2023] Open
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technologies has enabled gene expression profiling at the single-cell resolution, thereby enabling the quantification and comparison of transcriptional variability among individual cells. Although alterations in transcriptional variability have been observed in various biological states, statistical methods for quantifying and testing differential variability between groups of cells are still lacking. To identify the best practices in differential variability analysis of single-cell gene expression data, we propose and compare 12 statistical pipelines using different combinations of methods for normalization, feature selection, dimensionality reduction and variability calculation. Using high-quality synthetic scRNA-seq datasets, we benchmarked the proposed pipelines and found that the most powerful and accurate pipeline performs simple library size normalization, retains all genes in analysis and uses denSNE-based distances to cluster medoids as the variability measure. By applying this pipeline to scRNA-seq datasets of COVID-19 and autism patients, we have identified cellular variability changes between patients with different severity status or between patients and healthy controls.
Collapse
Affiliation(s)
- Jiayi Liu
- Graduate Programs in Molecular Biosciences, Rutgers, The State University of New Jersey, 604 Allison Rd, Piscataway, 08854, NJ, USA
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, 08854, NJ, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, 08854, NJ, USA
| | - Anat Kreimer
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, 08854, NJ, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, 08854, NJ, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, 900 University Ave, Riverside, 92521, CA, USA
- Previous affiliation where part of the work was completed: Department of Biostatistics and Epidemiology, Rutgers, The State University of New Jersey, 683 Hoes Lane West, Piscataway, 08854, NJ, USA
| |
Collapse
|
16
|
Xi NM, Li JJ. Exploring the optimization of autoencoder design for imputing single-cell RNA sequencing data. Comput Struct Biotechnol J 2023; 21:4079-4095. [PMID: 37671239 PMCID: PMC10475479 DOI: 10.1016/j.csbj.2023.07.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 07/22/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
Autoencoders are the backbones of many imputation methods that aim to relieve the sparsity issue in single-cell RNA sequencing (scRNA-seq) data. The imputation performance of an autoencoder relies on both the neural network architecture and the hyperparameter choice. So far, literature in the single-cell field lacks a formal discussion on how to design the neural network and choose the hyperparameters. Here, we conducted an empirical study to answer this question. Our study used many real and simulated scRNA-seq datasets to examine the impacts of the neural network architecture, the activation function, and the regularization strategy on imputation accuracy and downstream analyses. Our results show that (i) deeper and narrower autoencoders generally lead to better imputation performance; (ii) the sigmoid and tanh activation functions consistently outperform other commonly used functions including ReLU; (iii) regularization improves the accuracy of imputation and downstream cell clustering and DE gene analyses. Notably, our results differ from common practices in the computer vision field regarding the activation function and the regularization strategy. Overall, our study offers practical guidance on how to optimize the autoencoder design for scRNA-seq data imputation.
Collapse
Affiliation(s)
- Nan Miles Xi
- Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL 60660, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554, USA
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772, USA
| |
Collapse
|
17
|
Fan S, Dang D, Ye Y, Zhang SW, Gao L, Zhang S. scHi-CSim: a flexible simulator that generates high-fidelity single-cell Hi-C data for benchmarking. J Mol Cell Biol 2023; 15:mjad003. [PMID: 36708167 PMCID: PMC10308180 DOI: 10.1093/jmcb/mjad003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 09/18/2022] [Accepted: 01/25/2023] [Indexed: 01/29/2023] Open
Abstract
Single-cell Hi-C technology provides an unprecedented opportunity to reveal chromatin structure in individual cells. However, high sequencing cost impedes the generation of biological Hi-C data with high sequencing depths and multiple replicates for downstream analysis. Here, we developed a single-cell Hi-C simulator (scHi-CSim) that generates high-fidelity data for benchmarking. scHi-CSim merges neighboring cells to overcome the sparseness of data, samples interactions in distance-stratified chromosomes to maintain the heterogeneity of single cells, and estimates the empirical distribution of restriction fragments to generate simulated data. We demonstrated that scHi-CSim can generate high-fidelity data by comparing the performance of single-cell clustering and detection of chromosomal high-order structures with raw data. Furthermore, scHi-CSim is flexible to change sequencing depth and the number of simulated replicates. We showed that increasing sequencing depth could improve the accuracy of detecting topologically associating domains. We also used scHi-CSim to generate a series of simulated datasets with different sequencing depths to benchmark scHi-C clustering methods.
Collapse
Affiliation(s)
- Shichen Fan
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Dachang Dang
- School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Yusen Ye
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Shao-Wu Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
| |
Collapse
|
18
|
Yan D, Sun Z, Fang J, Cao S, Wang W, Chang X, Badirli S, Fu H, Liu Y. scRAA: the development of a robust and automatic annotation procedure for single-cell RNA sequencing data. J Biopharm Stat 2023:1-14. [PMID: 37162278 DOI: 10.1080/10543406.2023.2208671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
A critical task in single-cell RNA sequencing (scRNA-Seq) data analysis is to identify cell types from heterogeneous tissues. While the majority of classification methods demonstrated high performance in scRNA-Seq annotation problems, a robust and accurate solution is desired to generate reliable outcomes for downstream analyses, for instance, marker genes identification, differentially expressed genes, and pathway analysis. It is hard to establish a universally good metric. Thus, a universally good classification method for all kinds of scenarios does not exist. In addition, reference and query data in cell classification are usually from different experimental batches, and failure to consider batch effects may result in misleading conclusions. To overcome this bottleneck, we propose a robust ensemble approach to classify cells and utilize a batch correction method between reference and query data. We simulated four scenarios that comprise simple to complex batch effect and account for varying cell-type proportions. We further tested our approach on both lung and pancreas data. We found improved prediction accuracy and robust performance across simulation scenarios and real data. The incorporation of batch effect correction between reference and query, and the ensemble approach improve cell-type prediction accuracy while maintaining robustness. We demonstrated these through simulated and real scRNA-Seq data.
Collapse
Affiliation(s)
- Dongyan Yan
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Zhe Sun
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Jiyuan Fang
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Shanshan Cao
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Wenjie Wang
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Xinyue Chang
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Sarkhan Badirli
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Haoda Fu
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Yushi Liu
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| |
Collapse
|
19
|
Crowell HL, Morillo Leonardo SX, Soneson C, Robinson MD. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol 2023; 24:62. [PMID: 36991470 PMCID: PMC10061781 DOI: 10.1186/s13059-023-02904-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/20/2023] [Indexed: 03/31/2023] Open
Abstract
BACKGROUND With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Collapse
Affiliation(s)
- Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | | | - Charlotte Soneson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Current address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
| |
Collapse
|
20
|
Jeon H, Xie J, Jeon Y, Jung KJ, Gupta A, Chang W, Chung D. Statistical Power Analysis for Designing Bulk, Single-Cell, and Spatial Transcriptomics Experiments: Review, Tutorial, and Perspectives. Biomolecules 2023; 13:221. [PMID: 36830591 PMCID: PMC9952882 DOI: 10.3390/biom13020221] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 01/20/2023] [Accepted: 01/21/2023] [Indexed: 01/26/2023] Open
Abstract
Gene expression profiling technologies have been used in various applications such as cancer biology. The development of gene expression profiling has expanded the scope of target discovery in transcriptomic studies, and each technology produces data with distinct characteristics. In order to guarantee biologically meaningful findings using transcriptomic experiments, it is important to consider various experimental factors in a systematic way through statistical power analysis. In this paper, we review and discuss the power analysis for three types of gene expression profiling technologies from a practical standpoint, including bulk RNA-seq, single-cell RNA-seq, and high-throughput spatial transcriptomics. Specifically, we describe the existing power analysis tools for each research objective for each of the bulk RNA-seq and scRNA-seq experiments, along with recommendations. On the other hand, since there are no power analysis tools for high-throughput spatial transcriptomics at this point, we instead investigate the factors that can influence power analysis.
Collapse
Affiliation(s)
- Hyeongseon Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
| | - Juan Xie
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
- The Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| | - Yeseul Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Department of Statistics and Data Science, Yonsei University, Seoul 03722, Republic of Korea
- Department of Applied Statistics, Yonsei University, Seoul 03722, Republic of Korea
| | - Kyeong Joo Jung
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
| | - Arkobrato Gupta
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
- The Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH 45221, USA
| | - Dongjun Chung
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
- The Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
21
|
Sun L, Wang G, Zhang Z. SimCH: simulation of single-cell RNA sequencing data by modeling cellular heterogeneity at gene expression level. Brief Bioinform 2023; 24:6961608. [PMID: 36575569 DOI: 10.1093/bib/bbac590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 11/08/2022] [Accepted: 12/02/2022] [Indexed: 12/29/2022] Open
Abstract
Single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) has been a powerful technology for transcriptome analysis. However, the systematic validation of diverse computational tools used in scRNA-seq analysis remains challenging. Here, we propose a novel simulation tool, termed as Simulation of Cellular Heterogeneity (SimCH), for the flexible and comprehensive assessment of scRNA-seq computational methods. The Gaussian Copula framework is recruited to retain gene coexpression of experimental data shown to be associated with cellular heterogeneity. The synthetic count matrices generated by suitable SimCH modes closely match experimental data originating from either homogeneous or heterogeneous cell populations and either unique molecular identifier (UMI)-based or non-UMI-based techniques. We demonstrate how SimCH can benchmark several types of computational methods, including cell clustering, discovery of differentially expressed genes, trajectory inference, batch correction and imputation. Moreover, we show how SimCH can be used to conduct power evaluation of cell clustering methods. Given these merits, we believe that SimCH can accelerate single-cell research.
Collapse
Affiliation(s)
- Lei Sun
- School of Information Engineering, Yangzhou University, Yangzhou, P.R. China.,School of Artificial Intelligence, Yangzhou University, Yangzhou, P.R. China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, P.R. China
| | - Gongming Wang
- School of Information Engineering, Yangzhou University, Yangzhou, P.R. China.,School of Artificial Intelligence, Yangzhou University, Yangzhou, P.R. China.,China Unicom Software Research Institute Jinan Branch, Jinan, P.R. China
| | - Zhihua Zhang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, P.R. China.,School of Life Science, University of Chinese Academy of Sciences, Beijing, P.R. China
| |
Collapse
|
22
|
Cheng A, Hu G, Li WV. Benchmarking cell-type clustering methods for spatially resolved transcriptomics data. Brief Bioinform 2023; 24:bbac475. [PMID: 36410733 PMCID: PMC9851325 DOI: 10.1093/bib/bbac475] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 09/20/2022] [Accepted: 10/04/2022] [Indexed: 11/23/2022] Open
Abstract
Spatially resolved transcriptomics technologies enable the measurement of transcriptome information while retaining the spatial context at the regional, cellular or sub-cellular level. While previous computational methods have relied on gene expression information alone for clustering single-cell populations, more recent methods have begun to leverage spatial location and histology information to improve cell clustering and cell-type identification. In this study, using seven semi-synthetic datasets with real spatial locations, simulated gene expression and histology images as well as ground truth cell-type labels, we evaluate 15 clustering methods based on clustering accuracy, robustness to data variation and input parameters, computational efficiency, and software usability. Our analysis demonstrates that even though incorporating the additional spatial and histology information leads to increased accuracy in some datasets, it does not consistently improve clustering compared with using only gene expression data. Our results indicate that for the clustering of spatial transcriptomics data, there are still opportunities to enhance the overall accuracy and robustness by improving information extraction and feature selection from spatial and histology data.
Collapse
Affiliation(s)
- Andrew Cheng
- Department of Computer Science, Rutgers, The State University of New Jersey, 110 Frelinghuysen Road, Piscataway, 08854, NJ, USA
| | - Guanyu Hu
- Department of Statistics, University of Missouri-Columbia, 146 Middlebush Hall, Columbia, 65211, MO, USA
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, 683 Hoes Lane West, Piscataway, 08854, NJ, USA
- Department of Statistics, University of California, Riverside, 900 University Ave., Riverside, 92521, CA, USA
| |
Collapse
|
23
|
Qi Y, Han S, Tang L, Liu L. Imputation method for single-cell RNA-seq data using neural topic model. Gigascience 2022; 12:giad098. [PMID: 38000911 PMCID: PMC10673642 DOI: 10.1093/gigascience/giad098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 09/02/2023] [Accepted: 10/23/2023] [Indexed: 11/26/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technology studies transcriptome and cell-to-cell differences from higher single-cell resolution and different perspectives. Despite the advantage of high capture efficiency, downstream functional analysis of scRNA-seq data is made difficult by the excess of zero values (i.e., the dropout phenomenon). To effectively address this problem, we introduced scNTImpute, an imputation framework based on a neural topic model. A neural network encoder is used to extract underlying topic features of single-cell transcriptome data to infer high-quality cell similarity. At the same time, we determine which transcriptome data are affected by the dropout phenomenon according to the learning of the mixture model by the neural network. On the basis of stable cell similarity, the same gene information in other similar cells is borrowed to impute only the missing expression values. By evaluating the performance of real data, scNTImpute can accurately and efficiently identify the dropout values and imputes them accurately. In the meantime, the clustering of cell subsets is improved and the original biological information in cell clustering is solved, which is covered by technical noise. The source code for the scNTImpute module is available as open source at https://github.com/qiyueyang-7/scNTImpute.git.
Collapse
Affiliation(s)
- Yueyang Qi
- Yunnan Normal University, School of Information, Kunming 650500, China
| | - Shuangkai Han
- Yunnan Normal University, School of Information, Kunming 650500, China
| | - Lin Tang
- Yunnan Normal University, Faculty of Education, Kunming 650500, China
| | - Lin Liu
- Yunnan Normal University, School of Information, Kunming 650500, China
| |
Collapse
|
24
|
Gagnon J, Pi L, Ryals M, Wan Q, Hu W, Ouyang Z, Zhang B, Li K. Recommendations of scRNA-seq Differential Gene Expression Analysis Based on Comprehensive Benchmarking. LIFE (BASEL, SWITZERLAND) 2022; 12:life12060850. [PMID: 35743881 PMCID: PMC9225332 DOI: 10.3390/life12060850] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Revised: 05/31/2022] [Accepted: 06/04/2022] [Indexed: 12/13/2022]
Abstract
To guide analysts to select the right tool and parameters in differential gene expression analyses of single-cell RNA sequencing (scRNA-seq) data, we developed a novel simulator that recapitulates the data characteristics of real scRNA-seq datasets while accounting for all the relevant sources of variation in a multi-subject, multi-condition scRNA-seq experiment: the cell-to-cell variation within a subject, the variation across subjects, the variability across cell types, the mean/variance relationship of gene expression across genes, library size effects, group effects, and covariate effects. By applying it to benchmark 12 differential gene expression analysis methods (including cell-level and pseudo-bulk methods) on simulated multi-condition, multi-subject data of the 10x Genomics platform, we demonstrated that methods originating from the negative binomial mixed model such as glmmTMB and NEBULA-HL outperformed other methods. Utilizing NEBULA-HL in a statistical analysis pipeline for single-cell analysis will enable scientists to better understand the cell-type-specific transcriptomic response to disease or treatment effects and to discover new drug targets. Further, application to two real datasets showed the outperformance of our differential expression (DE) pipeline, with unified findings of differentially expressed genes (DEG) and a pseudo-time trajectory transcriptomic result. In the end, we made recommendations for filtering strategies of cells and genes based on simulation results to achieve optimal experimental goals.
Collapse
Affiliation(s)
- Jake Gagnon
- Analytics and Data Sciences, Biogen, Inc., 225 Binney St., Cambridge, MA 02142, USA;
| | - Lira Pi
- PharmaLex, 1700 District Ave., Burlington, MA 01803, USA; (L.P.); (M.R.); (Q.W.)
| | - Matthew Ryals
- PharmaLex, 1700 District Ave., Burlington, MA 01803, USA; (L.P.); (M.R.); (Q.W.)
| | - Qingwen Wan
- PharmaLex, 1700 District Ave., Burlington, MA 01803, USA; (L.P.); (M.R.); (Q.W.)
| | - Wenxing Hu
- Research Department, Biogen, Inc., 225 Binney St., Cambridge, MA 02142, USA;
| | - Zhengyu Ouyang
- BioInfoRx, Inc., 510 Charmany Dr., Suite 275A, Madison, WI 53719, USA;
| | - Baohong Zhang
- Research Department, Biogen, Inc., 225 Binney St., Cambridge, MA 02142, USA;
- Correspondence: (B.Z.); (K.L.)
| | - Kejie Li
- Research Department, Biogen, Inc., 225 Binney St., Cambridge, MA 02142, USA;
- Correspondence: (B.Z.); (K.L.)
| |
Collapse
|
25
|
Xiong KX, Zhou HL, Lin C, Yin JH, Kristiansen K, Yang HM, Li GB. Chord: an ensemble machine learning algorithm to identify doublets in single-cell RNA sequencing data. Commun Biol 2022; 5:510. [PMID: 35637301 PMCID: PMC9151659 DOI: 10.1038/s42003-022-03476-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 05/11/2022] [Indexed: 12/16/2022] Open
Abstract
High-throughput single-cell RNA sequencing (scRNA-seq) is a popular method, but it is accompanied by doublet rate problems that disturb the downstream analysis. Several computational approaches have been developed to detect doublets. However, most of these methods may yield satisfactory performance in some datasets but lack stability in others; thus, it is difficult to regard a single method as the gold standard which can be applied to all types of scenarios. It is a difficult and time-consuming task for researchers to choose the most appropriate software. We here propose Chord which implements a machine learning algorithm that integrates multiple doublet detection methods to address these issues. Chord had higher accuracy and stability than the individual approaches on different datasets containing real and synthetic data. Moreover, Chord was designed with a modular architecture port, which has high flexibility and adaptability to the incorporation of any new tools. Chord is a general solution to the doublet detection problem. For the unmet need to choose the suitable doublet detection method, an ensemble machine learning algorithm called Chord was developed, which integrates multiple methods and achieves higher accuracy and stability on different scRNA-seq datasets.
Collapse
Affiliation(s)
- Ke-Xu Xiong
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Han-Lin Zhou
- BGI-Shenzhen, Shenzhen, 518083, China. .,BGI College & Henan Institute of Medical and Pharmaceutical Science, Zhengzhou University, Zhengzhou, China. .,BGI-Henan, BGI-Shenzhen, Xinxiang, 453000, China. .,Guangdong Provincial Key Laboratory of Human Disease Genomics, Shenzhen Key Laboratory of Genomics, BGI-Shenzhen, Shenzhen, 518083, China. .,Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Copenhagen, DK-2100, Denmark.
| | - Cong Lin
- BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Henan, BGI-Shenzhen, Xinxiang, 453000, China.,Guangdong Provincial Key Laboratory of Human Disease Genomics, Shenzhen Key Laboratory of Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Shenzhen Key Laboratory of Single-Cell Omics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jian-Hua Yin
- BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Henan, BGI-Shenzhen, Xinxiang, 453000, China.,Guangdong Provincial Key Laboratory of Human Disease Genomics, Shenzhen Key Laboratory of Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Shenzhen Key Laboratory of Single-Cell Omics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Karsten Kristiansen
- BGI-Shenzhen, Shenzhen, 518083, China.,Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Copenhagen, DK-2100, Denmark
| | - Huan-Ming Yang
- BGI-Shenzhen, Shenzhen, 518083, China.,James D. Watson Institute of Genome Science, 310008, Hangzhou, China
| | - Gui-Bo Li
- BGI-Shenzhen, Shenzhen, 518083, China. .,BGI College & Henan Institute of Medical and Pharmaceutical Science, Zhengzhou University, Zhengzhou, China. .,BGI-Henan, BGI-Shenzhen, Xinxiang, 453000, China. .,Guangdong Provincial Key Laboratory of Human Disease Genomics, Shenzhen Key Laboratory of Genomics, BGI-Shenzhen, Shenzhen, 518083, China. .,Shenzhen Key Laboratory of Single-Cell Omics, BGI-Shenzhen, Shenzhen, 518083, China.
| |
Collapse
|
26
|
Qian K, Fu S, Li H, Li WV. scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data. Genome Biol 2022; 23:82. [PMID: 35313930 PMCID: PMC8935111 DOI: 10.1186/s13059-022-02649-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 03/07/2022] [Indexed: 12/30/2022] Open
Abstract
The increasing number of scRNA-seq data emphasizes the need for integrative analysis to interpret similarities and differences between single-cell samples. Although different batch effect removal methods have been developed, none are suitable for heterogeneous single-cell samples coming from multiple biological conditions. We propose a method, scINSIGHT, to learn coordinated gene expression patterns that are common among, or specific to, different biological conditions, and identify cellular identities and processes across single-cell samples. We compare scINSIGHT with state-of-the-art methods using simulated and real data, which demonstrate its improved performance. Our results show the applicability of scINSIGHT in diverse biomedical and clinical problems.
Collapse
Affiliation(s)
- Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, Hubei, China
| | - Shiwei Fu
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, 08854, NJ, USA
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, Hubei, China
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, 08854, NJ, USA.
| |
Collapse
|
27
|
PathogenTrack and Yeskit: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19. Front Med 2022; 16:251-262. [PMID: 35192147 PMCID: PMC8861993 DOI: 10.1007/s11684-021-0915-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 12/20/2021] [Indexed: 12/20/2022]
Abstract
Pathogenic microbes can induce cellular dysfunction, immune response, and cause infectious disease and other diseases including cancers. However, the cellular distributions of pathogens and their impact on host cells remain rarely explored due to the limited methods. Taking advantage of single-cell RNA-sequencing (scRNA-seq) analysis, we can assess the transcriptomic features at the single-cell level. Still, the tools used to interpret pathogens (such as viruses, bacteria, and fungi) at the single-cell level remain to be explored. Here, we introduced PathogenTrack, a python-based computational pipeline that uses unmapped scRNA-seq data to identify intracellular pathogens at the single-cell level. In addition, we established an R package named Yeskit to import, integrate, analyze, and interpret pathogen abundance and transcriptomic features in host cells. Robustness of these tools has been tested on various real and simulated scRNA-seq datasets. PathogenTrack is competitive to the state-of-the-art tools such as Viral-Track, and the first tools for identifying bacteria at the single-cell level. Using the raw data of bronchoalveolar lavage fluid samples (BALF) from COVID-19 patients in the SRA database, we found the SARS-CoV-2 virus exists in multiple cell types including epithelial cells and macrophages. SARS-CoV-2-positive neutrophils showed increased expression of genes related to type I interferon pathway and antigen presenting module. Additionally, we observed the Haemophilus parahaemolyticus in some macrophage and epithelial cells, indicating a co-infection of the bacterium in some severe cases of COVID-19. The PathogenTrack pipeline and the Yeskit package are publicly available at GitHub.
Collapse
|
28
|
Qin F, Luo X, Xiao F, Cai G. SCRIP: an accurate simulator for single-cell RNA sequencing data. Bioinformatics 2022; 38:1304-1311. [PMID: 34874992 DOI: 10.1093/bioinformatics/btab824] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 11/22/2021] [Accepted: 12/01/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Recent advancements in single-cell RNA sequencing (scRNA-seq) have enabled time-efficient transcriptome profiling in individual cells. To optimize sequencing protocols and develop reliable analysis methods for various application scenarios, solid simulation methods for scRNA-seq data are required. However, due to the noisy nature of scRNA-seq data, currently available simulation methods cannot sufficiently capture and simulate important properties of real data, especially the biological variation. In this study, we developed scRNA-seq information producer (SCRIP), a novel simulator for scRNA-seq that is accurate and enables simulation of bursting kinetics. RESULTS Compared to existing simulators, SCRIP showed a significantly higher accuracy of stimulating key data features, including mean-variance dependency in all experiments. SCRIP also outperformed other methods in recovering cell-cell distances. The application of SCRIP in evaluating differential expression analysis methods showed that edgeR outperformed other examined methods in differential expression analyses, and ZINB-WaVE improved the AUC at high dropout rates. Collectively, this study provides the research community with a rigorous tool for scRNA-seq data simulation. AVAILABILITY AND IMPLEMENTATION https://CRAN.R-project.org/package=SCRIP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| |
Collapse
|
29
|
Bacher R, Chu LF, Argus C, Bolin JM, Knight P, Thomson J, Stewart R, Kendziorski C. Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization. Nucleic Acids Res 2022; 50:e12. [PMID: 34850101 PMCID: PMC8789062 DOI: 10.1093/nar/gkab1071] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 10/14/2021] [Accepted: 10/20/2021] [Indexed: 11/14/2022] Open
Abstract
Considerable effort has been devoted to refining experimental protocols to reduce levels of technical variability and artifacts in single-cell RNA-sequencing data (scRNA-seq). We here present evidence that equalizing the concentration of cDNA libraries prior to pooling, a step not consistently performed in single-cell experiments, improves gene detection rates, enhances biological signals, and reduces technical artifacts in scRNA-seq data. To evaluate the effect of equalization on various protocols, we developed Scaffold, a simulation framework that models each step of an scRNA-seq experiment. Numerical experiments demonstrate that equalization reduces variation in sequencing depth and gene-specific expression variability. We then performed a set of experiments in vitro with and without the equalization step and found that equalization increases the number of genes that are detected in every cell by 17-31%, improves discovery of biologically relevant genes, and reduces nuisance signals associated with cell cycle. Further support is provided in an analysis of publicly available data.
Collapse
Affiliation(s)
- Rhonda Bacher
- Department of Biostatistics, University of Florida, FL, USA
| | - Li-Fang Chu
- Department of Comparative Biology and Experimental Medicine, University of Calgary, Calgary, AB, Canada
- Morgridge Institute for Research, Madison, WI, USA
| | - Cara Argus
- Morgridge Institute for Research, Madison, WI, USA
| | | | - Parker Knight
- Department of Mathematics, University of Florida, FL, USA
| | | | - Ron Stewart
- Morgridge Institute for Research, Madison, WI, USA
| | | |
Collapse
|
30
|
Sun T, Song D, Li WV, Li JJ. Simulating Single-Cell Gene Expression Count Data with Preserved Gene Correlations by scDesign2. J Comput Biol 2022; 29:23-26. [PMID: 35020490 PMCID: PMC8812500 DOI: 10.1089/cmb.2021.0440] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
scDesign2 is a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. This article shows how to download and install the scDesign2 R package, how to fit probabilistic models (one per cell type) to real data and simulate synthetic data from the fitted models, and how to use scDesign2 to guide experimental design and benchmark computational methods. Finally, a note is given about cell clustering as a preprocessing step before model fitting and data simulation.
Collapse
Affiliation(s)
- Tianyi Sun
- Department of Statistics, University of California, Los Angeles, California, USA
| | - Dongyuan Song
- Interdepartmental Program of Bioinformatics, University of California, Los Angeles, California, USA
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, New Jersey, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, California, USA.,Address correspondence to: Dr. Jingyi Jessica Li, Department of Statistics, University of California, Los Angeles, 8125 Math Sciences Building, Los Angeles, CA 90095-1554, USA
| |
Collapse
|
31
|
Cao Y, Yang P, Yang JYH. A benchmark study of simulation methods for single-cell RNA sequencing data. Nat Commun 2021; 12:6911. [PMID: 34824223 PMCID: PMC8617278 DOI: 10.1038/s41467-021-27130-w] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 10/26/2021] [Indexed: 11/09/2022] Open
Abstract
Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. The reliability of evaluation depends on the ability of simulation methods to capture properties of experimental data. However, while many scRNA-seq data simulation methods have been proposed, a systematic evaluation of these methods is lacking. We develop a comprehensive evaluation framework, SimBench, including a kernel density estimation measure to benchmark 12 simulation methods through 35 scRNA-seq experimental datasets. We evaluate the simulation methods on a panel of data properties, ability to maintain biological signals, scalability and applicability. Our benchmark uncovers performance differences among the methods and highlights the varying difficulties in simulating data characteristics. Furthermore, we identify several limitations including maintaining heterogeneity of distribution. These results, together with the framework and datasets made publicly available as R packages, will guide simulation methods selection and their future development.
Collapse
Affiliation(s)
- Yue Cao
- Charles Perkins Centre, The University of Sydney, Sydney, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, Australia
| | - Pengyi Yang
- Charles Perkins Centre, The University of Sydney, Sydney, Australia.
- School of Mathematics and Statistics, The University of Sydney, Sydney, Australia.
- Computational Systems Biology Group, Children's Medical Research Institute, Westmead, NSW, Australia.
| | - Jean Yee Hwa Yang
- Charles Perkins Centre, The University of Sydney, Sydney, Australia.
- School of Mathematics and Statistics, The University of Sydney, Sydney, Australia.
| |
Collapse
|
32
|
Millard N, Korsunsky I, Weinand K, Fonseka CY, Nathan A, Kang JB, Raychaudhuri S. Maximizing statistical power to detect differentially abundant cell states with scPOST. CELL REPORTS METHODS 2021; 1:100120. [PMID: 35005693 PMCID: PMC8740883 DOI: 10.1016/j.crmeth.2021.100120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 07/27/2021] [Accepted: 10/27/2021] [Indexed: 11/30/2022]
Abstract
To estimate a study design's power to detect differential abundance, we require a framework that simulates many multi-sample single-cell datasets. However, current simulation methods are challenging for large-scale power analyses because they are computationally resource intensive and do not support easy simulation of multi-sample datasets. Current methods also lack modeling of important inter-sample variation, such as the variation in the frequency of cell states between samples that is observed in single-cell data. Thus, we developed single-cell POwer Simulation Tool (scPOST) to address these limitations and help investigators quickly simulate multi-sample single-cell datasets. Users may explore a range of effect sizes and study design choices (such as increasing the number of samples or cells per sample) to determine their effect on power, and thus choose the optimal study design for their planned experiments.
Collapse
Affiliation(s)
- Nghia Millard
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Ilya Korsunsky
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Kathryn Weinand
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Chamith Y. Fonseka
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Aparna Nathan
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Joyce B. Kang
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Soumya Raychaudhuri
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA 02115, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Science Centre, The University of Manchester, Manchester 46962, UK
| |
Collapse
|
33
|
Schmid KT, Höllbacher B, Cruceanu C, Böttcher A, Lickert H, Binder EB, Theis FJ, Heinig M. scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies. Nat Commun 2021; 12:6625. [PMID: 34785648 PMCID: PMC8595682 DOI: 10.1038/s41467-021-26779-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 10/22/2021] [Indexed: 12/13/2022] Open
Abstract
Single cell RNA-seq has revolutionized transcriptomics by providing cell type resolution for differential gene expression and expression quantitative trait loci (eQTL) analyses. However, efficient power analysis methods for single cell data and inter-individual comparisons are lacking. Here, we present scPower; a statistical framework for the design and power analysis of multi-sample single cell transcriptomic experiments. We modelled the relationship between sample size, the number of cells per individual, sequencing depth, and the power of detecting differentially expressed genes within cell types. We systematically evaluated these optimal parameter combinations for several single cell profiling platforms, and generated broad recommendations. In general, shallow sequencing of high numbers of cells leads to higher overall power than deep sequencing of fewer cells. The model, including priors, is implemented as an R package and is accessible as a web tool. scPower is a highly customizable tool that experimentalists can use to quickly compare a multitude of experimental designs and optimize for a limited budget.
Collapse
Affiliation(s)
- Katharina T Schmid
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Department of Informatics, Technical University Munich, Munich, Germany
| | - Barbara Höllbacher
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Department of Informatics, Technical University Munich, Munich, Germany
| | - Cristiana Cruceanu
- Department of Translational Research, Max Planck Institute for Psychiatry, Munich, Germany
| | - Anika Böttcher
- Institute of Diabetes and Regeneration Research, Helmholtz Diabetes Center, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), Neuherberg, Germany
- School of Medicine, Technical University of Munich, Munich, Germany
| | - Heiko Lickert
- Institute of Diabetes and Regeneration Research, Helmholtz Diabetes Center, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), Neuherberg, Germany
- School of Medicine, Technical University of Munich, Munich, Germany
| | - Elisabeth B Binder
- Department of Translational Research, Max Planck Institute for Psychiatry, Munich, Germany
- Department of Psychiatry and Behavioral Sciences, Emory University School of Medicine, Georgia, USA
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Department of Mathematics, Technical University Munich, Munich, Germany
| | - Matthias Heinig
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.
- Department of Informatics, Technical University Munich, Munich, Germany.
| |
Collapse
|
34
|
Sheng J, Li WV. Selecting gene features for unsupervised analysis of single-cell gene expression data. Brief Bioinform 2021; 22:bbab295. [PMID: 34351383 PMCID: PMC8574996 DOI: 10.1093/bib/bbab295] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Revised: 06/17/2021] [Accepted: 07/12/2021] [Indexed: 11/15/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technologies facilitate the characterization of transcriptomic landscapes in diverse species, tissues, and cell types with unprecedented molecular resolution. In order to evaluate various biological hypotheses using high-dimensional single-cell gene expression data, most computational and statistical methods depend on a gene feature selection step to identify genes with high biological variability and reduce computational complexity. Even though many gene selection methods have been developed for scRNA-seq analysis, there lacks a systematic comparison of the assumptions, statistical models, and selection criteria used by these methods. In this article, we summarize and discuss 17 computational methods for selecting gene features in unsupervised analysis of single-cell gene expression data, with unified notations and statistical frameworks. Our discussion provides a useful summary to help practitioners select appropriate methods based on their assumptions and applicability, and to assist method developers in designing new computational tools for unsupervised learning of scRNA-seq data.
Collapse
Affiliation(s)
- Jie Sheng
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, NJ 08854, USA
| |
Collapse
|
35
|
Xi NM, Li JJ. Protocol for executing and benchmarking eight computational doublet-detection methods in single-cell RNA sequencing data analysis. STAR Protoc 2021; 2:100699. [PMID: 34382023 PMCID: PMC8339294 DOI: 10.1016/j.xpro.2021.100699] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational techniques have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet detection methods. DoubletCollection provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can accommodate new doublet-detection methods in the fast-growing scRNA-seq field. For details on the use and execution of this protocol, please refer to Xi and Li (2020). Integrate the installation and execution of eight doublet-detection methods An interface to perform and visualize downstream analysis after doublet detection A collection of real and synthetic scRNA-seq datasets with doublet annotations A user-friendly R package to perform doublet detection on scRNA-seq count matrices
Collapse
Affiliation(s)
- Nan Miles Xi
- Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL 60660, USA.,Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095-1554, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095-1554, USA.,Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095-7088, USA.,Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095-1766, USA.,Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095-1772, USA
| |
Collapse
|
36
|
scLink: Inferring Sparse Gene Co-expression Networks from Single-cell Expression Data. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:475-492. [PMID: 34252628 PMCID: PMC8896229 DOI: 10.1016/j.gpb.2020.11.006] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 10/23/2020] [Accepted: 12/26/2020] [Indexed: 11/23/2022]
Abstract
A system-level understanding of the regulation and coordination mechanisms of gene expression is essential for studying the complexity of biological processes in health and disease. With the rapid development of single-cell RNA sequencing technologies, it is now possible to investigate gene interactions in a cell type-specific manner. Here we propose the scLink method, which uses statistical network modeling to understand the co-expression relationships among genes and construct sparse gene co-expression networks from single-cell gene expression data. We use both simulation and real data studies to demonstrate the advantages of scLink and its ability to improve single-cell gene network analysis. The scLink R package is available at https://github.com/Vivianstats/scLink.
Collapse
|
37
|
Jiang R, Li WV, Li JJ. mbImpute: an accurate and robust imputation method for microbiome data. Genome Biol 2021; 22:192. [PMID: 34183041 PMCID: PMC8240317 DOI: 10.1186/s13059-021-02400-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 06/04/2021] [Indexed: 12/22/2022] Open
Abstract
A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data-mbImpute-to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
Collapse
Affiliation(s)
- Ruochen Jiang
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, 08854, NJ, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
38
|
Sun T, Song D, Li WV, Li JJ. scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol 2021; 22:163. [PMID: 34034771 PMCID: PMC8147071 DOI: 10.1186/s13059-021-02367-2] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 04/27/2021] [Indexed: 12/13/2022] Open
Abstract
A pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.
Collapse
Affiliation(s)
- Tianyi Sun
- grid.19006.3e0000 0000 9632 6718Department of Statistics, University of California, Los Angeles, 90095-1554 CA USA
| | - Dongyuan Song
- grid.19006.3e0000 0000 9632 6718Interdepartmental Program of Bioinformatics, University of California, Los Angeles, 90095-7246 CA USA
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, 08854, NJ, USA.
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA. .,Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA. .,Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA. .,Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
39
|
Zimmerman KD, Langefeld CD. Hierarchicell: an R-package for estimating power for tests of differential expression with single-cell data. BMC Genomics 2021; 22:319. [PMID: 33932993 PMCID: PMC8088563 DOI: 10.1186/s12864-021-07635-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 04/21/2021] [Indexed: 11/20/2022] Open
Abstract
Background Study design is a critical aspect of any experiment, and sample size calculations for statistical power that are consistent with that study design are central to robust and reproducible results. However, the existing power calculators for tests of differential expression in single-cell RNA-seq data focus on the total number of cells and not the number of independent experimental units, the true unit of interest for power. Thus, current methods grossly overestimate the power. Results Hierarchicell is the first single-cell power calculator to explicitly simulate and account for the hierarchical correlation structure (i.e., within sample correlation) that exists in single-cell RNA-seq data. Hierarchicell, an R-package available on GitHub, estimates the within sample correlation structure from real data to simulate hierarchical single-cell RNA-seq data and estimate power for tests of differential expression. This multi-stage approach models gene dropout rates, intra-individual dispersion, inter-individual variation, variable or fixed number of cells per individual, and the correlation among cells within an individual. Without modeling the within sample correlation structure and without properly accounting for the correlation in downstream analysis, we demonstrate that estimates of power are falsely inflated. Hierarchicell can be used to estimate power for binary and continuous phenotypes based on user-specified number of independent experimental units (e.g., individuals) and cells within the experimental unit. Conclusions Hierarchicell is a user-friendly R-package that provides accurate estimates of power for testing hypotheses of differential expression in single-cell RNA-seq data. This R-package represents an important addition to single-cell RNA analytic tools and will help researchers design experiments with appropriate and accurate power, increasing discovery and improving robustness and reproducibility. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07635-w.
Collapse
Affiliation(s)
- Kip D Zimmerman
- Center for Precision Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA.
| | - Carl D Langefeld
- Center for Precision Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA. .,Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, NC, USA. .,Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston-Salem, NC, USA.
| |
Collapse
|
40
|
Su K, Wu Z, Wu H. Simulation, power evaluation and sample size recommendation for single-cell RNA-seq. Bioinformatics 2021; 36:4860-4868. [PMID: 32614380 DOI: 10.1093/bioinformatics/btaa607] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Revised: 06/16/2020] [Accepted: 06/23/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq in the context of differential expression (DE), this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq such as sparsity and heterogeneity increase the challenge. RESULTS We propose POWSC, a simulation-based method, to provide power evaluation and sample size recommendation for single-cell RNA-sequencing DE analysis. POWSC consists of a data simulator that creates realistic expression data, and a power assessor that provides a comprehensive evaluation and visualization of the power and sample size relationship. The data simulator in POWSC outperforms two other state-of-art simulators in capturing key characteristics of real datasets. The power assessor in POWSC provides a variety of power evaluations including stratified and marginal power analyses for DEs characterized by two forms (phase transition or magnitude tuning), under different comparison scenarios. In addition, POWSC offers information for optimizing the tradeoffs between sample size and sequencing depth with the same total reads. AVAILABILITY AND IMPLEMENTATION POWSC is an open-source R package available online at https://github.com/suke18/POWSC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kenong Su
- Department of Computer Science, Emory University, Atlanta, GA 30329, USA
| | - Zhijin Wu
- Department of Biostatistics, Brown University, Providence, RI 02912, USA
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
41
|
Xi NM, Li JJ. Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data. Cell Syst 2021; 12:176-194.e6. [PMID: 33338399 PMCID: PMC7897250 DOI: 10.1016/j.cels.2020.11.008] [Citation(s) in RCA: 92] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 10/06/2020] [Accepted: 11/19/2020] [Indexed: 12/29/2022]
Abstract
In single-cell RNA sequencing (scRNA-seq), doublets form when two cells are encapsulated into one reaction volume. The existence of doublets, which appear to be-but are not-real cells, is a key confounder in scRNA-seq data analysis. Computational methods have been developed to detect doublets in scRNA-seq data; however, the scRNA-seq field lacks a comprehensive benchmarking of these methods, making it difficult for researchers to choose an appropriate method for specific analyses. We conducted a systematic benchmark study of nine cutting-edge computational doublet-detection methods. Our study included 16 real datasets, which contained experimentally annotated doublets, and 112 realistic synthetic datasets. We compared doublet-detection methods regarding detection accuracy under various experimental settings, impacts on downstream analyses, and computational efficiencies. Our results show that existing methods exhibited diverse performance and distinct advantages in different aspects. Overall, the DoubletFinder method has the best detection accuracy, and the cxds method has the highest computational efficiency. A record of this paper's transparent peer review process is included in the Supplemental Information.
Collapse
Affiliation(s)
- Nan Miles Xi
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA; Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA; Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA.
| |
Collapse
|