1
|
Luo Q, Chen Y, Lan X. COMSE: analysis of single-cell RNA-seq data using community detection-based feature selection. BMC Biol 2024; 22:167. [PMID: 39113021 PMCID: PMC11304914 DOI: 10.1186/s12915-024-01963-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Accepted: 07/10/2024] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing enables studying cells individually, yet high gene dimensions and low cell numbers challenge analysis. And only a subset of the genes detected are involved in the biological processes underlying cell-type specific functions. RESULT In this study, we present COMSE, an unsupervised feature selection framework using community detection to capture informative genes from scRNA-seq data. COMSE identified homogenous cell substates with high resolution, as demonstrated by distinguishing different cell cycle stages. Evaluations based on real and simulated scRNA-seq datasets showed COMSE outperformed methods even with high dropout rates in cell clustering assignment. We also demonstrate that by identifying communities of genes associated with batch effects, COMSE parses signals reflecting biological difference from noise arising due to differences in sequencing protocols, thereby enabling integrated analysis of scRNA-seq datasets of different sources. CONCLUSIONS COMSE provides an efficient unsupervised framework that selects highly informative genes in scRNA-seq data improving cell sub-states identification and cell clustering. It identifies gene subsets that reveal biological and technical heterogeneity, supporting applications like batch effect correction and pathway analysis. It also provides robust results for bulk RNA-seq data analysis.
Collapse
Affiliation(s)
- Qinhuan Luo
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing, 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing, 100084, China
| | - Yaozhu Chen
- School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China
| | - Xun Lan
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing, 100084, China.
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing, 100084, China.
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|
2
|
Singh A, Khiabanian H. Feature selection followed by a novel residuals-based normalization that includes variance stabilization simplifies and improves single-cell gene expression analysis. BMC Bioinformatics 2024; 25:248. [PMID: 39080559 PMCID: PMC11290295 DOI: 10.1186/s12859-024-05872-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Accepted: 07/16/2024] [Indexed: 08/02/2024] Open
Abstract
Normalization is a crucial step in the analysis of single-cell RNA-sequencing (scRNA-seq) counts data. Its principal objectives are reduction of systematic biases primarily introduced through technical sources and transformation of counts to make them more amenable for the application of established statistical frameworks. In the standard workflows, normalization is followed by feature selection to identify highly variable genes (HVGs) that capture most of the biologically meaningful variation across the cells. Here, we make the case for a revised workflow by proposing a simple feature selection method and showing that we can perform feature selection before normalization by relying on observed counts. We highlight that the feature selection step can be used to not only select HVGs but to also identify stable genes. We further propose a novel variance stabilization transformation inclusive residuals-based normalization method that in fact relies on the stable genes to inform the reduction of systematic biases. We demonstrate significant improvements in downstream clustering analyses through the application of our proposed methods on biological truth-known as well as simulated counts datasets. We have implemented this novel workflow for analyzing high-throughput scRNA-seq data in an R package called Piccolo.
Collapse
Affiliation(s)
- Amartya Singh
- Center for Systems and Computational Biology, Rutgers Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ, USA.
| | - Hossein Khiabanian
- Center for Systems and Computational Biology, Rutgers Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ, USA
- Department of Pathology and Laboratory Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ, USA
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Tarrytown, NY, USA
| |
Collapse
|
3
|
Fang C, Selega A, Campbell KR. Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance. Genome Biol 2024; 25:159. [PMID: 38886757 PMCID: PMC11184819 DOI: 10.1186/s13059-024-03304-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 06/06/2024] [Indexed: 06/20/2024] Open
Abstract
BACKGROUND The advent of single-cell RNA-sequencing (scRNA-seq) has driven significant computational methods development for all steps in the scRNA-seq data analysis pipeline, including filtering, normalization, and clustering. The large number of methods and their resulting parameter combinations has created a combinatorial set of possible pipelines to analyze scRNA-seq data, which leads to the obvious question: which is best? Several benchmarking studies compare methods but frequently find variable performance depending on dataset and pipeline characteristics. Alternatively, the large number of scRNA-seq datasets along with advances in supervised machine learning raise a tantalizing possibility: could the optimal pipeline be predicted for a given dataset? RESULTS Here, we begin to answer this question by applying 288 scRNA-seq analysis pipelines to 86 datasets and quantifying pipeline success via a range of measures evaluating cluster purity and biological plausibility. We build supervised machine learning models to predict pipeline success given a range of dataset and pipeline characteristics. We find that prediction performance is significantly better than random and that in many cases pipelines predicted to perform well provide clustering outputs similar to expert-annotated cell type labels. We identify characteristics of datasets that correlate with strong prediction performance that could guide when such prediction models may be useful. CONCLUSIONS Supervised machine learning models have utility for recommending analysis pipelines and therefore the potential to alleviate the burden of choosing from the near-infinite number of possibilities. Different aspects of datasets influence the predictive performance of such models which will further guide users.
Collapse
Affiliation(s)
- Cindy Fang
- Lunenfeld-Tanenbaum Research Institute, Toronto, Canada
- Program in Bioinformatics and Computational Biology, University of Toronto, Toronto, Canada
- Present address: Department of Biostatistics, Johns Hopkins University, Baltimore, USA
| | - Alina Selega
- Lunenfeld-Tanenbaum Research Institute, Toronto, Canada
- Vector Institute, Toronto, Canada
| | - Kieran R Campbell
- Lunenfeld-Tanenbaum Research Institute, Toronto, Canada.
- Vector Institute, Toronto, Canada.
- Departments of Molecular Genetics, Statistical Sciences, Computer Science, University of Toronto, Toronto, Canada.
- Ontario Institute for Cancer Research, Toronto, Canada.
| |
Collapse
|
4
|
Caron DP, Specht WL, Chen D, Wells SB, Szabo PA, Jensen IJ, Farber DL, Sims PA. Multimodal hierarchical classification of CITE-seq data delineates immune cell states across lineages and tissues. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.06.547944. [PMID: 37461466 PMCID: PMC10350048 DOI: 10.1101/2023.07.06.547944] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/27/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is invaluable for profiling cellular heterogeneity and dissecting transcriptional states, but transcriptomic profiles do not always delineate subsets defined by surface proteins, as in cells of the immune system. Cellular Indexing of Transcriptomes and Epitopes (CITE-seq) enables simultaneous profiling of single-cell transcriptomes and surface proteomes; however, accurate cell type annotation requires a classifier that integrates multimodal data. Here, we describe MultiModal Classifier Hierarchy (MMoCHi), a marker-based approach for classification, reconciling gene and protein expression without reliance on reference atlases. We benchmark MMoCHi using sorted T lymphocyte subsets and annotate a cross-tissue human immune cell dataset. MMoCHi outperforms leading transcriptome-based classifiers and multimodal unsupervised clustering in its ability to identify immune cell subsets that are not readily resolved and to reveal novel subset markers. MMoCHi is designed for adaptability and can integrate annotation of cell types and developmental states across diverse lineages, samples, or modalities.
Collapse
Affiliation(s)
- Daniel P. Caron
- Department of Microbiology and Immunology, Columbia University Irving Medical Center, New York, NY, USA
| | - William L. Specht
- Department of Microbiology and Immunology, Columbia University Irving Medical Center, New York, NY, USA
| | - David Chen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Steven B. Wells
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Peter A. Szabo
- Department of Microbiology and Immunology, Columbia University Irving Medical Center, New York, NY, USA
| | - Isaac J. Jensen
- Department of Microbiology and Immunology, Columbia University Irving Medical Center, New York, NY, USA
| | - Donna L. Farber
- Department of Microbiology and Immunology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Surgery, Columbia University Irving Medical Center, New York, NY, USA
| | - Peter A. Sims
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY, USA
| |
Collapse
|
5
|
Dong X, Leary JR, Yang C, Brusko MA, Brusko TM, Bacher R. Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference. Brief Bioinform 2024; 25:bbae216. [PMID: 38725155 PMCID: PMC11082074 DOI: 10.1093/bib/bbae216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 03/01/2024] [Accepted: 04/25/2024] [Indexed: 05/13/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) experiments have become instrumental in developmental and differentiation studies, enabling the profiling of cells at a single or multiple time-points to uncover subtle variations in expression profiles reflecting underlying biological processes. Benchmarking studies have compared many of the computational methods used to reconstruct cellular dynamics; however, researchers still encounter challenges in their analysis due to uncertainty with respect to selecting the most appropriate methods and parameters. Even among universal data processing steps used by trajectory inference methods such as feature selection and dimension reduction, trajectory methods' performances are highly dataset-specific. To address these challenges, we developed Escort, a novel framework for evaluating a dataset's suitability for trajectory inference and quantifying trajectory properties influenced by analysis decisions. Escort evaluates the suitability of trajectory analysis and the combined effects of processing choices using trajectory-specific metrics. Escort navigates single-cell trajectory analysis through these data-driven assessments, reducing uncertainty and much of the decision burden inherent to trajectory inference analyses. Escort is implemented in an accessible R package and R/Shiny application, providing researchers with the necessary tools to make informed decisions during trajectory analysis and enabling new insights into dynamic biological processes at single-cell resolution.
Collapse
Affiliation(s)
- Xiaoru Dong
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, United States
| | - Jack R Leary
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, United States
| | - Chuanhao Yang
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, United States
| | - Maigan A Brusko
- Diabetes Institute, University of Florida, Gainesville, FL 32610, United States
- Department of Pathology, Immunology, and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, FL 32610, United States
| | - Todd M Brusko
- Diabetes Institute, University of Florida, Gainesville, FL 32610, United States
- Department of Pathology, Immunology, and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, FL 32610, United States
- Department of Pediatrics, College of Medicine, University of Florida, Gainesville, FL 32610, United States
| | - Rhonda Bacher
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, United States
- Diabetes Institute, University of Florida, Gainesville, FL 32610, United States
| |
Collapse
|
6
|
Zhang K, Zemke NR, Armand EJ, Ren B. A fast, scalable and versatile tool for analysis of single-cell omics data. Nat Methods 2024; 21:217-227. [PMID: 38191932 PMCID: PMC10864184 DOI: 10.1038/s41592-023-02139-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/23/2023] [Indexed: 01/10/2024]
Abstract
Single-cell omics technologies have revolutionized the study of gene regulation in complex tissues. A major computational challenge in analyzing these datasets is to project the large-scale and high-dimensional data into low-dimensional space while retaining the relative relationships between cells. This low dimension embedding is necessary to decompose cellular heterogeneity and reconstruct cell-type-specific gene regulatory programs. Traditional dimensionality reduction techniques, however, face challenges in computational efficiency and in comprehensively addressing cellular diversity across varied molecular modalities. Here we introduce a nonlinear dimensionality reduction algorithm, embodied in the Python package SnapATAC2, which not only achieves a more precise capture of single-cell omics data heterogeneities but also ensures efficient runtime and memory usage, scaling linearly with the number of cells. Our algorithm demonstrates exceptional performance, scalability and versatility across diverse single-cell omics datasets, including single-cell assay for transposase-accessible chromatin using sequencing, single-cell RNA sequencing, single-cell Hi-C and single-cell multi-omics datasets, underscoring its utility in advancing single-cell analysis.
Collapse
Affiliation(s)
- Kai Zhang
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA
- Westlake Laboratory of Life Sciences and Biomedicine, School of Life Sciences, Westlake University, Hangzhou, China
| | - Nathan R Zemke
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA
- Center for Epigenomics, University of California, San Diego School of Medicine, La Jolla, CA, USA
| | - Ethan J Armand
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA, USA
| | - Bing Ren
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA.
- Center for Epigenomics, University of California, San Diego School of Medicine, La Jolla, CA, USA.
- Ludwig Institute for Cancer Research, La Jolla, CA, USA.
- Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
7
|
Swaminath S, Russell AB. The use of single-cell RNA-seq to study heterogeneity at varying levels of virus-host interactions. PLoS Pathog 2024; 20:e1011898. [PMID: 38236826 PMCID: PMC10796064 DOI: 10.1371/journal.ppat.1011898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2024] Open
Abstract
The outcome of viral infection depends on the diversity of the infecting viral population and the heterogeneity of the cell population that is infected. Until almost a decade ago, the study of these dynamic processes during viral infection was challenging and limited to certain targeted measurements. Presently, with the use of single-cell sequencing technology, the complex interface defined by the interactions of cells with infecting virus can now be studied across the breadth of the transcriptome in thousands of individual cells simultaneously. In this review, we will describe the use of single-cell RNA sequencing (scRNA-seq) to study the heterogeneity of viral infections, ranging from individual virions to the immune response between infected individuals. In addition, we highlight certain key experimental limitations and methodological decisions that are critical to analyzing scRNA-seq data at each scale.
Collapse
Affiliation(s)
- Sharmada Swaminath
- School of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Alistair B. Russell
- School of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| |
Collapse
|
8
|
Dong X, Leary JR, Yang C, Brusko MA, Brusko TM, Bacher R. Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.572214. [PMID: 38187768 PMCID: PMC10769271 DOI: 10.1101/2023.12.18.572214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) experiments have become instrumental in developmental and differentiation studies, enabling the profiling of cells at a single or multiple time-points to uncover subtle variations in expression profiles reflecting underlying biological processes. Benchmarking studies have compared many of the computational methods used to reconstruct cellular dynamics, however researchers still encounter challenges in their analysis due to uncertainties in selecting the most appropriate methods and parameters. Even among universal data processing steps used by trajectory inference methods such as feature selection and dimension reduction, trajectory methods' performances are highly dataset-specific. To address these challenges, we developed Escort, a framework for evaluating a dataset's suitability for trajectory inference and quantifying trajectory properties influenced by analysis decisions. Escort navigates single-cell trajectory analysis through data-driven assessments, reducing uncertainty and much of the decision burden associated with trajectory inference. Escort is implemented in an accessible R package and R/Shiny application, providing researchers with the necessary tools to make informed decisions during trajectory analysis and enabling new insights into dynamic biological processes at single-cell resolution.
Collapse
Affiliation(s)
- Xiaoru Dong
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, USA
| | - Jack R. Leary
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, USA
| | - Chuanhao Yang
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, USA
| | - Maigan A. Brusko
- Diabetes Institute, University of Florida, Gainesville, FL 32610, USA
- Department of Pathology, Immunology, and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Todd M. Brusko
- Diabetes Institute, University of Florida, Gainesville, FL 32610, USA
- Department of Pathology, Immunology, and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, FL 32610, USA
- Department of Pediatrics, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Rhonda Bacher
- Department of Biostatistics, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, USA
- Diabetes Institute, University of Florida, Gainesville, FL 32610, USA
| |
Collapse
|
9
|
Karakurt HU, Pir P. SUMA: a lightweight machine learning model-powered shared nearest neighbour-based clustering application interface for scRNA-Seq data. Turk J Biol 2023; 47:413-422. [PMID: 38681777 PMCID: PMC11045205 DOI: 10.55730/1300-0152.2675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/28/2023] [Accepted: 12/18/2023] [Indexed: 05/01/2024] Open
Abstract
Background/aim Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach that presents data as a "shared nearest neighbour" graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters.Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to obtain the optimum clustering results. Moreover, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/). Materials and methods Publicly available scRNA-Seq datasets and 3 different graph-based clustering algorithms were used to develop SUMA, and a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the adjusted Rand index (ARI) and true labels of each dataset. The data were split into training and test datasets, and the model was built and optimised using Scikit-learn (Python) and randomForest (R) libraries. Results The accuracy of our machine learning model was 0.96, while the AUC of the ROC curve was 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours. Conclusion We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods and enables nonbioinformatician users to cluster and visualise their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.
Collapse
Affiliation(s)
- Hamza Umut Karakurt
- Department of Bioengineering, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkiye
- Idea Technology Solutions R&D Center, İstanbul, Turkiye
| | - Pınar Pir
- Department of Bioengineering, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkiye
- Idea Technology Solutions R&D Center, İstanbul, Turkiye
| |
Collapse
|
10
|
Chen YT, Gao LL. Testing for a difference in means of a single feature after clustering. ARXIV 2023:arXiv:2311.16375v1. [PMID: 38076519 PMCID: PMC10705581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a single feature between a pair of clusters obtained using hierarchical or k-means clustering. The test based on the proposed p-value controls the selective Type I error rate in finite samples and can be efficiently computed. We further illustrate the validity and power of our proposal in simulation and demonstrate its use on single-cell RNA-sequencing data.
Collapse
Affiliation(s)
- Yiqun T Chen
- Department of Biomedical Data Science, Stanford University
| | - Lucy L Gao
- Department of Statistics, University of British Columbia, November 29, 2023
| |
Collapse
|
11
|
Atitey K, Motsinger-Reif AA, Anchang B. Model-based evaluation of spatiotemporal data reduction methods with unknown ground truth through optimal visualization and interpretability metrics. Brief Bioinform 2023; 25:bbad455. [PMID: 38113074 PMCID: PMC10729792 DOI: 10.1093/bib/bbad455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/06/2023] [Accepted: 11/20/2023] [Indexed: 12/21/2023] Open
Abstract
Optimizing and benchmarking data reduction methods for dynamic or spatial visualization and interpretation (DSVI) face challenges due to many factors, including data complexity, lack of ground truth, time-dependent metrics, dimensionality bias and different visual mappings of the same data. Current studies often focus on independent static visualization or interpretability metrics that require ground truth. To overcome this limitation, we propose the MIBCOVIS framework, a comprehensive and interpretable benchmarking and computational approach. MIBCOVIS enhances the visualization and interpretability of high-dimensional data without relying on ground truth by integrating five robust metrics, including a novel time-ordered Markov-based structural metric, into a semi-supervised hierarchical Bayesian model. The framework assesses method accuracy and considers interaction effects among metric features. We apply MIBCOVIS using linear and nonlinear dimensionality reduction methods to evaluate optimal DSVI for four distinct dynamic and spatial biological processes captured by three single-cell data modalities: CyTOF, scRNA-seq and CODEX. These data vary in complexity based on feature dimensionality, unknown cell types and dynamic or spatial differences. Unlike traditional single-summary score approaches, MIBCOVIS compares accuracy distributions across methods. Our findings underscore the joint evaluation of visualization and interpretability, rather than relying on separate metrics. We reveal that prioritizing average performance can obscure method feature performance. Additionally, we explore the impact of data complexity on visualization and interpretability. Specifically, we provide optimal parameters and features and recommend methods, like the optimized variational contractive autoencoder, for targeted DSVI for various data complexities. MIBCOVIS shows promise for evaluating dynamic single-cell atlases and spatiotemporal data reduction models.
Collapse
Affiliation(s)
- Komlan Atitey
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| | - Alison A Motsinger-Reif
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| | - Benedict Anchang
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| |
Collapse
|
12
|
Domingo J, Kutsyr-Kolesnyk O, Leon T, Perez-Moraga R, Ayala G, Roson B. A cell abundance analysis based on efficient PAM clustering for a better understanding of the dynamics of endometrial remodelling. BMC Bioinformatics 2023; 24:440. [PMID: 37990148 PMCID: PMC10664584 DOI: 10.1186/s12859-023-05569-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 11/15/2023] [Indexed: 11/23/2023] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) is a powerful tool for investigating cell abundance changes during tissue regeneration and remodeling processes. Differential cell abundance supports the initial clustering of all cells; then, the number of cells per cluster and sample are evaluated, and the dependence of these counts concerning the phenotypic covariates of the samples is studied. Analysis heavily depends on the clustering method. Partitioning Around Medoids (PAM or k-medoids) represents a well-established clustering procedure that leverages the downstream interpretation of clusters by pinpointing real individuals in the dataset as cluster centers (medoids) without reducing dimensions. Of note, PAM suffers from high computational costs and memory requirements. RESULTS This paper proposes a method for differential abundance analysis using PAM as a clustering method and negative binomial regression as a statistical model to relate covariates to cluster/cell counts. We used this approach to study the differential cell abundance of human endometrial cell types throughout the natural secretory phase of the menstrual cycle. We developed a new R package -scellpam-, that incorporates an efficient parallel C++ implementation of PAM, and applied this package in this study. We compared the PAM-BS clustering method with other methods and evaluated both the computational aspects of its implementation and the quality of the classifications obtained using distinct published datasets with known subpopulations that demonstrate promising results. CONCLUSIONS The implementation of PAM-BS, included in the scellpam package, exhibits robust performance in terms of speed and memory usage compared to other related methods. PAM allowed quick and robust clustering of sets of cells with a size ranging from 70,000 to 300,000 cells. https://cran.r-project.org/web/packages/scellpam/index.html . Finally, our approach provides important new insights into the transient subpopulations associated with the fertile time frame when applied to the study of changes in the human endometrium during the secretory phase of the menstrual cycle.
Collapse
Affiliation(s)
- Juan Domingo
- Department of Informatics, ETSE, University of Valencia, Avda. de la Universidad, s/n, 46100, Burjasot, Valencia, Spain.
| | - Oleksandra Kutsyr-Kolesnyk
- Department of Statistics and Operations Research, University of Valencia, Avda. Vicente Andres Estelles, 46100, Burjasot, Valencia, Spain
| | - Teresa Leon
- Department of Statistics and Operations Research, University of Valencia, Avda. Vicente Andres Estelles, 46100, Burjasot, Valencia, Spain
| | - Raul Perez-Moraga
- Carlos Simon Foundation, INCLIVA Health Research Institute, Eduardo Primo Yufera, 46012, Valencia, Valencia, Spain
- Igenomix R&D, Technology Park, 46980, Paterna, Valencia, Spain
| | - Guillermo Ayala
- Department of Statistics and Operations Research, University of Valencia, Avda. Vicente Andres Estelles, 46100, Burjasot, Valencia, Spain
| | - Beatriz Roson
- Carlos Simon Foundation, INCLIVA Health Research Institute, Eduardo Primo Yufera, 46012, Valencia, Valencia, Spain.
| |
Collapse
|
13
|
Li H, Zhang Z, Squires M, Chen X, Zhang X. scMultiSim: simulation of single cell multi-omics and spatial data guided by gene regulatory networks and cell-cell interactions. RESEARCH SQUARE 2023:rs.3.rs-3301625. [PMID: 37790516 PMCID: PMC10543280 DOI: 10.21203/rs.3.rs-3301625/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Simulated single-cell data is essential for designing and evaluating computational methods in the absence of experimental ground truth. Existing simulators typically focus on modeling one or two specific biological factors or mechanisms that affect the output data, which limits their capacity to simulate the complexity and multi-modality in real data. Here, we present scMultiSim, an in silico simulator that generates multi-modal single-cell data, including gene expression, chromatin accessibility, RNA velocity, and spatial cell locations while accounting for the relationships between modalities. scMultiSim jointly models various biological factors that affect the output data, including cell identity, within-cell gene regulatory networks (GRNs), cell-cell interactions (CCIs), and chromatin accessibility, hile also incorporating technical noises. Moreover, it allows users to adjust each factor's effect easily. We validated scMultiSim's simulated biological effects and demonstrated its applications by benchmarking a wide range of computational tasks, including multi-modal and multi-batch data integration, RNA velocity estimation, GRN inference and CCI inference using spatially resolved gene expression data, many of them were not benchmarked before due to the lack of proper tools. Compared to existing simulators, scMultiSim can benchmark a much broader range of existing computational problems and even new potential tasks.
Collapse
Affiliation(s)
- Hechen Li
- Georgia Institute of Technology, Atlanta, USA
| | - Ziqi Zhang
- Georgia Institute of Technology, Atlanta, USA
| | | | - Xi Chen
- Southern University of Science and Technology, Shenzhen, China
| | | |
Collapse
|
14
|
Zhang K, Zemke NR, Armand EJ, Ren B. SnapATAC2: a fast, scalable and versatile tool for analysis of single-cell omics data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.11.557221. [PMID: 37745443 PMCID: PMC10515871 DOI: 10.1101/2023.09.11.557221] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Single-cell omics technologies have ushered in a new era for the study of dynamic gene regulation in complex tissues during development and disease pathogenesis. A major computational challenge in analyzing these datasets is to project the large-scale and high dimensional data into low-dimensional space while retaining the relative relationships between cells in order to decompose the cellular heterogeneity and reconstruct cell-type-specific gene regulatory programs. Conventional dimensionality reduction methods suffer from computational inefficiency, difficulty to capture the full spectrum of cellular heterogeneity, or inability to apply across diverse molecular modalities. Here, we report a fast and nonlinear dimensionality reduction algorithm that not only more accurately captures the heterogeneities of single-cell omics data, but also features runtime and memory usage that is computational efficient and linearly proportional to cell numbers. We implement this algorithm in a Python package named SnapATAC2, and demonstrate its superior performance, remarkable scalability and general adaptability using an array of single-cell omics data types, including single-cell ATAC-seq, single-cell RNA-seq, single-cell Hi-C, and single-cell multiomics datasets.
Collapse
|
15
|
Domingo J, Leon T, Dura E. Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets. BMC Bioinformatics 2023; 24:342. [PMID: 37710192 PMCID: PMC10503022 DOI: 10.1186/s12859-023-05471-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 09/08/2023] [Indexed: 09/16/2023] Open
Abstract
BACKGROUND Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text], [Formula: see text], Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required.
Collapse
Affiliation(s)
- Juan Domingo
- Department of Informatics, ETSE, University of Valencia, Avda. de la Universidad, s/n, 46100, Burjassot, Valencia, Spain.
| | - Teresa Leon
- Department of Statistics and Operations Research, University of Valencia, Avda. Vicente Andres Estelles, 46100, Burjassot, Valencia, Spain
| | - Esther Dura
- Department of Informatics, ETSE, University of Valencia, Avda. de la Universidad, s/n, 46100, Burjassot, Valencia, Spain
| |
Collapse
|
16
|
Odaka M, Magnin M, Inoue K. Gene network inference from single-cell omics data and domain knowledge for constructing COVID-19-specific ICAM1-associated pathways. Front Genet 2023; 14:1250545. [PMID: 37719701 PMCID: PMC10501835 DOI: 10.3389/fgene.2023.1250545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 08/16/2023] [Indexed: 09/19/2023] Open
Abstract
Introduction: Intercellular adhesion molecule 1 (ICAM-1) is a critical molecule responsible for interactions between cells. Previous studies have suggested that ICAM-1 triggers cell-to-cell transmission of HIV-1 or HTLV-1, that SARS-CoV-2 shares several features with these viruses via interactions between cells, and that SARS-CoV-2 cell-to-cell transmission is associated with COVID-19 severity. From these previous arguments, it is assumed that ICAM-1 can be related to SARS-CoV-2 cell-to-cell transmission in COVID-19 patients. Indeed, the time-dependent change of the ICAM-1 expression level has been detected in COVID-19 patients. However, signaling pathways that consist of ICAM-1 and other molecules interacting with ICAM-1 are not identified in COVID-19. For example, the current COVID-19 Disease Map has no entry for those pathways. Therefore, discovering unknown ICAM1-associated pathways will be indispensable for clarifying the mechanism of COVID-19. Materials and methods: This study builds ICAM1-associated pathways by gene network inference from single-cell omics data and multiple knowledge bases. First, single-cell omics data analysis extracts coexpressed genes with significant differences in expression levels with spurious correlations removed. Second, knowledge bases validate the models. Finally, mapping the models onto existing pathways identifies new ICAM1-associated pathways. Results: Comparison of the obtained pathways between different cell types and time points reproduces the known pathways and indicates the following two unknown pathways: (1) upstream pathway that includes proteins in the non-canonical NF-κB pathway and (2) downstream pathway that contains integrins and cytoskeleton or motor proteins for cell transformation. Discussion: In this way, data-driven and knowledge-based approaches are integrated into gene network inference for ICAM1-associated pathway construction. The results can contribute to repairing and completing the COVID-19 Disease Map, thereby improving our understanding of the mechanism of COVID-19.
Collapse
Affiliation(s)
- Mitsuhiro Odaka
- The Graduate University for Advanced Studies, SOKENDAI, Tokyo, Japan
- Principles of Informatics Research Division, National Institute of Informatics, Tokyo, Japan
- Laboratoire des Sciences du Numérique de Nantes, École Centrale de Nantes, Nantes Université, UMR 6004, Nantes, France
- Japan Society for the Promotion of Science, Tokyo, Japan
| | - Morgan Magnin
- Principles of Informatics Research Division, National Institute of Informatics, Tokyo, Japan
- Laboratoire des Sciences du Numérique de Nantes, École Centrale de Nantes, Nantes Université, UMR 6004, Nantes, France
| | - Katsumi Inoue
- The Graduate University for Advanced Studies, SOKENDAI, Tokyo, Japan
- Principles of Informatics Research Division, National Institute of Informatics, Tokyo, Japan
- Laboratoire des Sciences du Numérique de Nantes, École Centrale de Nantes, Nantes Université, UMR 6004, Nantes, France
| |
Collapse
|
17
|
Song D, Li K, Ge X, Li JJ. ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping. RESEARCH SQUARE 2023:rs.3.rs-3211191. [PMID: 37577698 PMCID: PMC10418557 DOI: 10.21203/rs.3.rs-3211191/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
| | - Kexin Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA 02138
| |
Collapse
|
18
|
Dudchenko O, Ordovas-Montanes J, Bingle CD. Respiratory epithelial cell types, states and fates in the era of single-cell RNA-sequencing. Biochem J 2023; 480:921-939. [PMID: 37410389 PMCID: PMC10422933 DOI: 10.1042/bcj20220572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 06/19/2023] [Accepted: 06/20/2023] [Indexed: 07/07/2023]
Abstract
Standalone and consortia-led single-cell atlases of healthy and diseased human airways generated with single-cell RNA-sequencing (scRNA-seq) have ushered in a new era in respiratory research. Numerous discoveries, including the pulmonary ionocyte, potentially novel cell fates, and a diversity of cell states among common and rare epithelial cell types have highlighted the extent of cellular heterogeneity and plasticity in the respiratory tract. scRNA-seq has also played a pivotal role in our understanding of host-virus interactions in coronavirus disease 2019 (COVID-19). However, as our ability to generate large quantities of scRNA-seq data increases, along with a growing number of scRNA-seq protocols and data analysis methods, new challenges related to the contextualisation and downstream applications of insights are arising. Here, we review the fundamental concept of cellular identity from the perspective of single-cell transcriptomics in the respiratory context, drawing attention to the need to generate reference annotations and to standardise the terminology used in literature. Findings about airway epithelial cell types, states and fates obtained from scRNA-seq experiments are compared and contrasted with information accumulated through the use of conventional methods. This review attempts to discuss major opportunities and to outline some of the key limitations of the modern-day scRNA-seq that need to be addressed to enable efficient and meaningful integration of scRNA-seq data from different platforms and studies, with each other as well as with data from other high-throughput sequencing-based genomic, transcriptomic and epigenetic analyses.
Collapse
Affiliation(s)
- Oleksandr Dudchenko
- Department of Infection, Immunity and Cardiovascular Disease, The Medical School, University of Sheffield, Sheffield, South Yorkshire, U.K
| | - Jose Ordovas-Montanes
- Division of Gastroenterology, Hepatology and Nutrition, Boston Children's Hospital, Boston, MA, U.S.A
- Programme in Immunology, Harvard Medical School, Boston, MA, U.S.A
| | - Colin D. Bingle
- Department of Infection, Immunity and Cardiovascular Disease, The Medical School, University of Sheffield, Sheffield, South Yorkshire, U.K
| |
Collapse
|
19
|
Arts JA, Laberthonnière C, Lima Cunha D, Zhou H. Single-Cell RNA Sequencing: Opportunities and Challenges for Studies on Corneal Biology in Health and Disease. Cells 2023; 12:1808. [PMID: 37443842 PMCID: PMC10340756 DOI: 10.3390/cells12131808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 06/27/2023] [Accepted: 07/04/2023] [Indexed: 07/15/2023] Open
Abstract
The structure and major cell types of the multi-layer human cornea have been extensively studied. However, various cell states in specific cell types and key genes that define the cell states are not fully understood, hindering our comprehension of corneal homeostasis, related diseases, and therapeutic discovery. Single-cell RNA sequencing is a revolutionary and powerful tool for identifying cell states within tissues such as the cornea. This review provides an overview of current single-cell RNA sequencing studies on the human cornea, highlighting similarities and differences between them, and summarizing the key genes that define corneal cell states reported in these studies. In addition, this review discusses the opportunities and challenges of using single-cell RNA sequencing to study corneal biology in health and disease.
Collapse
Affiliation(s)
- Julian A. Arts
- Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Radboud University, 6525 GA Nijmegen, The Netherlands; (J.A.A.)
| | - Camille Laberthonnière
- Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Radboud University, 6525 GA Nijmegen, The Netherlands; (J.A.A.)
| | - Dulce Lima Cunha
- Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Radboud University, 6525 GA Nijmegen, The Netherlands; (J.A.A.)
| | - Huiqing Zhou
- Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Radboud University, 6525 GA Nijmegen, The Netherlands; (J.A.A.)
- Department of Human Genetics, Radboud University Medical Center, 6500 HB Nijmegen, The Netherlands
| |
Collapse
|
20
|
Eid SA, Noureldein M, Kim B, Hinder LM, Mendelson FE, Hayes JM, Hur J, Feldman EL. Single-cell RNA-seq uncovers novel metabolic functions of Schwann cells beyond myelination. J Neurochem 2023; 166:367-388. [PMID: 37328915 PMCID: PMC11141588 DOI: 10.1111/jnc.15877] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 05/04/2023] [Accepted: 05/30/2023] [Indexed: 06/18/2023]
Abstract
Schwann cells (SCs) support peripheral nerves under homeostatic conditions, independent of myelination, and contribute to damage in prediabetic peripheral neuropathy (PN). Here, we used single-cell RNA sequencing to characterize the transcriptional profiles and intercellular communication of SCs in the nerve microenvironment using the high-fat diet-fed mouse, which mimics human prediabetes and neuropathy. We identified four major SC clusters, myelinating, nonmyelinating, immature, and repair in healthy and neuropathic nerves, in addition to a distinct cluster of nerve macrophages. Myelinating SCs acquired a unique transcriptional profile, beyond myelination, in response to metabolic stress. Mapping SC intercellular communication identified a shift in communication, centered on immune response and trophic support pathways, which primarily impacted nonmyelinating SCs. Validation analyses revealed that neuropathic SCs become pro-inflammatory and insulin resistant under prediabetic conditions. Overall, our study offers a unique resource for interrogating SC function, communication, and signaling in nerve pathophysiology to help inform SC-specific therapies.
Collapse
Affiliation(s)
- Stéphanie A. Eid
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Mohamed Noureldein
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Bhumsoo Kim
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Lucy M. Hinder
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Faye E. Mendelson
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - John M. Hayes
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Junguk Hur
- Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, USA
| | - Eva L. Feldman
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| |
Collapse
|
21
|
Van de Sande B, Lee JS, Mutasa-Gottgens E, Naughton B, Bacon W, Manning J, Wang Y, Pollard J, Mendez M, Hill J, Kumar N, Cao X, Chen X, Khaladkar M, Wen J, Leach A, Ferran E. Applications of single-cell RNA sequencing in drug discovery and development. Nat Rev Drug Discov 2023; 22:496-520. [PMID: 37117846 PMCID: PMC10141847 DOI: 10.1038/s41573-023-00688-4] [Citation(s) in RCA: 52] [Impact Index Per Article: 52.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/10/2023] [Indexed: 04/30/2023]
Abstract
Single-cell technologies, particularly single-cell RNA sequencing (scRNA-seq) methods, together with associated computational tools and the growing availability of public data resources, are transforming drug discovery and development. New opportunities are emerging in target identification owing to improved disease understanding through cell subtyping, and highly multiplexed functional genomics screens incorporating scRNA-seq are enhancing target credentialling and prioritization. ScRNA-seq is also aiding the selection of relevant preclinical disease models and providing new insights into drug mechanisms of action. In clinical development, scRNA-seq can inform decision-making via improved biomarker identification for patient stratification and more precise monitoring of drug response and disease progression. Here, we illustrate how scRNA-seq methods are being applied in key steps in drug discovery and development, and discuss ongoing challenges for their implementation in the pharmaceutical industry.
Collapse
Affiliation(s)
| | | | | | - Bart Naughton
- Computational Neurobiology, Eisai, Cambridge, MA, USA
| | - Wendi Bacon
- EMBL-EBI, Wellcome Genome Campus, Hinxton, UK
- The Open University, Milton Keynes, UK
| | | | - Yong Wang
- Precision Bioinformatics, Prometheus Biosciences, San Diego, CA, USA
| | | | - Melissa Mendez
- Genomic Sciences, GlaxoSmithKline, Collegeville, PA, USA
| | - Jon Hill
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., Ridgefield, CT, USA
| | - Namit Kumar
- Informatics & Predictive Sciences, Bristol Myers Squibb, San Diego, CA, USA
| | - Xiaohong Cao
- Genomic Research Center, AbbVie Inc., Cambridge, MA, USA
| | - Xiao Chen
- Magnet Biomedicine, Cambridge, MA, USA
| | - Mugdha Khaladkar
- Human Genetics and Computational Biology, GlaxoSmithKline, Collegeville, PA, USA
| | - Ji Wen
- Oncology Research and Development Unit, Pfizer, La Jolla, CA, USA
| | | | | |
Collapse
|
22
|
Zhu J, Yang Y. scMEB: a fast and clustering-independent method for detecting differentially expressed genes in single-cell RNA-seq data. BMC Genomics 2023; 24:280. [PMID: 37231345 DOI: 10.1186/s12864-023-09374-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2022] [Accepted: 05/11/2023] [Indexed: 05/27/2023] Open
Abstract
BACKGROUND Cell clustering is a prerequisite for identifying differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data. Obtaining a perfect clustering result is of central importance for subsequent analyses, but not easy. Additionally, the increase in cell throughput due to the advancement of scRNA-seq protocols exacerbates many computational issues, especially regarding method runtime. To address these difficulties, a new, accurate, and fast method for detecting DEGs in scRNA-seq data is needed. RESULTS Here, we propose single-cell minimum enclosing ball (scMEB), a novel and fast method for detecting single-cell DEGs without prior cell clustering results. The proposed method utilizes a small part of known non-DEGs (stably expressed genes) to build a minimum enclosing ball and defines the DEGs based on the distance of a mapped gene to the center of the hypersphere in a feature space. CONCLUSIONS We compared scMEB to two different approaches that could be used to identify DEGs without cell clustering. The investigation of 11 real datasets revealed that scMEB outperformed rival methods in terms of cell clustering, predicting genes with biological functions, and identifying marker genes. Moreover, scMEB was much faster than the other methods, making it particularly effective for finding DEGs in high-throughput scRNA-seq data. We have developed a package scMEB for the proposed method, which could be available at https://github.com/FocusPaka/scMEB .
Collapse
Affiliation(s)
- Jiadi Zhu
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Youlong Yang
- School of Mathematics and Statistics, Xidian University, Xi'an, China.
| |
Collapse
|
23
|
Chen YT, Witten DM. Selective inference for k-means clustering. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2023; 24:152. [PMID: 38264325 PMCID: PMC10805457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]
Abstract
We consider the problem of testing for a difference in means between clusters of observations identified via k -means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k -means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k -means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k -means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.
Collapse
Affiliation(s)
- Yiqun T Chen
- Data Science Institute and Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Daniela M Witten
- Departments of Statistics and Biostatistics, University of Washington, Seattle, WA 98195-4322, USA
| |
Collapse
|
24
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
25
|
Qiu Y, Yan C, Zhao P, Zou Q. SSNMDI: a novel joint learning model of semi-supervised non-negative matrix factorization and data imputation for clustering of single-cell RNA-seq data. Brief Bioinform 2023; 24:7147025. [PMID: 37122068 DOI: 10.1093/bib/bbad149] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 02/18/2023] [Accepted: 03/28/2023] [Indexed: 05/02/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) technology attracts extensive attention in the biomedical field. It can be used to measure gene expression and analyze the transcriptome at the single-cell level, enabling the identification of cell types based on unsupervised clustering. Data imputation and dimension reduction are conducted before clustering because scRNA-seq has a high 'dropout' rate, noise and linear inseparability. However, independence of dimension reduction, imputation and clustering cannot fully characterize the pattern of the scRNA-seq data, resulting in poor clustering performance. Herein, we propose a novel and accurate algorithm, SSNMDI, that utilizes a joint learning approach to simultaneously perform imputation, dimensionality reduction and cell clustering in a non-negative matrix factorization (NMF) framework. In addition, we integrate the cell annotation as prior information, then transform the joint learning into a semi-supervised NMF model. Through experiments on 14 datasets, we demonstrate that SSNMDI has a faster convergence speed, better dimensionality reduction performance and a more accurate cell clustering performance than previous methods, providing an accurate and robust strategy for analyzing scRNA-seq data. Biological analysis are also conducted to validate the biological significance of our method, including pseudotime analysis, gene ontology and survival analysis. We believe that we are among the first to introduce imputation, partial label information, dimension reduction and clustering to the single-cell field. AVAILABILITY AND IMPLEMENTATION The source code for SSNMDI is available at https://github.com/yushanqiu/SSNMDI.
Collapse
Affiliation(s)
- Yushan Qiu
- College of Mathematics and Statistics, Shenzhen University, 518000, Guangdong, China
| | - Chang Yan
- College of Mathematics and Statistics, Shenzhen University, 518000, Guangdong, China
| | - Pu Zhao
- College of Life and Health Sciences, Northeastern University, Shenyang, 110169, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610056, China
| |
Collapse
|
26
|
Crowell HL, Morillo Leonardo SX, Soneson C, Robinson MD. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol 2023; 24:62. [PMID: 36991470 PMCID: PMC10061781 DOI: 10.1186/s13059-023-02904-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/20/2023] [Indexed: 03/31/2023] Open
Abstract
BACKGROUND With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Collapse
Affiliation(s)
- Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | | | - Charlotte Soneson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Current address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
| |
Collapse
|
27
|
Li H, Zhang Z, Squires M, Chen X, Zhang X. scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks. RESEARCH SQUARE 2023:rs.3.rs-2675530. [PMID: 36993284 PMCID: PMC10055660 DOI: 10.21203/rs.3.rs-2675530/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Simulated single-cell data is essential for designing and evaluating computational methods in the absence of experimental ground truth. Existing simulators typically focus on modeling one or two specific biological factors or mechanisms that affect the output data, which limits their capacity to simulate the complexity and multi-modality in real data. Here, we present scMultiSim, an in silico simulator that generates multi-modal single-cell data, including gene expression, chromatin accessibility, RNA velocity, and spatial cell locations while accounting for the relationships between modalities. scMultiSim jointly models various biological factors that affect the output data, including cell identity, within-cell gene regulatory networks (GRNs), cell-cell interactions (CCIs), and chromatin accessibility, while also incorporating technical noises. Moreover, it allows users to adjust each factor's effect easily. We validated scMultiSim's simulated biological effects and demonstrated its applications by benchmarking a wide range of computational tasks, including cell clustering and trajectory inference, multi-modal and multi-batch data integration, RNA velocity estimation, GRN inference and CCI inference using spatially resolved gene expression data. Compared to existing simulators, scMultiSim can benchmark a much broader range of existing computational problems and even new potential tasks.
Collapse
Affiliation(s)
- Hechen Li
- Georgia Institute of Technology, Atlanta, USA
| | - Ziqi Zhang
- Georgia Institute of Technology, Atlanta, USA
| | | | - Xi Chen
- Southern University of Science and Technology, China
| | | |
Collapse
|
28
|
Ding L, Shi H, Qian C, Burdyshaw C, Veloso JP, Khatamian A, Pan Q, Dhungana Y, Xie Z, Risch I, Yang X, Huang X, Yan L, Rusch M, Brewer M, Yan KK, Chi H, Yu J. scMINER: a mutual information-based framework for identifying hidden drivers from single-cell omics data. RESEARCH SQUARE 2023:rs.3.rs-2476875. [PMID: 36747874 PMCID: PMC9901036 DOI: 10.21203/rs.3.rs-2476875/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The sparse nature of single-cell omics data makes it challenging to dissect the wiring and rewiring of the transcriptional and signaling drivers that regulate cellular states. Many of the drivers, referred to as "hidden drivers", are difficult to identify via conventional expression analysis due to low expression and inconsistency between RNA and protein activity caused by post-translational and other modifications. To address this issue, we developed scMINER, a mutual information (MI)-based computational framework for unsupervised clustering analysis and cell-type specific inference of intracellular networks, hidden drivers and network rewiring from single-cell RNA-seq data. We designed scMINER to capture nonlinear cell-cell and gene-gene relationships and infer driver activities. Systematic benchmarking showed that scMINER outperforms popular single-cell clustering algorithms, especially in distinguishing similar cell types. With respect to network inference, scMINER does not rely on the binding motifs which are available for a limited set of transcription factors, therefore scMINER can provide quantitative activity assessment for more than 6,000 transcription and signaling drivers from a scRNA-seq experiment. As demonstrations, we used scMINER to expose hidden transcription and signaling drivers and dissect their regulon rewiring in immune cell heterogeneity, lineage differentiation, and tissue specification. Overall, activity-based scMINER is a widely applicable, highly accurate, reproducible and scalable method for inferring cellular transcriptional and signaling networks in each cell state from scRNA-seq data. The scMINER software is publicly accessible via: https://github.com/jyyulab/scMINER.
Collapse
Affiliation(s)
- Liang Ding
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Hao Shi
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Department of Immunology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Chenxi Qian
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Chad Burdyshaw
- Department of Information Services, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Joao Pedro Veloso
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Alireza Khatamian
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Qingfei Pan
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Yogesh Dhungana
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Graduate School of Biomedical Sciences, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Zhen Xie
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Department of Physiology, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Isabel Risch
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Department of Immunology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Xu Yang
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Xin Huang
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Lei Yan
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Michael Rusch
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Michael Brewer
- Department of Information Services, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Koon-Kiu Yan
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Hongbo Chi
- Department of Immunology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Jiyang Yu
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| |
Collapse
|
29
|
Hsu LL, Culhane AC. Correspondence analysis for dimension reduction, batch integration, and visualization of single-cell RNA-seq data. Sci Rep 2023; 13:1197. [PMID: 36681709 PMCID: PMC9867729 DOI: 10.1038/s41598-022-26434-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 12/14/2022] [Indexed: 01/22/2023] Open
Abstract
Effective dimension reduction is essential for single cell RNA-seq (scRNAseq) analysis. Principal component analysis (PCA) is widely used, but requires continuous, normally-distributed data; therefore, it is often coupled with log-transformation in scRNAseq applications, which can distort the data and obscure meaningful variation. We describe correspondence analysis (CA), a count-based alternative to PCA. CA is based on decomposition of a chi-squared residual matrix, avoiding distortive log-transformation. To address overdispersion and high sparsity in scRNAseq data, we propose five adaptations of CA, which are fast, scalable, and outperform standard CA and glmPCA, to compute cell embeddings with more performant or comparable clustering accuracy in 8 out of 9 datasets. In particular, we find that CA with Freeman-Tukey residuals performs especially well across diverse datasets. Other advantages of the CA framework include visualization of associations between genes and cell populations in a "CA biplot," and extension to multi-table analysis; we introduce corralm for integrative multi-table dimension reduction of scRNAseq data. We implement CA for scRNAseq data in corral, an R/Bioconductor package which interfaces directly with single cell classes in Bioconductor. Switching from PCA to CA is achieved through a simple pipeline substitution and improves dimension reduction of scRNAseq datasets.
Collapse
Affiliation(s)
- Lauren L Hsu
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
- Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Aedín C Culhane
- Limerick Digital Cancer Research Centre, Health Research Institute, School of Medicine, University of Limerick, Limerick, Ireland.
| |
Collapse
|
30
|
Harmanci A, Harmanci AS, Klisch TJ, Patel AJ. XCVATR: detection and characterization of variant impact on the Embeddings of single -cell and bulk RNA-sequencing samples. BMC Genomics 2022; 23:841. [PMID: 36539717 PMCID: PMC9764736 DOI: 10.1186/s12864-022-09004-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 11/09/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND RNA-sequencing has become a standard tool for analyzing gene activity in bulk samples and at the single-cell level. By increasing sample sizes and cell counts, this technique can uncover substantial information about cellular transcriptional states. Beyond quantification of gene expression, RNA-seq can be used for detecting variants, including single nucleotide polymorphisms, small insertions/deletions, and larger variants, such as copy number variants. Notably, joint analysis of variants with cellular transcriptional states may provide insights into the impact of mutations, especially for complex and heterogeneous samples. However, this analysis is often challenging due to a prohibitively high number of variants and cells, which are difficult to summarize and visualize. Further, there is a dearth of methods that assess and summarize the association between detected variants and cellular transcriptional states. RESULTS Here, we introduce XCVATR (eXpressed Clusters of Variant Alleles in Transcriptome pRofiles), a method that identifies variants and detects local enrichment of expressed variants within embedding of samples and cells in single-cell and bulk RNA-seq datasets. XCVATR visualizes local "clumps" of small and large-scale variants and searches for patterns of association between each variant and cellular states, as described by the coordinates of cell embedding, which can be computed independently using any type of distance metrics, such as principal component analysis or t-distributed stochastic neighbor embedding. Through simulations and analysis of real datasets, we demonstrate that XCVATR can detect enrichment of expressed variants and provide insight into the transcriptional states of cells and samples. We next sequenced 2 new single cell RNA-seq tumor samples and applied XCVATR. XCVATR revealed subtle differences in CNV impact on tumors. CONCLUSIONS XCVATR is publicly available to download from https://github.com/harmancilab/XCVATR .
Collapse
Affiliation(s)
- Arif Harmanci
- grid.267308.80000 0000 9206 2401University of Texas Health Science Center, School of Biomedical Informatics, Center for Secure Artificial intelligence For hEalthcare (SAFE), Center for Precision Health, Houston, USA
| | - Akdes Serin Harmanci
- grid.39382.330000 0001 2160 926XDepartment of Neurosurgery, Baylor College of Medicine, Houston, TX 77030 USA
| | - Tiemo J. Klisch
- grid.416975.80000 0001 2200 2638Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX 77030 USA ,grid.39382.330000 0001 2160 926XDepartment of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Akash J. Patel
- grid.39382.330000 0001 2160 926XDepartment of Neurosurgery, Baylor College of Medicine, Houston, TX 77030 USA ,grid.416975.80000 0001 2200 2638Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX 77030 USA ,grid.39382.330000 0001 2160 926XDepartment of Otolaryngology – Head and Neck Surgery, Baylor College of Medicine, Houston, TX 77030 USA
| |
Collapse
|
31
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China.,Department of Immunology, Nanjing Medical University, Nanjing, 211166, China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children's Hospital of Chongqing Medical University, Chongqing, 401174, China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110, Guangdong, China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China.
| | - Murray J Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW, 2308, Australia. .,Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW, 2305, Australia.
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China.
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China.
| |
Collapse
|
32
|
Jiménez‐Santos MJ, García‐Martín S, Fustero‐Torre C, Di Domenico T, Gómez‐López G, Al‐Shahrour F. Bioinformatics roadmap for therapy selection in cancer genomics. Mol Oncol 2022; 16:3881-3908. [PMID: 35811332 PMCID: PMC9627786 DOI: 10.1002/1878-0261.13286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 06/22/2022] [Accepted: 07/08/2022] [Indexed: 12/24/2022] Open
Abstract
Tumour heterogeneity is one of the main characteristics of cancer and can be categorised into inter- or intratumour heterogeneity. This heterogeneity has been revealed as one of the key causes of treatment failure and relapse. Precision oncology is an emerging field that seeks to design tailored treatments for each cancer patient according to epidemiological, clinical and omics data. This discipline relies on bioinformatics tools designed to compute scores to prioritise available drugs, with the aim of helping clinicians in treatment selection. In this review, we describe the current approaches for therapy selection depending on which type of tumour heterogeneity is being targeted and the available next-generation sequencing data. We cover intertumour heterogeneity studies and individual treatment selection using genomics variants, expression data or multi-omics strategies. We also describe intratumour dissection through clonal inference and single-cell transcriptomics, in each case providing bioinformatics tools for tailored treatment selection. Finally, we discuss how these therapy selection workflows could be integrated into the clinical practice.
Collapse
Affiliation(s)
| | | | - Coral Fustero‐Torre
- Bioinformatics UnitSpanish National Cancer Research Centre (CNIO)MadridSpain
| | - Tomás Di Domenico
- Bioinformatics UnitSpanish National Cancer Research Centre (CNIO)MadridSpain
| | - Gonzalo Gómez‐López
- Bioinformatics UnitSpanish National Cancer Research Centre (CNIO)MadridSpain
| | - Fátima Al‐Shahrour
- Bioinformatics UnitSpanish National Cancer Research Centre (CNIO)MadridSpain
| |
Collapse
|
33
|
Couckuyt A, Seurinck R, Emmaneel A, Quintelier K, Novak D, Van Gassen S, Saeys Y. Challenges in translational machine learning. Hum Genet 2022; 141:1451-1466. [PMID: 35246744 PMCID: PMC8896412 DOI: 10.1007/s00439-022-02439-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Accepted: 02/08/2022] [Indexed: 11/25/2022]
Abstract
Machine learning (ML) algorithms are increasingly being used to help implement clinical decision support systems. In this new field, we define as "translational machine learning", joint efforts and strong communication between data scientists and clinicians help to span the gap between ML and its adoption in the clinic. These collaborations also improve interpretability and trust in translational ML methods and ultimately aim to result in generalizable and reproducible models. To help clinicians and bioinformaticians refine their translational ML pipelines, we review the steps from model building to the use of ML in the clinic. We discuss experimental setup, computational analysis, interpretability and reproducibility, and emphasize the challenges involved. We highly advise collaboration and data sharing between consortia and institutes to build multi-centric cohorts that facilitate ML methodologies that generalize across centers. In the end, we hope that this review provides a way to streamline translational ML and helps to tackle the challenges that come with it.
Collapse
Affiliation(s)
- Artuur Couckuyt
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Ruth Seurinck
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Annelies Emmaneel
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Katrien Quintelier
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
- Department of Pulmonary Diseases, Erasmus MC, Rotterdam, The Netherlands
| | - David Novak
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Sofie Van Gassen
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Yvan Saeys
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium.
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium.
| |
Collapse
|
34
|
Li Z, Zhou X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol 2022; 23:168. [PMID: 35927760 PMCID: PMC9351148 DOI: 10.1186/s13059-022-02734-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 07/21/2022] [Indexed: 02/08/2023] Open
Abstract
Spatial transcriptomic studies are reaching single-cell spatial resolution, with data often collected from multiple tissue sections. Here, we present a computational method, BASS, that enables multi-scale and multi-sample analysis for single-cell resolution spatial transcriptomics. BASS performs cell type clustering at the single-cell scale and spatial domain detection at the tissue regional scale, with the two tasks carried out simultaneously within a Bayesian hierarchical modeling framework. We illustrate the benefits of BASS through comprehensive simulations and applications to three datasets. The substantial power gain brought by BASS allows us to reveal accurate transcriptomic and cellular landscape in both cortex and hypothalamus.
Collapse
Affiliation(s)
- Zheng Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA.,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA. .,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
35
|
Algabri YA, Li L, Liu ZP. scGENA: A Single-Cell Gene Coexpression Network Analysis Framework for Clustering Cell Types and Revealing Biological Mechanisms. Bioengineering (Basel) 2022; 9:bioengineering9080353. [PMID: 36004879 PMCID: PMC9405199 DOI: 10.3390/bioengineering9080353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 07/27/2022] [Accepted: 07/27/2022] [Indexed: 11/16/2022] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) is a recent high-throughput technique that can measure gene expression, reveal cell heterogeneity, rare and complex cell populations, and discover cell types and their relationships. The analysis of scRNA-seq data is challenging because of transcripts sparsity, replication noise, and outlier cell populations. A gene coexpression network (GCN) analysis effectively deciphers phenotypic differences in specific states by describing gene–gene pairwise relationships. The underlying gene modules with different coexpression patterns partially bridge the gap between genotype and phenotype. This study presents a new framework called scGENA (single-cell gene coexpression network analysis) for GCN analysis based on scRNA-seq data. Although there are several methods for scRNA-seq data analysis, we aim to build an integrative pipeline for several purposes that cover primary data preprocessing, including data exploration, quality control, normalization, imputation, and dimensionality reduction of clustering as downstream of GCN analysis. To demonstrate this integrated workflow, an scRNA-seq dataset of the human diabetic pancreas with 1600 cells and 39,851 genes was implemented to perform all these processes in practice. As a result, scGENA is demonstrated to uncover interesting gene modules behind complex diseases, which reveal biological mechanisms. scGENA provides a state-of-the-art method for gene coexpression analysis for scRNA-seq data.
Collapse
|
36
|
Huang H, Wang Y, Rudin C, Browne EP. Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Commun Biol 2022; 5:719. [PMID: 35853932 PMCID: PMC9296444 DOI: 10.1038/s42003-022-03628-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 06/23/2022] [Indexed: 12/11/2022] Open
Abstract
Dimension reduction (DR) algorithms project data from high dimensions to lower dimensions to enable visualization of interesting high-dimensional structure. DR algorithms are widely used for analysis of single-cell transcriptomic data. Despite widespread use of DR algorithms such as t-SNE and UMAP, these algorithms have characteristics that lead to lack of trust: they do not preserve important aspects of high-dimensional structure and are sensitive to arbitrary user choices. Given the importance of gaining insights from DR, DR methods should be evaluated carefully before trusting their results. In this paper, we introduce and perform a systematic evaluation of popular DR methods, including t-SNE, art-SNE, UMAP, PaCMAP, TriMap and ForceAtlas2. Our evaluation considers five components: preservation of local structure, preservation of global structure, sensitivity to parameter choices, sensitivity to preprocessing choices, and computational efficiency. This evaluation can help us to choose DR tools that align with the scientific goals of the user.
Collapse
Affiliation(s)
- Haiyang Huang
- Department of Computer Science, Duke University, Durham, NC, 27708, USA
| | - Yingfan Wang
- Department of Computer Science, Duke University, Durham, NC, 27708, USA
| | - Cynthia Rudin
- Department of Computer Science, Duke University, Durham, NC, 27708, USA
| | - Edward P Browne
- Department of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
37
|
Ellis D, Wu D, Datta S. SAREV: A review on statistical analytics of single-cell RNA sequencing data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14:e1558. [PMID: 36034329 PMCID: PMC9400796 DOI: 10.1002/wics.1558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 04/09/2021] [Indexed: 06/15/2023]
Abstract
Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.
Collapse
Affiliation(s)
- Dorothy Ellis
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| | - Dongyuan Wu
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| | - Susmita Datta
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| |
Collapse
|
38
|
Anchang B, Mendez-Giraldez R, Xu X, Archer TK, Chen Q, Hu G, Plevritis SK, Motsinger-Reif AA, Li JL. Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures. Brief Bioinform 2022; 23:6534382. [PMID: 35192692 PMCID: PMC8921621 DOI: 10.1093/bib/bbac017] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 11/19/2021] [Accepted: 01/13/2022] [Indexed: 11/13/2022] Open
Abstract
A major topic of debate in developmental biology centers on whether development is continuous, discontinuous, or a mixture of both. Pseudo-time trajectory models, optimal for visualizing cellular progression, model cell transitions as continuous state manifolds and do not explicitly model real-time, complex, heterogeneous systems and are challenging for benchmarking with temporal models. We present a data-driven framework that addresses these limitations with temporal single-cell data collected at discrete time points as inputs and a mixture of dependent minimum spanning trees (MSTs) as outputs, denoted as dynamic spanning forest mixtures (DSFMix). DSFMix uses decision-tree models to select genes that account for variations in multimodality, skewness and time. The genes are subsequently used to build the forest using tree agglomerative hierarchical clustering and dynamic branch cutting. We first motivate the use of forest-based algorithms compared to single-tree approaches for visualizing and characterizing developmental processes. We next benchmark DSFMix to pseudo-time and temporal approaches in terms of feature selection, time correlation, and network similarity. Finally, we demonstrate how DSFMix can be used to visualize, compare and characterize complex relationships during biological processes such as epithelial-mesenchymal transition, spermatogenesis, stem cell pluripotency, early transcriptional response from hormones and immune response to coronavirus disease. Our results indicate that the expression of genes during normal development exhibits a high proportion of non-uniformly distributed profiles that are mostly right-skewed and multimodal; the latter being a characteristic of major steady states during development. Our study also identifies and validates gene signatures driving complex dynamic processes during somatic or germline differentiation.
Collapse
Affiliation(s)
- Benedict Anchang
- Corresponding author: Benedict Anchang, Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences. 111 T W Alexander Dr, Research Triangle Park, NC 27709, USA and Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, USA. Tel +1 984-287-3350; E-mail:
| | - Raul Mendez-Giraldez
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Xiaojiang Xu
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Trevor K Archer
- Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Qing Chen
- Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Guang Hu
- Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Sylvia K Plevritis
- Department of Biomedical Data Science, Center for Cancer Systems Biology, Stanford University, Stanford, California, USA
| | - Alison Anne Motsinger-Reif
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Jian-Liang Li
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| |
Collapse
|
39
|
A Primer for Single-Cell Sequencing in Non-Model Organisms. Genes (Basel) 2022; 13:genes13020380. [PMID: 35205423 PMCID: PMC8872538 DOI: 10.3390/genes13020380] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Revised: 02/12/2022] [Accepted: 02/17/2022] [Indexed: 02/05/2023] Open
Abstract
Single-cell sequencing technologies have led to a revolution in our knowledge of the diversity of cell types, connections between biological levels of organization, and relationships between genotype and phenotype. These advances have mainly come from using model organisms; however, using single-cell sequencing in non-model organisms could enable investigations of questions inaccessible with typical model organisms. This primer describes a general workflow for single-cell sequencing studies and considerations for using non-model organisms (limited to multicellular animals). Importantly, single-cell sequencing, when further applied in non-model organisms, will allow for a deeper understanding of the mechanisms between genotype and phenotype and the basis for biological variation.
Collapse
|
40
|
Yu L, Cao Y, Yang JYH, Yang P. Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data. Genome Biol 2022; 23:49. [PMID: 35135612 PMCID: PMC8822786 DOI: 10.1186/s13059-022-02622-0] [Citation(s) in RCA: 31] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 01/27/2022] [Indexed: 01/24/2023] Open
Abstract
Background A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods. Results We systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data. Conclusions We identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from (https://github.com/PYangLab/scCCESS). Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02622-0.
Collapse
Affiliation(s)
- Lijia Yu
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.,Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.,Charles Perkins Centre, University of Sydney, Sydney, NSW, 2006, Australia
| | - Yue Cao
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.,Charles Perkins Centre, University of Sydney, Sydney, NSW, 2006, Australia
| | - Jean Y H Yang
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.,Charles Perkins Centre, University of Sydney, Sydney, NSW, 2006, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia. .,Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia. .,Charles Perkins Centre, University of Sydney, Sydney, NSW, 2006, Australia.
| |
Collapse
|
41
|
Baruzzo G, Patuzzi I, Di Camillo B. Beware to ignore the rare: how imputing zero-values can improve the quality of 16S rRNA gene studies results. BMC Bioinformatics 2022; 22:618. [PMID: 35130833 PMCID: PMC8822630 DOI: 10.1186/s12859-022-04587-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 01/27/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND 16S rRNA-gene sequencing is a valuable approach to characterize the taxonomic content of the whole bacterial population inhabiting a metabolic and spatial niche, providing an important opportunity to study bacteria and their role in many health and environmental mechanisms. The analysis of data produced by amplicon sequencing, however, brings very specific methodological issues that need to be properly addressed to obtain reliable biological conclusions. Among these, 16S count data tend to be very sparse, with many null values reflecting species that are present but got unobserved due to the multiplexing constraints. However, current data workflows do not consider a step in which the information about unobserved species is recovered. RESULTS In this work, we evaluate for the first time the effects of introducing in the 16S data workflow a new preprocessing step, zero-imputation, to recover this lost information. Due to the lack of published zero-imputation methods specifically designed for 16S count data, we considered a set of zero-imputation strategies available for other frameworks, and benchmarked them using in silico 16S count data reflecting different experimental designs. Additionally, we assessed the effect of combining zero-imputation and normalization, i.e. the only preprocessing step in current 16S workflow. Overall, we benchmarked 35 16S preprocessing pipelines assessing their ability to handle data sparsity, identify species presence/absence, recovery sample proportional abundance distributions, and improve typical downstream analyses such as computation of alpha and beta diversity indices and differential abundance analysis. CONCLUSIONS The results clearly show that 16S data analysis greatly benefits from a properly-performed zero-imputation step, despite the choice of the right zero-imputation method having a pivotal role. In addition, we identify a set of best-performing pipelines that could be a valuable indication for data analysts.
Collapse
Affiliation(s)
- Giacomo Baruzzo
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Ilaria Patuzzi
- Department of Information Engineering, University of Padova, Padua, Italy
- Microbial Ecology Unit, Istituto Zooprofilattico Sperimentale Delle Venezie, Padua, Italy
- Research & Development Division, EuBiome S.R.L., Padua, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Padua, Italy.
- CRIBI Biotechnology Centre, University of Padova, Padua, Italy.
- Department of Comparative Biomedicine and Food Science, University of Padova, Padua, Italy.
| |
Collapse
|
42
|
Li Z, Feng H. A neural network-based method for exhaustive cell label assignment using single cell RNA-seq data. Sci Rep 2022; 12:910. [PMID: 35042860 PMCID: PMC8766435 DOI: 10.1038/s41598-021-04473-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 12/21/2021] [Indexed: 02/01/2023] Open
Abstract
The fast-advancing single cell RNA sequencing (scRNA-seq) technology enables researchers to study the transcriptome of heterogeneous tissues at a single cell level. The initial important step of analyzing scRNA-seq data is usually to accurately annotate cells. The traditional approach of annotating cell types based on unsupervised clustering and marker genes is time-consuming and laborious. Taking advantage of the numerous existing scRNA-seq databases, many supervised label assignment methods have been developed. One feature that many label assignment methods shares is to label cells with low confidence as "unassigned." These unassigned cells can be the result of assignment difficulties due to highly similar cell types or caused by the presence of unknown cell types. However, when unknown cell types are not expected, existing methods still label a considerable number of cells as unassigned, which is not desirable. In this work, we develop a neural network-based cell annotation method called NeuCA (Neural network-based Cell Annotation) for scRNA-seq data obtained from well-studied tissues. NeuCA can utilize the hierarchical structure information of the cell types to improve the annotation accuracy, which is especially helpful when data contain closely correlated cell types. We show that NeuCA can achieve more accurate cell annotation results compared with existing methods. Additionally, the applications on eight real datasets show that NeuCA has stable performance for intra- and inter-study annotation, as well as cross-condition annotation. NeuCA is freely available as an R/Bioconductor package at https://bioconductor.org/packages/NeuCA .
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Hao Feng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, 44106, USA.
| |
Collapse
|
43
|
Schiebout C, Frost HR. CAMML: Multi-Label Immune Cell-Typing and Stemness Analysis for Single-Cell RNA-sequencing. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2022; 27:199-210. [PMID: 34890149 PMCID: PMC8669732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Inferring the cell types in single-cell RNA-sequencing (scRNA-seq) data is of particular importance for understanding the potential cellular mechanisms and phenotypes occurring in complex tissues, such as the tumor-immune microenvironment (TME). The sparsity and noise of scRNA-seq data, combined with the fact that immune cell types often occur on a continuum, make cell typing of TME scRNA-seq data a significant challenge. Several single-label cell typing methods have been put forth to address the limitations of noise and sparsity, but accounting for the often overlapped spectrum of cell types in the immune TME remains an obstacle. To address this, we developed a new scRNA-seq cell-typing method, Cell-typing using variance Adjusted Mahalanobis distances with Multi-Labeling (CAMML). CAMML leverages cell type-specific weighted gene sets to score every cell in a dataset for every potential cell type. This allows cells to be labelled either by their highest scoring cell type as a single label classification or based on a score cut-off to give multi-label classification. For single-label cell typing, CAMML performance is comparable to existing cell typing methods, SingleR and Garnett. For scenarios where cells may exhibit features of multiple cell types (e.g., undifferentiated cells), the multi-label classification supported by CAMML offers important benefits relative to the current state-of-the-art methods. By integrating data across studies, omics platforms, and species, CAMML serves as a robust and adaptable method for overcoming the challenges of scRNA-seq analysis.
Collapse
|
44
|
You Y, Tian L, Su S, Dong X, Jabbari JS, Hickey PF, Ritchie ME. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol 2021; 22:339. [PMID: 34906205 PMCID: PMC8672463 DOI: 10.1186/s13059-021-02552-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 11/22/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) technologies and associated analysis methods have rapidly developed in recent years. This includes preprocessing methods, which assign sequencing reads to genes to create count matrices for downstream analysis. While several packaged preprocessing workflows have been developed to provide users with convenient tools for handling this process, how they compare to one another and how they influence downstream analysis have not been well studied. RESULTS Here, we systematically benchmark the performance of 10 end-to-end preprocessing workflows (Cell Ranger, Optimus, salmon alevin, alevin-fry, kallisto bustools, dropSeqPipe, scPipe, zUMIs, celseq2, and scruff) using datasets yielding different biological complexity levels generated by CEL-Seq2 and 10x Chromium platforms. We compare these workflows in terms of their quantification properties directly and their impact on normalization and clustering by evaluating the performance of different method combinations. While the scRNA-seq preprocessing workflows compared vary in their detection and quantification of genes across datasets, after downstream analysis with performant normalization and clustering methods, almost all combinations produce clustering results that agree well with the known cell type labels that provided the ground truth in our analysis. CONCLUSIONS In summary, the choice of preprocessing method was found to be less important than other steps in the scRNA-seq analysis process. Our study comprehensively compares common scRNA-seq preprocessing workflows and summarizes their characteristics to guide workflow users.
Collapse
Affiliation(s)
- Yue You
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Luyi Tian
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Shian Su
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Xueyi Dong
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Jafar S. Jabbari
- Australian Genome Research Facility, Victorian Comprehensive Cancer Centre, Melbourne, Australia
- Microbiological Diagnostic Unit Public Health Laboratory, Department of Microbiology and Immunology, The University of Melbourne at The Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Peter F. Hickey
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
- Single-Cell Open Research Endeavour (SCORE), The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
| | - Matthew E. Ritchie
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
- School of Mathematics and Statistics, The University of Melbourne, Parkville, Australia
| |
Collapse
|
45
|
Ostner J, Carcy S, Müller CL. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data. Front Genet 2021; 12:766405. [PMID: 34950190 PMCID: PMC8689185 DOI: 10.3389/fgene.2021.766405] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 11/01/2021] [Indexed: 12/11/2022] Open
Abstract
Accurate generative statistical modeling of count data is of critical relevance for the analysis of biological datasets from high-throughput sequencing technologies. Important instances include the modeling of microbiome compositions from amplicon sequencing surveys and the analysis of cell type compositions derived from single-cell RNA sequencing. Microbial and cell type abundance data share remarkably similar statistical features, including their inherent compositionality and a natural hierarchical ordering of the individual components from taxonomic or cell lineage tree information, respectively. To this end, we introduce a Bayesian model for tree-aggregated amplicon and single-cell compositional data analysis (tascCODA) that seamlessly integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data. By combining latent parameters based on the tree structure with spike-and-slab Lasso penalization, tascCODA can determine covariate effects across different levels of the population hierarchy in a data-driven parsimonious way. In the context of differential abundance testing, we validate tascCODA's excellent performance on a comprehensive set of synthetic benchmark scenarios. Our analyses on human single-cell RNA-seq data from ulcerative colitis patients and amplicon data from patients with irritable bowel syndrome, respectively, identified aggregated cell type and taxon compositional changes that were more predictive and parsimonious than those proposed by other schemes. We posit that tascCODA constitutes a valuable addition to the growing statistical toolbox for generative modeling and analysis of compositional changes in microbial or cell population data.
Collapse
Affiliation(s)
- Johannes Ostner
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Salomé Carcy
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Department of Biology, École Normale Supérieure, PSL University, Paris, France
| | - Christian L. Müller
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Center for Computational Mathematics, Flatiron Institute, New York, NY, United States
| |
Collapse
|
46
|
Morelli L, Giansanti V, Cittaro D. Nested Stochastic Block Models applied to the analysis of single cell data. BMC Bioinformatics 2021; 22:576. [PMID: 34847879 PMCID: PMC8630903 DOI: 10.1186/s12859-021-04489-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 11/19/2021] [Indexed: 12/30/2022] Open
Abstract
Single cell profiling has been proven to be a powerful tool in molecular biology to understand the complex behaviours of heterogeneous system. The definition of the properties of single cells is the primary endpoint of such analysis, cells are typically clustered to underpin the common determinants that can be used to describe functional properties of the cell mixture under investigation. Several approaches have been proposed to identify cell clusters; while this is matter of active research, one popular approach is based on community detection in neighbourhood graphs by optimisation of modularity. In this paper we propose an alternative and principled solution to this problem, based on Stochastic Block Models. We show that such approach not only is suitable for identification of cell groups, it also provides a solid framework to perform other relevant tasks in single cell analysis, such as label transfer. To encourage the use of Stochastic Block Models, we developed a python library, schist, that is compatible with the popular scanpy framework.
Collapse
Affiliation(s)
- Leonardo Morelli
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
- Università Vita-Salute San Raffaele, Milan, Italy
| | - Valentina Giansanti
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
| | - Davide Cittaro
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy.
| |
Collapse
|
47
|
Chan A, Jiang W, Blyth E, Yang J, Patrick E. treekoR: identifying cellular-to-phenotype associations by elucidating hierarchical relationships in high-dimensional cytometry data. Genome Biol 2021; 22:324. [PMID: 34844647 PMCID: PMC8628061 DOI: 10.1186/s13059-021-02526-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 10/26/2021] [Indexed: 12/13/2022] Open
Abstract
High-throughput single-cell technologies hold the promise of discovering novel cellular relationships with disease. However, analytical workflows constructed for these technologies to associate cell proportions with disease often employ unsupervised clustering techniques that overlook the valuable hierarchical structures that have been used to define cell types. We present treekoR, a framework that empirically recapitulates these structures, facilitating multiple quantifications and comparisons of cell type proportions. Our results from twelve case studies reinforce the importance of quantifying proportions relative to parent populations in the analyses of cytometry data — as failing to do so can lead to missing important biological insights.
Collapse
Affiliation(s)
- Adam Chan
- School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Wei Jiang
- Centre for Cancer Research, Westmead Institute for Medical Research, The University of Sydney, Sydney, New South Wales, Australia.,Faculty of Medicine and Health, The University of Sydney, Sydney, New South Wales, Australia
| | - Emily Blyth
- Centre for Cancer Research, Westmead Institute for Medical Research, The University of Sydney, Sydney, New South Wales, Australia.,Faculty of Medicine and Health, The University of Sydney, Sydney, New South Wales, Australia.,Blood Transplant and Cell Therapies Program, Department of Haematology, Westmead Hospital, Westmead, NSW, Australia
| | - Jean Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia.,Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Ellis Patrick
- School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia. .,Centre for Cancer Research, Westmead Institute for Medical Research, The University of Sydney, Sydney, New South Wales, Australia. .,Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China.
| |
Collapse
|
48
|
Bej S, Galow AM, David R, Wolfien M, Wolkenhauer O. Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling. BMC Bioinformatics 2021; 22:557. [PMID: 34798805 PMCID: PMC8603509 DOI: 10.1186/s12859-021-04469-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 11/03/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. RESULTS We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of "less" rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. CONCLUSIONS In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.
Collapse
Affiliation(s)
- Saptarshi Bej
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057, Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354, Freising, Germany
| | - Anne-Marie Galow
- Institute of Genome Biology, Research Institute for Farm Animal Biology, 18196, Dummerstorf, Germany
| | - Robert David
- Department of Cardiac Surgery, Rostock University Medical Centre, 18057, Rostock, Germany
- Department of Life, Light and Matter, University of Rostock, 18059, Rostock, Germany
| | - Markus Wolfien
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057, Rostock, Germany
| | - Olaf Wolkenhauer
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057, Rostock, Germany.
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354, Freising, Germany.
- Stellenbosch Institute of Advanced Study, Stellenbosch University, Stellenbosch, 7602, South Africa.
| |
Collapse
|
49
|
Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol 2021; 22:301. [PMID: 34715899 PMCID: PMC8555270 DOI: 10.1186/s13059-021-02519-4] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022] Open
Abstract
Recent years have seen a revolution in single-cell RNA-sequencing (scRNA-seq) technologies, datasets, and analysis methods. Since 2016, the scRNA-tools database has cataloged software tools for analyzing scRNA-seq data. With the number of tools in the database passing 1000, we provide an update on the state of the project and the field. This data shows the evolution of the field and a change of focus from ordering cells on continuous trajectories to integrating multiple samples and making use of reference datasets. We also find that open science practices reward developers with increased recognition and help accelerate the field.
Collapse
Affiliation(s)
- Luke Zappia
- Institute of Computational Biology, Helmholtz Zentrum München, 85764, Neuherberg, Germany
- Department of Mathematics, Technical University of Munich, 85748, Garching bei München, Germany
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, 85764, Neuherberg, Germany.
- Department of Mathematics, Technical University of Munich, 85748, Garching bei München, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, 85354, Freising, Germany.
| |
Collapse
|
50
|
Xie B, Jiang Q, Mora A, Li X. Automatic cell type identification methods for single-cell RNA sequencing. Comput Struct Biotechnol J 2021; 19:5874-5887. [PMID: 34815832 PMCID: PMC8572862 DOI: 10.1016/j.csbj.2021.10.027] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 09/23/2021] [Accepted: 10/18/2021] [Indexed: 11/24/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a powerful tool for scientists of many research disciplines due to its ability to elucidate the heterogeneous and complex cell-type compositions of different tissues and cell populations. Traditional cell-type identification methods for scRNA-seq data analysis are time-consuming and knowledge-dependent for manual annotation. By contrast, automatic cell-type identification methods may have the advantages of being fast, accurate, and more user friendly. Here, we discuss and evaluate thirty-two published automatic methods for scRNA-seq data analysis in terms of their prediction accuracy, F1-score, unlabeling rate and running time. We highlight the advantages and disadvantages of these methods and provide recommendations of method choice depending on the available information. The challenges and future applications of these automatic methods are further discussed. In addition, we provide a free scRNA-seq data analysis package encompassing the discussed automatic methods to help the easy usage of them in real-world applications.
Collapse
Affiliation(s)
- Bingbing Xie
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510060, Guangdong, China
| | - Qin Jiang
- Affiliated Eye Hospital of Nanjing Medical University, Nanjing, China
| | - Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Xinzao, Panyu District, Guangzhou 511436, Guangdong, China
| | - Xuri Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510060, Guangdong, China
| |
Collapse
|