1
|
Sin DD. What Single Cell RNA Sequencing Has Taught Us about Chronic Obstructive Pulmonary Disease. Tuberc Respir Dis (Seoul) 2024; 87:252-260. [PMID: 38369875 PMCID: PMC11222093 DOI: 10.4046/trd.2024.0001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 02/17/2024] [Indexed: 02/20/2024] Open
Abstract
Chronic obstructive pulmonary disease (COPD) affects close to 400 million people worldwide and is the 3rd leading cause of mortality. It is a heterogeneous disorder with multiple endophenotypes, each driven by specific molecular networks and processes. Therapeutic discovery in COPD has lagged behind other disease areas owing to a lack of understanding of its pathobiology and scarcity of biomarkers to guide therapies. Single cell RNA sequencing (scRNA-seq) is a powerful new tool to identify important cellular and molecular networks that play a crucial role in disease pathogenesis. This paper provides an overview of the scRNA-seq technology and its application in COPD and the lessons learned to date from scRNA-seq experiments in COPD.
Collapse
Affiliation(s)
- Don D. Sin
- Centre for Heart Lung Innovation, St. Paul’s Hospital and Division of Respiratory Medicine, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
2
|
Yarlagadda S, Giorgio TD. A guide to single-cell RNA sequencing analysis using web-based tools for non-bioinformatician. FEBS J 2024; 291:2545-2561. [PMID: 38148322 DOI: 10.1111/febs.17036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 12/14/2023] [Indexed: 12/28/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a technique that has proven to be a powerful tool for a wide range of fields and research studies. However, scRNA-seq data analysis has been dominated by scientists highly trained in bioinformatics or those with extensive computational experience and understanding. Recently, this trend has begun to shift as more user-friendly web-based scRNA-seq analysis tools have been developed that require little computational experience to use. However, barriers persist for nonbioinformaticians in using this technique. Complex, unfamiliar language and scarce comprehensive literature guidance to provide a framework for understanding scRNA-seq analysis outputs are among the obstacles. This work introduces many popular web-based tools for scRNA-seq and provides a general overview of their user interfaces and features. Then, a comprehensive start-to-finish introductory scRNA-seq analysis pipeline is described in detail, which aims to enable researchers to carry out scRNA-seq analysis, regardless of computational experience. Companion video tutorials can be found at "EasyScRNAseqTutorials" on YouTube (https://www.youtube.com/@scrnaseqtutorials). However, as scRNA-seq continues to penetrate new fields and expand in importance, there remains a need for more literature to help overcome barriers to its use by explaining further the highly complex and advanced analyses that are introduced within this paper.
Collapse
Affiliation(s)
| | - Todd D Giorgio
- Biomedical Engineering, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
3
|
Ali M, Yang T, He H, Zhang Y. Plant biotechnology research with single-cell transcriptome: recent advancements and prospects. PLANT CELL REPORTS 2024; 43:75. [PMID: 38381195 DOI: 10.1007/s00299-024-03168-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 02/05/2024] [Indexed: 02/22/2024]
Abstract
KEY MESSAGE Single-cell transcriptomic techniques have emerged as powerful tools in plant biology, offering high-resolution insights into gene expression at the individual cell level. This review highlights the rapid expansion of single-cell technologies in plants, their potential in understanding plant development, and their role in advancing plant biotechnology research. Single-cell techniques have emerged as powerful tools to enhance our understanding of biological systems, providing high-resolution transcriptomic analysis at the single-cell level. In plant biology, the adoption of single-cell transcriptomics has seen rapid expansion of available technologies and applications. This review article focuses on the latest advancements in the field of single-cell transcriptomic in plants and discusses the potential role of these approaches in plant development and expediting plant biotechnology research in the near future. Furthermore, inherent challenges and limitations of single-cell technology are critically examined to overcome them and enhance our knowledge and understanding.
Collapse
Affiliation(s)
- Muhammad Ali
- School of Agriculture, Sun Yat-Sen University, Shenzhen, 518107, China
- Peking University-Institute of Advanced Agricultural Sciences, Weifang, China
| | - Tianxia Yang
- School of Agriculture, Sun Yat-Sen University, Shenzhen, 518107, China
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding (MOE), China Agricultural University, Beijing, China
| | - Hai He
- School of Agriculture, Sun Yat-Sen University, Shenzhen, 518107, China
| | - Yu Zhang
- School of Agriculture, Sun Yat-Sen University, Shenzhen, 518107, China.
| |
Collapse
|
4
|
Song D, Wang Q, Yan G, Liu T, Sun T, Li JJ. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol 2024; 42:247-252. [PMID: 37169966 PMCID: PMC11182337 DOI: 10.1038/s41587-023-01772-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 03/30/2023] [Indexed: 05/13/2023]
Abstract
We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA
| | - Qingyang Wang
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Guanao Yan
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyang Liu
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyi Sun
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA.
- Department of Statistics, University of California, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, CA, USA.
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
5
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. Genome Biol 2023; 24:236. [PMID: 37858253 PMCID: PMC10588049 DOI: 10.1186/s13059-023-03067-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 09/20/2023] [Indexed: 10/21/2023] Open
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
6
|
Pan Y, Landis JT, Moorad R, Wu D, Marron JS, Dittmer DP. The Poisson distribution model fits UMI-based single-cell RNA-sequencing data. BMC Bioinformatics 2023; 24:256. [PMID: 37330471 PMCID: PMC10276395 DOI: 10.1186/s12859-023-05349-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/24/2023] [Indexed: 06/19/2023] Open
Abstract
BACKGROUND Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson.
Collapse
Affiliation(s)
- Yue Pan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Justin T Landis
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Razia Moorad
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Di Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Adam School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - J S Marron
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Dirk P Dittmer
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA.
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA.
| |
Collapse
|
7
|
Gao LL, Bien J, Witten D. Selective Inference for Hierarchical Clustering. J Am Stat Assoc 2022; 119:332-342. [PMID: 38660582 PMCID: PMC11036349 DOI: 10.1080/01621459.2022.2116331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 08/16/2022] [Indexed: 10/17/2022]
Abstract
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.
Collapse
Affiliation(s)
- Lucy L. Gao
- Department of Statistics, University of British Columbia
| | - Jacob Bien
- Department of Data Sciences and Operations, University of Southern California
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington
| |
Collapse
|
8
|
LSH-GAN enables in-silico generation of cells for small sample high dimensional scRNA-seq data. Commun Biol 2022; 5:577. [PMID: 35688990 PMCID: PMC9187761 DOI: 10.1038/s42003-022-03473-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Accepted: 05/02/2022] [Indexed: 11/08/2022] Open
Abstract
A fundamental problem of downstream analysis of scRNA-seq data is the unavailability of enough cell samples compare to the feature size. This is mostly due to the budgetary constraint of single cell experiments or simply because of the small number of available patient samples. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic cell samples. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying the standard procedures of downstream analysis. LSH-GAN outperforms the benchmarks for realistic generation of quality cell samples. Experimental results show that generated samples of LSH-GAN improves the performance of the downstream analysis such as feature (gene) selection and cell clustering. Overall, LSH-GAN therefore addressed the key challenges of small sample scRNA-seq data analysis.
Collapse
|
9
|
Spatially informed cell-type deconvolution for spatial transcriptomics. Nat Biotechnol 2022; 40:1349-1359. [PMID: 35501392 PMCID: PMC9464662 DOI: 10.1038/s41587-022-01273-7] [Citation(s) in RCA: 115] [Impact Index Per Article: 57.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 03/07/2022] [Indexed: 12/16/2022]
Abstract
Many spatially resolved transcriptomic technologies do not have single-cell resolution but measure the average gene expression for each spot from a mixture of cells of potentially heterogeneous cell types. Here, we introduce a deconvolution method, conditional autoregressive deconvolution (CARD), that combines cell type–specific expression information from single-cell RNA sequencing (scRNA-seq) with correlation in cell type composition across tissue locations. Modeling spatial correlation allows us to borrow the cell-type composition information across locations, improving accuracy of deconvolution even with a mismatched scRNA-seq reference. CARD can also impute cell type compositions and gene expression levels at unmeasured tissue locations, enable the construction of a refined spatial tissue map with a resolution arbitrarily higher than that measured in the original study, and perform deconvolution without a scRNA-seq reference. Applications to four datasets including a pancreatic cancer dataset identified multiple cell types and molecular markers with distinct spatial localization that define the progression, heterogeneity, and compartmentalization of pancreatic cancer.
Collapse
|
10
|
Jovic D, Liang X, Zeng H, Lin L, Xu F, Luo Y. Single-cell RNA sequencing technologies and applications: A brief overview. Clin Transl Med 2022; 12:e694. [PMID: 35352511 PMCID: PMC8964935 DOI: 10.1002/ctm2.694] [Citation(s) in RCA: 266] [Impact Index Per Article: 133.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/09/2021] [Accepted: 12/20/2021] [Indexed: 12/19/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technology has become the state-of-the-art approach for unravelling the heterogeneity and complexity of RNA transcripts within individual cells, as well as revealing the composition of different cell types and functions within highly organized tissues/organs/organisms. Since its first discovery in 2009, studies based on scRNA-seq provide massive information across different fields making exciting new discoveries in better understanding the composition and interaction of cells within humans, model animals and plants. In this review, we provide a concise overview about the scRNA-seq technology, experimental and computational procedures for transforming the biological and molecular processes into computational and statistical data. We also provide an explanation of the key technological steps in implementing the technology. We highlight a few examples on how scRNA-seq can provide unique information for better understanding health and diseases. One important application of the scRNA-seq technology is to build a better and high-resolution catalogue of cells in all living organism, commonly known as atlas, which is key resource to better understand and provide a solution in treating diseases. While great promises have been demonstrated with the technology in all areas, we further highlight a few remaining challenges to be overcome and its great potentials in transforming current protocols in disease diagnosis and treatment.
Collapse
Affiliation(s)
- Dragomirka Jovic
- Lars Bolund Institute of Regenerative MedicineQingdao‐Europe Advanced Institute for Life SciencesQingdaoChina
- BGI‐ShenzhenShenzhenChina
| | - Xue Liang
- Lars Bolund Institute of Regenerative MedicineQingdao‐Europe Advanced Institute for Life SciencesQingdaoChina
- BGI‐ShenzhenShenzhenChina
- Department of BiologyUniversity of CopenhagenCopenhagenDenmark
| | - Hua Zeng
- Nanjing University of Chinese MedicineNanjingChina
| | - Lin Lin
- Department of BiomedicineAarhus UniversityAarhusDenmark
- Steno Diabetes Center AarhusAarhus University HospitalAarhusDenmark
| | - Fengping Xu
- Lars Bolund Institute of Regenerative MedicineQingdao‐Europe Advanced Institute for Life SciencesQingdaoChina
- BGI‐ShenzhenShenzhenChina
| | - Yonglun Luo
- Lars Bolund Institute of Regenerative MedicineQingdao‐Europe Advanced Institute for Life SciencesQingdaoChina
- BGI‐ShenzhenShenzhenChina
- Department of BiomedicineAarhus UniversityAarhusDenmark
- Steno Diabetes Center AarhusAarhus University HospitalAarhusDenmark
| |
Collapse
|
11
|
Li Z, Feng H. A neural network-based method for exhaustive cell label assignment using single cell RNA-seq data. Sci Rep 2022; 12:910. [PMID: 35042860 PMCID: PMC8766435 DOI: 10.1038/s41598-021-04473-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 12/21/2021] [Indexed: 02/01/2023] Open
Abstract
The fast-advancing single cell RNA sequencing (scRNA-seq) technology enables researchers to study the transcriptome of heterogeneous tissues at a single cell level. The initial important step of analyzing scRNA-seq data is usually to accurately annotate cells. The traditional approach of annotating cell types based on unsupervised clustering and marker genes is time-consuming and laborious. Taking advantage of the numerous existing scRNA-seq databases, many supervised label assignment methods have been developed. One feature that many label assignment methods shares is to label cells with low confidence as "unassigned." These unassigned cells can be the result of assignment difficulties due to highly similar cell types or caused by the presence of unknown cell types. However, when unknown cell types are not expected, existing methods still label a considerable number of cells as unassigned, which is not desirable. In this work, we develop a neural network-based cell annotation method called NeuCA (Neural network-based Cell Annotation) for scRNA-seq data obtained from well-studied tissues. NeuCA can utilize the hierarchical structure information of the cell types to improve the annotation accuracy, which is especially helpful when data contain closely correlated cell types. We show that NeuCA can achieve more accurate cell annotation results compared with existing methods. Additionally, the applications on eight real datasets show that NeuCA has stable performance for intra- and inter-study annotation, as well as cross-condition annotation. NeuCA is freely available as an R/Bioconductor package at https://bioconductor.org/packages/NeuCA .
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Hao Feng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, 44106, USA.
| |
Collapse
|
12
|
Bej S, Galow AM, David R, Wolfien M, Wolkenhauer O. Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling. BMC Bioinformatics 2021; 22:557. [PMID: 34798805 PMCID: PMC8603509 DOI: 10.1186/s12859-021-04469-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 11/03/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. RESULTS We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of "less" rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. CONCLUSIONS In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.
Collapse
Affiliation(s)
- Saptarshi Bej
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057, Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354, Freising, Germany
| | - Anne-Marie Galow
- Institute of Genome Biology, Research Institute for Farm Animal Biology, 18196, Dummerstorf, Germany
| | - Robert David
- Department of Cardiac Surgery, Rostock University Medical Centre, 18057, Rostock, Germany
- Department of Life, Light and Matter, University of Rostock, 18059, Rostock, Germany
| | - Markus Wolfien
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057, Rostock, Germany
| | - Olaf Wolkenhauer
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057, Rostock, Germany.
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354, Freising, Germany.
- Stellenbosch Institute of Advanced Study, Stellenbosch University, Stellenbosch, 7602, South Africa.
| |
Collapse
|
13
|
Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol 2021; 22:301. [PMID: 34715899 PMCID: PMC8555270 DOI: 10.1186/s13059-021-02519-4] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022] Open
Abstract
Recent years have seen a revolution in single-cell RNA-sequencing (scRNA-seq) technologies, datasets, and analysis methods. Since 2016, the scRNA-tools database has cataloged software tools for analyzing scRNA-seq data. With the number of tools in the database passing 1000, we provide an update on the state of the project and the field. This data shows the evolution of the field and a change of focus from ordering cells on continuous trajectories to integrating multiple samples and making use of reference datasets. We also find that open science practices reward developers with increased recognition and help accelerate the field.
Collapse
Affiliation(s)
- Luke Zappia
- Institute of Computational Biology, Helmholtz Zentrum München, 85764, Neuherberg, Germany
- Department of Mathematics, Technical University of Munich, 85748, Garching bei München, Germany
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, 85764, Neuherberg, Germany.
- Department of Mathematics, Technical University of Munich, 85748, Garching bei München, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, 85354, Freising, Germany.
| |
Collapse
|
14
|
Liu S, Thennavan A, Garay JP, Marron JS, Perou CM. MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data. Genome Biol 2021; 22:232. [PMID: 34412669 PMCID: PMC8375188 DOI: 10.1186/s13059-021-02445-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Accepted: 07/29/2021] [Indexed: 01/02/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) provides new opportunities to characterize cell populations, typically accomplished through some type of clustering analysis. Estimation of the optimal cluster number (K) is a crucial step but often ignored. Our approach improves most current scRNA-seq cluster methods by providing an objective estimation of the number of groups using a multi-resolution perspective. MultiK is a tool for objective selection of insightful Ks and achieves high robustness through a consensus clustering approach. We demonstrate that MultiK identifies reproducible groups in scRNA-seq data, thus providing an objective means to estimating the number of possible groups or cell-type populations present.
Collapse
Affiliation(s)
- Siyao Liu
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Marsico Hall, 5th floor, CB#7599, 125 Mason Farm Road, Chapel Hill, NC, 27599, USA
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel Hill, NC, 27599, USA
| | - Aatish Thennavan
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Marsico Hall, 5th floor, CB#7599, 125 Mason Farm Road, Chapel Hill, NC, 27599, USA
- Oral and Craniofacial Biomedicine Program, School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Joseph P Garay
- Department of Surgery, Oregon Health & Science University, Portland, OR, 97239, USA
| | - J S Marron
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Marsico Hall, 5th floor, CB#7599, 125 Mason Farm Road, Chapel Hill, NC, 27599, USA.
- Department of Statistics and Operation Research, University of North Carolina at Chapel Hill, 352 Hanes Hall CB#3260, Chapel Hill, NC, 27599, USA.
| | - Charles M Perou
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Marsico Hall, 5th floor, CB#7599, 125 Mason Farm Road, Chapel Hill, NC, 27599, USA.
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
- Department of Pathology and Laboratory Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
15
|
Lütge M, Pikor NB, Ludewig B. Differentiation and activation of fibroblastic reticular cells. Immunol Rev 2021; 302:32-46. [PMID: 34046914 PMCID: PMC8361914 DOI: 10.1111/imr.12981] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 04/17/2021] [Accepted: 04/30/2021] [Indexed: 12/29/2022]
Abstract
Secondary lymphoid organs (SLO) are underpinned by fibroblastic reticular cells (FRC) that form dedicated microenvironmental niches to secure induction and regulation of innate and adaptive immunity. Distinct FRC subsets are strategically positioned in SLOs to provide niche factors and govern efficient immune cell interaction. In recent years, the use of specialized mouse models in combination with single-cell transcriptomics has facilitated the elaboration of the molecular FRC landscape at an unprecedented resolution. While single-cell RNA-sequencing has advanced the resolution of FRC subset characterization and function, the high dimensionality of the generated data necessitates careful analysis and validation. Here, we reviewed novel findings from high-resolution transcriptomic analyses that refine our understanding of FRC differentiation and activation processes in the context of infection and inflammation. We further discuss concepts, strategies, and limitations for the analysis of single-cell transcriptome data from FRCs and the wide-ranging implications for our understanding of stromal cell biology.
Collapse
Affiliation(s)
- Mechthild Lütge
- Institute of Immunobiology, Medical Research Center, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Natalia B Pikor
- Institute of Immunobiology, Medical Research Center, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Burkhard Ludewig
- Institute of Immunobiology, Medical Research Center, Kantonsspital St. Gallen, St. Gallen, Switzerland.,Institute of Experimental Immunology, University of Zürich, Zürich, Switzerland
| |
Collapse
|
16
|
Wang YXR, Li L, Li JJ, Huang H. Network Modeling in Biology: Statistical Methods for Gene and Brain Networks. Stat Sci 2021; 36:89-108. [PMID: 34305304 DOI: 10.1214/20-sts792] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The rise of network data in many different domains has offered researchers new insight into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical network modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using covariates as a first step. We provide a discussion on existing statistical and computational methods for edge esitimation and subsequent statistical inference problems in these two types of biological networks.
Collapse
Affiliation(s)
- Y X Rachel Wang
- School of Mathematics and Statistics, University of Sydney, Australia
| | - Lexin Li
- Department of Biostatistics and Epidemiology, School of Public Health, University of California, Berkeley
| | | | - Haiyan Huang
- Department of Statistics, University of California, Berkeley
| |
Collapse
|
17
|
Abstract
Normalization is an important step in the analysis of single-cell RNA-seq data. While no single method outperforms all others in all datasets, the choice of normalization can have profound impact on the results. Data-driven metrics can be used to rank normalization methods and select the best performers. Here, we show how to use R/Bioconductor to calculate normalization factors, apply them to compute normalized data, and compare several normalization approaches. Finally, we briefly show how to perform downstream analysis steps on the normalized data.
Collapse
Affiliation(s)
- Davide Risso
- Department of Statistical Sciences, University of Padova, Padova, Italy.
| |
Collapse
|
18
|
Li Y, Xu Q, Wu D, Chen G. Exploring Additional Valuable Information From Single-Cell RNA-Seq Data. Front Cell Dev Biol 2020; 8:593007. [PMID: 33335900 PMCID: PMC7736616 DOI: 10.3389/fcell.2020.593007] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Accepted: 10/26/2020] [Indexed: 12/28/2022] Open
Abstract
Single-cell RNA-seq (scRNA-seq) technologies are broadly applied to dissect the cellular heterogeneity and expression dynamics, providing unprecedented insights into single-cell biology. Most of the scRNA-seq studies mainly focused on the dissection of cell types/states, developmental trajectory, gene regulatory network, and alternative splicing. However, besides these routine analyses, many other valuable scRNA-seq investigations can be conducted. Here, we first review cell-to-cell communication exploration, RNA velocity inference, identification of large-scale copy number variations and single nucleotide changes, and chromatin accessibility prediction based on single-cell transcriptomics data. Next, we discuss the identification of novel genes/transcripts through transcriptome reconstruction approaches, as well as the profiling of long non-coding RNAs and circular RNAs. Additionally, we survey the integration of single-cell and bulk RNA-seq datasets for deconvoluting the cell composition of large-scale bulk samples and linking single-cell signatures to patient outcomes. These additional analyses could largely facilitate corresponding basic science and clinical applications.
Collapse
Affiliation(s)
- Yunjin Li
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China
| | - Qiyue Xu
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China
| | - Duojiao Wu
- Institute of Clinical Science, Zhongshan Hospital, Fudan University, Shanghai, China
| | - Geng Chen
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China
| |
Collapse
|
19
|
Ye P, Ye W, Ye C, Li S, Ye L, Ji G, Wu X. scHinter: imputing dropout events for single-cell RNA-seq data with limited sample size. Bioinformatics 2020; 36:789-797. [PMID: 31392316 DOI: 10.1093/bioinformatics/btz627] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Revised: 07/18/2019] [Accepted: 08/06/2019] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Single-cell RNA-sequencing (scRNA-seq) is fast and becoming a powerful technique for studying dynamic gene regulation at unprecedented resolution. However, scRNA-seq data suffer from problems of extremely high dropout rate and cell-to-cell variability, demanding new methods to recover gene expression loss. Despite the availability of various dropout imputation approaches for scRNA-seq, most studies focus on data with a medium or large number of cells, while few studies have explicitly investigated the differential performance across different sample sizes or the applicability of the approach on small or imbalanced data. It is imperative to develop new imputation approaches with higher generalizability for data with various sample sizes. RESULTS We proposed a method called scHinter for imputing dropout events for scRNA-seq with special emphasis on data with limited sample size. scHinter incorporates a voting-based ensemble distance and leverages the synthetic minority oversampling technique for random interpolation. A hierarchical framework is also embedded in scHinter to increase the reliability of the imputation for small samples. We demonstrated the ability of scHinter to recover gene expression measurements across a wide spectrum of scRNA-seq datasets with varied sample sizes. We comprehensively examined the impact of sample size and cluster number on imputation. Comprehensive evaluation of scHinter across diverse scRNA-seq datasets with imbalanced or limited sample size showed that scHinter achieved higher and more robust performance than competing approaches, including MAGIC, scImpute, SAVER and netSmooth. AVAILABILITY AND IMPLEMENTATION Freely available for download at https://github.com/BMILAB/scHinter. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pengchao Ye
- Department of Automation, Fujian 361005, China.,National Institute for Data Science in Health and Medicine, Fujian 361005, China
| | - Wenbin Ye
- Department of Automation, Fujian 361005, China.,National Institute for Data Science in Health and Medicine, Fujian 361005, China
| | - Congting Ye
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen, Fujian 361005, China
| | - Shuchao Li
- Department of Automation, Fujian 361005, China.,National Institute for Data Science in Health and Medicine, Fujian 361005, China
| | - Lishan Ye
- Zhongshan Hospital of Xiamen University, Xiamen, Fujian 361004, China
| | - Guoli Ji
- Department of Automation, Fujian 361005, China.,National Institute for Data Science in Health and Medicine, Fujian 361005, China
| | - Xiaohui Wu
- Department of Automation, Fujian 361005, China.,National Institute for Data Science in Health and Medicine, Fujian 361005, China
| |
Collapse
|
20
|
Kim TH, Zhou X, Chen M. Demystifying "drop-outs" in single-cell UMI data. Genome Biol 2020; 21:196. [PMID: 32762710 PMCID: PMC7412673 DOI: 10.1186/s13059-020-02096-y] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 07/08/2020] [Indexed: 01/10/2023] Open
Abstract
Many existing pipelines for scRNA-seq data apply pre-processing steps such as normalization or imputation to account for excessive zeros or "drop-outs." Here, we extensively analyze diverse UMI data sets to show that clustering should be the foremost step of the workflow. We observe that most drop-outs disappear once cell-type heterogeneity is resolved, while imputing or normalizing heterogeneous data can introduce unwanted noise. We propose a novel framework HIPPO (Heterogeneity-Inspired Pre-Processing tOol) that leverages zero proportions to explain cellular heterogeneity and integrates feature selection with iterative clustering. HIPPO leads to downstream analysis with greater flexibility and interpretability compared to alternatives.
Collapse
Affiliation(s)
- Tae Hyun Kim
- Department of Statistics, University of Chicago, Chicago, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, USA.
| | - Mengjie Chen
- Department of Human Genetics and Department of Medicine, University of Chicago, Chicago, USA.
| |
Collapse
|
21
|
Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction. Nat Commun 2020; 11:3274. [PMID: 32612268 PMCID: PMC7330047 DOI: 10.1038/s41467-020-16905-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 05/29/2020] [Indexed: 01/22/2023] Open
Abstract
Despite their widespread applications, single-cell RNA-sequencing (scRNA-seq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs—the reference panel and the chain-type designs—true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data. It is not clear which designs, other than completely randomized ones, are valid for scRNA-seq experiments so that batch effects can be adjusted. Here the authors show that under flexible reference panel and chain-type designs, biological variability can also be separated from batch effects, at least by BUSseq.
Collapse
|
22
|
Network-Based Single-Cell RNA-Seq Data Imputation Enhances Cell Type Identification. Genes (Basel) 2020; 11:genes11040377. [PMID: 32244427 PMCID: PMC7230610 DOI: 10.3390/genes11040377] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 03/24/2020] [Accepted: 03/24/2020] [Indexed: 12/14/2022] Open
Abstract
Single-cell RNA sequencing is a powerful technology for obtaining transcriptomes at single-cell resolutions. However, it suffers from dropout events (i.e., excess zero counts) since only a small fraction of transcripts get sequenced in each cell during the sequencing process. This inherent sparsity of expression profiles hinders further characterizations at cell/gene-level such as cell type identification and downstream analysis. To alleviate this dropout issue we introduce a network-based method, netImpute, by leveraging the hidden information in gene co-expression networks to recover real signals. netImpute employs Random Walk with Restart (RWR) to adjust the gene expression level in a given cell by borrowing information from its neighbors in a gene co-expression network. Performance evaluation and comparison with existing tools on simulated data and seven real datasets show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. While the idea of netImpute is general and can be applied with other types of networks such as cell co-expression network or protein–protein interaction (PPI) network, evaluation results show that gene co-expression network is consistently more beneficial, presumably because PPI network usually lacks cell type context, while cell co-expression network can cause information loss for rare cell types. Evaluation results on several biological datasets show that netImpute can more effectively recover missing transcripts in scRNA-seq data and enhance the identification and visualization of heterogeneous cell types than existing methods.
Collapse
|
23
|
Huh R, Yang Y, Jiang Y, Shen Y, Li Y. SAME-clustering: Single-cell Aggregated Clustering via Mixture Model Ensemble. Nucleic Acids Res 2020; 48:86-95. [PMID: 31777938 PMCID: PMC6943136 DOI: 10.1093/nar/gkz959] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 10/03/2019] [Accepted: 10/10/2019] [Indexed: 12/19/2022] Open
Abstract
Clustering is an essential step in the analysis of single cell RNA-seq (scRNA-seq) data to shed light on tissue complexity including the number of cell types and transcriptomic signatures of each cell type. Due to its importance, novel methods have been developed recently for this purpose. However, different approaches generate varying estimates regarding the number of clusters and the single-cell level cluster assignments. This type of unsupervised clustering is challenging and it is often times hard to gauge which method to use because none of the existing methods outperform others across all scenarios. We present SAME-clustering, a mixture model-based approach that takes clustering solutions from multiple methods and selects a maximally diverse subset to produce an improved ensemble solution. We tested SAME-clustering across 15 scRNA-seq datasets generated by different platforms, with number of clusters varying from 3 to 15, and number of single cells from 49 to 32 695. Results show that our SAME-clustering ensemble method yields enhanced clustering, in terms of both cluster assignments and number of clusters. The mixture model ensemble clustering is not limited to clustering scRNA-seq data and may be useful to a wide range of clustering applications.
Collapse
Affiliation(s)
- Ruth Huh
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yuchen Yang
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yuchao Jiang
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yin Shen
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
- Department of Neurology, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Yun Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- To whom correspondence should be addressed. Tel: +1 919 843 2832; Fax: +1 919 843 4682;
| |
Collapse
|
24
|
Abstract
One primary reason that makes single-cell RNA-seq analysis challenging is dropouts, where the data only captures a small fraction of the transcriptome of each cell. Almost all computational algorithms developed for single-cell RNA-seq adopted gene selection, dimension reduction or imputation to address the dropouts. Here, an opposite view is explored. Instead of treating dropouts as a problem to be fixed, we embrace it as a useful signal. We represent the dropout pattern by binarizing single-cell RNA-seq count data, and present a co-occurrence clustering algorithm to cluster cells based on the dropout pattern. We demonstrate in multiple published datasets that the binary dropout pattern is as informative as the quantitative expression of highly variable genes for the purpose of identifying cell types. We expect that recognizing the utility of dropouts provides an alternative direction for developing computational algorithms for single-cell RNA-seq analysis.
Collapse
|
25
|
Casey MJ, Stumpf PS, MacArthur BD. Theory of cell fate. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2020; 12:e1471. [PMID: 31828979 PMCID: PMC7027507 DOI: 10.1002/wsbm.1471] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Revised: 10/15/2019] [Accepted: 11/06/2019] [Indexed: 11/17/2022]
Abstract
Cell fate decisions are controlled by complex intracellular molecular regulatory networks. Studies increasingly reveal the scale of this complexity: not only do cell fate regulatory networks contain numerous positive and negative feedback loops, they also involve a range of different kinds of nonlinear protein-protein and protein-DNA interactions. This inherent complexity and nonlinearity makes cell fate decisions hard to understand using experiment and intuition alone. In this primer, we will outline how tools from mathematics can be used to understand cell fate dynamics. We will briefly introduce some notions from dynamical systems theory, and discuss how they offer a framework within which to build a rigorous understanding of what we mean by a cell "fate", and how cells change fate. We will also outline how modern experiments, particularly high-throughput single-cell experiments, are enabling us to test and explore the limits of these ideas, and build a better understanding of cellular identities. This article is categorized under: Models of Systems Properties and Processes > Mechanistic Models Biological Mechanisms > Cell Fates Models of Systems Properties and Processes > Cellular Models.
Collapse
Affiliation(s)
- Michael J. Casey
- Mathematical SciencesUniversity of SouthamptonSouthamptonUK
- Institute for Life SciencesUniversity of SouthamptonSouthamptonUK
| | - Patrick S. Stumpf
- Institute for Life SciencesUniversity of SouthamptonSouthamptonUK
- Centre for Human Development, Stem Cells and Regeneration, Faculty of MedicineUniversity of SouthamptonSouthamptonUK
| | - Ben D. MacArthur
- Mathematical SciencesUniversity of SouthamptonSouthamptonUK
- Institute for Life SciencesUniversity of SouthamptonSouthamptonUK
- Centre for Human Development, Stem Cells and Regeneration, Faculty of MedicineUniversity of SouthamptonSouthamptonUK
| |
Collapse
|
26
|
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini CSO, Aparicio S, Baaijens J, Balvert M, Barbanson BD, Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo TH, Lelieveldt BP, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Rączkowska A, Reinders M, Ridder JD, Saliba AE, Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in single-cell data science. Genome Biol 2020; 21:31. [PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6] [Citation(s) in RCA: 564] [Impact Index Per Article: 141.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 01/02/2020] [Indexed: 02/08/2023] Open
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Collapse
Affiliation(s)
- David Lähnemann
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf, Germany
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Johannes Köster
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA
| | - Ewa Szczurek
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Davis J. McCarthy
- Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia
- Melbourne Integrative Genomics, School of BioSciences–School of Mathematics & Statistics, Faculty of Science, University of Melbourne, Melbourne, Australia
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD USA
| | - Mark D. Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zürich, Zürich, Switzerland
| | - Catalina A. Vallejos
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK
- The Alan Turing Institute, British Library, London, UK
| | - Kieran R. Campbell
- Department of Statistics, University of British Columbia, Vancouver, Canada
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Data Science Institute, University of British Columbia, Vancouver, Canada
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Luca Pinello
- Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA
- Department of Pathology, Harvard Medical School, Boston, USA
- Broad Institute of Harvard and MIT, Cambridge, MA USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, USA
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | | | - Samuel Aparicio
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Jasmijn Baaijens
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
| | - Marleen Balvert
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| | - Buys de Barbanson
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Antonio Cappuccio
- Institute for Advanced Study, University of Amsterdam, Amsterdam, The Netherlands
| | - Giacomo Corleone
- Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, London, UK
| | - Bas E. Dutilh
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Maria Florescu
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Rens Holmer
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Katharina Jahn
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Thamar Jessurun Lobo
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Emma M. Keizer
- Biometris, Wageningen University & Research, Wageningen, The Netherlands
| | - Indu Khatri
- Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, Leiden, The Netherlands
| | - Szymon M. Kielbasa
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Jan O. Korbel
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alexey M. Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tzu-Hao Kuo
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Boudewijn P.F. Lelieveldt
- PRB lab, Delft University of Technology, Delft, The Netherlands
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Ion I. Mandoiu
- Computer Science & Engineering Department, University of Connecticut, Storrs, USA
| | - John C. Marioni
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Felix Mölder
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
| | - Amir Niknejad
- Computation molecular design, Zuse Institute Berlin, Berlin, Germany
- Mathematics Department, Mount Saint Vincent, New York, USA
| | - Alicja Rączkowska
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Marcel Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Antoine-Emmanuel Saliba
- Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany
| | - Antonios Somarakis
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Oliver Stegle
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center–DKFZ, Heidelberg, Germany
| | - Fabian J. Theis
- Institute of Computational Biology, Helmholtz Zentrum München–German Research Center for Environmental Health, Neuherberg, Germany
| | - Huan Yang
- Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research–LACDR–Leiden University, Leiden, The Netherlands
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Alice C. McHardy
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Sohrab P. Shah
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Alexander Schönhuth
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
27
|
Geddes TA, Kim T, Nan L, Burchfield JG, Yang JYH, Tao D, Yang P. Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis. BMC Bioinformatics 2019; 20:660. [PMID: 31870278 PMCID: PMC6929272 DOI: 10.1186/s12859-019-3179-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 10/28/2019] [Indexed: 01/23/2023] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. Results Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. Conclusions Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS
Collapse
Affiliation(s)
- Thomas A Geddes
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia.,Charles Perkins Centre, School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Taiyun Kim
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Lihao Nan
- UBTECH Sydney Artificial Intelligence Centre and the School of Computer Science, Faculty of Engineering and Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia
| | - James G Burchfield
- Charles Perkins Centre, School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Jean Y H Yang
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Dacheng Tao
- UBTECH Sydney Artificial Intelligence Centre and the School of Computer Science, Faculty of Engineering and Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia
| | - Pengyi Yang
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia. .,Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW 2145, Australia.
| |
Collapse
|
28
|
Cao Y, Lin Y, Ormerod JT, Yang P, Yang JYH, Lo KK. scDC: single cell differential composition analysis. BMC Bioinformatics 2019; 20:721. [PMID: 31870280 PMCID: PMC6929335 DOI: 10.1186/s12859-019-3211-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/12/2019] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Differences in cell-type composition across subjects and conditions often carry biological significance. Recent advancements in single cell sequencing technologies enable cell-types to be identified at the single cell level, and as a result, cell-type composition of tissues can now be studied in exquisite detail. However, a number of challenges remain with cell-type composition analysis - none of the existing methods can identify cell-type perfectly and variability related to cell sampling exists in any single cell experiment. This necessitates the development of method for estimating uncertainty in cell-type composition. RESULTS We developed a novel single cell differential composition (scDC) analysis method that performs differential cell-type composition analysis via bootstrap resampling. scDC captures the uncertainty associated with cell-type proportions of each subject via bias-corrected and accelerated bootstrap confidence intervals. We assessed the performance of our method using a number of simulated datasets and synthetic datasets curated from publicly available single cell datasets. In simulated datasets, scDC correctly recovered the true cell-type proportions. In synthetic datasets, the cell-type compositions returned by scDC were highly concordant with reference cell-type compositions from the original data. Since the majority of datasets tested in this study have only 2 to 5 subjects per condition, the addition of confidence intervals enabled better comparisons of compositional differences between subjects and across conditions. CONCLUSIONS scDC is a novel statistical method for performing differential cell-type composition analysis for scRNA-seq data. It uses bootstrap resampling to estimate the standard errors associated with cell-type proportion estimates and performs significance testing through GLM and GLMM models. We have made this method available to the scientific community as part of the scdney package (Single Cell Data Integrative Analysis) R package, available from https://github.com/SydneyBioX/scdney.
Collapse
Affiliation(s)
- Yue Cao
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - John T Ormerod
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW 2145, Australia
| | - Jean Y H Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| | - Kitty K Lo
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia.
| |
Collapse
|
29
|
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol 2019; 20:295. [PMID: 31870412 PMCID: PMC6927135 DOI: 10.1186/s13059-019-1861-6] [Citation(s) in RCA: 206] [Impact Index Per Article: 41.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 10/15/2019] [Indexed: 12/23/2022] Open
Abstract
Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.
Collapse
Affiliation(s)
- F. William Townes
- Department of Biostatistics, Harvard University, Cambridge, MA USA
- Present Address: Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD USA
| | - Martin J. Aryee
- Department of Biostatistics, Harvard University, Cambridge, MA USA
- Molecular Pathology Unit, Massachusetts General Hospital, Charlestown, MA USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA USA
- Department of Pathology, Harvard Medical School, Boston, MA USA
| | - Rafael A. Irizarry
- Department of Biostatistics, Harvard University, Cambridge, MA USA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA USA
| |
Collapse
|
30
|
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol 2019; 20:295. [PMID: 31870412 DOI: 10.1101/574574] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 10/15/2019] [Indexed: 05/24/2023] Open
Abstract
Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.
Collapse
Affiliation(s)
- F William Townes
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
- Present Address: Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Martin J Aryee
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
- Molecular Pathology Unit, Massachusetts General Hospital, Charlestown, MA, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Harvard University, Cambridge, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|
31
|
Cheng C, Easton J, Rosencrance C, Li Y, Ju B, Williams J, Mulder HL, Pang Y, Chen W, Chen X. Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell RNA-seq data. Nucleic Acids Res 2019; 47:e143. [PMID: 31566233 PMCID: PMC6902034 DOI: 10.1093/nar/gkz826] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 08/30/2019] [Accepted: 09/26/2019] [Indexed: 12/21/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a powerful tool for characterizing the cell-to-cell variation and cellular dynamics in populations which appear homogeneous otherwise in basic and translational biological research. However, significant challenges arise in the analysis of scRNA-seq data, including the low signal-to-noise ratio with high data sparsity, potential batch effects, scalability problems when hundreds of thousands of cells are to be analyzed among others. The inherent complexities of scRNA-seq data and dynamic nature of cellular processes lead to suboptimal performance of many currently available algorithms, even for basic tasks such as identifying biologically meaningful heterogeneous subpopulations. In this study, we developed the Latent Cellular Analysis (LCA), a machine learning-based analytical pipeline that combines cosine-similarity measurement by latent cellular states with a graph-based clustering algorithm. LCA provides heuristic solutions for population number inference, dimension reduction, feature selection, and control of technical variations without explicit gene filtering. We show that LCA is robust, accurate, and powerful by comparison with multiple state-of-the-art computational methods when applied to large-scale real and simulated scRNA-seq data. Importantly, the ability of LCA to learn from representative subsets of the data provides scalability, thereby addressing a significant challenge posed by growing sample sizes in scRNA-seq data analysis.
Collapse
Affiliation(s)
- Changde Cheng
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - John Easton
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Celeste Rosencrance
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Yan Li
- The University of Texas MD Anderson Cancer Center UTHealthGraduate School of Biomedical Sciences, Houston, TX 77030, USA
| | - Bensheng Ju
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Justin Williams
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Heather L Mulder
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Yakun Pang
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Wenan Chen
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Xiang Chen
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| |
Collapse
|
32
|
Krzak M, Raykov Y, Boukouvalas A, Cutillo L, Angelini C. Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front Genet 2019; 10:1253. [PMID: 31921297 PMCID: PMC6918801 DOI: 10.3389/fgene.2019.01253] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 11/13/2019] [Indexed: 01/04/2023] Open
Abstract
Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.
Collapse
Affiliation(s)
- Monika Krzak
- Institute for Applied Mathematics “Mauro Picone”, Naples, Italy
| | - Yordan Raykov
- Department of Mathematics, Aston University, Birmingham, United Kingdom
| | | | - Luisa Cutillo
- School of Mathematics, University of Leeds, Leeds, United Kingdom
| | | |
Collapse
|
33
|
Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 2019; 20:269. [PMID: 31823809 PMCID: PMC6902413 DOI: 10.1186/s13059-019-1898-6] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 11/22/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. RESULTS We aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used dimensionality reduction methods for scRNA-seq studies. Specifically, we compare 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different dimensionality reduction methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluate the computational scalability of different dimensionality reduction methods by recording their computational cost. CONCLUSIONS Based on the comprehensive evaluation results, we provide important guidelines for choosing dimensionality reduction methods for scRNA-seq data analysis. We also provide all analysis scripts used in the present study at www.xzlab.org/reproduce.html.
Collapse
Affiliation(s)
- Shiquan Sun
- School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, People's Republic of China
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Jiaqiang Zhu
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Ying Ma
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA.
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
34
|
Chaudhry F, Isherwood J, Bawa T, Patel D, Gurdziel K, Lanfear DE, Ruden DM, Levy PD. Single-Cell RNA Sequencing of the Cardiovascular System: New Looks for Old Diseases. Front Cardiovasc Med 2019; 6:173. [PMID: 31921894 PMCID: PMC6914766 DOI: 10.3389/fcvm.2019.00173] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Accepted: 11/12/2019] [Indexed: 12/18/2022] Open
Abstract
Cardiovascular disease encompasses a wide range of conditions, resulting in the highest number of deaths worldwide. The underlying pathologies surrounding cardiovascular disease include a vast and complicated network of both cellular and molecular mechanisms. Unique phenotypic alterations in specific cell types, visualized as varying RNA expression-levels (both coding and non-coding), have been identified as crucial factors in the pathology underlying conditions such as heart failure and atherosclerosis. Recent advances in single-cell RNA sequencing (scRNA-seq) have elucidated a new realm of cell subpopulations and transcriptional variations that are associated with normal and pathological physiology in a wide variety of diseases. This breakthrough in the phenotypical understanding of our cells has brought novel insight into cardiovascular basic science. scRNA-seq allows for separation of widely distinct cell subpopulations which were, until recently, simply averaged together with bulk-tissue RNA-seq. scRNA-seq has been used to identify novel cell types in the heart and vasculature that could be implicated in a variety of disease pathologies. Furthermore, scRNA-seq has been able to identify significant heterogeneity of phenotypes within individual cell subtype populations. The ability to characterize single cells based on transcriptional phenotypes allows researchers the ability to map development of cells and identify changes in specific subpopulations due to diseases at a very high throughput. This review looks at recent scRNA-seq studies of various aspects of the cardiovascular system and discusses their potential value to our understanding of the cardiovascular system and pathology.
Collapse
Affiliation(s)
- Farhan Chaudhry
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| | - Jenna Isherwood
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, United States
| | - Tejeshwar Bawa
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| | - Dhruvil Patel
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| | - Katherine Gurdziel
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, United States
| | - David E Lanfear
- Heart and Vascular Institute, Henry Ford Health System, Detroit, MI, United States
| | - Douglas M Ruden
- Department of Obstetrics and Gynecology, Center for Urban Responses to Environmental Stressors, Wayne State University, Detroit, MI, United States
| | - Phillip D Levy
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| |
Collapse
|
35
|
Xu G, Liu Y, Li H, Liu L, Zhang S, Zhang Z. Dissecting the human immune system with single cell RNA sequencing technology. J Leukoc Biol 2019; 107:613-623. [PMID: 31803960 DOI: 10.1002/jlb.5mr1019-179r] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Revised: 10/24/2019] [Accepted: 11/13/2019] [Indexed: 12/23/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a powerful new technology allowing the analysis of transcriptomes from individual cell and is ideally suited to dissect immune cell heterogeneity. ScRNA-seq has already been applied to identify novel immune cell subsets, elaborate cellular differentiation trajectories, and elucidate immunopathogenic mechanisms. Here, we briefly discuss the recent progresses and challenges in the scRNA-seq technology including the workflow, recent applications in immunology, and potential hurdles that need to be overcome. This review will highlight how single cell technology promotes our understanding of human immunology.
Collapse
Affiliation(s)
- Gang Xu
- Institute of Hepatology, National Clinical Research Center for Infectious Disease, Shenzhen Third People's Hospital, the Second Affiliated Hospital of Southern University of Science and Technology, Shenzhen, Guangdong Province, China.,Guangdong Key Lab of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Longgang District, Shenzhen, China
| | - Yang Liu
- Institute of Hepatology, National Clinical Research Center for Infectious Disease, Shenzhen Third People's Hospital, the Second Affiliated Hospital of Southern University of Science and Technology, Shenzhen, Guangdong Province, China.,Guangdong Key Lab of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Longgang District, Shenzhen, China
| | - Hanjie Li
- Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Lei Liu
- Institute of Hepatology, National Clinical Research Center for Infectious Disease, Shenzhen Third People's Hospital, the Second Affiliated Hospital of Southern University of Science and Technology, Shenzhen, Guangdong Province, China.,Guangdong Key Lab of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Longgang District, Shenzhen, China
| | - Shuye Zhang
- Shanghai Public Health Clinical Center and Institute of Biomedical Sciences, Fudan University, Shanghai, China
| | - Zheng Zhang
- Institute of Hepatology, National Clinical Research Center for Infectious Disease, Shenzhen Third People's Hospital, the Second Affiliated Hospital of Southern University of Science and Technology, Shenzhen, Guangdong Province, China.,Guangdong Key Lab of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Longgang District, Shenzhen, China.,Key Laboratory of Immunology, Sino-French Hoffmann Institute, School of Basic Medical Sciences; Guangdong Provincial Key Laboratory of Allergy & Clinical Immunology, The Second Affiliated Hospital, Guangzhou Medical University, Guangzhou, China
| |
Collapse
|
36
|
Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods 2019; 16:1007-1015. [PMID: 31501550 DOI: 10.1038/s41592-019-0529-1] [Citation(s) in RCA: 184] [Impact Index Per Article: 36.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 07/16/2019] [Indexed: 01/23/2023]
Abstract
Single-cell RNA sequencing has enabled the decomposition of complex tissues into functionally distinct cell types. Often, investigators wish to assign cells to cell types through unsupervised clustering followed by manual annotation or via 'mapping' to existing data. However, manual interpretation scales poorly to large datasets, mapping approaches require purified or pre-annotated data and both are prone to batch effects. To overcome these issues, we present CellAssign, a probabilistic model that leverages prior knowledge of cell-type marker genes to annotate single-cell RNA sequencing data into predefined or de novo cell types. CellAssign automates the process of assigning cells in a highly scalable manner across large datasets while controlling for batch and sample effects. We demonstrate the advantages of CellAssign through extensive simulations and analysis of tumor microenvironment composition in high-grade serous ovarian cancer and follicular lymphoma.
Collapse
|
37
|
Yu X, Chen YA, Conejo-Garcia JR, Chung CH, Wang X. Estimation of immune cell content in tumor using single-cell RNA-seq reference data. BMC Cancer 2019; 19:715. [PMID: 31324168 PMCID: PMC6642583 DOI: 10.1186/s12885-019-5927-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Accepted: 07/12/2019] [Indexed: 12/12/2022] Open
Abstract
Background The rapid development of single-cell RNA sequencing (scRNA-seq) provides unprecedented opportunities to study the tumor ecosystem that involves a heterogeneous mixture of cell types. However, the majority of previous and current studies related to translational and molecular oncology have only focused on the bulk tumor and there is a wealth of gene expression data accumulated with matched clinical outcomes. Results In this paper, we introduce a scheme for characterizing cell compositions from bulk tumor gene expression by integrating signatures learned from scRNA-seq data. We derived the reference expression matrix to each cell type based on cell subpopulations identified in head and neck cancer dataset. Our results suggest that scRNA-Seq-derived reference matrix outperforms the existing gene panel and reference matrix with respect to distinguishing immune cell subtypes. Conclusions Findings and resources created from this study enable future and secondary analysis of tumor RNA mixtures in head and neck cancer for a more accurate cellular deconvolution, and can facilitate the profiling of the immune infiltration in other solid tumors due to the expression homogeneity observed in immune cells. Electronic supplementary material The online version of this article (10.1186/s12885-019-5927-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoqing Yu
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Y Ann Chen
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Jose R Conejo-Garcia
- Department of Immunology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Christine H Chung
- Department of Head and Neck-Endocrine Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Xuefeng Wang
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA.
| |
Collapse
|
38
|
Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, Boulesteix AL, Saeys Y, Robinson MD. Essential guidelines for computational method benchmarking. Genome Biol 2019; 20:125. [PMID: 31221194 PMCID: PMC6584985 DOI: 10.1186/s13059-019-1738-8] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.
Collapse
Affiliation(s)
- Lukas M Weber
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland
| | - Wouter Saelens
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Robrecht Cannoodt
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland
- Present address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
| | - Alexander Hapfelmeier
- Institute of Medical Informatics, Statistics and Epidemiology, Technical University of Munich, 81675, Munich, Germany
| | - Paul P Gardner
- Department of Biochemistry, University of Otago, Dunedin, 9016, New Zealand
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-University, 81377, Munich, Germany
| | - Yvan Saeys
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium.
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland.
| |
Collapse
|
39
|
Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol 2019; 15:e8746. [PMID: 31217225 PMCID: PMC6582955 DOI: 10.15252/msb.20188746] [Citation(s) in RCA: 953] [Impact Index Per Article: 190.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 03/15/2019] [Accepted: 04/03/2019] [Indexed: 12/21/2022] Open
Abstract
Single-cell RNA-seq has enabled gene expression to be studied at an unprecedented resolution. The promise of this technology is attracting a growing user base for single-cell analysis methods. As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one's data. Here, we detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We formulate current best-practice recommendations for these steps based on independent comparison studies. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice. Our documented case study can be found at https://www.github.com/theislab/single-cell-tutorial This review will serve as a workflow tutorial for new entrants into the field, and help established users update their analysis pipelines.
Collapse
Affiliation(s)
- Malte D Luecken
- Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
- Department of Mathematics, Technische Universität München, Garching bei München, Germany
| |
Collapse
|
40
|
Crow M, Gillis J. Single cell RNA-sequencing: replicability of cell types. Curr Opin Neurobiol 2019; 56:69-77. [PMID: 30654233 PMCID: PMC6551252 DOI: 10.1016/j.conb.2018.12.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 12/03/2018] [Accepted: 12/09/2018] [Indexed: 01/09/2023]
Abstract
Recent technical advances have enabled transcriptomics experiments at an unprecedented scale, and single-cell profiles from neural tissue are accumulating rapidly. There has been considerable effort to use these profiles to understand cell diversity, primarily through unsupervised clustering and differential expression analysis. However, current practices to validate these findings vary. In this review, we describe recent efforts to evaluate clusters from single-cell RNA-sequencing data, and provide a framework for considering current evidence and practices in terms of their capacity to establish principles of cell biology. Single-cell RNA-sequencing has already transformed neuroscience. By facilitating detailed comparative and genetic perturbation analyses, it may provide the tools to uncover fundamental mechanisms of neural diversity throughout the tree of life.
Collapse
Affiliation(s)
- Megan Crow
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA.
| |
Collapse
|
41
|
Tian L, Dong X, Freytag S, Lê Cao KA, Su S, JalalAbadi A, Amann-Zalcenstein D, Weber TS, Seidi A, Jabbari JS, Naik SH, Ritchie ME. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019; 16:479-487. [DOI: 10.1038/s41592-019-0425-8] [Citation(s) in RCA: 183] [Impact Index Per Article: 36.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 04/18/2019] [Indexed: 11/09/2022]
|
42
|
Ye W, Ji G, Ye P, Long Y, Xiao X, Li S, Su Y, Wu X. scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data. BMC Genomics 2019; 20:347. [PMID: 31068142 PMCID: PMC6505295 DOI: 10.1186/s12864-019-5747-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 04/29/2019] [Indexed: 12/15/2022] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) is fast becoming a powerful tool for profiling genome-scale transcriptomes of individual cells and capturing transcriptome-wide cell-to-cell variability. However, scRNA-seq technologies suffer from high levels of technical noise and variability, hindering reliable quantification of lowly and moderately expressed genes. Since most downstream analyses on scRNA-seq, such as cell type clustering and differential expression analysis, rely on the gene-cell expression matrix, preprocessing of scRNA-seq data is a critical preliminary step in the analysis of scRNA-seq data. Results We presented scNPF, an integrative scRNA-seq preprocessing framework assisted by network propagation and network fusion, for recovering gene expression loss, correcting gene expression measurements, and learning similarities between cells. scNPF leverages the context-specific topology inherent in the given data and the priori knowledge derived from publicly available molecular gene-gene interaction networks to augment gene-gene relationships in a data driven manner. We have demonstrated the great potential of scNPF in scRNA-seq preprocessing for accurately recovering gene expression values and learning cell similarity networks. Comprehensive evaluation of scNPF across a wide spectrum of scRNA-seq data sets showed that scNPF achieved comparable or higher performance than the competing approaches according to various metrics of internal validation and clustering accuracy. We have made scNPF an easy-to-use R package, which can be used as a versatile preprocessing plug-in for most existing scRNA-seq analysis pipelines or tools. Conclusions scNPF is a universal tool for preprocessing of scRNA-seq data, which jointly incorporates the global topology of priori interaction networks and the context-specific information encapsulated in the scRNA-seq data to capture both shared and complementary knowledge from diverse data sources. scNPF could be used to recover gene signatures and learn cell-to-cell similarities from emerging scRNA-seq data to facilitate downstream analyses such as dimension reduction, cell type clustering, and visualization. Electronic supplementary material The online version of this article (10.1186/s12864-019-5747-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenbin Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China.,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China
| | - Pengchao Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Yuqi Long
- Software Quality Testing Engineering Research Center, China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou, 510610, China
| | - Xuesong Xiao
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Shuchao Li
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350116, China
| | - Xiaohui Wu
- Department of Automation, Xiamen University, Xiamen, 361005, China. .,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China. .,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
43
|
Sun Z, Chen L, Xin H, Jiang Y, Huang Q, Cillo AR, Tabib T, Kolls JK, Bruno TC, Lafyatis R, Vignali DAA, Chen K, Ding Y, Hu M, Chen W. A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies. Nat Commun 2019; 10:1649. [PMID: 30967541 PMCID: PMC6456731 DOI: 10.1038/s41467-019-09639-3] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 03/15/2019] [Indexed: 02/08/2023] Open
Abstract
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Results from extensive simulation studies and applications of BAMM-SC to in-house experimental scRNA-seq datasets using blood, lung and skin cells from humans or mice demonstrate that BAMM-SC outperformed existing clustering methods with considerable improved clustering accuracy, particularly in the presence of heterogeneity among individuals. With the development of large scale single cell RNA-seq technology, population-scale scRNA-seq studies are emerging. Here, the authors develop BAMM-SC, a tool for clustering droplet-based scRNA-seq data from multiple individuals simultaneously.
Collapse
Affiliation(s)
- Zhe Sun
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University, Auburn, AL, 36849, USA
| | - Hongyi Xin
- Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, 15224, USA
| | - Yale Jiang
- Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, 15224, USA.,School of Medicine, Tsinghua University, Beijing, 100084, China
| | - Qianhui Huang
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Anthony R Cillo
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15262, USA
| | - Tracy Tabib
- Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Jay K Kolls
- School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Tullia C Bruno
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15262, USA.,Tumor Microenvironment Center, UPMC Hillman Cancer Center, Pittsburgh, PA, 15232, USA
| | - Robert Lafyatis
- Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Dario A A Vignali
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15262, USA.,Tumor Microenvironment Center, UPMC Hillman Cancer Center, Pittsburgh, PA, 15232, USA.,Cancer Immunology and Immunotherapy Program, UPMC Hillman Cancer Center, Pittsburgh, PA, 15232, USA
| | - Kong Chen
- Division of Pulmonary, Allergy and Critical Care Medicine, Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, USA
| | - Ying Ding
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
| | - Ming Hu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA.
| | - Wei Chen
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, 15261, USA. .,Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, 15224, USA.
| |
Collapse
|
44
|
Diaz-Mejia JJ, Meng EC, Pico AR, MacParland SA, Ketela T, Pugh TJ, Bader GD, Morris JH. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data. F1000Res 2019; 8:ISCB Comm J-296. [PMID: 31508207 PMCID: PMC6720041 DOI: 10.12688/f1000research.18490.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/19/2019] [Indexed: 10/15/2023] Open
Abstract
Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated steps from normalization to cell clustering. However, assigning cell type labels to cell clusters is often conducted manually, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. This is partially due to the scarcity of reference cell type signatures and because some methods support limited cell type signatures. Methods: In this study, we benchmarked five methods representing first-generation enrichment analysis (ORA), second-generation approaches (GSEA and GSVA), machine learning tools (CIBERSORT) and network-based neighbor voting (METANEIGHBOR), for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used five scRNA-seq datasets: human liver, 11 Tabula Muris mouse tissues, two human peripheral blood mononuclear cell datasets, and mouse retinal neurons, for which reference cell type signatures were available. The datasets span Drop-seq, 10X Chromium and Seq-Well technologies and range in size from ~3,700 to ~68,000 cells. Results: Our results show that, in general, all five methods perform well in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.91, sd = 0.06), whereas precision-recall analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). We observed an influence of the number of genes in cell type signatures on performance, with smaller signatures leading more frequently to incorrect results. Conclusions: GSVA was the overall top performer and was more robust in cell type signature subsampling simulations, although different methods performed well using different datasets. METANEIGHBOR and GSVA were the fastest methods. CIBERSORT and METANEIGHBOR were more influenced than the other methods by analyses including only expected cell types. We provide an extensible framework that can be used to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.
Collapse
Affiliation(s)
- J. Javier Diaz-Mejia
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | - Elaine C. Meng
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | | | - Sonya A. MacParland
- Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Toronto, ON, M5G 2C4, Canada
- Department of Immunology, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Troy Ketela
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
| | - Trevor J. Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Ontario Institute for Cancer Research, Toronto, ON, M5G 0A3, Canada
| | - Gary D. Bader
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5G 1A8, Canada
| | - John H. Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| |
Collapse
|
45
|
Diaz-Mejia JJ, Meng EC, Pico AR, MacParland SA, Ketela T, Pugh TJ, Bader GD, Morris JH. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data. F1000Res 2019; 8:ISCB Comm J-296. [PMID: 31508207 PMCID: PMC6720041 DOI: 10.12688/f1000research.18490.3] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/09/2019] [Indexed: 01/28/2023] Open
Abstract
Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated steps from normalization to cell clustering. However, assigning cell type labels to cell clusters is often conducted manually, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. This is partially due to the scarcity of reference cell type signatures and because some methods support limited cell type signatures. Methods: In this study, we benchmarked five methods representing first-generation enrichment analysis (ORA), second-generation approaches (GSEA and GSVA), machine learning tools (CIBERSORT) and network-based neighbor voting (METANEIGHBOR), for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used five scRNA-seq datasets: human liver, 11 Tabula Muris mouse tissues, two human peripheral blood mononuclear cell datasets, and mouse retinal neurons, for which reference cell type signatures were available. The datasets span Drop-seq, 10X Chromium and Seq-Well technologies and range in size from ~3,700 to ~68,000 cells. Results: Our results show that, in general, all five methods perform well in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.91, sd = 0.06), whereas precision-recall analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). We observed an influence of the number of genes in cell type signatures on performance, with smaller signatures leading more frequently to incorrect results. Conclusions: GSVA was the overall top performer and was more robust in cell type signature subsampling simulations, although different methods performed well using different datasets. METANEIGHBOR and GSVA were the fastest methods. CIBERSORT and METANEIGHBOR were more influenced than the other methods by analyses including only expected cell types. We provide an extensible framework that can be used to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.
Collapse
Affiliation(s)
- J. Javier Diaz-Mejia
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | - Elaine C. Meng
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | | | - Sonya A. MacParland
- Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Toronto, ON, M5G 2C4, Canada
- Department of Immunology, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Troy Ketela
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
| | - Trevor J. Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Ontario Institute for Cancer Research, Toronto, ON, M5G 0A3, Canada
| | - Gary D. Bader
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5G 1A8, Canada
| | - John H. Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| |
Collapse
|
46
|
Diaz-Mejia JJ, Meng EC, Pico AR, MacParland SA, Ketela T, Pugh TJ, Bader GD, Morris JH. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data. F1000Res 2019; 8:ISCB Comm J-296. [PMID: 31508207 PMCID: PMC6720041 DOI: 10.12688/f1000research.18490.1] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/08/2019] [Indexed: 12/11/2022] Open
Abstract
Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated computational steps like data normalization, dimensionality reduction and cell clustering. However, assigning cell type labels to cell clusters is still conducted manually by most researchers, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. Two bottlenecks to automating this task are the scarcity of reference cell type gene expression signatures and the fact that some dedicated methods are available only as web servers with limited cell type gene expression signatures. Methods: In this study, we benchmarked four methods (CIBERSORT, GSEA, GSVA, and ORA) for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used scRNA-seq datasets from liver, peripheral blood mononuclear cells and retinal neurons for which reference cell type gene expression signatures were available. Results: Our results show that, in general, all four methods show a high performance in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.94, sd = 0.036), whereas precision-recall curve analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). Conclusions: CIBERSORT and GSVA were the top two performers. Additionally, GSVA was the fastest of the four methods and was more robust in cell type gene expression signature subsampling simulations. We provide an extensible framework to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.
Collapse
Affiliation(s)
- J. Javier Diaz-Mejia
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | - Elaine C. Meng
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | | | - Sonya A. MacParland
- Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Toronto, ON, M5G 2C4, Canada
- Department of Immunology, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Troy Ketela
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
| | - Trevor J. Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Ontario Institute for Cancer Research, Toronto, ON, M5G 0A3, Canada
| | - Gary D. Bader
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5G 1A8, Canada
| | - John H. Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| |
Collapse
|
47
|
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res 2018; 7:1297. [PMID: 30228881 PMCID: PMC6124389 DOI: 10.12688/f1000research.15809.1] [Citation(s) in RCA: 99] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/07/2018] [Indexed: 01/21/2023] Open
Abstract
Background: The commercially available 10x Genomics protocol to generate droplet-based single-cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use. Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as three silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also robustness of a dozen methods. Results: We found that some methods, including Seurat and Cell Ranger, outperform other methods, although performance seems to be dependent on the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other. Conclusions: In light of this, we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.
Collapse
Affiliation(s)
- Saskia Freytag
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Luyi Tian
- Department of Medical Biology, University of Melbourne, Parkville, Australia
- Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | | | - Milica Ng
- Bio21 Insititute, CSL Limited, Parkville, Australia
| | - Melanie Bahlo
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| |
Collapse
|
48
|
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res 2018; 7:1297. [PMID: 30228881 PMCID: PMC6124389 DOI: 10.12688/f1000research.15809.2] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/14/2018] [Indexed: 12/23/2022] Open
Abstract
Background: The commercially available 10x Genomics protocol to generate droplet-based single cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use. Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as multiple silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also running time and robustness of a dozen methods. Results: We found that Seurat outperformed other methods, although performance seems to be dependent on many factors, including the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other. Conclusions: In light of this we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.
Collapse
Affiliation(s)
- Saskia Freytag
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Luyi Tian
- Department of Medical Biology, University of Melbourne, Parkville, Australia
- Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | | | - Milica Ng
- Bio21 Insititute, CSL Limited, Parkville, Australia
| | - Melanie Bahlo
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| |
Collapse
|