101
|
Fu R, Gillen AE, Sheridan RM, Tian C, Daya M, Hao Y, Hesselberth JR, Riemondy KA. clustifyr: an R package for automated single-cell RNA sequencing cluster classification. F1000Res 2020; 9:223. [PMID: 32765839 PMCID: PMC7383722 DOI: 10.12688/f1000research.22969.2] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/08/2020] [Indexed: 01/02/2023] Open
Abstract
Assignment of cell types from single-cell RNA sequencing (scRNA-seq) data remains a time-consuming and error-prone process. Current packages for identity assignment use limited types of reference data and often have rigid data structure requirements. We developed the clustifyr R package to leverage several external data types, including gene expression profiles to assign likely cell types using data from scRNA-seq, bulk RNA-seq, microarray expression data, or signature gene lists. We benchmark various parameters of a correlation-based approach and implement gene list enrichment methods. clustifyr is a lightweight and effective cell-type assignment tool developed for compatibility with various scRNA-seq analysis workflows. clustifyr is publicly available at
https://github.com/rnabioco/clustifyr
Collapse
Affiliation(s)
- Rui Fu
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora,, CO, 80045, USA
| | - Austin E Gillen
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora,, CO, 80045, USA
| | - Ryan M Sheridan
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora,, CO, 80045, USA
| | - Chengzhe Tian
- Department of Biochemistry, University of Colorado Boulder, Boulder, CO, 80303, USA
| | - Michelle Daya
- Biomedical Informatics & Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Yue Hao
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, 27695, USA
| | - Jay R Hesselberth
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora,, CO, 80045, USA.,Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Kent A Riemondy
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora,, CO, 80045, USA
| |
Collapse
|
102
|
Network-Based Single-Cell RNA-Seq Data Imputation Enhances Cell Type Identification. Genes (Basel) 2020; 11:genes11040377. [PMID: 32244427 PMCID: PMC7230610 DOI: 10.3390/genes11040377] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 03/24/2020] [Accepted: 03/24/2020] [Indexed: 12/14/2022] Open
Abstract
Single-cell RNA sequencing is a powerful technology for obtaining transcriptomes at single-cell resolutions. However, it suffers from dropout events (i.e., excess zero counts) since only a small fraction of transcripts get sequenced in each cell during the sequencing process. This inherent sparsity of expression profiles hinders further characterizations at cell/gene-level such as cell type identification and downstream analysis. To alleviate this dropout issue we introduce a network-based method, netImpute, by leveraging the hidden information in gene co-expression networks to recover real signals. netImpute employs Random Walk with Restart (RWR) to adjust the gene expression level in a given cell by borrowing information from its neighbors in a gene co-expression network. Performance evaluation and comparison with existing tools on simulated data and seven real datasets show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. While the idea of netImpute is general and can be applied with other types of networks such as cell co-expression network or protein–protein interaction (PPI) network, evaluation results show that gene co-expression network is consistently more beneficial, presumably because PPI network usually lacks cell type context, while cell co-expression network can cause information loss for rare cell types. Evaluation results on several biological datasets show that netImpute can more effectively recover missing transcripts in scRNA-seq data and enhance the identification and visualization of heterogeneous cell types than existing methods.
Collapse
|
103
|
Huh R, Yang Y, Jiang Y, Shen Y, Li Y. SAME-clustering: Single-cell Aggregated Clustering via Mixture Model Ensemble. Nucleic Acids Res 2020; 48:86-95. [PMID: 31777938 PMCID: PMC6943136 DOI: 10.1093/nar/gkz959] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 10/03/2019] [Accepted: 10/10/2019] [Indexed: 12/19/2022] Open
Abstract
Clustering is an essential step in the analysis of single cell RNA-seq (scRNA-seq) data to shed light on tissue complexity including the number of cell types and transcriptomic signatures of each cell type. Due to its importance, novel methods have been developed recently for this purpose. However, different approaches generate varying estimates regarding the number of clusters and the single-cell level cluster assignments. This type of unsupervised clustering is challenging and it is often times hard to gauge which method to use because none of the existing methods outperform others across all scenarios. We present SAME-clustering, a mixture model-based approach that takes clustering solutions from multiple methods and selects a maximally diverse subset to produce an improved ensemble solution. We tested SAME-clustering across 15 scRNA-seq datasets generated by different platforms, with number of clusters varying from 3 to 15, and number of single cells from 49 to 32 695. Results show that our SAME-clustering ensemble method yields enhanced clustering, in terms of both cluster assignments and number of clusters. The mixture model ensemble clustering is not limited to clustering scRNA-seq data and may be useful to a wide range of clustering applications.
Collapse
Affiliation(s)
- Ruth Huh
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yuchen Yang
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yuchao Jiang
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yin Shen
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
- Department of Neurology, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Yun Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- To whom correspondence should be addressed. Tel: +1 919 843 2832; Fax: +1 919 843 4682;
| |
Collapse
|
104
|
Loss of the branched-chain amino acid transporter CD98hc alters the development of colonic macrophages in mice. Commun Biol 2020; 3:130. [PMID: 32188932 PMCID: PMC7080761 DOI: 10.1038/s42003-020-0842-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Accepted: 02/21/2020] [Indexed: 12/14/2022] Open
Abstract
Comprehensive development is critical for gut macrophages being essential for the intestinal immune system. However, the underlying mechanisms of macrophage development in the colon remain elusive. To investigate the function of branched-chain amino acids in the development of gut macrophages, an inducible knock-out mouse model for the branched-chain amino acid transporter CD98hc in CX3CR1+ macrophages was generated. The relatively selective deletion of CD98hc in macrophage populations leads to attenuated severity of chemically-induced colitis that we assessed by clinical, endoscopic, and histological scoring. Single-cell RNA sequencing of colonic lamina propria macrophages revealed that conditional deletion of CD98hc alters the “monocyte waterfall”-development to MHC II+ macrophages. The change in the macrophage development after deletion of CD98hc is associated with increased apoptotic gene expression. Our results show that CD98hc deletion changes the development of colonic macrophages. CD98hc in macrophages attenuates the severity of colitis. This change in the macrophage development is associated with increased expression of apoptotic genes, suggesting that CD98hc maintains the gut homeostasis by ensuring the development of gut macrophages.
Collapse
|
105
|
Casey MJ, Stumpf PS, MacArthur BD. Theory of cell fate. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2020; 12:e1471. [PMID: 31828979 PMCID: PMC7027507 DOI: 10.1002/wsbm.1471] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Revised: 10/15/2019] [Accepted: 11/06/2019] [Indexed: 11/17/2022]
Abstract
Cell fate decisions are controlled by complex intracellular molecular regulatory networks. Studies increasingly reveal the scale of this complexity: not only do cell fate regulatory networks contain numerous positive and negative feedback loops, they also involve a range of different kinds of nonlinear protein-protein and protein-DNA interactions. This inherent complexity and nonlinearity makes cell fate decisions hard to understand using experiment and intuition alone. In this primer, we will outline how tools from mathematics can be used to understand cell fate dynamics. We will briefly introduce some notions from dynamical systems theory, and discuss how they offer a framework within which to build a rigorous understanding of what we mean by a cell "fate", and how cells change fate. We will also outline how modern experiments, particularly high-throughput single-cell experiments, are enabling us to test and explore the limits of these ideas, and build a better understanding of cellular identities. This article is categorized under: Models of Systems Properties and Processes > Mechanistic Models Biological Mechanisms > Cell Fates Models of Systems Properties and Processes > Cellular Models.
Collapse
Affiliation(s)
- Michael J. Casey
- Mathematical SciencesUniversity of SouthamptonSouthamptonUK
- Institute for Life SciencesUniversity of SouthamptonSouthamptonUK
| | - Patrick S. Stumpf
- Institute for Life SciencesUniversity of SouthamptonSouthamptonUK
- Centre for Human Development, Stem Cells and Regeneration, Faculty of MedicineUniversity of SouthamptonSouthamptonUK
| | - Ben D. MacArthur
- Mathematical SciencesUniversity of SouthamptonSouthamptonUK
- Institute for Life SciencesUniversity of SouthamptonSouthamptonUK
- Centre for Human Development, Stem Cells and Regeneration, Faculty of MedicineUniversity of SouthamptonSouthamptonUK
| |
Collapse
|
106
|
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini CSO, Aparicio S, Baaijens J, Balvert M, Barbanson BD, Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo TH, Lelieveldt BP, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Rączkowska A, Reinders M, Ridder JD, Saliba AE, Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in single-cell data science. Genome Biol 2020; 21:31. [PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6] [Citation(s) in RCA: 564] [Impact Index Per Article: 141.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 01/02/2020] [Indexed: 02/08/2023] Open
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Collapse
Affiliation(s)
- David Lähnemann
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf, Germany
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Johannes Köster
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA
| | - Ewa Szczurek
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Davis J. McCarthy
- Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia
- Melbourne Integrative Genomics, School of BioSciences–School of Mathematics & Statistics, Faculty of Science, University of Melbourne, Melbourne, Australia
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD USA
| | - Mark D. Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zürich, Zürich, Switzerland
| | - Catalina A. Vallejos
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK
- The Alan Turing Institute, British Library, London, UK
| | - Kieran R. Campbell
- Department of Statistics, University of British Columbia, Vancouver, Canada
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Data Science Institute, University of British Columbia, Vancouver, Canada
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Luca Pinello
- Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA
- Department of Pathology, Harvard Medical School, Boston, USA
- Broad Institute of Harvard and MIT, Cambridge, MA USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, USA
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | | | - Samuel Aparicio
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Jasmijn Baaijens
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
| | - Marleen Balvert
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| | - Buys de Barbanson
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Antonio Cappuccio
- Institute for Advanced Study, University of Amsterdam, Amsterdam, The Netherlands
| | - Giacomo Corleone
- Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, London, UK
| | - Bas E. Dutilh
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Maria Florescu
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Rens Holmer
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Katharina Jahn
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Thamar Jessurun Lobo
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Emma M. Keizer
- Biometris, Wageningen University & Research, Wageningen, The Netherlands
| | - Indu Khatri
- Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, Leiden, The Netherlands
| | - Szymon M. Kielbasa
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Jan O. Korbel
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alexey M. Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tzu-Hao Kuo
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Boudewijn P.F. Lelieveldt
- PRB lab, Delft University of Technology, Delft, The Netherlands
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Ion I. Mandoiu
- Computer Science & Engineering Department, University of Connecticut, Storrs, USA
| | - John C. Marioni
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Felix Mölder
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
| | - Amir Niknejad
- Computation molecular design, Zuse Institute Berlin, Berlin, Germany
- Mathematics Department, Mount Saint Vincent, New York, USA
| | - Alicja Rączkowska
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Marcel Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Antoine-Emmanuel Saliba
- Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany
| | - Antonios Somarakis
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Oliver Stegle
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center–DKFZ, Heidelberg, Germany
| | - Fabian J. Theis
- Institute of Computational Biology, Helmholtz Zentrum München–German Research Center for Environmental Health, Neuherberg, Germany
| | - Huan Yang
- Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research–LACDR–Leiden University, Leiden, The Netherlands
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Alice C. McHardy
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Sohrab P. Shah
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Alexander Schönhuth
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
107
|
Geddes TA, Kim T, Nan L, Burchfield JG, Yang JYH, Tao D, Yang P. Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis. BMC Bioinformatics 2019; 20:660. [PMID: 31870278 PMCID: PMC6929272 DOI: 10.1186/s12859-019-3179-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 10/28/2019] [Indexed: 01/23/2023] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. Results Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. Conclusions Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS
Collapse
Affiliation(s)
- Thomas A Geddes
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia.,Charles Perkins Centre, School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Taiyun Kim
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Lihao Nan
- UBTECH Sydney Artificial Intelligence Centre and the School of Computer Science, Faculty of Engineering and Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia
| | - James G Burchfield
- Charles Perkins Centre, School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Jean Y H Yang
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - Dacheng Tao
- UBTECH Sydney Artificial Intelligence Centre and the School of Computer Science, Faculty of Engineering and Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia
| | - Pengyi Yang
- Charles Perkins Centre, School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Sydney, NSW 2006, Australia. .,Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW 2145, Australia.
| |
Collapse
|
108
|
Cao Y, Lin Y, Ormerod JT, Yang P, Yang JYH, Lo KK. scDC: single cell differential composition analysis. BMC Bioinformatics 2019; 20:721. [PMID: 31870280 PMCID: PMC6929335 DOI: 10.1186/s12859-019-3211-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/12/2019] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Differences in cell-type composition across subjects and conditions often carry biological significance. Recent advancements in single cell sequencing technologies enable cell-types to be identified at the single cell level, and as a result, cell-type composition of tissues can now be studied in exquisite detail. However, a number of challenges remain with cell-type composition analysis - none of the existing methods can identify cell-type perfectly and variability related to cell sampling exists in any single cell experiment. This necessitates the development of method for estimating uncertainty in cell-type composition. RESULTS We developed a novel single cell differential composition (scDC) analysis method that performs differential cell-type composition analysis via bootstrap resampling. scDC captures the uncertainty associated with cell-type proportions of each subject via bias-corrected and accelerated bootstrap confidence intervals. We assessed the performance of our method using a number of simulated datasets and synthetic datasets curated from publicly available single cell datasets. In simulated datasets, scDC correctly recovered the true cell-type proportions. In synthetic datasets, the cell-type compositions returned by scDC were highly concordant with reference cell-type compositions from the original data. Since the majority of datasets tested in this study have only 2 to 5 subjects per condition, the addition of confidence intervals enabled better comparisons of compositional differences between subjects and across conditions. CONCLUSIONS scDC is a novel statistical method for performing differential cell-type composition analysis for scRNA-seq data. It uses bootstrap resampling to estimate the standard errors associated with cell-type proportion estimates and performs significance testing through GLM and GLMM models. We have made this method available to the scientific community as part of the scdney package (Single Cell Data Integrative Analysis) R package, available from https://github.com/SydneyBioX/scdney.
Collapse
Affiliation(s)
- Yue Cao
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - John T Ormerod
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW 2145, Australia
| | - Jean Y H Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| | - Kitty K Lo
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia.
| |
Collapse
|
109
|
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol 2019; 20:295. [PMID: 31870412 PMCID: PMC6927135 DOI: 10.1186/s13059-019-1861-6] [Citation(s) in RCA: 206] [Impact Index Per Article: 41.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 10/15/2019] [Indexed: 12/23/2022] Open
Abstract
Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.
Collapse
Affiliation(s)
- F. William Townes
- Department of Biostatistics, Harvard University, Cambridge, MA USA
- Present Address: Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD USA
| | - Martin J. Aryee
- Department of Biostatistics, Harvard University, Cambridge, MA USA
- Molecular Pathology Unit, Massachusetts General Hospital, Charlestown, MA USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA USA
- Department of Pathology, Harvard Medical School, Boston, MA USA
| | - Rafael A. Irizarry
- Department of Biostatistics, Harvard University, Cambridge, MA USA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA USA
| |
Collapse
|
110
|
Cheng C, Easton J, Rosencrance C, Li Y, Ju B, Williams J, Mulder HL, Pang Y, Chen W, Chen X. Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell RNA-seq data. Nucleic Acids Res 2019; 47:e143. [PMID: 31566233 PMCID: PMC6902034 DOI: 10.1093/nar/gkz826] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 08/30/2019] [Accepted: 09/26/2019] [Indexed: 12/21/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a powerful tool for characterizing the cell-to-cell variation and cellular dynamics in populations which appear homogeneous otherwise in basic and translational biological research. However, significant challenges arise in the analysis of scRNA-seq data, including the low signal-to-noise ratio with high data sparsity, potential batch effects, scalability problems when hundreds of thousands of cells are to be analyzed among others. The inherent complexities of scRNA-seq data and dynamic nature of cellular processes lead to suboptimal performance of many currently available algorithms, even for basic tasks such as identifying biologically meaningful heterogeneous subpopulations. In this study, we developed the Latent Cellular Analysis (LCA), a machine learning-based analytical pipeline that combines cosine-similarity measurement by latent cellular states with a graph-based clustering algorithm. LCA provides heuristic solutions for population number inference, dimension reduction, feature selection, and control of technical variations without explicit gene filtering. We show that LCA is robust, accurate, and powerful by comparison with multiple state-of-the-art computational methods when applied to large-scale real and simulated scRNA-seq data. Importantly, the ability of LCA to learn from representative subsets of the data provides scalability, thereby addressing a significant challenge posed by growing sample sizes in scRNA-seq data analysis.
Collapse
Affiliation(s)
- Changde Cheng
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - John Easton
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Celeste Rosencrance
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Yan Li
- The University of Texas MD Anderson Cancer Center UTHealthGraduate School of Biomedical Sciences, Houston, TX 77030, USA
| | - Bensheng Ju
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Justin Williams
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Heather L Mulder
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Yakun Pang
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Wenan Chen
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Xiang Chen
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| |
Collapse
|
111
|
Krzak M, Raykov Y, Boukouvalas A, Cutillo L, Angelini C. Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front Genet 2019; 10:1253. [PMID: 31921297 PMCID: PMC6918801 DOI: 10.3389/fgene.2019.01253] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 11/13/2019] [Indexed: 01/04/2023] Open
Abstract
Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.
Collapse
Affiliation(s)
- Monika Krzak
- Institute for Applied Mathematics “Mauro Picone”, Naples, Italy
| | - Yordan Raykov
- Department of Mathematics, Aston University, Birmingham, United Kingdom
| | | | - Luisa Cutillo
- School of Mathematics, University of Leeds, Leeds, United Kingdom
| | | |
Collapse
|
112
|
Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 2019; 20:269. [PMID: 31823809 PMCID: PMC6902413 DOI: 10.1186/s13059-019-1898-6] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 11/22/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. RESULTS We aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used dimensionality reduction methods for scRNA-seq studies. Specifically, we compare 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different dimensionality reduction methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluate the computational scalability of different dimensionality reduction methods by recording their computational cost. CONCLUSIONS Based on the comprehensive evaluation results, we provide important guidelines for choosing dimensionality reduction methods for scRNA-seq data analysis. We also provide all analysis scripts used in the present study at www.xzlab.org/reproduce.html.
Collapse
Affiliation(s)
- Shiquan Sun
- School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, People's Republic of China
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Jiaqiang Zhu
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Ying Ma
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA.
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
113
|
Chaudhry F, Isherwood J, Bawa T, Patel D, Gurdziel K, Lanfear DE, Ruden DM, Levy PD. Single-Cell RNA Sequencing of the Cardiovascular System: New Looks for Old Diseases. Front Cardiovasc Med 2019; 6:173. [PMID: 31921894 PMCID: PMC6914766 DOI: 10.3389/fcvm.2019.00173] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Accepted: 11/12/2019] [Indexed: 12/18/2022] Open
Abstract
Cardiovascular disease encompasses a wide range of conditions, resulting in the highest number of deaths worldwide. The underlying pathologies surrounding cardiovascular disease include a vast and complicated network of both cellular and molecular mechanisms. Unique phenotypic alterations in specific cell types, visualized as varying RNA expression-levels (both coding and non-coding), have been identified as crucial factors in the pathology underlying conditions such as heart failure and atherosclerosis. Recent advances in single-cell RNA sequencing (scRNA-seq) have elucidated a new realm of cell subpopulations and transcriptional variations that are associated with normal and pathological physiology in a wide variety of diseases. This breakthrough in the phenotypical understanding of our cells has brought novel insight into cardiovascular basic science. scRNA-seq allows for separation of widely distinct cell subpopulations which were, until recently, simply averaged together with bulk-tissue RNA-seq. scRNA-seq has been used to identify novel cell types in the heart and vasculature that could be implicated in a variety of disease pathologies. Furthermore, scRNA-seq has been able to identify significant heterogeneity of phenotypes within individual cell subtype populations. The ability to characterize single cells based on transcriptional phenotypes allows researchers the ability to map development of cells and identify changes in specific subpopulations due to diseases at a very high throughput. This review looks at recent scRNA-seq studies of various aspects of the cardiovascular system and discusses their potential value to our understanding of the cardiovascular system and pathology.
Collapse
Affiliation(s)
- Farhan Chaudhry
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| | - Jenna Isherwood
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, United States
| | - Tejeshwar Bawa
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| | - Dhruvil Patel
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| | - Katherine Gurdziel
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, United States
| | - David E Lanfear
- Heart and Vascular Institute, Henry Ford Health System, Detroit, MI, United States
| | - Douglas M Ruden
- Department of Obstetrics and Gynecology, Center for Urban Responses to Environmental Stressors, Wayne State University, Detroit, MI, United States
| | - Phillip D Levy
- Department of Emergency Medicine and Integrative Biosciences Center, Wayne State University, Detroit, MI, United States
| |
Collapse
|
114
|
Tarashansky AJ, Xue Y, Li P, Quake SR, Wang B. Self-assembling manifolds in single-cell RNA sequencing data. eLife 2019; 8:e48994. [PMID: 31524596 PMCID: PMC6795480 DOI: 10.7554/elife.48994] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Accepted: 09/16/2019] [Indexed: 12/14/2022] Open
Abstract
Single-cell RNA sequencing has spurred the development of computational methods that enable researchers to classify cell types, delineate developmental trajectories, and measure molecular responses to external perturbations. Many of these technologies rely on their ability to detect genes whose cell-to-cell variations arise from the biological processes of interest rather than transcriptional or technical noise. However, for datasets in which the biologically relevant differences between cells are subtle, identifying these genes is challenging. We present the self-assembling manifold (SAM) algorithm, an iterative soft feature selection strategy to quantify gene relevance and improve dimensionality reduction. We demonstrate its advantages over other state-of-the-art methods with experimental validation in identifying novel stem cell populations of Schistosoma mansoni, a prevalent parasite that infects hundreds of millions of people. Extending our analysis to a total of 56 datasets, we show that SAM is generalizable and consistently outperforms other methods in a variety of biological and quantitative benchmarks.
Collapse
Affiliation(s)
| | - Yuan Xue
- Department of BioengineeringStanford UniversityStanfordUnited States
| | - Pengyang Li
- Department of BioengineeringStanford UniversityStanfordUnited States
| | - Stephen R Quake
- Department of BioengineeringStanford UniversityStanfordUnited States
- Department of Applied PhysicsStanford UniversityStanfordUnited States
- Chan Zuckerberg BiohubSan FranciscoUnited States
| | - Bo Wang
- Department of BioengineeringStanford UniversityStanfordUnited States
- Department of Developmental BiologyStanford University School of MedicineStanfordUnited States
| |
Collapse
|
115
|
Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, Mahfouz A. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol 2019; 20:194. [PMID: 31500660 PMCID: PMC6734286 DOI: 10.1186/s13059-019-1795-z] [Citation(s) in RCA: 305] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 08/17/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. RESULTS Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods' sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. CONCLUSIONS We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub ( https://github.com/tabdelaal/scRNAseq_Benchmark ). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.
Collapse
Affiliation(s)
- Tamim Abdelaal
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| | - Lieke Michielsen
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| | - Davy Cats
- Sequencing Analysis Support Core, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
| | - Dylan Hoogduin
- Sequencing Analysis Support Core, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
| | - Hailiang Mei
- Sequencing Analysis Support Core, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
| | - Marcel J. T. Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands
| |
Collapse
|
116
|
Yu X, Chen YA, Conejo-Garcia JR, Chung CH, Wang X. Estimation of immune cell content in tumor using single-cell RNA-seq reference data. BMC Cancer 2019; 19:715. [PMID: 31324168 PMCID: PMC6642583 DOI: 10.1186/s12885-019-5927-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Accepted: 07/12/2019] [Indexed: 12/12/2022] Open
Abstract
Background The rapid development of single-cell RNA sequencing (scRNA-seq) provides unprecedented opportunities to study the tumor ecosystem that involves a heterogeneous mixture of cell types. However, the majority of previous and current studies related to translational and molecular oncology have only focused on the bulk tumor and there is a wealth of gene expression data accumulated with matched clinical outcomes. Results In this paper, we introduce a scheme for characterizing cell compositions from bulk tumor gene expression by integrating signatures learned from scRNA-seq data. We derived the reference expression matrix to each cell type based on cell subpopulations identified in head and neck cancer dataset. Our results suggest that scRNA-Seq-derived reference matrix outperforms the existing gene panel and reference matrix with respect to distinguishing immune cell subtypes. Conclusions Findings and resources created from this study enable future and secondary analysis of tumor RNA mixtures in head and neck cancer for a more accurate cellular deconvolution, and can facilitate the profiling of the immune infiltration in other solid tumors due to the expression homogeneity observed in immune cells. Electronic supplementary material The online version of this article (10.1186/s12885-019-5927-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoqing Yu
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Y Ann Chen
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Jose R Conejo-Garcia
- Department of Immunology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Christine H Chung
- Department of Head and Neck-Endocrine Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Xuefeng Wang
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA.
| |
Collapse
|
117
|
Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, Boulesteix AL, Saeys Y, Robinson MD. Essential guidelines for computational method benchmarking. Genome Biol 2019; 20:125. [PMID: 31221194 PMCID: PMC6584985 DOI: 10.1186/s13059-019-1738-8] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.
Collapse
Affiliation(s)
- Lukas M Weber
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland
| | - Wouter Saelens
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Robrecht Cannoodt
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland
- Present address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
| | - Alexander Hapfelmeier
- Institute of Medical Informatics, Statistics and Epidemiology, Technical University of Munich, 81675, Munich, Germany
| | - Paul P Gardner
- Department of Biochemistry, University of Otago, Dunedin, 9016, New Zealand
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-University, 81377, Munich, Germany
| | - Yvan Saeys
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium.
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland.
| |
Collapse
|
118
|
Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol 2019; 15:e8746. [PMID: 31217225 PMCID: PMC6582955 DOI: 10.15252/msb.20188746] [Citation(s) in RCA: 953] [Impact Index Per Article: 190.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 03/15/2019] [Accepted: 04/03/2019] [Indexed: 12/21/2022] Open
Abstract
Single-cell RNA-seq has enabled gene expression to be studied at an unprecedented resolution. The promise of this technology is attracting a growing user base for single-cell analysis methods. As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one's data. Here, we detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We formulate current best-practice recommendations for these steps based on independent comparison studies. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice. Our documented case study can be found at https://www.github.com/theislab/single-cell-tutorial This review will serve as a workflow tutorial for new entrants into the field, and help established users update their analysis pipelines.
Collapse
Affiliation(s)
- Malte D Luecken
- Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
- Department of Mathematics, Technische Universität München, Garching bei München, Germany
| |
Collapse
|
119
|
Crow M, Gillis J. Single cell RNA-sequencing: replicability of cell types. Curr Opin Neurobiol 2019; 56:69-77. [PMID: 30654233 PMCID: PMC6551252 DOI: 10.1016/j.conb.2018.12.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 12/03/2018] [Accepted: 12/09/2018] [Indexed: 01/09/2023]
Abstract
Recent technical advances have enabled transcriptomics experiments at an unprecedented scale, and single-cell profiles from neural tissue are accumulating rapidly. There has been considerable effort to use these profiles to understand cell diversity, primarily through unsupervised clustering and differential expression analysis. However, current practices to validate these findings vary. In this review, we describe recent efforts to evaluate clusters from single-cell RNA-sequencing data, and provide a framework for considering current evidence and practices in terms of their capacity to establish principles of cell biology. Single-cell RNA-sequencing has already transformed neuroscience. By facilitating detailed comparative and genetic perturbation analyses, it may provide the tools to uncover fundamental mechanisms of neural diversity throughout the tree of life.
Collapse
Affiliation(s)
- Megan Crow
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA.
| |
Collapse
|
120
|
Ye W, Ji G, Ye P, Long Y, Xiao X, Li S, Su Y, Wu X. scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data. BMC Genomics 2019; 20:347. [PMID: 31068142 PMCID: PMC6505295 DOI: 10.1186/s12864-019-5747-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 04/29/2019] [Indexed: 12/15/2022] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) is fast becoming a powerful tool for profiling genome-scale transcriptomes of individual cells and capturing transcriptome-wide cell-to-cell variability. However, scRNA-seq technologies suffer from high levels of technical noise and variability, hindering reliable quantification of lowly and moderately expressed genes. Since most downstream analyses on scRNA-seq, such as cell type clustering and differential expression analysis, rely on the gene-cell expression matrix, preprocessing of scRNA-seq data is a critical preliminary step in the analysis of scRNA-seq data. Results We presented scNPF, an integrative scRNA-seq preprocessing framework assisted by network propagation and network fusion, for recovering gene expression loss, correcting gene expression measurements, and learning similarities between cells. scNPF leverages the context-specific topology inherent in the given data and the priori knowledge derived from publicly available molecular gene-gene interaction networks to augment gene-gene relationships in a data driven manner. We have demonstrated the great potential of scNPF in scRNA-seq preprocessing for accurately recovering gene expression values and learning cell similarity networks. Comprehensive evaluation of scNPF across a wide spectrum of scRNA-seq data sets showed that scNPF achieved comparable or higher performance than the competing approaches according to various metrics of internal validation and clustering accuracy. We have made scNPF an easy-to-use R package, which can be used as a versatile preprocessing plug-in for most existing scRNA-seq analysis pipelines or tools. Conclusions scNPF is a universal tool for preprocessing of scRNA-seq data, which jointly incorporates the global topology of priori interaction networks and the context-specific information encapsulated in the scRNA-seq data to capture both shared and complementary knowledge from diverse data sources. scNPF could be used to recover gene signatures and learn cell-to-cell similarities from emerging scRNA-seq data to facilitate downstream analyses such as dimension reduction, cell type clustering, and visualization. Electronic supplementary material The online version of this article (10.1186/s12864-019-5747-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenbin Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China.,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China
| | - Pengchao Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Yuqi Long
- Software Quality Testing Engineering Research Center, China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou, 510610, China
| | - Xuesong Xiao
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Shuchao Li
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350116, China
| | - Xiaohui Wu
- Department of Automation, Xiamen University, Xiamen, 361005, China. .,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China. .,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
121
|
Sun Z, Chen L, Xin H, Jiang Y, Huang Q, Cillo AR, Tabib T, Kolls JK, Bruno TC, Lafyatis R, Vignali DAA, Chen K, Ding Y, Hu M, Chen W. A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies. Nat Commun 2019; 10:1649. [PMID: 30967541 PMCID: PMC6456731 DOI: 10.1038/s41467-019-09639-3] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 03/15/2019] [Indexed: 02/08/2023] Open
Abstract
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Results from extensive simulation studies and applications of BAMM-SC to in-house experimental scRNA-seq datasets using blood, lung and skin cells from humans or mice demonstrate that BAMM-SC outperformed existing clustering methods with considerable improved clustering accuracy, particularly in the presence of heterogeneity among individuals. With the development of large scale single cell RNA-seq technology, population-scale scRNA-seq studies are emerging. Here, the authors develop BAMM-SC, a tool for clustering droplet-based scRNA-seq data from multiple individuals simultaneously.
Collapse
Affiliation(s)
- Zhe Sun
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University, Auburn, AL, 36849, USA
| | - Hongyi Xin
- Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, 15224, USA
| | - Yale Jiang
- Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, 15224, USA.,School of Medicine, Tsinghua University, Beijing, 100084, China
| | - Qianhui Huang
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Anthony R Cillo
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15262, USA
| | - Tracy Tabib
- Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Jay K Kolls
- School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Tullia C Bruno
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15262, USA.,Tumor Microenvironment Center, UPMC Hillman Cancer Center, Pittsburgh, PA, 15232, USA
| | - Robert Lafyatis
- Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Dario A A Vignali
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15262, USA.,Tumor Microenvironment Center, UPMC Hillman Cancer Center, Pittsburgh, PA, 15232, USA.,Cancer Immunology and Immunotherapy Program, UPMC Hillman Cancer Center, Pittsburgh, PA, 15232, USA
| | - Kong Chen
- Division of Pulmonary, Allergy and Critical Care Medicine, Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, USA
| | - Ying Ding
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
| | - Ming Hu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA.
| | - Wei Chen
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, 15261, USA. .,Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, 15224, USA.
| |
Collapse
|
122
|
Choi YH, Kim JK. Dissecting Cellular Heterogeneity Using Single-Cell RNA Sequencing. Mol Cells 2019; 42:189-199. [PMID: 30764602 PMCID: PMC6449718 DOI: 10.14348/molcells.2019.2446] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 01/09/2019] [Accepted: 01/09/2019] [Indexed: 12/22/2022] Open
Abstract
Cell-to-cell variability in gene expression exists even in a homogeneous population of cells. Dissecting such cellular heterogeneity within a biological system is a prerequisite for understanding how a biological system is developed, homeo-statically regulated, and responds to external perturbations. Single-cell RNA sequencing (scRNA-seq) allows the quantitative and unbiased characterization of cellular heterogeneity by providing genome-wide molecular profiles from tens of thousands of individual cells. A major question in analyzing scRNA-seq data is how to account for the observed cell-to-cell variability. In this review, we provide an overview of scRNA-seq protocols, computational approaches for dissecting cellular heterogeneity, and future directions of single-cell transcriptomic analysis.
Collapse
Affiliation(s)
- Yoon Ha Choi
- Department of New Biology, DGIST, Daegu 42988,
Korea
| | | |
Collapse
|
123
|
Diaz-Mejia JJ, Meng EC, Pico AR, MacParland SA, Ketela T, Pugh TJ, Bader GD, Morris JH. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data. F1000Res 2019; 8:ISCB Comm J-296. [PMID: 31508207 PMCID: PMC6720041 DOI: 10.12688/f1000research.18490.3] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/09/2019] [Indexed: 01/28/2023] Open
Abstract
Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated steps from normalization to cell clustering. However, assigning cell type labels to cell clusters is often conducted manually, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. This is partially due to the scarcity of reference cell type signatures and because some methods support limited cell type signatures. Methods: In this study, we benchmarked five methods representing first-generation enrichment analysis (ORA), second-generation approaches (GSEA and GSVA), machine learning tools (CIBERSORT) and network-based neighbor voting (METANEIGHBOR), for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used five scRNA-seq datasets: human liver, 11 Tabula Muris mouse tissues, two human peripheral blood mononuclear cell datasets, and mouse retinal neurons, for which reference cell type signatures were available. The datasets span Drop-seq, 10X Chromium and Seq-Well technologies and range in size from ~3,700 to ~68,000 cells. Results: Our results show that, in general, all five methods perform well in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.91, sd = 0.06), whereas precision-recall analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). We observed an influence of the number of genes in cell type signatures on performance, with smaller signatures leading more frequently to incorrect results. Conclusions: GSVA was the overall top performer and was more robust in cell type signature subsampling simulations, although different methods performed well using different datasets. METANEIGHBOR and GSVA were the fastest methods. CIBERSORT and METANEIGHBOR were more influenced than the other methods by analyses including only expected cell types. We provide an extensible framework that can be used to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.
Collapse
Affiliation(s)
- J. Javier Diaz-Mejia
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | - Elaine C. Meng
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | | | - Sonya A. MacParland
- Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Toronto, ON, M5G 2C4, Canada
- Department of Immunology, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Troy Ketela
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
| | - Trevor J. Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Ontario Institute for Cancer Research, Toronto, ON, M5G 0A3, Canada
| | - Gary D. Bader
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5G 1A8, Canada
| | - John H. Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| |
Collapse
|
124
|
Diaz-Mejia JJ, Meng EC, Pico AR, MacParland SA, Ketela T, Pugh TJ, Bader GD, Morris JH. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data. F1000Res 2019; 8:ISCB Comm J-296. [PMID: 31508207 PMCID: PMC6720041 DOI: 10.12688/f1000research.18490.1] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/08/2019] [Indexed: 12/11/2022] Open
Abstract
Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated computational steps like data normalization, dimensionality reduction and cell clustering. However, assigning cell type labels to cell clusters is still conducted manually by most researchers, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. Two bottlenecks to automating this task are the scarcity of reference cell type gene expression signatures and the fact that some dedicated methods are available only as web servers with limited cell type gene expression signatures. Methods: In this study, we benchmarked four methods (CIBERSORT, GSEA, GSVA, and ORA) for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used scRNA-seq datasets from liver, peripheral blood mononuclear cells and retinal neurons for which reference cell type gene expression signatures were available. Results: Our results show that, in general, all four methods show a high performance in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.94, sd = 0.036), whereas precision-recall curve analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). Conclusions: CIBERSORT and GSVA were the top two performers. Additionally, GSVA was the fastest of the four methods and was more robust in cell type gene expression signature subsampling simulations. We provide an extensible framework to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.
Collapse
Affiliation(s)
- J. Javier Diaz-Mejia
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | - Elaine C. Meng
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| | | | - Sonya A. MacParland
- Multi-Organ Transplant Program, Toronto General Hospital Research Institute, Toronto, ON, M5G 2C4, Canada
- Department of Immunology, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Troy Ketela
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
| | - Trevor J. Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2M9, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Ontario Institute for Cancer Research, Toronto, ON, M5G 0A3, Canada
| | - Gary D. Bader
- The Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5G 1A8, Canada
| | - John H. Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94143, USA
| |
Collapse
|
125
|
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res 2018; 7:1297. [PMID: 30228881 PMCID: PMC6124389 DOI: 10.12688/f1000research.15809.1] [Citation(s) in RCA: 99] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/07/2018] [Indexed: 01/21/2023] Open
Abstract
Background: The commercially available 10x Genomics protocol to generate droplet-based single-cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use. Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as three silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also robustness of a dozen methods. Results: We found that some methods, including Seurat and Cell Ranger, outperform other methods, although performance seems to be dependent on the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other. Conclusions: In light of this, we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.
Collapse
Affiliation(s)
- Saskia Freytag
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Luyi Tian
- Department of Medical Biology, University of Melbourne, Parkville, Australia
- Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | | | - Milica Ng
- Bio21 Insititute, CSL Limited, Parkville, Australia
| | - Melanie Bahlo
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| |
Collapse
|
126
|
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res 2018; 7:1297. [PMID: 30228881 PMCID: PMC6124389 DOI: 10.12688/f1000research.15809.2] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/14/2018] [Indexed: 12/23/2022] Open
Abstract
Background: The commercially available 10x Genomics protocol to generate droplet-based single cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use. Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as multiple silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also running time and robustness of a dozen methods. Results: We found that Seurat outperformed other methods, although performance seems to be dependent on many factors, including the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other. Conclusions: In light of this we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.
Collapse
Affiliation(s)
- Saskia Freytag
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Luyi Tian
- Department of Medical Biology, University of Melbourne, Parkville, Australia
- Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | | | - Milica Ng
- Bio21 Insititute, CSL Limited, Parkville, Australia
| | - Melanie Bahlo
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| |
Collapse
|