51
|
Jiang T, Zhou W, Sheng Q, Yu J, Xie Y, Ding N, Zhang Y, Xu J, Li Y. ImmCluster: an ensemble resource for immunology cell type clustering and annotations in normal and cancerous tissues. Nucleic Acids Res 2022; 51:D1325-D1332. [PMID: 36271790 PMCID: PMC9825417 DOI: 10.1093/nar/gkac922] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Revised: 09/22/2022] [Accepted: 10/06/2022] [Indexed: 01/30/2023] Open
Abstract
Single-cell transcriptome has enabled the transcriptional profiling of thousands of immune cells in complex tissues and cancers. However, subtle transcriptomic differences in immune cell subpopulations and the high dimensionality of transcriptomic data make the clustering and annotation of immune cells challenging. Herein, we introduce ImmCluster (http://bio-bigdata.hrbmu.edu.cn/ImmCluster) for immunology cell type clustering and annotation. We manually curated 346 well-known marker genes from 1163 studies. ImmCluster integrates over 420 000 immune cells from nine healthy tissues and over 648 000 cells from different tumour samples of 17 cancer types to generate stable marker-gene sets and develop context-specific immunology references. In addition, ImmCluster provides cell clustering using seven reference-based and four marker gene-based computational methods, and the ensemble method was developed to provide consistent cell clustering than individual methods. Five major analytic modules were provided for interactively exploring the annotations of immune cells, including clustering and annotating immune cell clusters, gene expression of markers, functional assignment in cancer hallmarks, cell states and immune pathways, cell-cell communications and the corresponding ligand-receptor interactions, as well as online tools. ImmCluster generates diverse plots and tables, enabling users to identify significant associations in immune cell clusters simultaneously. ImmCluster is a valuable resource for analysing cellular heterogeneity in cancer microenvironments.
Collapse
Affiliation(s)
| | | | | | | | - Yunjin Xie
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang150081, China
| | - Na Ding
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang150081, China
| | - Yunpeng Zhang
- Correspondence may also be addressed to Yunpeng Zhang.
| | - Juan Xu
- Correspondence may also be addressed to Juan Xu.
| | - Yongsheng Li
- To whom correspondence should be addressed. Tel: +86 13604805482;
| |
Collapse
|
52
|
Chen Y, Zhang S. Automatic Cell Type Annotation Using Marker Genes for Single-Cell RNA Sequencing Data. Biomolecules 2022; 12:biom12101539. [PMID: 36291748 PMCID: PMC9599378 DOI: 10.3390/biom12101539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Revised: 10/01/2022] [Accepted: 10/11/2022] [Indexed: 11/16/2022] Open
Abstract
Recent advancement in single-cell RNA sequencing (scRNA-seq) technology is gaining more and more attention. Cell type annotation plays an essential role in scRNA-seq data analysis. Several computational methods have been proposed for automatic annotation. Traditional cell type annotation is to first cluster the cells using unsupervised learning methods based on the gene expression profiles, then to label the clusters using the aggregated cluster-level expression profiles and the marker genes’ information. Such procedure relies heavily on the clustering results. As the purity of clusters cannot be guaranteed, false detection of cluster features may lead to wrong annotations. In this paper, we improve this procedure and propose an Automatic Cell type Annotation Method (ACAM). ACAM delineates a clear framework to conduct automatic cell annotation through representative cluster identification, representative cluster annotation using marker genes, and the remaining cells’ classification. Experiments on seven real datasets show the better performance of ACAM compared to six well-known cell type annotation methods.
Collapse
Affiliation(s)
- Yu Chen
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Shuqin Zhang
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
- Key Laboratory of Mathematics for Nonlinear Science (Ministry of Education), Fudan University, Shanghai 200433, China
- Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China
- Correspondence:
| |
Collapse
|
53
|
Grabski IN, Irizarry RA. A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data. Biostatistics 2022; 23:1150-1164. [PMID: 35770795 PMCID: PMC9802389 DOI: 10.1093/biostatistics/kxac021] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 05/10/2022] [Accepted: 05/22/2022] [Indexed: 01/07/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.
Collapse
Affiliation(s)
- Isabella N Grabski
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA and Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| |
Collapse
|
54
|
Schiebout C, Frost HR. CAMML with the Integration of Marker Proteins (ChIMP). Bioinformatics 2022; 38:5206-5213. [PMID: 36214642 PMCID: PMC9710548 DOI: 10.1093/bioinformatics/btac674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 09/12/2022] [Accepted: 10/06/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Cell typing is a critical task in the analysis of single-cell data, particularly when studying complex diseased tissues. Unfortunately, the sparsity and noise of single-cell data make accurate cell typing of individual cells difficult. To address these challenges, we previously developed the CAMML method for multi-label cell typing of single-cell RNA-sequencing (scRNA-seq) data. CAMML uses weighted gene sets to score each profiled cell for multiple potential cell types. While CAMML outperforms other scRNA-seq cell typing techniques, it only leverages transcriptomic data so cannot take advantage of newer multi-omic single-cell assays that jointly profile gene expression and protein abundance (e.g. joint scRNA-seq/CITE-seq). RESULTS We developed the CAMML with the Integration of Marker Proteins (ChIMP) method to support multi-label cell typing of individual cells jointly profiled via scRNA-seq and CITE-seq. ChIMP combines cell type scores computed on scRNA-seq data via the CAMML approach with discretized CITE-seq measurements for cell type marker proteins. The multi-omic cell type scores generated by ChIMP allow researchers to more precisely and conservatively cell type joint scRNA-seq/CITE-seq data. AVAILABILITY AND IMPLEMENTATION An implementation of this work is available on CRAN at https://cran.r-project.org/web/packages/CAMML/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
55
|
Wan H, Chen L, Deng M. scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:939-958. [PMID: 36608843 PMCID: PMC10025768 DOI: 10.1016/j.gpb.2022.12.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 11/30/2022] [Accepted: 12/11/2022] [Indexed: 01/05/2023]
Abstract
Current cell-type annotation tools for single-cell RNA sequencing (scRNA-seq) data mainly utilize well-annotated source data to help identify cell types in target data. However, on account of privacy preservation, their requirements for raw source data may not always be satisfied. In this case, achieving feature alignment between source and target data explicitly is impossible. Additionally, these methods are barely able to discover the presence of novel cell types. A subjective threshold is often selected by users to detect novel cells. We propose a universal annotation framework for scRNA-seq data called scEMAIL, which automatically detects novel cell types without accessing source data during adaptation. For new cell-type identification, a novel cell-type perception module is designed with three steps. First, an expert ensemble system measures uncertainty of each cell from three complementary aspects. Second, based on this measurement, bimodality tests are applied to detect the presence of new cell types. Third, once assured of their presence, an adaptive threshold via manifold mixup partitions target cells into "known" and "unknown" groups. Model adaptation is then conducted to alleviate the batch effect. We gather multi-order neighborhood messages globally and impose local affinity regularizations on "known" cells. These constraints mitigate wrong classifications of the source model via reliable self-supervised information of neighbors. scEMAIL is accurate and robust under various scenarios in both simulation and real data. It is also flexible to be applied to challenging single-cell ATAC-seq data without loss of superiority. The source code of scEMAIL can be accessed at https://github.com/aster-ww/scEMAIL and https://ngdc.cncb.ac.cn/biocode/tools/BT007335/releases/v1.0.
Collapse
Affiliation(s)
- Hui Wan
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Liang Chen
- Huawei Technologies Co., Ltd., Beijing 100080, China.
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China; Center for Statistical Science, Peking University, Beijing 100871, China; Center for Quantitative Biology, Peking University, Beijing 100871, China.
| |
Collapse
|
56
|
Madadi Y, Sun J, Chen H, Williams R, Yousefi S. Detecting retinal neural and stromal cell classes and ganglion cell subtypes based on transcriptome data with deep transfer learning. Bioinformatics 2022; 38:4321-4329. [PMID: 35876552 PMCID: PMC9991888 DOI: 10.1093/bioinformatics/btac514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 07/11/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION To develop and assess the accuracy of deep learning models that identify different retinal cell types, as well as different retinal ganglion cell (RGC) subtypes, based on patterns of single-cell RNA sequencing (scRNA-seq) in multiple datasets. RESULTS Deep domain adaptation models were developed and tested using three different datasets. The first dataset included 44 808 single retinal cells from mice (39 cell types) with 24 658 genes, the second dataset included 6225 single RGCs from mice (41 subtypes) with 13 616 genes and the third dataset included 35 699 single RGCs from mice (45 subtypes) with 18 222 genes. We used four loss functions in the learning process to align the source and target distributions, reduce misclassification errors and maximize robustness. Models were evaluated based on classification accuracy and confusion matrix. The accuracy of the model for correctly classifying 39 different retinal cell types in the first dataset was ∼92%. Accuracy in the second and third datasets reached ∼97% and 97% in correctly classifying 40 and 45 different RGCs subtypes, respectively. Across a range of seven different batches in the first dataset, the accuracy of the lead model ranged from 74% to nearly 100%. The lead model provided high accuracy in identifying retinal cell types and RGC subtypes based on scRNA-seq data. The performance was reasonable based on data from different batches as well. The validated model could be readily applied to scRNA-seq data to identify different retinal cell types and subtypes. AVAILABILITY AND IMPLEMENTATION The code and datasets are available on https://github.com/DM2LL/Detecting-Retinal-Cell-Classes-and-Ganglion-Cell-Subtypes. We have also added the class labels of all samples to the datasets. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yeganeh Madadi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
- University of Tehran, Tehran, Iran
| | - Jian Sun
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert Williams
- Department of Genetics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Siamak Yousefi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
- Department of Genetics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| |
Collapse
|
57
|
Li Z, Wang Y, Ganan-Gomez I, Colla S, Do KA. A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data. Bioinformatics 2022; 38:4885-4892. [PMID: 36083008 PMCID: PMC9801963 DOI: 10.1093/bioinformatics/btac617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 09/06/2022] [Accepted: 09/08/2022] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) has been widely used to decompose complex tissues into functionally distinct cell types. The first and usually the most important step of scRNA-seq data analysis is to accurately annotate the cell labels. In recent years, many supervised annotation methods have been developed and shown to be more convenient and accurate than unsupervised cell clustering. One challenge faced by all the supervised annotation methods is the identification of the novel cell type, which is defined as the cell type that is not present in the training data, only exists in the testing data. Existing methods usually label the cells simply based on the correlation coefficients or confidence scores, which sometimes results in an excessive number of unlabeled cells. RESULTS We developed a straightforward yet effective method combining autoencoder with iterative feature selection to automatically identify novel cells from scRNA-seq data. Our method trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors. By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, our method can accurately identify novel cells that are not present in the training data. We further combined this approach with a support vector machine to provide a complete solution for annotating the full range of cell types. Extensive numerical experiments using five real scRNA-seq datasets demonstrated favorable performance of the proposed method over existing methods serving similar purposes. AVAILABILITY AND IMPLEMENTATION Our R software package CAMLU is publicly available through the Zenodo repository (https://doi.org/10.5281/zenodo.7054422) or GitHub repository (https://github.com/ziyili20/CAMLU). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ziyi Li
- To whom correspondence should be addressed. or
| | - Yizhuo Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Irene Ganan-Gomez
- Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Simona Colla
- Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Kim-Anh Do
- To whom correspondence should be addressed. or
| |
Collapse
|
58
|
Galdos FX, Xu S, Goodyer WR, Duan L, Huang YV, Lee S, Zhu H, Lee C, Wei N, Lee D, Wu SM. devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data. Nat Commun 2022; 13:5271. [PMID: 36071107 PMCID: PMC9452519 DOI: 10.1038/s41467-022-33045-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 08/31/2022] [Indexed: 11/09/2022] Open
Abstract
A major informatic challenge in single cell RNA-sequencing analysis is the precise annotation of datasets where cells exhibit complex multilayered identities or transitory states. Here, we present devCellPy a highly accurate and precise machine learning-enabled tool that enables automated prediction of cell types across complex annotation hierarchies. To demonstrate the power of devCellPy, we construct a murine cardiac developmental atlas from published datasets encompassing 104,199 cells from E6.5-E16.5 and train devCellPy to generate a cardiac prediction algorithm. Using this algorithm, we observe a high prediction accuracy (>90%) across multiple layers of annotation and across de novo murine developmental data. Furthermore, we conduct a cross-species prediction of cardiomyocyte subtypes from in vitro-derived human induced pluripotent stem cells and unexpectedly uncover a predominance of left ventricular (LV) identity that we confirmed by an LV-specific TBX5 lineage tracing system. Together, our results show devCellPy to be a useful tool for automated cell prediction across complex cellular hierarchies, species, and experimental systems. A major informatic challenge in single cell RNA-sequencing analysis is the precise annotation of datasets where cells exhibit complex multilayered identities or transitory states. Here the authors present devCellPy, a Python-based package that enables the automated prediction of cell types across complex cellular hierarchies, species, and experimental systems with high accuracy, particularly for developmental scRNA-seq datasets.
Collapse
Affiliation(s)
- Francisco X Galdos
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA.,Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA
| | - Sidra Xu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - William R Goodyer
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA.,Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA.,Division of Pediatric Cardiology, Department of Pediatrics, Stanford University School of Medicine, Palo Alto, USA
| | - Lauren Duan
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Yuhsin V Huang
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Soah Lee
- Biopharmaceutical Convergence, School of Pharmacy, Sungkyunkwan University, Suwon, South Korea
| | - Han Zhu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA.,Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, USA
| | - Carissa Lee
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Nicholas Wei
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Daniel Lee
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Sean M Wu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA. .,Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA. .,Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, USA.
| |
Collapse
|
59
|
scWizard: a web-based automated tool for classifying and annotating single cells and downstream analysis of single-cell RNA-seq data in cancers. Comput Struct Biotechnol J 2022; 20:4902-4909. [PMID: 36147672 PMCID: PMC9474308 DOI: 10.1016/j.csbj.2022.08.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/27/2022] [Accepted: 08/12/2022] [Indexed: 11/22/2022] Open
Abstract
scWizard provides comprehensive analysis pipeline for integration strategies of cancer scRNA-seq data. scWizard enables classification of 47 cell subtypes within the TME based on hierarchical model by deep neural network. scWizard gives a higher accuracy for annotation cell subtypes within the TME compared with five methods. scWizard packages is a point-and-click tool helping for researchers without proficient programming skills.
The emerging number of single-cell RNA-seq (scRNA-Seq) datasets allows the characterization of cell types across various cancer types. However, there is still lack of effective tools to integrate the various analysis of single-cells, especially for making fine annotation on subtype cells within the tumor microenvironment (TME). We developed scWizard, a point-and-click tool packaging automated process including our developed cell annotation method based on deep neural network learning and 11 downstream analyses methods. scWizard used 113,976 cells across 13 cancer types as a built-in reference dataset for training the hierarchical model enabling to automatedly classify and annotate 7 major cell types and 47 cell subtypes in the TME. scWizard provides a built-in pre-training set for user’s flexible choice, and gives a higher accuracy for annotation subtypes of tumor-derived T-lymphocytes/natural killer cells (T/NK) and myeloid cells from different cancer types compared with the existing five methods. scWizard has good robustness in three independent cancer datasets, with an accuracy of 0.98 in annotating major cell types, 0.85 in annotating myeloid cell subtypes and 0.79 in annotating T/NK cell subtypes, indicting the wide applicability of scWizard in different cell types of cancers. Finally, the automatic analysis and visualization function of scWizard are presented by using the intrahepatic cholangiocarcinoma (ICC) scRNA-Seq dataset as a case. scWizard focuses on decoding TME and covers various analysis flows for cancer scRNA-Seq study, and provides an easy-to-use tool and a user-friendly interface for researchers widely, to further accelerate the biological discovery of cancer research.
Collapse
|
60
|
Upadhyay P, Ray S. A Regularized Multi-Task Learning Approach for Cell Type Detection in Single-Cell RNA Sequencing Data. Front Genet 2022; 13:788832. [PMID: 35495159 PMCID: PMC9043858 DOI: 10.3389/fgene.2022.788832] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 02/16/2022] [Indexed: 11/29/2022] Open
Abstract
Cell type prediction is one of the most challenging goals in single-cell RNA sequencing (scRNA-seq) data. Existing methods use unsupervised learning to identify signature genes in each cluster, followed by a literature survey to look up those genes for assigning cell types. However, finding potential marker genes in each cluster is cumbersome, which impedes the systematic analysis of single-cell RNA sequencing data. To address this challenge, we proposed a framework based on regularized multi-task learning (RMTL) that enables us to simultaneously learn the subpopulation associated with a particular cell type. Learning the structure of subpopulations is treated as a separate task in the multi-task learner. Regularization is used to modulate the multi-task model (e.g., W1, W2, … Wt) jointly, according to the specific prior. For validating our model, we trained it with reference data constructed from a single-cell RNA sequencing experiment and applied it to a query dataset. We also predicted completely independent data (the query dataset) from the reference data which are used for training. We have checked the efficacy of the proposed method by comparing it with other state-of-the-art techniques well known for cell type detection. Results revealed that the proposed method performed accurately in detecting the cell type in scRNA-seq data and thus can be utilized as a useful tool in the scRNA-seq pipeline.
Collapse
Affiliation(s)
- Piu Upadhyay
- B.P. Poddar Institute of Management and Technology, Kolkata, India
| | - Sumanta Ray
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
- Health Analytics Network, Pittsburgh, PA, United States
- *Correspondence: Sumanta Ray, ,
| |
Collapse
|
61
|
Yin Q, Liu Q, Fu Z, Zeng W, Zhang B, Zhang X, Jiang R, Lv H. scGraph: a graph neural network-based approach to automatically identify cell types. Bioinformatics 2022; 38:2996-3003. [PMID: 35394015 DOI: 10.1093/bioinformatics/btac199] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 12/13/2021] [Accepted: 04/07/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Single cell technologies play a crucial role in revolutionizing biological research over the past decade, which strengthens our understanding in cell differentiation, development, and regulation from a single-cell level perspective. Single-cell RNA sequencing (scRNA-seq) is one of the most common single cell technologies, which enables probing transcriptional states in thousands of cells in one experiment. Identification of cell types from scRNA-seq measurements is a fundamental and crucial question to answer. Most previous studies directly take gene expression as input while ignoring the comprehensive gene-gene interactions. RESULTS We propose scGraph, an automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell type identification. ScGraph is based on a graph neural network to aggregate the information of interacting genes. In a series of experiments, we demonstrate that scGraph is accurate and outperforms eight comparison methods in the task of cell type identification. Moreover, scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism. AVAILABILITY scGraph is freely available at https://github.com/QijinYin/scGraph and https://figshare.com/articles/software/scGraph/17157743. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Qiao Liu
- Department of Statistics, Stanford University Stanford, CA 94305
| | - Zhuoran Fu
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wanwen Zeng
- Department of Statistics, Stanford University Stanford, CA 94305.,College of Software, Nankai University, Tianjin, 300350, China
| | - Boheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Hairong Lv
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.,Fuzhou Institute of Data Technology, Changle, Fuzhou, 350200, China
| |
Collapse
|
62
|
Cao X, Xing L, Majd E, He H, Gu J, Zhang X. A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data. Front Genet 2022; 13:836798. [PMID: 35281805 PMCID: PMC8905542 DOI: 10.3389/fgene.2022.836798] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/18/2022] [Indexed: 11/13/2022] Open
Abstract
The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers.
Collapse
Affiliation(s)
- Xiaowen Cao
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China.,Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Li Xing
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Elham Majd
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Hua He
- School of Science, Hebei University of Technology, Tianjin, China
| | - Junhua Gu
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xuekui Zhang
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| |
Collapse
|
63
|
Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 2022; 13:1246. [PMID: 35273156 PMCID: PMC8913782 DOI: 10.1038/s41467-022-28803-w] [Citation(s) in RCA: 141] [Impact Index Per Article: 70.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 02/03/2022] [Indexed: 12/29/2022] Open
Abstract
Identification of cell populations often relies on manual annotation of cell clusters using established marker genes. However, the selection of marker genes is a time-consuming process that may lead to sub-optimal annotations as the markers must be informative of both the individual cell clusters and various cell types present in the sample. Here, we developed a computational platform, ScType, which enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information. Using six scRNA-seq datasets from various human and mouse tissues, we show how ScType provides unbiased and accurate cell type annotations by guaranteeing the specificity of positive and negative marker genes across cell clusters and cell types. We also demonstrate how ScType distinguishes between healthy and malignant cell populations, based on single-cell calling of single-nucleotide variants, making it a versatile tool for anticancer applications. The widely applicable method is deployed both as an interactive web-tool (https://sctype.app), and as an open-source R-package. Cell types are typically identified in single cell transcriptomic data by manual annotation of cell clusters using established marker genes. Here the authors present a fully-automated computational platform that can quickly and accurately distinguish between cell types.
Collapse
Affiliation(s)
- Aleksandr Ianevski
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.,Helsinki Institute for Information Technology (HIIT), Aalto University, Helsinki, Finland
| | - Anil K Giri
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland. .,Helsinki Institute for Information Technology (HIIT), Aalto University, Helsinki, Finland. .,Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Oslo, Norway. .,Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo, Norway.
| |
Collapse
|
64
|
Sharon M, Vinogradov E, Argov CM, Lazarescu O, Zoabi Y, Hekselman I, Yeger-Lotem E. The differential activity of biological processes in tissues and cell subsets can illuminate disease-related processes and cell-type identities. Bioinformatics 2022; 38:1584-1592. [PMID: 35015838 DOI: 10.1093/bioinformatics/btab883] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Revised: 12/09/2021] [Accepted: 01/02/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The distinct functionalities of human tissues and cell types underlie complex phenotype-genotype relationships, yet often remain elusive. Harnessing the multitude of bulk and single-cell human transcriptomes while focusing on processes can help reveal these distinct functionalities. RESULTS The Tissue-Process Activity (TiPA) method aims to identify processes that are preferentially active or under-expressed in specific contexts, by comparing the expression levels of process genes between contexts. We tested TiPA on 1579 tissue-specific processes and bulk tissue transcriptomes, finding that it performed better than another method. Next, we used TiPA to ask whether the activity of certain processes could underlie the tissue-specific manifestation of 1233 hereditary diseases. We found that 21% of the disease-causing genes indeed participated in such processes, thereby illuminating their genotype-phenotype relationships. Lastly, we applied TiPA to single-cell transcriptomes of 108 human cell types, revealing that process activities often match cell-type identities and can thus aid annotation efforts. Hence, differential activity of processes can highlight the distinct functionality of tissues and cells in a robust and meaningful manner. AVAILABILITY AND IMPLEMENTATION TiPA code is available in GitHub (https://github.com/moranshar/TiPA). In addition, all data are available as part of the Supplementary Material. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Moran Sharon
- Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Ekaterina Vinogradov
- Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Chanan M Argov
- Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Or Lazarescu
- Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Yazeed Zoabi
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Idan Hekselman
- Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Esti Yeger-Lotem
- Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel.,The National Institute for Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer Sheva, Israel
| |
Collapse
|
65
|
Wang CY, Gao YL, Liu JX, Kong XZ, Zheng CH. Single-Cell RNA Sequencing Data Clustering by Low-Rank Subspace Ensemble Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1154-1164. [PMID: 33026977 DOI: 10.1109/tcbb.2020.3029187] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of single-cell RNA sequencing (scRNA-seq)technology reveals the gene expression status and gene structure of individual cells, reflecting the heterogeneity and diversity of cells. The traditional methods of scRNA-seq data analysis treat data as the same subspace, and hide structural information in other subspaces. In this paper, we propose a low-rank subspace ensemble clustering framework (LRSEC)to analyze scRNA-seq data. Assuming that the scRNA-seq data exist in multiple subspaces, the low-rank model is used to find the lowest rank representation of the data in the subspace. It is worth noting that the penalty factor of the low-rank kernel function is uncertain, and different penalty factors correspond to different low-rank structures. Moreover, the single cluster model is difficult to find the cellular structure of all datasets. To strengthen the correlation between model solutions, we construct a new ensemble clustering framework LRSEC by using the low-rank model as the basic learner. The LRSEC framework captures the global structure of data through low-rank subspaces, which has better clustering performance than a single clustering model. We validate the performance of the LRSEC framework on seven small datasets and one large dataset and obtain satisfactory results.
Collapse
|
66
|
Duan D, He S, Huang E, Li Z, Feng H. NeuCA web server: a neural network-based cell annotation tool with web-app and GUI. Bioinformatics 2022; 38:2361-2363. [PMID: 35176143 PMCID: PMC9004646 DOI: 10.1093/bioinformatics/btac108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Revised: 01/24/2022] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Correctly annotating individual cell's type is an important initial step in single-cell RNA sequencing (scRNA-seq) data analysis. Here, we present NeuCA web server, a neural network-based scRNA-seq cell annotation tool with web-app portal and graphical user interface, for automatically assigning cell labels. NeuCA algorithm is accurate and exhaustive, maximizing the usage of measured cells for downstream analysis. NeuCA web server provides over 20 ready-to-use pre-trained classifiers for commonly used tissue types. As the first web-app tool with neural-network infrastructure implemented, NeuCA web will facilitate the research community in analyzing and annotating scRNA-seq data. AVAILABILITY AND IMPLEMENTATION NeuCA web server is implemented with R Shiny application online at https://statbioinfo.shinyapps.io/NeuCA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daoyu Duan
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Sijia He
- College of Arts and Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Emina Huang
- Department of Surgery, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Ziyi Li
- To whom correspondence should be addressed. or
| | - Hao Feng
- To whom correspondence should be addressed. or
| |
Collapse
|
67
|
Li J, Sheng Q, Shyr Y, Liu Q. scMRMA: single cell multiresolution marker-based annotation. Nucleic Acids Res 2022; 50:e7. [PMID: 34648021 PMCID: PMC8789072 DOI: 10.1093/nar/gkab931] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 09/10/2021] [Accepted: 09/28/2021] [Indexed: 01/22/2023] Open
Abstract
Single-cell RNA sequencing has become a powerful tool for identifying and characterizing cellular heterogeneity. One essential step to understanding cellular heterogeneity is determining cell identities. The widely used strategy predicts identities by projecting cells or cell clusters unidirectionally against a reference to find the best match. Here, we develop a bidirectional method, scMRMA, where a hierarchical reference guides iterative clustering and deep annotation with enhanced resolutions. Taking full advantage of the reference, scMRMA greatly improves the annotation accuracy. scMRMA achieved better performance than existing methods in four benchmark datasets and successfully revealed the expansion of CD8 T cell populations in squamous cell carcinoma after anti-PD-1 treatment.
Collapse
Affiliation(s)
- Jia Li
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Quanhu Sheng
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| |
Collapse
|
68
|
Li Z, Feng H. A neural network-based method for exhaustive cell label assignment using single cell RNA-seq data. Sci Rep 2022; 12:910. [PMID: 35042860 PMCID: PMC8766435 DOI: 10.1038/s41598-021-04473-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 12/21/2021] [Indexed: 02/01/2023] Open
Abstract
The fast-advancing single cell RNA sequencing (scRNA-seq) technology enables researchers to study the transcriptome of heterogeneous tissues at a single cell level. The initial important step of analyzing scRNA-seq data is usually to accurately annotate cells. The traditional approach of annotating cell types based on unsupervised clustering and marker genes is time-consuming and laborious. Taking advantage of the numerous existing scRNA-seq databases, many supervised label assignment methods have been developed. One feature that many label assignment methods shares is to label cells with low confidence as "unassigned." These unassigned cells can be the result of assignment difficulties due to highly similar cell types or caused by the presence of unknown cell types. However, when unknown cell types are not expected, existing methods still label a considerable number of cells as unassigned, which is not desirable. In this work, we develop a neural network-based cell annotation method called NeuCA (Neural network-based Cell Annotation) for scRNA-seq data obtained from well-studied tissues. NeuCA can utilize the hierarchical structure information of the cell types to improve the annotation accuracy, which is especially helpful when data contain closely correlated cell types. We show that NeuCA can achieve more accurate cell annotation results compared with existing methods. Additionally, the applications on eight real datasets show that NeuCA has stable performance for intra- and inter-study annotation, as well as cross-condition annotation. NeuCA is freely available as an R/Bioconductor package at https://bioconductor.org/packages/NeuCA .
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Hao Feng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, 44106, USA.
| |
Collapse
|
69
|
Nguyen V, Griss J. scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data. BMC Bioinformatics 2022; 23:44. [PMID: 35038984 PMCID: PMC8762856 DOI: 10.1186/s12859-022-04574-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Accepted: 01/11/2022] [Indexed: 12/02/2022] Open
Abstract
Background Automatic cell type identification is essential to alleviate a key bottleneck in scRNA-seq data analysis. While most existing classification tools show good sensitivity and specificity, they often fail to adequately not-classify cells that are missing in the used reference. Additionally, many tools do not scale to the continuously increasing size of current scRNA-seq datasets. Therefore, additional tools are needed to solve these challenges. Results scAnnotatR is a novel R package that provides a complete framework to classify cells in scRNA-seq datasets using pre-trained classifiers. It supports both Seurat and Bioconductor’s SingleCellExperiment and is thereby compatible with the vast majority of R-based analysis workflows. scAnnotatR uses hierarchically organised SVMs to distinguish a specific cell type versus all others. It shows comparable or even superior accuracy, sensitivity and specificity compared to existing tools while being able to not-classify unknown cell types. Moreover, scAnnotatR is the only of the best performing tools able to process datasets containing more than 600,000 cells. Conclusions scAnnotatR is freely available on GitHub (https://github.com/grisslab/scAnnotatR) and through Bioconductor (from version 3.14). It is consistently among the best performing tools in terms of classification accuracy while scaling to the largest datasets. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04574-5.
Collapse
Affiliation(s)
- Vy Nguyen
- Department of Dermatology, Medical University of Vienna, Währinger Gürtel 18-20, 1090, Vienna, Austria
| | - Johannes Griss
- Department of Dermatology, Medical University of Vienna, Währinger Gürtel 18-20, 1090, Vienna, Austria.
| |
Collapse
|
70
|
Zeng Y, Wei Z, Pan Z, Lu Y, Yang Y. A robust and scalable graph neural network for accurate single-cell classification. Brief Bioinform 2022; 23:6501353. [PMID: 35018408 DOI: 10.1093/bib/bbab570] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 12/01/2021] [Accepted: 12/11/2021] [Indexed: 12/25/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) techniques provide high-resolution data on cellular heterogeneity in diverse tissues, and a critical step for the data analysis is cell type identification. Traditional methods usually cluster the cells and manually identify cell clusters through marker genes, which is time-consuming and subjective. With the launch of several large-scale single-cell projects, millions of sequenced cells have been annotated and it is promising to transfer labels from the annotated datasets to newly generated datasets. One powerful way for the transferring is to learn cell relations through the graph neural network (GNN), but traditional GNNs are difficult to process millions of cells due to the expensive costs of the message-passing procedure at each training epoch. Here, we have developed a robust and scalable GNN-based method for accurate single-cell classification (GraphCS), where the graph is constructed to connect similar cells within and between labelled and unlabeled scRNA-seq datasets for propagation of shared information. To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity over cell numbers. Compared with existing methods, GraphCS demonstrates better performance on simulated, cross-platform, cross-species and cross-omics scRNA-seq datasets. More importantly, our model provides a high speed and scalability on large datasets, and can achieve superior performance for 1 million cells within 50 min.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zhuoyi Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China
| |
Collapse
|
71
|
Zhang Y, Zhang F, Wang Z, Wu S, Tian W. scMAGIC: accurately annotating single cells using two rounds of reference-based classification. Nucleic Acids Res 2022; 50:e43. [PMID: 34986249 PMCID: PMC9071478 DOI: 10.1093/nar/gkab1275] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 11/08/2021] [Accepted: 12/14/2021] [Indexed: 11/21/2022] Open
Abstract
Here, we introduce scMAGIC (Single Cell annotation using MArker Genes Identification and two rounds of reference-based Classification [RBC]), a novel method that uses well-annotated single-cell RNA sequencing (scRNA-seq) data as the reference to assist in the classification of query scRNA-seq data. A key innovation in scMAGIC is the introduction of a second-round RBC in which those query cells whose cell identities are confidently validated in the first round are used as a new reference to again classify query cells, therefore eliminating the batch effects between the reference and the query data. scMAGIC significantly outperforms 13 competing RBC methods with their optimal parameter settings across 86 benchmark tests, especially when the cell types in the query dataset are not completely covered by the reference dataset and when there exist significant batch effects between the reference and the query datasets. Moreover, when no reference dataset is available, scMAGIC can annotate query cells with reasonably high accuracy by using an atlas dataset as the reference.
Collapse
Affiliation(s)
- Yu Zhang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China.,Department of Histoembryology, Genetics and Developmental Biology, Shanghai Key Laboratory of Reproductive Medicine, Key Laboratory of Cell Differentiation and Apoptosis of Chinese Ministry of Education, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Zekun Wang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Siyi Wu
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China.,Qilu Children's Hospital of Shandong University, No 23976 Jingshi Road, Jinan, Shandong, China.,Children's Hospital of Fudan University, Shanghai 201102, China
| |
Collapse
|
72
|
Schiebout C, Frost HR. CAMML: Multi-Label Immune Cell-Typing and Stemness Analysis for Single-Cell RNA-sequencing. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2022; 27:199-210. [PMID: 34890149 PMCID: PMC8669732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Inferring the cell types in single-cell RNA-sequencing (scRNA-seq) data is of particular importance for understanding the potential cellular mechanisms and phenotypes occurring in complex tissues, such as the tumor-immune microenvironment (TME). The sparsity and noise of scRNA-seq data, combined with the fact that immune cell types often occur on a continuum, make cell typing of TME scRNA-seq data a significant challenge. Several single-label cell typing methods have been put forth to address the limitations of noise and sparsity, but accounting for the often overlapped spectrum of cell types in the immune TME remains an obstacle. To address this, we developed a new scRNA-seq cell-typing method, Cell-typing using variance Adjusted Mahalanobis distances with Multi-Labeling (CAMML). CAMML leverages cell type-specific weighted gene sets to score every cell in a dataset for every potential cell type. This allows cells to be labelled either by their highest scoring cell type as a single label classification or based on a score cut-off to give multi-label classification. For single-label cell typing, CAMML performance is comparable to existing cell typing methods, SingleR and Garnett. For scenarios where cells may exhibit features of multiple cell types (e.g., undifferentiated cells), the multi-label classification supported by CAMML offers important benefits relative to the current state-of-the-art methods. By integrating data across studies, omics platforms, and species, CAMML serves as a robust and adaptable method for overcoming the challenges of scRNA-seq analysis.
Collapse
|
73
|
OUP accepted manuscript. Brief Funct Genomics 2022; 21:159-176. [DOI: 10.1093/bfgp/elac002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 01/20/2022] [Accepted: 01/25/2022] [Indexed: 11/14/2022] Open
|
74
|
Yin Q, Wang Y, Guan J, Ji G. scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data. Brief Bioinform 2021; 23:6463428. [PMID: 34913057 DOI: 10.1093/bib/bbab508] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Revised: 10/28/2021] [Accepted: 11/04/2021] [Indexed: 12/12/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) allows quantitative analysis of gene expression at the level of single cells, beneficial to study cell heterogeneity. The recognition of cell types facilitates the construction of cell atlas in complex tissues or organisms, which is the basis of almost all downstream scRNA-seq data analyses. Using disease-related scRNA-seq data to perform the prediction of disease status can facilitate the specific diagnosis and personalized treatment of disease. Since single-cell gene expression data are high-dimensional and sparse with dropouts, we propose scIAE, an integrative autoencoder-based ensemble classification framework, to firstly perform multiple random projections and apply integrative and devisable autoencoders (integrating stacked, denoising and sparse autoencoders) to obtain compressed representations. Then base classifiers are built on the lower-dimensional representations and the predictions from all base models are integrated. The comparison of scIAE and common feature extraction methods shows that scIAE is effective and robust, independent of the choice of dimension, which is beneficial to subsequent cell classification. By testing scIAE on different types of data and comparing it with existing general and single-cell-specific classification methods, it is proven that scIAE has a great classification power in cell type annotation intradataset, across batches, across platforms and across species, and also disease status prediction. The architecture of scIAE is flexible and devisable, and it is available at https://github.com/JGuan-lab/scIAE.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Yang Wang
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China
| | - Jinting Guan
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China
| |
Collapse
|
75
|
Xie B, Jiang Q, Mora A, Li X. Automatic cell type identification methods for single-cell RNA sequencing. Comput Struct Biotechnol J 2021; 19:5874-5887. [PMID: 34815832 PMCID: PMC8572862 DOI: 10.1016/j.csbj.2021.10.027] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 09/23/2021] [Accepted: 10/18/2021] [Indexed: 11/24/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a powerful tool for scientists of many research disciplines due to its ability to elucidate the heterogeneous and complex cell-type compositions of different tissues and cell populations. Traditional cell-type identification methods for scRNA-seq data analysis are time-consuming and knowledge-dependent for manual annotation. By contrast, automatic cell-type identification methods may have the advantages of being fast, accurate, and more user friendly. Here, we discuss and evaluate thirty-two published automatic methods for scRNA-seq data analysis in terms of their prediction accuracy, F1-score, unlabeling rate and running time. We highlight the advantages and disadvantages of these methods and provide recommendations of method choice depending on the available information. The challenges and future applications of these automatic methods are further discussed. In addition, we provide a free scRNA-seq data analysis package encompassing the discussed automatic methods to help the easy usage of them in real-world applications.
Collapse
Affiliation(s)
- Bingbing Xie
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510060, Guangdong, China
| | - Qin Jiang
- Affiliated Eye Hospital of Nanjing Medical University, Nanjing, China
| | - Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Xinzao, Panyu District, Guangzhou 511436, Guangdong, China
| | - Xuri Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510060, Guangdong, China
| |
Collapse
|
76
|
Kimmel JC, Kelley DR. Semisupervised adversarial neural networks for single-cell classification. Genome Res 2021; 31:1781-1793. [PMID: 33627475 PMCID: PMC8494222 DOI: 10.1101/gr.268581.120] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 02/18/2021] [Indexed: 11/25/2022]
Abstract
Annotating cell identities is a common bottleneck in the analysis of single-cell genomics experiments. Here, we present scNym, a semisupervised, adversarial neural network that learns to transfer cell identity annotations from one experiment to another. scNym takes advantage of information in both labeled data sets and new, unlabeled data sets to learn rich representations of cell identity that enable effective annotation transfer. We show that scNym effectively transfers annotations across experiments despite biological and technical differences, achieving performance superior to existing methods. We also show that scNym models can synthesize information from multiple training and target data sets to improve performance. We show that in addition to high accuracy, scNym models are well calibrated and interpretable with saliency methods.
Collapse
Affiliation(s)
- Jacob C Kimmel
- Calico Life Sciences, LLC, South San Francisco, California 94080, USA
| | - David R Kelley
- Calico Life Sciences, LLC, South San Francisco, California 94080, USA
| |
Collapse
|
77
|
Pinkney HR, Black MA, Diermeier SD. Single-Cell RNA-Seq Reveals Heterogeneous lncRNA Expression in Xenografted Triple-Negative Breast Cancer Cells. BIOLOGY 2021; 10:987. [PMID: 34681087 PMCID: PMC8533545 DOI: 10.3390/biology10100987] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 09/23/2021] [Accepted: 09/26/2021] [Indexed: 12/03/2022]
Abstract
Breast cancer is the most commonly diagnosed cancer in the world, with triple-negative breast cancer (TNBC) making up 12% of these diagnoses. TNBC tumours are highly heterogeneous in both inter-tumour and intra-tumour gene expression profiles, where they form subclonal populations of varying levels of aggressiveness. These aspects make it difficult to study and treat TNBC, requiring further research into tumour heterogeneity as well as potential therapeutic targets and biomarkers. Recently, it was discovered that the majority of the transcribed genome comprises non-coding RNAs, in particular long non-coding RNAs (lncRNAs). LncRNAs are transcripts of >200 nucleotides in length that do not encode a protein. They have been characterised as regulatory molecules and their expression can be associated with a malignant phenotype. We set out to explore TNBC tumour heterogeneity in vivo at a single cell level to investigate whether lncRNA expression varies across different cells within the tumour, even if cells are coming from the same cell line, and whether lncRNA expression is sufficient to define cellular subpopulations. We applied single-cell expression profiling due to its ability to capture expression signals of lncRNAs expressed in small subpopulations of cells. Overall, we observed most lncRNAs to be expressed at low, but detectable levels in TNBC xenografts, with a median of 25 lncRNAs detected per cell. LncRNA expression alone was insufficient to define a subpopulation of cells, and lncRNAs showed highly heterogeneous expression patterns, including ubiquitous expression, subpopulation-specific expression, and a hybrid pattern of lncRNAs expressed in several, but not all subpopulations. These findings reinforce that transcriptionally defined tumour cell subpopulations can be identified in cell-line derived xenografts, and uses single-cell RNA-seq (scRNA-seq) to detect and characterise lncRNA expression across these subpopulations in xenografted tumours. Future studies will aim to investigate the spatial distribution of lncRNAs within xenografts and patient tissues, and study the potential of subclone-specific lncRNAs as new therapeutic targets and/or biomarkers.
Collapse
Affiliation(s)
- Holly R. Pinkney
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand; (H.R.P.); (M.A.B.)
| | - Michael A. Black
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand; (H.R.P.); (M.A.B.)
| | - Sarah D. Diermeier
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand; (H.R.P.); (M.A.B.)
- Amaroq Therapeutics Ltd., Dunedin 9016, New Zealand
| |
Collapse
|
78
|
Shao X, Yang H, Zhuang X, Liao J, Yang P, Cheng J, Lu X, Chen H, Fan X. scDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic Acids Res 2021; 49:e122. [PMID: 34500471 PMCID: PMC8643674 DOI: 10.1093/nar/gkab775] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 08/04/2021] [Accepted: 08/26/2021] [Indexed: 01/16/2023] Open
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) have furthered the simultaneous classification of thousands of cells in a single assay based on transcriptome profiling. In most analysis protocols, single-cell type annotation relies on marker genes or RNA-seq profiles, resulting in poor extrapolation. Still, the accurate cell-type annotation for single-cell transcriptomic data remains a great challenge. Here, we introduce scDeepSort (https://github.com/ZJUFanLab/scDeepSort), a pre-trained cell-type annotation tool for single-cell transcriptomics that uses a deep learning model with a weighted graph neural network (GNN). Using human and mouse scRNA-seq data resources, we demonstrate the high performance and robustness of scDeepSort in labeling 764 741 cells involving 56 human and 32 mouse tissues. Significantly, scDeepSort outperformed other known methods in annotating 76 external test datasets, reaching an 83.79% accuracy across 265 489 cells in humans and mice. Moreover, we demonstrate the universality of scDeepSort using more challenging datasets and using references from different scRNA-seq technology. Above all, scDeepSort is the first attempt to annotate cell types of scRNA-seq data with a pre-trained GNN model, which can realize the accurate cell-type annotation without additional references, i.e. markers or RNA-seq profiles.
Collapse
Affiliation(s)
- Xin Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,iMedicine Lab, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University, Hangzhou 310058, China
| | - Haihong Yang
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.,Hangzhou Innovation Center, Zhejiang University, Hangzhou 310058, China
| | - Xiang Zhuang
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
| | - Jie Liao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,iMedicine Lab, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University, Hangzhou 310058, China
| | - Penghui Yang
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Junyun Cheng
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xiaoyan Lu
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Center in Zhejiang University, State Key Laboratory of Component-Based Chinese Medicine, Hangzhou 310058, China
| | - Huajun Chen
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.,The First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou 310003, China.,Hangzhou Innovation Center, Zhejiang University, Hangzhou 310058, China
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,iMedicine Lab, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University, Hangzhou 310058, China.,Innovation Center in Zhejiang University, State Key Laboratory of Component-Based Chinese Medicine, Hangzhou 310058, China.,Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou 310058, China
| |
Collapse
|
79
|
Ma W, Su K, Wu H. Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction. Genome Biol 2021; 22:264. [PMID: 34503564 PMCID: PMC8427961 DOI: 10.1186/s13059-021-02480-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 08/25/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. RESULTS In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. CONCLUSIONS Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).
Collapse
Affiliation(s)
- Wenjing Ma
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Kenong Su
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Hao Wu
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA.
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA.
| |
Collapse
|
80
|
Cortal A, Martignetti L, Six E, Rausell A. Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID. Nat Biotechnol 2021; 39:1095-1102. [PMID: 33927417 DOI: 10.1038/s41587-021-00896-6] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 03/15/2021] [Indexed: 02/08/2023]
Abstract
Because of the stochasticity associated with high-throughput single-cell sequencing, current methods for exploring cell-type diversity rely on clustering-based computational approaches in which heterogeneity is characterized at cell subpopulation rather than at full single-cell resolution. Here we present Cell-ID, a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell sequencing data. We applied Cell-ID to data from multiple human and mouse samples, including blood cells, pancreatic islets and airway, intestinal and olfactory epithelium, as well as to comprehensive mouse cell atlas datasets. We demonstrate that Cell-ID signatures are reproducible across different donors, tissues of origin, species and single-cell omics technologies, and can be used for automatic cell-type annotation and cell matching across datasets. Cell-ID improves biological interpretation at individual cell level, enabling discovery of previously uncharacterized rare cell types or cell states. Cell-ID is distributed as an open-source R software package.
Collapse
Affiliation(s)
- Akira Cortal
- Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France
| | - Loredana Martignetti
- Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France
| | - Emmanuelle Six
- Laboratory of Human Lymphohematopoiesis, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France
| | - Antonio Rausell
- Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France. .,Molecular Genetics Service, AP-HP, Necker Hospital for Sick Children, Paris, France.
| |
Collapse
|
81
|
Zhou X, Chai H, Zeng Y, Zhao H, Yang Y. scAdapt: virtual adversarial domain adaptation network for single cell RNA-seq data classification across platforms and species. Brief Bioinform 2021; 22:6326525. [PMID: 34308480 DOI: 10.1093/bib/bbab281] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 06/29/2021] [Accepted: 07/02/2021] [Indexed: 11/14/2022] Open
Abstract
In single cell analyses, cell types are conventionally identified based on expressions of known marker genes, whose identifications are time-consuming and irreproducible. To solve this issue, many supervised approaches have been developed to identify cell types based on the rapid accumulation of public datasets. However, these approaches are sensitive to batch effects or biological variations since the data distributions are different in cross-platforms or species predictions. In this study, we developed scAdapt, a virtual adversarial domain adaptation network, to transfer cell labels between datasets with batch effects. scAdapt used both the labeled source and unlabeled target data to train an enhanced classifier and aligned the labeled source centroids and pseudo-labeled target centroids to generate a joint embedding. The scAdapt was demonstrated to outperform existing methods for classification in simulated, cross-platforms, cross-species, spatial transcriptomic and COVID-19 immune datasets. Further quantitative evaluations and visualizations for the aligned embeddings confirm the superiority in cell mixing and the ability to preserve discriminative cluster structure present in the original datasets.
Collapse
Affiliation(s)
- Xiang Zhou
- School of Computer Science and Engineering at the Sun Yat-sen University, China
| | - Hua Chai
- School of Computer Science and Engineering at the Sun Yat-sen University, China
| | - Yuansong Zeng
- School of Computer Science and Engineering at the Sun Yat-sen University, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital at the Sun Yat-sen University, China
| | - Yuedong Yang
- School of Computer Science and Engineering and the National Super Computer Center at Guangzhou, Sun Yat-sen University, China
| |
Collapse
|
82
|
Zelco A, Börjesson V, de Kanter JK, Lebrero-Fernandez C, Lauschke VM, Rocha-Ferreira E, Nilsson G, Nair S, Svedin P, Bemark M, Hagberg H, Mallard C, Holstege FCP, Wang X. Single-cell atlas reveals meningeal leukocyte heterogeneity in the developing mouse brain. Genes Dev 2021; 35:1190-1207. [PMID: 34301765 PMCID: PMC8336895 DOI: 10.1101/gad.348190.120] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 06/28/2021] [Indexed: 12/19/2022]
Abstract
Here, Zelco et al. used single-cell RNA sequencing to generate the first comprehensive transcriptional atlas of neonatal mouse meningeal leukocytes under normal conditions and after perinatal brain injury. They found that early after hypoxic–ischemic insult, neutrophil numbers increased and exhibited increased granulopoiesis, suggesting that the meninges are an important site of immune cell expansion with implications for the initiation of inflammatory cascades after neonatal brain injury. The meninges are important for brain development and pathology. Using single-cell RNA sequencing, we have generated the first comprehensive transcriptional atlas of neonatal mouse meningeal leukocytes under normal conditions and after perinatal brain injury. We identified almost all known leukocyte subtypes and found differences between neonatal and adult border-associated macrophages, thus highlighting that neonatal border-associated macrophages are functionally immature with regards to immune responses compared with their adult counterparts. We also identified novel meningeal microglia-like cell populations that may participate in white matter development. Early after the hypoxic–ischemic insult, neutrophil numbers increased and they exhibited increased granulopoiesis, suggesting that the meninges are an important site of immune cell expansion with implications for the initiation of inflammatory cascades after neonatal brain injury. Our study provides a single-cell resolution view of the importance of meningeal leukocytes at the early stage of development in health and disease.
Collapse
Affiliation(s)
- Aura Zelco
- Centre of Perinatal Medicine and Health, Institute of Neuroscience and Physiology, Department of Physiology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden
| | - Vanja Börjesson
- Bioinformatics Core Facility, Sahlgrenska Academy, University of Gothenburg, Gothenburg 413 90, Sweden
| | - Jurrian K de Kanter
- Princess Máxima Center for Pediatric Oncology, 3584 CS Utrecht, The Netherlands
| | - Cristina Lebrero-Fernandez
- Department of Microbiology and Immunology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden
| | - Volker M Lauschke
- Department of Physiology and Pharmacology, Karolinska Institute, Stockholm 17177, Sweden.,Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart 70 376, Germany
| | - Eridan Rocha-Ferreira
- Centre of Perinatal Medicine and Health, Institute of Clinical Sciences, Department of Obstetrics and Gynecology, Sahlgrenska Academy, Gothenburg University, Gothenburg 40530, Sweden
| | - Gisela Nilsson
- Centre of Perinatal Medicine and Health, Institute of Neuroscience and Physiology, Department of Physiology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden
| | - Syam Nair
- Centre of Perinatal Medicine and Health, Institute of Clinical Sciences, Department of Obstetrics and Gynecology, Sahlgrenska Academy, Gothenburg University, Gothenburg 40530, Sweden
| | - Pernilla Svedin
- Centre of Perinatal Medicine and Health, Institute of Neuroscience and Physiology, Department of Physiology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden
| | - Mats Bemark
- Department of Microbiology and Immunology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden
| | - Henrik Hagberg
- Centre of Perinatal Medicine and Health, Institute of Clinical Sciences, Department of Obstetrics and Gynecology, Sahlgrenska Academy, Gothenburg University, Gothenburg 40530, Sweden
| | - Carina Mallard
- Centre of Perinatal Medicine and Health, Institute of Neuroscience and Physiology, Department of Physiology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden
| | - Frank C P Holstege
- Princess Máxima Center for Pediatric Oncology, 3584 CS Utrecht, The Netherlands
| | - Xiaoyang Wang
- Centre of Perinatal Medicine and Health, Institute of Neuroscience and Physiology, Department of Physiology, Sahlgrenska Academy, University of Gothenburg, Gothenburg 40530, Sweden.,Centre of Perinatal Medicine and Health, Institute of Clinical Sciences, Department of Obstetrics and Gynecology, Sahlgrenska Academy, Gothenburg University, Gothenburg 40530, Sweden.,Henan Key Laboratory of Child Brain Injury, Institute of Neuroscience, Third Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, China
| |
Collapse
|
83
|
Wei Z, Zhang S. CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data. Bioinformatics 2021; 37:i51-i58. [PMID: 34252936 PMCID: PMC8686678 DOI: 10.1093/bioinformatics/btab286] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/23/2021] [Indexed: 12/13/2022] Open
Abstract
Motivation Single-cell RNA sequencing (scRNA-seq) technology has been widely applied to capture the heterogeneity of different cell types within complex tissues. An essential step in scRNA-seq data analysis is the annotation of cell types. Traditional cell-type annotation is mainly clustering the cells first, and then using the aggregated cluster-level expression profiles and the marker genes to label each cluster. Such methods are greatly dependent on the clustering results, which are insufficient for accurate annotation. Results In this article, we propose a semi-supervised learning method for cell-type annotation called CALLR. It combines unsupervised learning represented by the graph Laplacian matrix constructed from all the cells and supervised learning using sparse logistic regression. By alternately updating the cell clusters and annotation labels, high annotation accuracy can be achieved. The model is formulated as an optimization problem, and a computationally efficient algorithm is developed to solve it. Experiments on 10 real datasets show that CALLR outperforms the compared (semi-)supervised learning methods, and the popular clustering methods. Availability and implementation The implementation of CALLR is available at https://github.com/MathSZhang/CALLR. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ziyang Wei
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA.,School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Shuqin Zhang
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China.,Laboratory of Mathematics for Nonlinear Science, Fudan University, Shanghai 200433, China.,Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China
| |
Collapse
|
84
|
Kaymaz Y, Ganglberger F, Tang M, Haslinger C, Fernandez-Albert F, Lawless N, Sackton T. HieRFIT: A hierarchical cell type classification tool for projections from complex single-cell atlas datasets. Bioinformatics 2021; 37:4431-4436. [PMID: 34255817 DOI: 10.1093/bioinformatics/btab499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2020] [Revised: 05/25/2021] [Accepted: 07/02/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The emergence of single-cell RNA sequencing (scRNA-seq) has led to an explosion in novel methods to study biological variation among individual cells, and to classify cells into functional and biologically meaningful categories. RESULTS Here, we present a new cell type projection tool, HieRFIT (Hierarchical Random Forest for Information Transfer), based on hierarchical random forests. HieRFIT uses a priori information about cell type relationships to improve classification accuracy, taking as input a hierarchical tree structure representing the class relationships, along with the reference data. We use an ensemble approach combining multiple random forest models, organized in a hierarchical decision tree structure. We show that our hierarchical classification approach improves accuracy and reduces incorrect predictions especially for inter-dataset tasks which reflect real life applications. We use a scoring scheme that adjusts probability distributions for candidate class labels and resolves uncertainties while avoiding the assignment of cells to incorrect types by labeling cells at internal nodes of the hierarchy when necessary. AVAILABILITY HieRFIT is implemented as an R package, and it is available at (https://github.com/yasinkaymaz/HieRFIT/releases/tag/v1.0.0). t. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yasin Kaymaz
- Informatics Group, Harvard University, Cambridge, MA, USA
| | | | - Ming Tang
- Informatics Group, Harvard University, Cambridge, MA, USA
| | - Christian Haslinger
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH and Co KG, Biberach an der Riss, DE
| | - Francesc Fernandez-Albert
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH and Co KG, Biberach an der Riss, DE
| | - Nathan Lawless
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH and Co KG, Biberach an der Riss, DE
| | | |
Collapse
|
85
|
Song Q, Su J, Zhang W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat Commun 2021; 12:3826. [PMID: 34158507 PMCID: PMC8219725 DOI: 10.1038/s41467-021-24172-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Accepted: 06/07/2021] [Indexed: 12/20/2022] Open
Abstract
Single-cell omics is the fastest-growing type of genomics data in the literature and public genomics repositories. Leveraging the growing repository of labeled datasets and transferring labels from existing datasets to newly generated datasets will empower the exploration of single-cell omics data. However, the current label transfer methods have limited performance, largely due to the intrinsic heterogeneity among cell populations and extrinsic differences between datasets. Here, we present a robust graph artificial intelligence model, single-cell Graph Convolutional Network (scGCN), to achieve effective knowledge transfer across disparate datasets. Through benchmarking with other label transfer methods on a total of 30 single cell omics datasets, scGCN consistently demonstrates superior accuracy on leveraging cells from different tissues, platforms, and species, as well as cells profiled at different molecular layers. scGCN is implemented as an integrated workflow as a python software, which is available at https://github.com/QSong-github/scGCN .
Collapse
Affiliation(s)
- Qianqian Song
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC, USA
- Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC, USA
| | - Jing Su
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, USA.
- Section on Gerontology and Geriatric Medicine, Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA.
| | - Wei Zhang
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC, USA.
- Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC, USA.
| |
Collapse
|
86
|
Challenges and Opportunities in the Statistical Analysis of Multiplex Immunofluorescence Data. Cancers (Basel) 2021; 13:cancers13123031. [PMID: 34204319 PMCID: PMC8233801 DOI: 10.3390/cancers13123031] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 06/11/2021] [Accepted: 06/14/2021] [Indexed: 12/21/2022] Open
Abstract
Simple Summary Immune modulation is considered a hallmark of cancer initiation and progression, and has offered promising opportunities for therapeutic manipulation. Multiplex immunofluorescence (mIF) technology has enabled the tumor immune microenvironment (TIME) to be studied at an increased scale, in terms of both the number of markers and the number of samples. Another benefit of mIF technology is the ability to measure not only the abundance but also the spatial location of multiple cells types within a tissue sample simultaneously, allowing for assessment of the co-localization of different types of immune markers. Thus, the use of mIF technologies have enable researchers to characterize patient, clinical, and tumor characteristics in the hope of identifying patients whom might benefit from immunotherapy treatments. In this review we outline some of the challenges and opportunities in the statistical analyses of mIF data to study the TIME. Abstract Immune modulation is considered a hallmark of cancer initiation and progression. The recent development of immunotherapies has ushered in a new era of cancer treatment. These therapeutics have led to revolutionary breakthroughs; however, the efficacy of immunotherapy has been modest and is often restricted to a subset of patients. Hence, identification of which cancer patients will benefit from immunotherapy is essential. Multiplex immunofluorescence (mIF) microscopy allows for the assessment and visualization of the tumor immune microenvironment (TIME). The data output following image and machine learning analyses for cell segmenting and phenotyping consists of the following information for each tumor sample: the number of positive cells for each marker and phenotype(s) of interest, number of total cells, percent of positive cells for each marker, and spatial locations for all measured cells. There are many challenges in the analysis of mIF data, including many tissue samples with zero positive cells or “zero-inflated” data, repeated measurements from multiple TMA cores or tissue slides per subject, and spatial analyses to determine the level of clustering and co-localization between the cell types in the TIME. In this review paper, we will discuss the challenges in the statistical analysis of mIF data and opportunities for further research.
Collapse
|
87
|
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets. Genes (Basel) 2021; 12:genes12060898. [PMID: 34200671 PMCID: PMC8229796 DOI: 10.3390/genes12060898] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 06/04/2021] [Accepted: 06/04/2021] [Indexed: 01/05/2023] Open
Abstract
Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10−76, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.
Collapse
|
88
|
Balzer MS, Ma Z, Zhou J, Abedini A, Susztak K. How to Get Started with Single Cell RNA Sequencing Data Analysis. J Am Soc Nephrol 2021; 32:1279-1292. [PMID: 33722930 PMCID: PMC8259643 DOI: 10.1681/asn.2020121742] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Over the last 5 years, single cell methods have enabled the monitoring of gene and protein expression, genetic, and epigenetic changes in thousands of individual cells in a single experiment. With the improved measurement and the decreasing cost of the reactions and sequencing, the size of these datasets is increasing rapidly. The critical bottleneck remains the analysis of the wealth of information generated by single cell experiments. In this review, we give a simplified overview of the analysis pipelines, as they are typically used in the field today. We aim to enable researchers starting out in single cell analysis to gain an overview of challenges and the most commonly used analytical tools. In addition, we hope to empower others to gain an understanding of how typical readouts from single cell datasets are presented in the published literature.
Collapse
Affiliation(s)
- Michael S. Balzer
- Renal Electrolyte and Hypertension Division, Department of Medicine, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania
| | - Ziyuan Ma
- Renal Electrolyte and Hypertension Division, Department of Medicine, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania
| | - Jianfu Zhou
- Renal Electrolyte and Hypertension Division, Department of Medicine, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania
| | - Amin Abedini
- Renal Electrolyte and Hypertension Division, Department of Medicine, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania
| | - Katalin Susztak
- Renal Electrolyte and Hypertension Division, Department of Medicine, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania,Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania
| |
Collapse
|
89
|
Liu J, Fan Z, Zhao W, Zhou X. Machine Intelligence in Single-Cell Data Analysis: Advances and New Challenges. Front Genet 2021; 12:655536. [PMID: 34135939 PMCID: PMC8203333 DOI: 10.3389/fgene.2021.655536] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/26/2021] [Indexed: 12/18/2022] Open
Abstract
The rapid development of single-cell technologies allows for dissecting cellular heterogeneity at different omics layers with an unprecedented resolution. In-dep analysis of cellular heterogeneity will boost our understanding of complex biological systems or processes, including cancer, immune system and chronic diseases, thereby providing valuable insights for clinical and translational research. In this review, we will focus on the application of machine learning methods in single-cell multi-omics data analysis. We will start with the pre-processing of single-cell RNA sequencing (scRNA-seq) data, including data imputation, cross-platform batch effect removal, and cell cycle and cell-type identification. Next, we will introduce advanced data analysis tools and methods used for copy number variance estimate, single-cell pseudo-time trajectory analysis, phylogenetic tree inference, cell-cell interaction, regulatory network inference, and integrated analysis of scRNA-seq and spatial transcriptome data. Finally, we will present the latest analyzing challenges, such as multi-omics integration and integrated analysis of scRNA-seq data.
Collapse
Affiliation(s)
- Jiajia Liu
- College of Electronic and Information Engineering, Tongji University, Shanghai, China
- School of Biomedical Informatics, The University of Texas Health Science Centre at Houston, Houston, TX, United States
| | - Zhiwei Fan
- School of Biomedical Informatics, The University of Texas Health Science Centre at Houston, Houston, TX, United States
- West China School of Public Health, West China Fourth Hospital, Sichuan University, Chengdu, China
| | - Weiling Zhao
- School of Biomedical Informatics, The University of Texas Health Science Centre at Houston, Houston, TX, United States
| | - Xiaobo Zhou
- School of Biomedical Informatics, The University of Texas Health Science Centre at Houston, Houston, TX, United States
| |
Collapse
|
90
|
Shen Y, Chu Q, Timko MP, Fan L. scDetect: a rank-based ensemble learning algorithm for cell type identification of single-cell RNA sequencing in cancer. Bioinformatics 2021; 37:4115-4122. [PMID: 34048541 DOI: 10.1093/bioinformatics/btab410] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 05/16/2021] [Accepted: 05/27/2021] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) has enabled the characterization of different cell types in many tissues and tumor samples. Cell type identification is essential for single-cell RNA profiling, currently transforming the life sciences. Often, this is achieved by searching for combinations of genes that have previously been implicated as being cell-type specific, an approach that is not quantitative and does not explicitly take advantage of other scRNA-seq studies. Batch effects and different data platforms greatly decrease the predictive performance in inter-laboratory and different data type validation. RESULTS Here, we present a new ensemble learning method named as "scDetect" that combines gene expression rank-based analysis and a majority vote ensemble machine-learning probability-based prediction method capable of highly accurate classification of cells based on scRNA-seq data by different sequencing platforms. Because of tumor heterogeneity, in order to accurately predict tumor cells in the single cell RNA-seq data, we have also incorporated cell copy number variation consensus clustering and epithelial score in the classification. We applied scDetect to scRNA-seq data from pancreatic tissue, mononuclear cells, and tumor biopsies cells and show that scDetect classified individual cells with high accuracy and better than other publicly available tools. AVAILABILITY scDetect is an open source software. Source code and test data is freely available from Github (https://github.com/IVDgenomicslab/scDetect/) and Zenodo (https://zenodo.org/record/4764132#.YKCOlrH5AYN). The examples and tutorial page is at https://ivdgenomicslab.github.io/scDetect-Introduction/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yifei Shen
- Centre of Clinical Laboratory, First Affiliated Hospital, College of Medicine, Zhejiang University, China.,Key Laboratory of Clinical In Vitro Diagnostic Techniques of Zhejiang Province, China.,Institute of Laboratory Medicine, Zhejiang University, China
| | - Qinjie Chu
- Institute of Bioinformatics, Zhejiang University, China
| | - Michael P Timko
- Departments of Biology and Public Health Sciences, University of Virginia, USA
| | - Longjiang Fan
- Institute of Bioinformatics, Zhejiang University, China.,Department of Medical Oncology, First Affiliated Hospital, College of Medicine, Zhejiang University, China
| |
Collapse
|
91
|
Duan B, Chen S, Chen X, Zhu C, Tang C, Wang S, Gao Y, Fu S, Liu Q. Integrating multiple references for single-cell assignment. Nucleic Acids Res 2021; 49:e80. [PMID: 34037791 PMCID: PMC8373058 DOI: 10.1093/nar/gkab380] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Revised: 04/13/2021] [Accepted: 04/27/2021] [Indexed: 01/09/2023] Open
Abstract
Efficient single-cell assignment is essential for single-cell sequencing data analysis. With the explosive growth of single-cell sequencing data, multiple single-cell sequencing data sources are available for the same kind of tissue, which can be integrated to further improve single-cell assignment; however, an efficient integration strategy is still lacking due to the great challenges of data heterogeneity existing in multiple references. To this end, we present mtSC, a flexible single-cell assignment framework that integrates multiple references based on multitask deep metric learning designed specifically for cell type identification within tissues with multiple single-cell sequencing data as references. We evaluated mtSC on a comprehensive set of publicly available benchmark datasets and demonstrated its state-of-the-art effectiveness for integrative single-cell assignment with multiple references.
Collapse
Affiliation(s)
- Bin Duan
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Shaoqi Chen
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiaohan Chen
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Chenyu Zhu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Chen Tang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Shuguang Wang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Yicheng Gao
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Shaliu Fu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Qi Liu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| |
Collapse
|
92
|
Michielsen L, Reinders MJT, Mahfouz A. Hierarchical progressive learning of cell identities in single-cell data. Nat Commun 2021; 12:2799. [PMID: 33990598 PMCID: PMC8121839 DOI: 10.1038/s41467-021-23196-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 04/16/2021] [Indexed: 12/11/2022] Open
Abstract
Supervised methods are increasingly used to identify cell populations in single-cell data. Yet, current methods are limited in their ability to learn from multiple datasets simultaneously, are hampered by the annotation of datasets at different resolutions, and do not preserve annotations when retrained on new datasets. The latter point is especially important as researchers cannot rely on downstream analysis performed using earlier versions of the dataset. Here, we present scHPL, a hierarchical progressive learning method which allows continuous learning from single-cell data by leveraging the different resolutions of annotations across multiple datasets to learn and continuously update a classification tree. We evaluate the classification and tree learning performance using simulated as well as real datasets and show that scHPL can successfully learn known cellular hierarchies from multiple datasets while preserving the original annotations. scHPL is available at https://github.com/lcmmichielsen/scHPL .
Collapse
Affiliation(s)
- Lieke Michielsen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
| | - Marcel J T Reinders
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
| | - Ahmed Mahfouz
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands.
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands.
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.
| |
Collapse
|
93
|
Sánchez-Corrales YE, Pohle RVC, Castellano S, Giustacchini A. Taming Cell-to-Cell Heterogeneity in Acute Myeloid Leukaemia With Machine Learning. Front Oncol 2021; 11:666829. [PMID: 33996595 PMCID: PMC8117935 DOI: 10.3389/fonc.2021.666829] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 04/06/2021] [Indexed: 12/21/2022] Open
Abstract
Acute Myeloid Leukaemia (AML) is a phenotypically and genetically heterogenous blood cancer characterised by very poor prognosis, with disease relapse being the primary cause of treatment failure. AML heterogeneity arise from different genetic and non-genetic sources, including its proposed hierarchical structure, with leukemic stem cells (LSCs) and progenitors giving origin to a variety of more mature leukemic subsets. Recent advances in single-cell molecular and phenotypic profiling have highlighted the intra and inter-patient heterogeneous nature of AML, which has so far limited the success of cell-based immunotherapy approaches against single targets. Machine Learning (ML) can be uniquely used to find non-trivial patterns from high-dimensional datasets and identify rare sub-populations. Here we review some recent ML tools that applied to single-cell data could help disentangle cell heterogeneity in AML by identifying distinct core molecular signatures of leukemic cell subsets. We discuss the advantages and limitations of unsupervised and supervised ML approaches to cluster and classify cell populations in AML, for the identification of biomarkers and the design of personalised therapies.
Collapse
Affiliation(s)
- Yara E. Sánchez-Corrales
- Genetics and Genomic Medicine Department, Great Ormond Street Institute of Child Health, University College London, London, United Kingdom
| | - Ruben V. C. Pohle
- Molecular and Cellular Immunology Section, Great Ormond Street Institute of Child Health, University College London, London, United Kingdom
| | - Sergi Castellano
- Genetics and Genomic Medicine Department, Great Ormond Street Institute of Child Health, University College London, London, United Kingdom
- University College London (UCL) Genomics, Great Ormond Street Institute of Child Health, University College London, London, United Kingdom
| | - Alice Giustacchini
- Molecular and Cellular Immunology Section, Great Ormond Street Institute of Child Health, University College London, London, United Kingdom
| |
Collapse
|
94
|
Chen S, Yan G, Zhang W, Li J, Jiang R, Lin Z. RA3 is a reference-guided approach for epigenetic characterization of single cells. Nat Commun 2021; 12:2177. [PMID: 33846355 PMCID: PMC8041798 DOI: 10.1038/s41467-021-22495-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 03/18/2021] [Indexed: 12/13/2022] Open
Abstract
The recent advancements in single-cell technologies, including single-cell chromatin accessibility sequencing (scCAS), have enabled profiling the epigenetic landscapes for thousands of individual cells. However, the characteristics of scCAS data, including high dimensionality, high degree of sparsity and high technical variation, make the computational analysis challenging. Reference-guided approaches, which utilize the information in existing datasets, may facilitate the analysis of scCAS data. Here, we present RA3 (Reference-guided Approach for the Analysis of single-cell chromatin Accessibility data), which utilizes the information in massive existing bulk chromatin accessibility and annotated scCAS data. RA3 simultaneously models (1) the shared biological variation among scCAS data and the reference data, and (2) the unique biological variation in scCAS data that identifies distinct subpopulations. We show that RA3 achieves superior performance when used on several scCAS datasets, and on references constructed using various approaches. Altogether, these analyses demonstrate the wide applicability of RA3 in analyzing scCAS data.
Collapse
Affiliation(s)
- Shengquan Chen
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Guanao Yan
- School of Mathematical Sciences, Zhejiang University, Hangzhou, China
| | - Wenyu Zhang
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Jinzhao Li
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China.
| | - Zhixiang Lin
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
95
|
Huang Q, Liu Y, Du Y, Garmire LX. Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2021; 19:267-281. [PMID: 33359678 PMCID: PMC8602772 DOI: 10.1016/j.gpb.2020.07.004] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 07/16/2020] [Accepted: 10/27/2020] [Indexed: 01/13/2023]
Abstract
Annotating cell types is a critical step in single-cell RNA sequencing (scRNA-seq) data analysis. Some supervised or semi-supervised classification methods have recently emerged to enable automated cell type identification. However, comprehensive evaluations of these methods are lacking. Moreover, it is not clear whether some classification methods originally designed for analyzing other bulk omics data are adaptable to scRNA-seq analysis. In this study, we evaluated ten cell type annotation methods publicly available as R packages. Eight of them are popular methods developed specifically for single-cell research, including Seurat, scmap, SingleR, CHETAH, SingleCellNet, scID, Garnett, and SCINA. The other two methods were repurposed from deconvoluting DNA methylation data, i.e., linear constrained projection (CP) and robust partial correlations (RPC). We conducted systematic comparisons on a wide variety of public scRNA-seq datasets as well as simulation data. We assessed the accuracy through intra-dataset and inter-dataset predictions; the robustness over practical challenges such as gene filtering, high similarity among cell types, and increased cell type classes; as well as the detection of rare and unknown cell types. Overall, methods such as Seurat, SingleR, CP, RPC, and SingleCellNet performed well, with Seurat being the best at annotating major cell types. Additionally, Seurat, SingleR, CP, and RPC were more robust against downsampling. However, Seurat did have a major drawback at predicting rare cell populations, and it was suboptimal at differentiating cell types highly similar to each other, compared to SingleR and RPC. All the code and data are available from https://github.com/qianhuiSenn/scRNA_cell_deconv_benchmark.
Collapse
Affiliation(s)
- Qianhui Huang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yu Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48105, USA
| | - Yuheng Du
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Lana X Garmire
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48105, USA.
| |
Collapse
|
96
|
Ma SX, Lim SB. Single-Cell RNA Sequencing in Parkinson's Disease. Biomedicines 2021; 9:368. [PMID: 33916045 PMCID: PMC8066089 DOI: 10.3390/biomedicines9040368] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 03/28/2021] [Accepted: 03/30/2021] [Indexed: 02/07/2023] Open
Abstract
Single-cell and single-nucleus RNA sequencing (sc/snRNA-seq) technologies have enhanced the understanding of the molecular pathogenesis of neurodegenerative disorders, including Parkinson's disease (PD). Nonetheless, their application in PD has been limited due mainly to the technical challenges resulting from the scarcity of postmortem brain tissue and low quality associated with RNA degradation. Despite such challenges, recent advances in animals and human in vitro models that recapitulate features of PD along with sequencing assays have fueled studies aiming to obtain an unbiased and global view of cellular composition and phenotype of PD at the single-cell resolution. Here, we reviewed recent sc/snRNA-seq efforts that have successfully characterized diverse cell-type populations and identified cell type-specific disease associations in PD. We also examined how these studies have employed computational and analytical tools to analyze and interpret the rich information derived from sc/snRNA-seq. Finally, we highlighted important limitations and emerging technologies for addressing key technical challenges currently limiting the integration of new findings into clinical practice.
Collapse
Affiliation(s)
- Shi-Xun Ma
- Institute for Cell Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA;
| | - Su Bin Lim
- Department of Biochemistry and Molecular Biology, Ajou University School of Medicine, Suwon 16499, Korea
| |
Collapse
|
97
|
Liu X, Gosline SJC, Pflieger LT, Wallet P, Iyer A, Guinney J, Bild AH, Chang JT. Knowledge-based classification of fine-grained immune cell types in single-cell RNA-Seq data. Brief Bioinform 2021; 22:6157454. [PMID: 33681983 DOI: 10.1093/bib/bbab039] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 01/11/2021] [Accepted: 01/27/2021] [Indexed: 11/13/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-Seq) is an emerging strategy for characterizing immune cell populations. Compared to flow or mass cytometry, scRNA-Seq could potentially identify cell types and activation states that lack precise cell surface markers. However, scRNA-Seq is currently limited due to the need to manually classify each immune cell from its transcriptional profile. While recently developed algorithms accurately annotate coarse cell types (e.g. T cells versus macrophages), making fine distinctions (e.g. CD8+ effector memory T cells) remains a difficult challenge. To address this, we developed a machine learning classifier called ImmClassifier that leverages a hierarchical ontology of cell type. We demonstrate that its predictions are highly concordant with flow-based markers from CITE-seq and outperforms other tools (+15% recall, +14% precision) in distinguishing fine-grained cell types with comparable performance on coarse ones. Thus, ImmClassifier can be used to explore more deeply the heterogeneity of the immune system in scRNA-Seq experiments.
Collapse
Affiliation(s)
- Xuan Liu
- Department of Integrative Biology & Pharmacology, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | | | - Lance T Pflieger
- Department of Medical Oncology & Therapeutics Research, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - Pierre Wallet
- Department of Medical Oncology & Therapeutics Research, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - Archana Iyer
- Center for Cancer Systems Immunology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | | | - Andrea H Bild
- Department of Medical Oncology & Therapeutics Research, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - Jeffrey T Chang
- Department of Integrative Biology & Pharmacology, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
98
|
Su K, Yu T, Wu H. Accurate feature selection improves single-cell RNA-seq cell clustering. Brief Bioinform 2021; 22:6145899. [PMID: 33611426 DOI: 10.1093/bib/bbab034] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 01/06/2021] [Accepted: 01/22/2021] [Indexed: 02/04/2023] Open
Abstract
Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as 'features'), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have a significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we carefully evaluate the impact of feature selection on cell clustering accuracy. In addition, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features. We apply the method on 12 public scRNA-seq datasets and demonstrate that using features selected by FEAST with existing clustering tools significantly improve the clustering accuracy.
Collapse
Affiliation(s)
- Kenong Su
- Department of Computer Science, Emory University
| | - Tianwei Yu
- School of Data Science, The Chinese University of Hong Kong, Shenzhen
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Emory University, 201 Dowman Dr, Atlanta, GA 30322, USA
| |
Collapse
|
99
|
Wilson CM, Fridley BL, Conejo-Garcia JR, Wang X, Yu X. Wide and deep learning for automatic cell type identification. Comput Struct Biotechnol J 2021; 19:1052-1062. [PMID: 33613870 PMCID: PMC7878986 DOI: 10.1016/j.csbj.2021.01.027] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 01/16/2021] [Accepted: 01/18/2021] [Indexed: 01/19/2023] Open
Abstract
Cell type classification is an important problem in cancer research, especially with the advent of single cell technologies. Correctly identifying cells within the tumor microenvironment can provide oncologists with a snapshot of how a patient’s immune system reacts to the tumor. Wide and deep learning (WDL) is an approach to construct a cell-classification prediction model that can learn patterns within high-dimensional data (deep) and ensure that biologically relevant features (wide) remain in the final model. In this paper, we demonstrate that regularization can prevent overfitting and adding a wide component to a neural network can result in a model with better predictive performance. In particular, we observed that a combination of dropout and ℓ2 regularization can lead to a validation loss function that does not depend on the number of training iterations and does not experience a significant decrease in prediction accuracy compared to models with ℓ1, dropout, or no regularization. Additionally, we show WDL can have superior classification accuracy when the training and testing of a model are completed data on that arise from the same cancer type but different platforms. More specifically, WDL compared to traditional deep learning models can substantially increase the overall cell type prediction accuracy (36.5 to 86.9%) and T cell subtypes (CD4: 2.4 to 59.1%, and CD8: 19.5 to 96.1%) when the models were trained using melanoma data obtained from the 10X platform and tested on basal cell carcinoma data obtained using SMART-seq. WDL obtains higher accuracy when compared to state-of-the-art cell classification algorithms CHETAH (70.36%) and SingleR (70.59%).
Collapse
Affiliation(s)
- Christopher M Wilson
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, 12902 USF Magnolia Drive, Tampa, FL 33612, USA
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, 12902 USF Magnolia Drive, Tampa, FL 33612, USA
| | - José R Conejo-Garcia
- Department of Immunology, H. Lee Moffitt Cancer Center & Research Institute, 12902 USF Magnolia Drive, Tampa, FL 33612, USA
| | - Xuefeng Wang
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, 12902 USF Magnolia Drive, Tampa, FL 33612, USA
| | - Xiaoqing Yu
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, 12902 USF Magnolia Drive, Tampa, FL 33612, USA
| |
Collapse
|
100
|
Pasquini G, Rojo Arias JE, Schäfer P, Busskamp V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 2021; 19:961-969. [PMID: 33613863 PMCID: PMC7873570 DOI: 10.1016/j.csbj.2021.01.015] [Citation(s) in RCA: 81] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 01/13/2021] [Accepted: 01/13/2021] [Indexed: 12/22/2022] Open
Abstract
The advent of single-cell sequencing started a new era of transcriptomic and genomic research, advancing our knowledge of the cellular heterogeneity and dynamics. Cell type annotation is a crucial step in analyzing single-cell RNA sequencing data, yet manual annotation is time-consuming and partially subjective. As an alternative, tools have been developed for automatic cell type identification. Different strategies have emerged to ultimately associate gene expression profiles of single cells with a cell type either by using curated marker gene databases, correlating reference expression data, or transferring labels by supervised classification. In this review, we present an overview of the available tools and the underlying approaches to perform automated cell type annotations on scRNA-seq data.
Collapse
Affiliation(s)
- Giovanni Pasquini
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
- Universitäts-Augenklinik Bonn, University of Bonn, Department of Ophthalmology, Bonn 53127, Germany
| | - Jesus Eduardo Rojo Arias
- Wellcome-MRC Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, UK
| | - Patrick Schäfer
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
| | - Volker Busskamp
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
- Universitäts-Augenklinik Bonn, University of Bonn, Department of Ophthalmology, Bonn 53127, Germany
| |
Collapse
|