1
|
Qiu Y, Yang L, Jiang H, Zou Q. scTPC: a novel semisupervised deep clustering model for scRNA-seq data. Bioinformatics 2024; 40:btae293. [PMID: 38684178 PMCID: PMC11091743 DOI: 10.1093/bioinformatics/btae293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/14/2024] [Accepted: 04/26/2024] [Indexed: 05/02/2024] Open
Abstract
MOTIVATION Continuous advancements in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to further explore the study of cell heterogeneity, trajectory inference, identification of rare cell types, and neurology. Accurate scRNA-seq data clustering is crucial in single-cell sequencing data analysis. However, the high dimensionality, sparsity, and presence of "false" zero values in the data can pose challenges to clustering. Furthermore, current unsupervised clustering algorithms have not effectively leveraged prior biological knowledge, making cell clustering even more challenging. RESULTS This study investigates a semisupervised clustering model called scTPC, which integrates the triplet constraint, pairwise constraint, and cross-entropy constraint based on deep learning. Specifically, the model begins by pretraining a denoising autoencoder based on a zero-inflated negative binomial distribution. Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model. A series of experimental results on 10 real scRNA-seq datasets and five simulated datasets demonstrate that scTPC achieves accurate clustering with a well-designed framework. AVAILABILITY AND IMPLEMENTATION scTPC is a Python-based algorithm, and the code is available from https://github.com/LF-Yang/Code or https://zenodo.org/records/10951780.
Collapse
Affiliation(s)
- Yushan Qiu
- School of Mathematical Sciences, Shenzhen University, Shenzhen, Guangdong 518000, China
| | - Lingfei Yang
- School of Mathematical Sciences, Shenzhen University, Shenzhen, Guangdong 518000, China
| | - Hao Jiang
- School of Mathematics, Renmin University of China, Haidian District, Beijing 100872, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610056, China
| |
Collapse
|
2
|
Ma M, Li L, Yang SH, Huang C, Zhuang W, Huang S, Xia X, Tang Y, Li Z, Zhao ZB, Chen Q, Qiao G, Lian ZX. Lymphatic endothelial cell-mediated accumulation of CD177 +Treg cells suppresses antitumor immunity in human esophageal squamous cell carcinoma. Oncoimmunology 2024; 13:2327692. [PMID: 38516269 PMCID: PMC10956621 DOI: 10.1080/2162402x.2024.2327692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Accepted: 03/04/2024] [Indexed: 03/23/2024] Open
Abstract
Regulatory T (Treg) cells are critical in shaping an immunosuppressive microenvironment to favor tumor progression and resistance to therapies. However, the heterogeneity and function of Treg cells in esophageal squamous cell carcinoma (ESCC) remain underexplored. We identified CD177 as a tumor-infiltrating Treg cell marker in ESCC. Interestingly, expression levels of CD177 and PD-1 were mutually exclusive in tumor Treg cells. CD177+ Treg cells expressed high levels of IL35, in association with CD8+ T cell exhaustion, whereas PD-1+ Treg cells expressed high levels of IL10. Pan-cancer analysis revealed that CD177+ Treg cells display increased clonal expansion compared to PD-1+ and double-negative (DN) Treg cells, and CD177+ and PD-1+ Treg cells develop from the same DN Treg cell origin. Importantly, we found CD177+ Treg cell infiltration to be associated with poor overall survival and poor response to anti-PD-1 immunotherapy plus chemotherapy in ESCC patients. Finally, we found that lymphatic endothelial cells are associated with CD177+ Treg cell accumulation in ESCC tumors, which are also decreased after anti-PD-1 immunotherapy plus chemotherapy. Our work identifies CD177+ Treg cell as a tumor-specific Treg cell subset and highlights their potential value as a prognostic marker of survival and response to immunotherapy and a therapeutic target in ESCC.
Collapse
Affiliation(s)
- Min Ma
- Chronic Disease Laboratory, School of Medicine South China University of Technology, Guangzhou, China
- Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Liang Li
- Medical Research Institute, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Shu-Han Yang
- Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Chuan Huang
- Chronic Disease Laboratory, School of Medicine South China University of Technology, Guangzhou, China
- Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Weitao Zhuang
- Department of Thoracic Surgery, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Shujie Huang
- Department of Thoracic Surgery, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Xin Xia
- Department of Thoracic Surgery, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Yong Tang
- Department of Thoracic Surgery, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Zijun Li
- Guangdong Provincial Institute of Geriatrics, Concord Medical Center, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Zhi-Bin Zhao
- Medical Research Institute, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Qingyun Chen
- Medical Research Institute, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Guibin Qiao
- Department of Thoracic Surgery, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| | - Zhe-Xiong Lian
- Chronic Disease Laboratory, School of Medicine South China University of Technology, Guangzhou, China
- Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
| |
Collapse
|
3
|
Yang J, Wang W, Zhang X. scSemiGCN: boosting cell-type annotation from noise-resistant graph neural networks with extremely limited supervision. Bioinformatics 2024; 40:btae091. [PMID: 38366925 PMCID: PMC10904148 DOI: 10.1093/bioinformatics/btae091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 01/14/2024] [Accepted: 02/14/2024] [Indexed: 02/19/2024] Open
Abstract
MOTIVATION Cell-type annotation is fundamental in revealing cell heterogeneity for single-cell data analysis. Although a host of works have been developed, the low signal-to-noise-ratio single-cell RNA-sequencing data that suffers from batch effects and dropout still poses obstacles in discovering grouped patterns for cell types by unsupervised learning and its alternative-semi-supervised learning that utilizes a few labeled cells as guidance for cell-type annotation. RESULTS We propose a robust cell-type annotation method scSemiGCN based on graph convolutional networks. Built upon a denoised network structure that characterizes reliable cell-to-cell connections, scSemiGCN generates pseudo labels for unannotated cells. Then supervised contrastive learning follows to refine the noisy single-cell data. Finally, message passing with the refined features over the denoised network structure is conducted for semi-supervised cell-type annotation. Comparison over several datasets with six methods under extremely limited supervision validates the effectiveness and efficiency of scSemiGCN for cell-type annotation. AVAILABILITY AND IMPLEMENTATION Implementation of scSemiGCN is available at https://github.com/Jane9898/scSemiGCN.
Collapse
Affiliation(s)
- Jue Yang
- School of Mathematics, Sun Yat-sen University, Guangzhou 510000, China
| | - Weiwen Wang
- Department of Mathematics, School of Information Science and Technology, Jinan University, Guangzhou 510000, China
| | - Xiwen Zhang
- Department of Bioinformatics, College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou 510000, China
| |
Collapse
|
4
|
Wei Z, Chenjun W, Feiyang X, Mingfeng J, Yixuan Z, Qi L, Zhuoxing S, Qi D. scHybridBERT: integrating gene regulation and cell graph for spatiotemporal dynamics in single-cell clustering. Brief Bioinform 2024; 25:bbae018. [PMID: 38517692 PMCID: PMC10959234 DOI: 10.1093/bib/bbae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 12/19/2023] [Accepted: 01/09/2024] [Indexed: 03/24/2024] Open
Abstract
Graph learning models have received increasing attention in the computational analysis of single-cell RNA sequencing (scRNA-seq) data. Compared with conventional deep neural networks, graph neural networks and language models have exhibited superior performance by extracting graph-structured data from raw gene count matrices. Established deep neural network-based clustering approaches generally focus on temporal expression patterns while ignoring inherent interactions at gene-level as well as cell-level, which could be regarded as spatial dynamics in single-cell data. Both gene-gene and cell-cell interactions are able to boost the performance of cell type detection, under the framework of multi-view modeling. In this study, spatiotemporal embedding and cell graphs are extracted to capture spatial dynamics at the molecular level. In order to enhance the accuracy of cell type detection, this study proposes the scHybridBERT architecture to conduct multi-view modeling of scRNA-seq data using extracted spatiotemporal patterns. In this scHybridBERT method, graph learning models are employed to deal with cell graphs and the Performer model employs spatiotemporal embeddings. Experimental outcomes about benchmark scRNA-seq datasets indicate that the proposed scHybridBERT method is able to enhance the accuracy of single-cell clustering tasks by integrating spatiotemporal embeddings and cell graphs.
Collapse
Affiliation(s)
- Zhang Wei
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| | - Wu Chenjun
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| | - Xing Feiyang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, 200092, Shanghai, China
| | | | - Zhang Yixuan
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| | - Liu Qi
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, 200092, Shanghai, China
| | - Shi Zhuoxing
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, 510060, Guangzhou, China
| | - Dai Qi
- Zhejiang Sci-Tech University, 310028, Hangzhou, China
| |
Collapse
|
5
|
Wan H, Yuan M, Fu Y, Deng M. Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data. Brief Bioinform 2024; 25:bbae047. [PMID: 38388681 PMCID: PMC10883808 DOI: 10.1093/bib/bbae047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 12/29/2023] [Accepted: 01/18/2024] [Indexed: 02/24/2024] Open
Abstract
MOTIVATION Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable. RESULTS We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL. AVAILABILITY An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch. CONTACT dengmh@pku.edu.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Journal Name online.
Collapse
Affiliation(s)
- Hui Wan
- School of Mathematical Sciences, Peking University, Beijing, China, 100871
| | - Musu Yuan
- Center for Quantitative Biology, Peking University, Beijing, China, 100871
| | - Yiwei Fu
- School of Mathematical Sciences, Peking University, Beijing, China, 100871
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China, 100871
- Center for Quantitative Biology, Peking University, Beijing, China, 100871
- Center for Statistical Science, Peking university, Beijing, China, 100871
| |
Collapse
|
6
|
Zhai Y, Chen L, Deng M. scEVOLVE: cell-type incremental annotation without forgetting for single-cell RNA-seq data. Brief Bioinform 2024; 25:bbae039. [PMID: 38366803 PMCID: PMC10939389 DOI: 10.1093/bib/bbae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Revised: 01/03/2024] [Accepted: 01/09/2024] [Indexed: 02/18/2024] Open
Abstract
The evolution in single-cell RNA sequencing (scRNA-seq) technology has opened a new avenue for researchers to inspect cellular heterogeneity with single-cell precision. One crucial aspect of this technology is cell-type annotation, which is fundamental for any subsequent analysis in single-cell data mining. Recently, the scientific community has seen a surge in the development of automatic annotation methods aimed at this task. However, these methods generally operate at a steady-state total cell-type capacity, significantly restricting the cell annotation systems'capacity for continuous knowledge acquisition. Furthermore, creating a unified scRNA-seq annotation system remains challenged by the need to progressively expand its understanding of ever-increasing cell-type concepts derived from a continuous data stream. In response to these challenges, this paper presents a novel and challenging setting for annotation, namely cell-type incremental annotation. This concept is designed to perpetually enhance cell-type knowledge, gleaned from continuously incoming data. This task encounters difficulty with data stream samples that can only be observed once, leading to catastrophic forgetting. To address this problem, we introduce our breakthrough methodology termed scEVOLVE, an incremental annotation method. This innovative approach is built upon the methodology of contrastive sample replay combined with the fundamental principle of partition confidence maximization. Specifically, we initially retain and replay sections of the old data in each subsequent training phase, then establish a unique prototypical learning objective to mitigate the cell-type imbalance problem, as an alternative to using cross-entropy. To effectively emulate a model that trains concurrently with complete data, we introduce a cell-type decorrelation strategy that efficiently scatters feature representations of each cell type uniformly. We constructed the scEVOLVE framework with simplicity and ease of integration into most deep softmax-based single-cell annotation methods. Thorough experiments conducted on a range of meticulously constructed benchmarks consistently prove that our methodology can incrementally learn numerous cell types over an extended period, outperforming other strategies that fail quickly. As far as our knowledge extends, this is the first attempt to propose and formulate an end-to-end algorithm framework to address this new, practical task. Additionally, scEVOLVE, coded in Python using the Pytorch machine-learning library, is freely accessible at https://github.com/aimeeyaoyao/scEVOLVE.
Collapse
Affiliation(s)
- Yuyao Zhai
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Liang Chen
- Huawei Technologies Co., Ltd., Beijing, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China
- Center for Statistical Science, Peking University, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing, China
| |
Collapse
|
7
|
Jiang H, Zhan S, Ching WK, Chen L. Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization. Bioinformatics 2023; 39:btad414. [PMID: 37382572 PMCID: PMC10329495 DOI: 10.1093/bioinformatics/btad414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 06/03/2023] [Accepted: 06/28/2023] [Indexed: 06/30/2023] Open
Abstract
MOTIVATION Simultaneous profiling of multi-omics single-cell data represents exciting technological advancements for understanding cellular states and heterogeneity. Cellular indexing of transcriptomes and epitopes by sequencing allowed for parallel quantification of cell-surface protein expression and transcriptome profiling in the same cells; methylome and transcriptome sequencing from single cells allows for analysis of transcriptomic and epigenomic profiling in the same individual cells. However, effective integration method for mining the heterogeneity of cells over the noisy, sparse, and complex multi-modal data is in growing need. RESULTS In this article, we propose a multi-modal high-order neighborhood Laplacian matrix optimization framework for integrating the multi-omics single-cell data: scHoML. Hierarchical clustering method was presented for analyzing the optimal embedding representation and identifying cell clusters in a robust manner. This novel method by integrating high-order and multi-modal Laplacian matrices would robustly represent the complex data structures and allow for systematic analysis at the multi-omics single-cell level, thus promoting further biological discoveries. AVAILABILITY AND IMPLEMENTATION Matlab code is available at https://github.com/jianghruc/scHoML.
Collapse
Affiliation(s)
- Hao Jiang
- School of Mathematics, Renmin University of China, Beijing 100872, China
| | - Senwen Zhan
- School of Mathematics, Renmin University of China, Beijing 100872, China
| | - Wai-Ki Ching
- Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong
| | - Luonan Chen
- Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, CAS Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
| |
Collapse
|
8
|
Xiong YX, Wang MG, Chen L, Zhang XF. Cell-type annotation with accurate unseen cell-type identification using multiple references. PLoS Comput Biol 2023; 19:e1011261. [PMID: 37379341 PMCID: PMC10335708 DOI: 10.1371/journal.pcbi.1011261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 07/11/2023] [Accepted: 06/11/2023] [Indexed: 06/30/2023] Open
Abstract
The recent advances in single-cell RNA sequencing (scRNA-seq) techniques have stimulated efforts to identify and characterize the cellular composition of complex tissues. With the advent of various sequencing techniques, automated cell-type annotation using a well-annotated scRNA-seq reference becomes popular. But it relies on the diversity of cell types in the reference, which may not capture all the cell types present in the query data of interest. There are generally unseen cell types in the query data of interest because most data atlases are obtained for different purposes and techniques. Identifying previously unseen cell types is essential for improving annotation accuracy and uncovering novel biological discoveries. To address this challenge, we propose mtANN (multiple-reference-based scRNA-seq data annotation), a new method to automatically annotate query data while accurately identifying unseen cell types with the aid of multiple references. Key innovations of mtANN include the integration of deep learning and ensemble learning to improve prediction accuracy, and the introduction of a new metric that considers three complementary aspects to distinguish between unseen cell types and shared cell types. Additionally, we provide a data-driven method to adaptively select a threshold for identifying previously unseen cell types. We demonstrate the advantages of mtANN over state-of-the-art methods for unseen cell-type identification and cell-type annotation on two benchmark dataset collections, as well as its predictive power on a collection of COVID-19 datasets. The source code and tutorial are available at https://github.com/Zhangxf-ccnu/mtANN.
Collapse
Affiliation(s)
- Yi-Xuan Xiong
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, China
| | - Meng-Guo Wang
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, China
| | - Luonan Chen
- State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou, China
- Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, China
| |
Collapse
|
9
|
Choi JM, Park C, Chae H. meth-SemiCancer: a cancer subtype classification framework via semi-supervised learning utilizing DNA methylation profiles. BMC Bioinformatics 2023; 24:168. [PMID: 37101254 PMCID: PMC10131478 DOI: 10.1186/s12859-023-05272-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 04/05/2023] [Indexed: 04/28/2023] Open
Abstract
BACKGROUND Identification of the cancer subtype plays a crucial role to provide an accurate diagnosis and proper treatment to improve the clinical outcomes of patients. Recent studies have shown that DNA methylation is one of the key factors for tumorigenesis and tumor growth, where the DNA methylation signatures have the potential to be utilized as cancer subtype-specific markers. However, due to the high dimensionality and the low number of DNA methylome cancer samples with the subtype information, still, to date, a cancer subtype classification method utilizing DNA methylome datasets has not been proposed. RESULTS In this paper, we present meth-SemiCancer, a semi-supervised cancer subtype classification framework based on DNA methylation profiles. The proposed model was first pre-trained based on the methylation datasets with the cancer subtype labels. After that, meth-SemiCancer generated the pseudo-subtypes for the cancer datasets without subtype information based on the model's prediction. Finally, fine-tuning was performed utilizing both the labeled and unlabeled datasets. CONCLUSIONS From the performance comparison with the standard machine learning-based classifiers, meth-SemiCancer achieved the highest average F1-score and Matthews correlation coefficient, outperforming other methods. Fine-tuning the model with the unlabeled patient samples by providing the proper pseudo-subtypes, encouraged meth-SemiCancer to generalize better than the supervised neural network-based subtype classification method. meth-SemiCancer is publicly available at https://github.com/cbi-bioinfo/meth-SemiCancer .
Collapse
Affiliation(s)
- Joung Min Choi
- Department of Computer Science, Virginia Tech, Blacksburg, USA
| | - Chaelin Park
- Division of Computer Science, Sookmyung Women's University, Seoul, Republic of Korea
| | - Heejoon Chae
- Division of Computer Science, Sookmyung Women's University, Seoul, Republic of Korea.
| |
Collapse
|
10
|
Yuan M, Chen L, Deng M. Clustering single-cell multi-omics data with MoClust. Bioinformatics 2022; 39:6831092. [PMID: 36383167 PMCID: PMC9805570 DOI: 10.1093/bioinformatics/btac736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2022] [Revised: 11/09/2022] [Accepted: 11/14/2022] [Indexed: 11/17/2022] Open
Abstract
MOTIVATION Single-cell multi-omics sequencing techniques have rapidly developed in the past few years. Clustering analysis with single-cell multi-omics data may give us novel perspectives to dissect cellular heterogeneity. However, multi-omics data have the properties of inherited large dimension, high sparsity and existence of doublets. Moreover, representations of different omics from even the same cell follow diverse distributions. Without proper distribution alignment techniques, clustering methods will encounter less separable clusters easily affected by less informative omics data. RESULTS We developed MoClust, a novel joint clustering framework that can be applied to several types of single-cell multi-omics data. A selective automatic doublet detection module that can identify and filter out doublets is introduced in the pretraining stage to improve data quality. Omics-specific autoencoders are introduced to characterize the multi-omics data. A contrastive learning way of distribution alignment is adopted to adaptively fuse omics representations into an omics-invariant representation. This novel way of alignment boosts the compactness and separableness of clusters, while accurately weighting the contribution of each omics to the clustering object. Extensive experiments, over both simulated and real multi-omics datasets, demonstrated the powerful alignment, doublet detection and clustering ability features of MoClust. AVAILABILITY AND IMPLEMENTATION An implementation of MoClust is available from https://doi.org/10.5281/zenodo.7306504. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Musu Yuan
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Liang Chen
- To whom correspondence should be addressed. or
| | | |
Collapse
|
11
|
Xu Z, Luo J, Xiong Z. scSemiGAN: a single-cell semi-supervised annotation and dimensionality reduction framework based on generative adversarial network. Bioinformatics 2022; 38:5042-5048. [PMID: 36193998 DOI: 10.1093/bioinformatics/btac652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 09/05/2022] [Accepted: 10/02/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Cell-type annotation plays a crucial role in single-cell RNA-seq (scRNA-seq) data analysis. As more and more well-annotated scRNA-seq reference data are publicly available, automatical label transference algorithms are gaining popularity over manual marker gene-based annotation methods. However, most existing methods fail to unify cell-type annotation with dimensionality reduction and are unable to generate deep latent representation from the perspective of data generation. RESULTS In this article, we propose scSemiGAN, a single-cell semi-supervised cell-type annotation and dimensionality reduction framework based on a generative adversarial network, to overcome these challenges, modeling scRNA-seq data from the aspect of data generation. Our proposed scSemiGAN is capable of performing deep latent representation learning and cell-type label prediction simultaneously. Through extensive comparison with four state-of-the-art annotation methods on diverse simulated and real scRNA-seq datasets, scSemiGAN achieves competitive or superior performance in multiple downstream tasks including cell-type annotation, latent representation visualization, confounding factor removal and enrichment analysis. AVAILABILITY AND IMPLEMENTATION The code and data of scSemiGAN are available on GitHub: https://github.com/rafa-nadal/scSemiGAN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhongyuan Xu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| | - Zehao Xiong
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| |
Collapse
|
12
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
13
|
Wan H, Chen L, Deng M. scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:939-958. [PMID: 36608843 PMCID: PMC10025768 DOI: 10.1016/j.gpb.2022.12.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 11/30/2022] [Accepted: 12/11/2022] [Indexed: 01/05/2023]
Abstract
Current cell-type annotation tools for single-cell RNA sequencing (scRNA-seq) data mainly utilize well-annotated source data to help identify cell types in target data. However, on account of privacy preservation, their requirements for raw source data may not always be satisfied. In this case, achieving feature alignment between source and target data explicitly is impossible. Additionally, these methods are barely able to discover the presence of novel cell types. A subjective threshold is often selected by users to detect novel cells. We propose a universal annotation framework for scRNA-seq data called scEMAIL, which automatically detects novel cell types without accessing source data during adaptation. For new cell-type identification, a novel cell-type perception module is designed with three steps. First, an expert ensemble system measures uncertainty of each cell from three complementary aspects. Second, based on this measurement, bimodality tests are applied to detect the presence of new cell types. Third, once assured of their presence, an adaptive threshold via manifold mixup partitions target cells into "known" and "unknown" groups. Model adaptation is then conducted to alleviate the batch effect. We gather multi-order neighborhood messages globally and impose local affinity regularizations on "known" cells. These constraints mitigate wrong classifications of the source model via reliable self-supervised information of neighbors. scEMAIL is accurate and robust under various scenarios in both simulation and real data. It is also flexible to be applied to challenging single-cell ATAC-seq data without loss of superiority. The source code of scEMAIL can be accessed at https://github.com/aster-ww/scEMAIL and https://ngdc.cncb.ac.cn/biocode/tools/BT007335/releases/v1.0.
Collapse
Affiliation(s)
- Hui Wan
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Liang Chen
- Huawei Technologies Co., Ltd., Beijing 100080, China.
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China; Center for Statistical Science, Peking University, Beijing 100871, China; Center for Quantitative Biology, Peking University, Beijing 100871, China.
| |
Collapse
|
14
|
Yuan M, Chen L, Deng M. Clustering CITE-seq data with a canonical correlation-based deep learning method. Front Genet 2022; 13:977968. [PMID: 36072672 PMCID: PMC9441595 DOI: 10.3389/fgene.2022.977968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2022] [Accepted: 07/22/2022] [Indexed: 12/03/2022] Open
Abstract
Single-cell multiomics sequencing techniques have rapidly developed in the past few years. Among these techniques, single-cell cellular indexing of transcriptomes and epitopes (CITE-seq) allows simultaneous quantification of gene expression and surface proteins. Clustering CITE-seq data have the great potential of providing us with a more comprehensive and in-depth view of cell states and interactions. However, CITE-seq data inherit the properties of scRNA-seq data, being noisy, large-dimensional, and highly sparse. Moreover, representations of RNA and surface protein are sometimes with low correlation and contribute divergently to the clustering object. To overcome these obstacles and find a combined representation well suited for clustering, we proposed scCTClust for multiomics data, especially CITE-seq data, and clustering analysis. Two omics-specific neural networks are introduced to extract cluster information from omics data. A deep canonical correlation method is adopted to find the maximumly correlated representations of two omics. A novel decentralized clustering method is utilized over the linear combination of latent representations of two omics. The fusion weights which can account for contributions of omics to clustering are adaptively updated during training. Extensive experiments over both simulated and real CITE-seq data sets demonstrated the power of scCTClust. We also applied scCTClust on transcriptome–epigenome data to illustrate its potential for generalizing.
Collapse
Affiliation(s)
- Musu Yuan
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- *Correspondence: Musu Yuan,
| | - Liang Chen
- Department of Probability and Statistics, School of Mathematical Sciences, Peking University, Beijing, China
| | - Minghua Deng
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Department of Probability and Statistics, School of Mathematical Sciences, Peking University, Beijing, China
- Center for Statistical Science, Peking University, Beijing, China
| |
Collapse
|
15
|
Dohmen J, Baranovskii A, Ronen J, Uyar B, Franke V, Akalin A. Identifying tumor cells at the single-cell level using machine learning. Genome Biol 2022; 23:123. [PMID: 35637521 PMCID: PMC9150321 DOI: 10.1186/s13059-022-02683-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 05/06/2022] [Indexed: 12/15/2022] Open
Abstract
Tumors are complex tissues of cancerous cells surrounded by a heterogeneous cellular microenvironment with which they interact. Single-cell sequencing enables molecular characterization of single cells within the tumor. However, cell annotation-the assignment of cell type or cell state to each sequenced cell-is a challenge, especially identifying tumor cells within single-cell or spatial sequencing experiments. Here, we propose ikarus, a machine learning pipeline aimed at distinguishing tumor cells from normal cells at the single-cell level. We test ikarus on multiple single-cell datasets, showing that it achieves high sensitivity and specificity in multiple experimental contexts.
Collapse
Affiliation(s)
- Jan Dohmen
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany
| | - Artem Baranovskii
- Non-coding RNAs and Mechanisms of Cytoplasmic Gene Regulation Lab, Berlin Institute for Medical Systems Biology, Hannoversche Str. 28, 10115, Berlin, Germany
- Free University Berlin, Kaiserswerther Str. 16-18, 14195, Berlin, Germany
| | - Jonathan Ronen
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany
| | - Bora Uyar
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany
| | - Vedran Franke
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany.
| | - Altuna Akalin
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany.
| |
Collapse
|
16
|
Zhang Y, Zhang F, Wang Z, Wu S, Tian W. scMAGIC: accurately annotating single cells using two rounds of reference-based classification. Nucleic Acids Res 2022; 50:e43. [PMID: 34986249 PMCID: PMC9071478 DOI: 10.1093/nar/gkab1275] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 11/08/2021] [Accepted: 12/14/2021] [Indexed: 11/21/2022] Open
Abstract
Here, we introduce scMAGIC (Single Cell annotation using MArker Genes Identification and two rounds of reference-based Classification [RBC]), a novel method that uses well-annotated single-cell RNA sequencing (scRNA-seq) data as the reference to assist in the classification of query scRNA-seq data. A key innovation in scMAGIC is the introduction of a second-round RBC in which those query cells whose cell identities are confidently validated in the first round are used as a new reference to again classify query cells, therefore eliminating the batch effects between the reference and the query data. scMAGIC significantly outperforms 13 competing RBC methods with their optimal parameter settings across 86 benchmark tests, especially when the cell types in the query dataset are not completely covered by the reference dataset and when there exist significant batch effects between the reference and the query datasets. Moreover, when no reference dataset is available, scMAGIC can annotate query cells with reasonably high accuracy by using an atlas dataset as the reference.
Collapse
Affiliation(s)
- Yu Zhang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
- Department of Histoembryology, Genetics and Developmental Biology, Shanghai Key Laboratory of Reproductive Medicine, Key Laboratory of Cell Differentiation and Apoptosis of Chinese Ministry of Education, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Zekun Wang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Siyi Wu
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
- Qilu Children's Hospital of Shandong University, No 23976 Jingshi Road, Jinan, Shandong, China
- Children’s Hospital of Fudan University, Shanghai 201102, China
| |
Collapse
|
17
|
Zhang R, Luo Y, Ma J, Zhang M, Wang S. scPretrain: multi-task self-supervised learning for cell-type classification. Bioinformatics 2022; 38:1607-1614. [PMID: 34999749 DOI: 10.1093/bioinformatics/btac007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 12/25/2021] [Accepted: 01/04/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell-type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this article, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell-type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset. RESULTS We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell-type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers, such as random forest, logistic regression and support-vector machines. scPretrain is able to effectively utilize the massive amount of unlabeled data and be applied to annotating increasingly generated scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION The data and code underlying this article are available in scPretrain: Multi-task self-supervised learning for cell type classification, at https://github.com/ruiyi-zhang/scPretrain and https://zenodo.org/record/5802306. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ruiyi Zhang
- School of EECS, Peking University, Beijing, China
| | - Yunan Luo
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jianzhu Ma
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.,Department of Biochemistry, Purdue University, West Lafayette, IN, USA
| | - Ming Zhang
- School of EECS, Peking University, Beijing, China
| | - Sheng Wang
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
18
|
Wan H, Chen L, Deng M. scNAME: neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data. Bioinformatics 2022; 38:1575-1583. [PMID: 34999761 DOI: 10.1093/bioinformatics/btac011] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 11/28/2021] [Accepted: 01/05/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The rapid development of single-cell RNA sequencing (scRNA-seq) makes it possible to study the heterogeneity of individual cell characteristics. Cell clustering is a vital procedure in scRNA-seq analysis, providing insight into complex biological phenomena. However, the noisy, high-dimensional and large-scale nature of scRNA-seq data introduces challenges in clustering analysis. Up to now, many deep learning-based methods have emerged to learn underlying feature representations while clustering. However, these methods are inefficient when it comes to rare cell type identification and barely able to fully utilize gene dependencies or cell similarity integrally. As a result, they cannot detect a clear cell type structure which is required for clustering accuracy as well as downstream analysis. RESULTS Here, we propose a novel scRNA-seq clustering algorithm called scNAME which incorporates a mask estimation task for gene pertinence mining and a neighborhood contrastive learning framework for cell intrinsic structure exploitation. The learned pattern through mask estimation helps reveal uncorrupted data structure and denoise the original single-cell data. In addition, the randomly created augmented data introduced in contrastive learning not only helps improve robustness of clustering, but also increases sample size in each cluster for better data capacity. Beyond this, we also introduce a neighborhood contrastive paradigm with an offline memory bank, global in scope, which can inspire discriminative feature representation and achieve intra-cluster compactness, yet inter-cluster separation. The combination of mask estimation task, neighborhood contrastive learning and global memory bank designed in scNAME is conductive to rare cell type detection. The experimental results of both simulations and real data confirm that our method is accurate, robust and scalable. We also implement biological analysis, including marker gene identification, gene ontology and pathway enrichment analysis, to validate the biological significance of our method. To the best of our knowledge, we are among the first to introduce a gene relationship exploration strategy, as well as a global cellular similarity repository, in the single-cell field. AVAILABILITY AND IMPLEMENTATION An implementation of scNAME is available from https://github.com/aster-ww/scNAME. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hui Wan
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Liang Chen
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China.,Center for Quantitative Biology, Peking University, Beijing 100871, China.,Center for Statistical Science, Peking university, Beijing 100871, China
| |
Collapse
|
19
|
An active learning approach for clustering single-cell RNA-seq data. J Transl Med 2022; 102:227-235. [PMID: 34244616 PMCID: PMC8742847 DOI: 10.1038/s41374-021-00639-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 06/22/2021] [Accepted: 06/23/2021] [Indexed: 11/24/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data has been widely used to profile cellular heterogeneities with a high-resolution picture. Clustering analysis is a crucial step of scRNA-seq data analysis because it provides a chance to identify and uncover undiscovered cell types. Most methods for clustering scRNA-seq data use an unsupervised learning strategy. Since the clustering step is separated from the cell annotation and labeling step, it is not uncommon for a totally exotic clustering with poor biological interpretability to be generated-a result generally undesired by biologists. To solve this problem, we proposed an active learning (AL) framework for clustering scRNA-seq data. The AL model employed a learning algorithm that can actively query biologists for labels, and this manual labeling is expected to be applied to only a subset of cells. To develop an optimal active learning approach, we explored several key parameters of the AL model in the experiments with four real scRNA-seq datasets. We demonstrate that the proposed AL model outperformed state-of-the-art unsupervised clustering methods with less than 1000 labeled cells. Therefore, we conclude that AL model is a promising tool for clustering scRNA-seq data that allows us to achieve a superior performance effectively and efficiently.
Collapse
|
20
|
Yuan M, Chen L, Deng M. scMRA: a robust deep learning method to annotate scRNA-seq data with multiple reference datasets. Bioinformatics 2022; 38:738-745. [PMID: 34623390 DOI: 10.1093/bioinformatics/btab700] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 08/24/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Single-cell RNA-seq (scRNA-seq) has been widely used to resolve cellular heterogeneity. After collecting scRNA-seq data, the natural next step is to integrate the accumulated data to achieve a common ontology of cell types and states. Thus, an effective and efficient cell-type identification method is urgently needed. Meanwhile, high-quality reference data remain a necessity for precise annotation. However, such tailored reference data are always lacking in practice. To address this, we aggregated multiple datasets into a meta-dataset on which annotation is conducted. Existing supervised or semi-supervised annotation methods suffer from batch effects caused by different sequencing platforms, the effect of which increases in severity with multiple reference datasets. RESULTS Herein, a robust deep learning-based single-cell Multiple Reference Annotator (scMRA) is introduced. In scMRA, a knowledge graph is constructed to represent the characteristics of cell types in different datasets, and a graphic convolutional network serves as a discriminator based on this graph. scMRA keeps intra-cell-type closeness and the relative position of cell types across datasets. scMRA is remarkably powerful at transferring knowledge from multiple reference datasets, to the unlabeled target domain, thereby gaining an advantage over other state-of-the-art annotation methods in multi-reference data experiments. Furthermore, scMRA can remove batch effects. To the best of our knowledge, this is the first attempt to use multiple insufficient reference datasets to annotate target data, and it is, comparatively, the best annotation method for multiple scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION An implementation of scMRA is available from https://github.com/ddb-qiwang/scMRA-torch. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Musu Yuan
- School of Mathematical Sciences, Peking University, Beijing 100871, China.,Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Liang Chen
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China.,Center for Quantitative Biology, Peking University, Beijing 100871, China.,Center for Statistical Science, Peking University, Beijing 100871, China
| |
Collapse
|
21
|
Wang Y, Wong KC, Li X. Exploring high-throughput biomolecular data with multiobjective robust continuous clustering. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.11.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
22
|
Ge J, King JL, Smuts A, Budowle B. Precision DNA Mixture Interpretation with Single-Cell Profiling. Genes (Basel) 2021; 12:1649. [PMID: 34828255 PMCID: PMC8623868 DOI: 10.3390/genes12111649] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 10/14/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022] Open
Abstract
Wet-lab based studies have exploited emerging single-cell technologies to address the challenges of interpreting forensic mixture evidence. However, little effort has been dedicated to developing a systematic approach to interpreting the single-cell profiles derived from the mixtures. This study is the first attempt to develop a comprehensive interpretation workflow in which single-cell profiles from mixtures are interpreted individually and holistically. In this approach, the genotypes from each cell are assessed, the number of contributors (NOC) of the single-cell profiles is estimated, followed by developing a consensus profile of each contributor, and finally the consensus profile(s) can be used for a DNA database search or comparing with known profiles to determine their potential sources. The potential of this single-cell interpretation workflow was assessed by simulation with various mixture scenarios and empirical allele drop-out and drop-in rates, the accuracies of estimating the NOC, the accuracies of recovering the true alleles by consensus, and the capabilities of deconvolving mixtures with related contributors. The results support that the single-cell based mixture interpretation can provide a precision that cannot beachieved with current standard CE-STR analyses. A new paradigm for mixture interpretation is available to enhance the interpretation of forensic genetic casework.
Collapse
Affiliation(s)
- Jianye Ge
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX 76107, USA; (J.L.K.); (A.S.); (B.B.)
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, Fort Worth, TX 76107, USA
| | - Jonathan L. King
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX 76107, USA; (J.L.K.); (A.S.); (B.B.)
| | - Amy Smuts
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX 76107, USA; (J.L.K.); (A.S.); (B.B.)
| | - Bruce Budowle
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX 76107, USA; (J.L.K.); (A.S.); (B.B.)
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, Fort Worth, TX 76107, USA
| |
Collapse
|
23
|
Ma W, Su K, Wu H. Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction. Genome Biol 2021; 22:264. [PMID: 34503564 PMCID: PMC8427961 DOI: 10.1186/s13059-021-02480-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 08/25/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. RESULTS In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. CONCLUSIONS Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).
Collapse
Affiliation(s)
- Wenjing Ma
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Kenong Su
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Hao Wu
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA.
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA.
| |
Collapse
|
24
|
Wang J, Zou Q, Lin C. A comparison of deep learning-based pre-processing and clustering approaches for single-cell RNA sequencing data. Brief Bioinform 2021; 23:6361043. [PMID: 34472590 DOI: 10.1093/bib/bbab345] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Revised: 07/22/2021] [Accepted: 08/04/2021] [Indexed: 11/13/2022] Open
Abstract
The emergence of single cell RNA sequencing has facilitated the studied of genomes, transcriptomes and proteomes. As available single-cell RNA-seq datasets are released continuously, one of the major challenges facing traditional RNA analysis tools is the high-dimensional, high-sparsity, high-noise and large-scale characteristics of single-cell RNA-seq data. Deep learning technologies match the characteristics of single-cell RNA-seq data perfectly and offer unprecedented promise. Here, we give a systematic review for most popular single-cell RNA-seq analysis methods and tools based on deep learning models, involving the procedures of data preprocessing (quality control, normalization, data correction, dimensionality reduction and data visualization) and clustering task for downstream analysis. We further evaluate the deep model-based analysis methods of data correction and clustering quantitatively on 11 gold standard datasets. Moreover, we discuss the data preferences of these methods and their limitations, and give some suggestions and guidance for users to select appropriate methods and tools.
Collapse
Affiliation(s)
- Jiacheng Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- School of Informatics, Xiamen University, Xiamen, China
| | - Chen Lin
- School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
25
|
Adossa N, Khan S, Rytkönen KT, Elo LL. Computational strategies for single-cell multi-omics integration. Comput Struct Biotechnol J 2021; 19:2588-2596. [PMID: 34025945 PMCID: PMC8114078 DOI: 10.1016/j.csbj.2021.04.060] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 04/23/2021] [Accepted: 04/24/2021] [Indexed: 02/06/2023] Open
Abstract
Single-cell omics technologies are currently solving biological and medical problems that earlier have remained elusive, such as discovery of new cell types, cellular differentiation trajectories and communication networks across cells and tissues. Current advances especially in single-cell multi-omics hold high potential for breakthroughs by integration of multiple different omics layers. To pair with the recent biotechnological developments, many computational approaches to process and analyze single-cell multi-omics data have been proposed. In this review, we first introduce recent developments in single-cell multi-omics in general and then focus on the available data integration strategies. The integration approaches are divided into three categories: early, intermediate, and late data integration. For each category, we describe the underlying conceptual principles and main characteristics, as well as provide examples of currently available tools and how they have been applied to analyze single-cell multi-omics data. Finally, we explore the challenges and prospective future directions of single-cell multi-omics data integration, including examples of adopting multi-view analysis approaches used in other disciplines to single-cell multi-omics.
Collapse
Affiliation(s)
- Nigatu Adossa
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Sofia Khan
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Kalle T. Rytkönen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Institute of Biomedicine, University of Turku, 20520 Turku, Finland
| | - Laura L. Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Institute of Biomedicine, University of Turku, 20520 Turku, Finland
| |
Collapse
|