1
|
Goggin SM, Zunder ER. A hyperparameter-randomized ensemble approach for robust clustering across diverse datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.571953. [PMID: 38187667 PMCID: PMC10769222 DOI: 10.1101/2023.12.18.571953] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Clustering analysis is widely used to group objects by similarity, but for complex datasets such as those produced by single-cell analysis, the currently available clustering methods are limited by accuracy, robustness, ease of use, and interpretability. To address these limitations, we developed an ensemble clustering method with hyperparameter randomization that outperforms other methods across a broad range of single-cell and synthetic datasets, without the need for manual hyperparameter selection. In addition to hard cluster labels, it also outputs soft cluster memberships to characterize continuum-like regions and per cell overlap scores to quantify the uncertainty in cluster assignment. We demonstrate the improved clustering interpretability from these features by tracing the intermediate stages between handwritten digits in the MNIST dataset, and between tanycyte subpopulations in the hypothalamus. This approach improves the quality of clustering and subsequent downstream analyses for single-cell datasets, and may also prove useful in other fields of data analysis.
Collapse
Affiliation(s)
- Sarah M. Goggin
- Neuroscience Graduate Program, School of Medicine, University of Virginia, Charlottesville, VA 22902
| | - Eli R. Zunder
- Neuroscience Graduate Program, School of Medicine, University of Virginia, Charlottesville, VA 22902
- Department of Biomedical Engineering, School of Engineering, University of Virginia, Charlottesville, VA 22902
| |
Collapse
|
2
|
Huang M, Ma J, An G, Ye X. Unravelling cancer subtype-specific driver genes in single-cell transcriptomics data with CSDGI. PLoS Comput Biol 2023; 19:e1011450. [PMID: 38096269 PMCID: PMC10754467 DOI: 10.1371/journal.pcbi.1011450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 12/28/2023] [Accepted: 12/05/2023] [Indexed: 12/29/2023] Open
Abstract
Cancer is known as a heterogeneous disease. Cancer driver genes (CDGs) need to be inferred for understanding tumor heterogeneity in cancer. However, the existing computational methods have identified many common CDGs. A key challenge exploring cancer progression is to infer cancer subtype-specific driver genes (CSDGs), which provides guidane for the diagnosis, treatment and prognosis of cancer. The significant advancements in single-cell RNA-sequencing (scRNA-seq) technologies have opened up new possibilities for studying human cancers at the individual cell level. In this study, we develop a novel unsupervised method, CSDGI (Cancer Subtype-specific Driver Gene Inference), which applies Encoder-Decoder-Framework consisting of low-rank residual neural networks to inferring driver genes corresponding to potential cancer subtypes at the single-cell level. To infer CSDGs, we apply CSDGI to the tumor single-cell transcriptomics data. To filter the redundant genes before driver gene inference, we perform the differential expression genes (DEGs). The experimental results demonstrate CSDGI is effective to infer driver genes that are cancer subtype-specific. Functional and disease enrichment analysis shows these inferred CSDGs indicate the key biological processes and disease pathways. CSDGI is the first method to explore cancer driver genes at the cancer subtype level. We believe that it can be a useful method to understand the mechanisms of cell transformation driving tumours.
Collapse
Affiliation(s)
- Meng Huang
- Department of Automation, Xiamen University, Xiamen, China
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Jiangtao Ma
- Department of Automation, Xiamen University, Xiamen, China
- School of Engineering, Dali University, Dali, Yunnan, China
| | - Guangqi An
- Graduate School of Life and Environmental Sciences, University of Tsukuba, Tsukuba, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| |
Collapse
|
3
|
Huang M, Long C, Ma J. AAFL: automatic association feature learning for gene signature identification of cancer subtypes in single-cell RNA-seq data. Brief Funct Genomics 2023; 22:420-427. [PMID: 37122141 DOI: 10.1093/bfgp/elac047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 11/04/2022] [Accepted: 11/04/2022] [Indexed: 05/02/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) technologies have enabled the study of human cancers in individual cells, which explores the cellular heterogeneity and the genotypic status of tumors. Gene signature identification plays an important role in the precise classification of cancer subtypes. However, most existing gene selection methods only select the same informative genes for each subtype. In this study, we propose a novel gene selection method, automatic association feature learning (AAFL), which automatically identifies different gene signatures for different cell subpopulations (cancer subtypes) at the same time. The proposed AAFL method combines the residual network with the low-rank network, which selects genes that are most associated with the corresponding cell subpopulations. Moreover, the differential expression genes are acquired before gene selection to filter the redundant genes. We apply the proposed feature learning method to the real cancer scRNA-seq data sets (melanoma) to identify cancer subtypes and detect gene signatures of identified cancer subtypes. The experimental results demonstrate that the proposed method can automatically identify different gene signatures for identified cancer subtypes. Gene ontology enrichment analysis shows that the identified gene signatures of different subtypes reveal the key biological processes and pathways. These gene signatures are expected to bring important implications for understanding cellular heterogeneity and the complex ecosystem of tumors.
Collapse
Affiliation(s)
- Meng Huang
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Changzhou Long
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Jiangtao Ma
- Department of Automation, Xiamen University, Xiamen, 361005, China
- School of Engineering, Dali University, Dali, 671000, China
| |
Collapse
|
4
|
Thompson M, Matsumoto M, Ma T, Senabouth A, Palpant NJ, Powell JE, Nguyen Q. scGPS: Determining Cell States and Global Fate Potential of Subpopulations. Front Genet 2021; 12:666771. [PMID: 34349778 PMCID: PMC8326972 DOI: 10.3389/fgene.2021.666771] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 06/04/2021] [Indexed: 12/20/2022] Open
Abstract
Finding cell states and their transcriptional relatedness is a main outcome from analysing single-cell data. In developmental biology, determining whether cells are related in a differentiation lineage remains a major challenge. A seamless analysis pipeline from cell clustering to estimating the probability of transitions between cell clusters is lacking. Here, we present Single Cell Global fate Potential of Subpopulations (scGPS) to characterise transcriptional relationship between cell states. scGPS decomposes mixed cell populations in one or more samples into clusters (SCORE algorithm) and estimates pairwise transitioning potential (scGPS algorithm) of any pair of clusters. SCORE allows for the assessment and selection of stable clustering results, a major challenge in clustering analysis. scGPS implements a novel approach, with machine learning classification, to flexibly construct trajectory connections between clusters. scGPS also has a feature selection functionality by network and modelling approaches to find biological processes and driver genes that connect cell populations. We applied scGPS in diverse developmental contexts and show superior results compared to a range of clustering and trajectory analysis methods. scGPS is able to identify the dynamics of cellular plasticity in a user-friendly workflow, that is fast and memory efficient. scGPS is implemented in R with optimised functions using C++ and is publicly available in Bioconductor.
Collapse
Affiliation(s)
- Michael Thompson
- Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, Australia
| | - Maika Matsumoto
- Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, Australia
| | - Tianqi Ma
- Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, Australia
| | - Anne Senabouth
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia
| | - Nathan J Palpant
- Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, Australia
| | - Joseph E Powell
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.,UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, NSW, Australia
| | - Quan Nguyen
- Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
5
|
Patterson-Cross RB, Levine AJ, Menon V. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinformatics 2021; 22:39. [PMID: 33522897 PMCID: PMC7852188 DOI: 10.1186/s12859-021-03957-4] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Accepted: 01/01/2021] [Indexed: 02/07/2023] Open
Abstract
Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.
Collapse
Affiliation(s)
- Ryan B Patterson-Cross
- Spinal Circuits and Plasticity Unit, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Ariel J Levine
- Spinal Circuits and Plasticity Unit, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
| | - Vilas Menon
- Department of Neurology, Center for Translational and Computational Neuroimmunology, Columbia University, New York City, NY, USA.
| |
Collapse
|
6
|
Ye X, Zhang W, Futamura Y, Sakurai T. Detecting Interactive Gene Groups for Single-Cell RNA-Seq Data Based on Co-Expression Network Analysis and Subgraph Learning. Cells 2020; 9:cells9091938. [PMID: 32825786 PMCID: PMC7563496 DOI: 10.3390/cells9091938] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Revised: 07/17/2020] [Accepted: 08/19/2020] [Indexed: 12/22/2022] Open
Abstract
High-throughput sequencing technologies have enabled the generation of single-cell RNA-seq (scRNA-seq) data, which explore both genetic heterogeneity and phenotypic variation between cells. Some methods have been proposed to detect the related genes causing cell-to-cell variability for understanding tumor heterogeneity. However, most existing methods detect the related genes separately, without considering gene interactions. In this paper, we proposed a novel learning framework to detect the interactive gene groups for scRNA-seq data based on co-expression network analysis and subgraph learning. We first utilized spectral clustering to identify the subpopulations of cells. For each cell subpopulation, the differentially expressed genes were then selected to construct a gene co-expression network. Finally, the interactive gene groups were detected by learning the dense subgraphs embedded in the gene co-expression networks. We applied the proposed learning framework on a real cancer scRNA-seq dataset to detect interactive gene groups of different cancer subtypes. Systematic gene ontology enrichment analysis was performed to examine the detected genes groups by summarizing the key biological processes and pathways. Our analysis shows that different subtypes exhibit distinct gene co-expression networks and interactive gene groups with different functional enrichment. The interactive genes are expected to yield important references for understanding tumor heterogeneity.
Collapse
|
7
|
Jagannathan NS, Ihsan MO, Kin XX, Welsch RE, Clément MV, Tucker-Kellogg L. Transcompp: understanding phenotypic plasticity by estimating Markov transition rates for cell state transitions. Bioinformatics 2020; 36:2813-2820. [PMID: 31971581 DOI: 10.1093/bioinformatics/btaa021] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 12/10/2019] [Accepted: 01/17/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Gradual population-level changes in tissues can be driven by stochastic plasticity, meaning rare stochastic transitions of single-cell phenotype. Quantifying the rates of these stochastic transitions requires time-intensive experiments, and analysis is generally confounded by simultaneous bidirectional transitions and asymmetric proliferation kinetics. To quantify cellular plasticity, we developed Transcompp (Transition Rate ANalysis of Single Cells to Observe and Measure Phenotypic Plasticity), a Markov modeling algorithm that uses optimization and resampling to compute best-fit rates and statistical intervals for stochastic cell-state transitions. RESULTS We applied Transcompp to time-series datasets in which purified subpopulations of stem-like or non-stem cancer cells were exposed to various cell culture environments, and allowed to re-equilibrate spontaneously over time. Results revealed that commonly used cell culture reagents hydrocortisone and cholera toxin shifted the cell population equilibrium toward stem-like or non-stem states, respectively, in the basal-like breast cancer cell line MCF10CA1a. In addition, applying Transcompp to patient-derived cells showed that transition rates computed from short-term experiments could predict long-term trajectories and equilibrium convergence of the cultured cell population. AVAILABILITY AND IMPLEMENTATION Freely available for download at http://github.com/nsuhasj/Transcompp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- N Suhas Jagannathan
- Cancer and Stem Cell Biology Programme, Centre for Computational Biology, Duke-NUS Medical School, 169857 Singapore
| | - Mario O Ihsan
- Department of Biochemistry, National University of Singapore, 117596 Singapore.,NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, 117456 Singapore
| | - Xiao Xuan Kin
- Department of Biochemistry, National University of Singapore, 117596 Singapore
| | - Roy E Welsch
- Sloan School of Management and Center for Statistics and Data Science, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Marie-Véronique Clément
- Department of Biochemistry, National University of Singapore, 117596 Singapore.,NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, 117456 Singapore
| | - Lisa Tucker-Kellogg
- Cancer and Stem Cell Biology Programme, Centre for Computational Biology, Duke-NUS Medical School, 169857 Singapore
| |
Collapse
|
8
|
Transfer learning-assisted multi-objective evolutionary clustering framework with decomposition for high-dimensional data. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.07.099] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
9
|
Qi R, Ma A, Ma Q, Zou Q. Clustering and classification methods for single-cell RNA-sequencing data. Brief Bioinform 2019; 21:1196-1208. [PMID: 31271412 DOI: 10.1093/bib/bbz062] [Citation(s) in RCA: 104] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 04/24/2019] [Accepted: 04/25/2019] [Indexed: 12/12/2022] Open
Abstract
Appropriate ways to measure the similarity between single-cell RNA-sequencing (scRNA-seq) data are ubiquitous in bioinformatics, but using single clustering or classification methods to process scRNA-seq data is generally difficult. This has led to the emergence of integrated methods and tools that aim to automatically process specific problems associated with scRNA-seq data. These approaches have attracted a lot of interest in bioinformatics and related fields. In this paper, we systematically review the integrated methods and tools, highlighting the pros and cons of each approach. We not only pay particular attention to clustering and classification methods but also discuss methods that have emerged recently as powerful alternatives, including nonlinear and linear methods and descending dimension methods. Finally, we focus on clustering and classification methods for scRNA-seq data, in particular, integrated methods, and provide a comprehensive description of scRNA-seq data and download URLs.
Collapse
Affiliation(s)
- Ren Qi
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, USA
| | - Qin Ma
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|