1
|
Zhong Y, Cui S, Yang Y, Cai JJ. Controlled Noise: Evidence of epigenetic regulation of Single-Cell expression variability. Bioinformatics 2024; 40:btae457. [PMID: 39018178 PMCID: PMC11283284 DOI: 10.1093/bioinformatics/btae457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 06/24/2024] [Accepted: 07/16/2024] [Indexed: 07/19/2024] Open
Abstract
MOTIVATION Understanding single-cell expression variability (scEV) or gene expression noise among cells of the same type and state is crucial for delineating population-level cellular function. While epigenetic mechanisms are widely implicated in gene expression regulation, a definitive link between chromatin accessibility and scEV remains elusive. Recent advances in single-cell techniques enable the study of single-cell multiomics data that include the simultaneous measurement of scATAC-seq and scRNA-seq within individual cells, presenting an unprecedented opportunity to address this gap. RESULTS This paper introduces an innovative testing pipeline to investigate the association between chromatin accessibility and scEV. With single-cell multiomics data of scATAC-seq and scRNA-seq, the pipeline hinges on comparing the prediction performance of scATAC-seq data on gene expression levels between highly variable genes (HVGs) and non-highly variable genes (non-HVGs). Applying this pipeline to paired scATAC-seq and scRNA-seq data from human hematopoietic stem and progenitor cells, we observed a significantly superior prediction performance of scATAC-seq data for HVGs compared to non-HVGs. Notably, there was substantial overlap between well-predicted genes and HVGs. The gene pathways enriched from well-predicted genes are highly pertinent to cell type-specific functions. Our findings support the notion that scEV largely stems from cell-to-cell variability in chromatin accessibility, providing compelling evidence for the epigenetic regulation of scEV and offering promising avenues for investigating gene regulation mechanisms at the single-cell level. AVAILABILITY The source code and data used in this paper can be found at https://github.com/SiweiCui/EpigeneticControlOfSingle-CellExpressionVariability. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Zhong
- School of Statistics, KLATASDS-MOE, East China Normal University, Shanghai, 200062, China
| | - Siwei Cui
- School of Statistics, KLATASDS-MOE, East China Normal University, Shanghai, 200062, China
| | - Yongjian Yang
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
| | - James J Cai
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX 77843, United States
- Interdisciplinary Program of Genetics, Texas A&M University, College Station, TX 77843, United States
| |
Collapse
|
2
|
Castro DC, Chan-Andersen P, Romanova EV, Sweedler JV. Probe-based mass spectrometry approaches for single-cell and single-organelle measurements. MASS SPECTROMETRY REVIEWS 2024; 43:888-912. [PMID: 37010120 PMCID: PMC10545815 DOI: 10.1002/mas.21841] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/09/2023] [Accepted: 03/01/2023] [Indexed: 06/19/2023]
Abstract
Exploring the chemical content of individual cells not only reveals underlying cell-to-cell chemical heterogeneity but is also a key component in understanding how cells combine to form emergent properties of cellular networks and tissues. Recent technological advances in many analytical techniques including mass spectrometry (MS) have improved instrumental limits of detection and laser/ion probe dimensions, allowing the analysis of micron and submicron sized areas. In the case of MS, these improvements combined with MS's broad analyte detection capabilities have enabled the rise of single-cell and single-organelle chemical characterization. As the chemical coverage and throughput of single-cell measurements increase, more advanced statistical and data analysis methods have aided in data visualization and interpretation. This review focuses on secondary ion MS and matrix-assisted laser desorption/ionization MS approaches for single-cell and single-organelle characterization, which is followed by advances in mass spectral data visualization and analysis.
Collapse
Affiliation(s)
- Daniel C. Castro
- Department of Molecular and Integrative Physiology, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL USA
| | - Peter Chan-Andersen
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL USA
| | - Elena V. Romanova
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Neuroscience Program, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL USA
| | - Jonathan V. Sweedler
- Department of Molecular and Integrative Physiology, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Neuroscience Program, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL USA
| |
Collapse
|
3
|
Cho J, Baik B, Nguyen HCT, Park D, Nam D. Characterizing efficient feature selection for single-cell expression analysis. Brief Bioinform 2024; 25:bbae317. [PMID: 38975891 PMCID: PMC11229035 DOI: 10.1093/bib/bbae317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 03/31/2024] [Accepted: 06/17/2024] [Indexed: 07/09/2024] Open
Abstract
Unsupervised feature selection is a critical step for efficient and accurate analysis of single-cell RNA-seq data. Previous benchmarks used two different criteria to compare feature selection methods: (i) proportion of ground-truth marker genes included in the selected features and (ii) accuracy of cell clustering using ground-truth cell types. Here, we systematically compare the performance of 11 feature selection methods for both criteria. We first demonstrate the discordance between these criteria and suggest using the latter. We then compare the distribution of selected genes in their means between feature selection methods. We show that lowly expressed genes exhibit seriously high coefficients of variation and are mostly excluded by high-performance methods. In particular, high-deviation- and high-expression-based methods outperform the widely used in Seurat package in clustering cells and data visualization. We further show they also enable a clear separation of the same cell type from different tissues as well as accurate estimation of cell trajectories.
Collapse
Affiliation(s)
- Juok Cho
- Department of Biomedical Engineering, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Ulsan 44919, Republic of Korea
| | - Bukyung Baik
- Department of Biological Sciences, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Ulsan 44919, Republic of Korea
| | - Hai C T Nguyen
- Department of Biological Sciences, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Ulsan 44919, Republic of Korea
| | - Daeui Park
- Department of Predictive Toxicology, Korea Institute of Toxicology, 141, Gajeong-ro, Daejeon 34114, Republic of Korea
| | - Dougu Nam
- Department of Biological Sciences, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Ulsan 44919, Republic of Korea
- Department of Mathematical Sciences, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Ulsan 44919, Republic of Korea
| |
Collapse
|
4
|
Park Y, Muttray NP, Hauschild AC. Species-agnostic transfer learning for cross-species transcriptomics data integration without gene orthology. Brief Bioinform 2024; 25:bbae004. [PMID: 38305455 PMCID: PMC10835749 DOI: 10.1093/bib/bbae004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/24/2023] [Accepted: 12/10/2023] [Indexed: 02/03/2024] Open
Abstract
Novel hypotheses in biomedical research are often developed or validated in model organisms such as mice and zebrafish and thus play a crucial role. However, due to biological differences between species, translating these findings into human applications remains challenging. Moreover, commonly used orthologous gene information is often incomplete and entails a significant information loss during gene-id conversion. To address these issues, we present a novel methodology for species-agnostic transfer learning with heterogeneous domain adaptation. We extended the cross-domain structure-preserving projection toward out-of-sample prediction. Our approach not only allows knowledge integration and translation across various species without relying on gene orthology but also identifies similar GO among the most influential genes composing the latent space for integration. Subsequently, during the alignment of latent spaces, each composed of species-specific genes, it is possible to identify functional annotations of genes missing from public orthology databases. We evaluated our approach with four different single-cell sequencing datasets focusing on cell-type prediction and compared it against related machine-learning approaches. In summary, the developed model outperforms related methods working without prior knowledge when predicting unseen cell types based on other species' data. The results demonstrate that our novel approach allows knowledge transfer beyond species barriers without the dependency on known gene orthology but utilizing the entire gene sets.
Collapse
Affiliation(s)
- Youngjun Park
- Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
- International Max Planck Research Schools for Genome Science, Georg-August-Universität Göttingen Göttingen, Germany
| | - Nils P Muttray
- Applied Statistics, Georg-August-Universität Göttingen Göttingen, Germany
| | - Anne-Christin Hauschild
- Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
- Campus-Institute Data Science (CIDAS), Georg-August-Universität Göttingen Göttingen, Germany
| |
Collapse
|
5
|
Liu J, Kreimer A, Li WV. Differential variability analysis of single-cell gene expression data. Brief Bioinform 2023; 24:bbad294. [PMID: 37598422 PMCID: PMC10516347 DOI: 10.1093/bib/bbad294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/18/2023] [Accepted: 07/29/2023] [Indexed: 08/22/2023] Open
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technologies has enabled gene expression profiling at the single-cell resolution, thereby enabling the quantification and comparison of transcriptional variability among individual cells. Although alterations in transcriptional variability have been observed in various biological states, statistical methods for quantifying and testing differential variability between groups of cells are still lacking. To identify the best practices in differential variability analysis of single-cell gene expression data, we propose and compare 12 statistical pipelines using different combinations of methods for normalization, feature selection, dimensionality reduction and variability calculation. Using high-quality synthetic scRNA-seq datasets, we benchmarked the proposed pipelines and found that the most powerful and accurate pipeline performs simple library size normalization, retains all genes in analysis and uses denSNE-based distances to cluster medoids as the variability measure. By applying this pipeline to scRNA-seq datasets of COVID-19 and autism patients, we have identified cellular variability changes between patients with different severity status or between patients and healthy controls.
Collapse
Affiliation(s)
- Jiayi Liu
- Graduate Programs in Molecular Biosciences, Rutgers, The State University of New Jersey, 604 Allison Rd, Piscataway, 08854, NJ, USA
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, 08854, NJ, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, 08854, NJ, USA
| | - Anat Kreimer
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, 08854, NJ, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, 08854, NJ, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, 900 University Ave, Riverside, 92521, CA, USA
- Previous affiliation where part of the work was completed: Department of Biostatistics and Epidemiology, Rutgers, The State University of New Jersey, 683 Hoes Lane West, Piscataway, 08854, NJ, USA
| |
Collapse
|
6
|
He X, Qian K, Wang Z, Zeng S, Li H, Li WV. scAce: an adaptive embedding and clustering method for single-cell gene expression data. Bioinformatics 2023; 39:btad546. [PMID: 37672035 PMCID: PMC10500084 DOI: 10.1093/bioinformatics/btad546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 08/01/2023] [Accepted: 09/05/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment. RESULTS In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness. AVAILABILITY AND IMPLEMENTATION The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.
Collapse
Affiliation(s)
- Xinwei He
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Ziqian Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Shirou Zeng
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, Riverside 92521, United States
| |
Collapse
|
7
|
Zang Z, Xu Y, Lu L, Geng Y, Yang S, Li SZ. UDRN: Unified Dimensional Reduction Neural Network for feature selection and feature projection. Neural Netw 2023; 161:626-637. [PMID: 36827960 DOI: 10.1016/j.neunet.2023.02.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 11/22/2022] [Accepted: 02/11/2023] [Indexed: 02/17/2023]
Abstract
Dimensional reduction (DR) maps high-dimensional data into a lower dimensions latent space with minimized defined optimization objectives. The two independent branches of DR are feature selection (FS) and feature projection (FP). FS focuses on selecting a critical subset of dimensions but risks destroying the data distribution (structure). On the other hand, FP combines all the input features into lower dimensions space, aiming to maintain the data structure, but lacks interpretability and sparsity. Moreover, FS and FP are traditionally incompatible categories and have not been unified into an amicable framework. Therefore, we consider that the ideal DR approach combines both FS and FP into a unified end-to-end manifold learning framework, simultaneously performing fundamental feature discovery while maintaining the intrinsic relationships between data samples in the latent space. This paper proposes a unified framework named Unified Dimensional Reduction Network (UDRN) to integrate FS and FP in an end-to-end way. Furthermore, a novel network framework is designed to implement FS and FP tasks separately using a stacked feature selection network and feature projection network. In addition, a stronger manifold assumption and a novel loss function are proposed. Furthermore, the loss function can leverage the priors of data augmentation to enhance the generalization ability of the proposed UDRN. Finally, comprehensive experimental results on four image and four biological datasets, including very high-dimensional data, demonstrate the advantages of DRN over existing methods (FS, FP, and FS&FP pipeline), especially in downstream tasks such as classification and visualization.
Collapse
Affiliation(s)
- Zelin Zang
- Zhejiang University, Hangzhou, 310000, China; Westlake University, AI Lab, School of Engineering, Hangzhou, 310000, China; Westlake Institute for Advanced Study, Institute of Advanced Technology, Hangzhou, 310000, China.
| | - Yongjie Xu
- Zhejiang University, Hangzhou, 310000, China; Westlake University, AI Lab, School of Engineering, Hangzhou, 310000, China; Westlake Institute for Advanced Study, Institute of Advanced Technology, Hangzhou, 310000, China
| | - Linyan Lu
- China Telecom Corporation Limited, Hangzhou Branch, Hangzhou, 310000, China
| | - Yulan Geng
- Westlake University, AI Lab, School of Engineering, Hangzhou, 310000, China
| | - Senqiao Yang
- Westlake University, AI Lab, School of Engineering, Hangzhou, 310000, China
| | - Stan Z Li
- Westlake University, AI Lab, School of Engineering, Hangzhou, 310000, China; Westlake Institute for Advanced Study, Institute of Advanced Technology, Hangzhou, 310000, China.
| |
Collapse
|
8
|
Ren T, Chen C, Danilov AV, Liu S, Guan X, Du S, Wu X, Sherman MH, Spellman PT, Coussens LM, Adey AC, Mills GB, Wu LY, Xia Z. Supervised learning of high-confidence phenotypic subpopulations from single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.23.533712. [PMID: 36993424 PMCID: PMC10055361 DOI: 10.1101/2023.03.23.533712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Accurately identifying phenotype-relevant cell subsets from heterogeneous cell populations is crucial for delineating the underlying mechanisms driving biological or clinical phenotypes. Here, by deploying a learning with rejection strategy, we developed a novel supervised learning framework called PENCIL to identify subpopulations associated with categorical or continuous phenotypes from single-cell data. By embedding a feature selection function into this flexible framework, for the first time, we were able to select informative features and identify cell subpopulations simultaneously, which enables the accurate identification of phenotypic subpopulations otherwise missed by methods incapable of concurrent gene selection. Furthermore, the regression mode of PENCIL presents a novel ability for supervised phenotypic trajectory learning of subpopulations from single-cell data. We conducted comprehensive simulations to evaluate PENCIĽs versatility in simultaneous gene selection, subpopulation identification and phenotypic trajectory prediction. PENCIL is fast and scalable to analyze 1 million cells within 1 hour. Using the classification mode, PENCIL detected T-cell subpopulations associated with melanoma immunotherapy outcomes. Moreover, when applied to scRNA-seq of a mantle cell lymphoma patient with drug treatment across multiple time points, the regression mode of PENCIL revealed a transcriptional treatment response trajectory. Collectively, our work introduces a scalable and flexible infrastructure to accurately identify phenotype-associated subpopulations from single-cell data.
Collapse
Affiliation(s)
- Tao Ren
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Canping Chen
- Computational Biology Program, Oregon Health & Science University, Portland, OR, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | | | - Susan Liu
- Computational Biology Program, Oregon Health & Science University, Portland, OR, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | - Xiangnan Guan
- Department of Oncology Biomarker Development, Genentech Inc, South San Francisco, CA, USA
| | - Shunyi Du
- Computational Biology Program, Oregon Health & Science University, Portland, OR, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | - Xiwei Wu
- City of Hope National Medical Center, Duarte, CA, USA
| | - Mara H. Sherman
- Department of Cell, Developmental & Cancer Biology, Oregon Health & Science University, Portland, OR, USA
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Paul T. Spellman
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
- Department of Molecular and Medical Genetics, Oregon Health & Science University, Portland, OR, USA
| | - Lisa M. Coussens
- Department of Cell, Developmental & Cancer Biology, Oregon Health & Science University, Portland, OR, USA
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Andrew C. Adey
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
- Department of Molecular and Medical Genetics, Oregon Health & Science University, Portland, OR, USA
| | - Gordon B. Mills
- Division of Oncological Sciences Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Ling-Yun Wu
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Zheng Xia
- Computational Biology Program, Oregon Health & Science University, Portland, OR, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
9
|
Deng T, Chen S, Zhang Y, Xu Y, Feng D, Wu H, Sun X. A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis. Brief Bioinform 2023; 24:bbad042. [PMID: 36754847 PMCID: PMC10025445 DOI: 10.1093/bib/bbad042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/05/2022] [Accepted: 01/18/2023] [Indexed: 02/10/2023] Open
Abstract
Feature gene selection has significant impact on the performance of cell clustering in single-cell RNA sequencing (scRNA-seq) analysis. A well-rounded feature selection (FS) method should consider relevance, redundancy and complementarity of the features. Yet most existing FS methods focus on gene relevance to the cell types but neglect redundancy and complementarity, which undermines the cell clustering performance. We develop a novel computational method GeneClust to select feature genes for scRNA-seq cell clustering. GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. It can work as a plug-in tool for FS with any existing cell clustering method. Extensive benchmark results demonstrate that GeneClust significantly improve the clustering performance. Moreover, GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset. GeneClust is freely available at https://github.com/ToryDeng/scGeneClust.
Collapse
Affiliation(s)
- Tao Deng
- School of Data Science, The Chinese University of Hong Kong—Shenzhen, Guangdong, China
| | - Siyu Chen
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Ying Zhang
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Yuanbin Xu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Da Feng
- School of Pharmacy, Tongji Medical College, Huazhong University of Sciences and Technology, Hubei, China
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, GA, USA
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Xiaobo Sun
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| |
Collapse
|
10
|
Cheng A, Hu G, Li WV. Benchmarking cell-type clustering methods for spatially resolved transcriptomics data. Brief Bioinform 2023; 24:bbac475. [PMID: 36410733 PMCID: PMC9851325 DOI: 10.1093/bib/bbac475] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 09/20/2022] [Accepted: 10/04/2022] [Indexed: 11/23/2022] Open
Abstract
Spatially resolved transcriptomics technologies enable the measurement of transcriptome information while retaining the spatial context at the regional, cellular or sub-cellular level. While previous computational methods have relied on gene expression information alone for clustering single-cell populations, more recent methods have begun to leverage spatial location and histology information to improve cell clustering and cell-type identification. In this study, using seven semi-synthetic datasets with real spatial locations, simulated gene expression and histology images as well as ground truth cell-type labels, we evaluate 15 clustering methods based on clustering accuracy, robustness to data variation and input parameters, computational efficiency, and software usability. Our analysis demonstrates that even though incorporating the additional spatial and histology information leads to increased accuracy in some datasets, it does not consistently improve clustering compared with using only gene expression data. Our results indicate that for the clustering of spatial transcriptomics data, there are still opportunities to enhance the overall accuracy and robustness by improving information extraction and feature selection from spatial and histology data.
Collapse
Affiliation(s)
- Andrew Cheng
- Department of Computer Science, Rutgers, The State University of New Jersey, 110 Frelinghuysen Road, Piscataway, 08854, NJ, USA
| | - Guanyu Hu
- Department of Statistics, University of Missouri-Columbia, 146 Middlebush Hall, Columbia, 65211, MO, USA
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, 683 Hoes Lane West, Piscataway, 08854, NJ, USA
- Department of Statistics, University of California, Riverside, 900 University Ave., Riverside, 92521, CA, USA
| |
Collapse
|
11
|
Seffernick AE, Mrózek K, Nicolet D, Stone RM, Eisfeld AK, Byrd JC, Archer KJ. High-dimensional genomic feature selection with the ordered stereotype logit model. Brief Bioinform 2022; 23:bbac414. [PMID: 36184192 PMCID: PMC9677495 DOI: 10.1093/bib/bbac414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 08/18/2022] [Accepted: 08/27/2022] [Indexed: 12/30/2022] Open
Abstract
For many high-dimensional genomic and epigenomic datasets, the outcome of interest is ordinal. While these ordinal outcomes are often thought of as the observed cutpoints of some latent continuous variable, some ordinal outcomes are truly discrete and are comprised of the subjective combination of several factors. The nonlinear stereotype logistic model, which does not assume proportional odds, was developed for these 'assessed' ordinal variables. It has previously been extended to the frequentist high-dimensional feature selection setting, but the Bayesian framework provides some distinct advantages in terms of simultaneous uncertainty quantification and variable selection. Here, we review the stereotype model and Bayesian variable selection methods and demonstrate how to combine them to select genomic features associated with discrete ordinal outcomes. We compared the Bayesian and frequentist methods in terms of variable selection performance. We additionally applied the Bayesian stereotype method to an acute myeloid leukemia RNA-sequencing dataset to further demonstrate its variable selection abilities by identifying features associated with the European LeukemiaNet prognostic risk score.
Collapse
Affiliation(s)
- Anna Eames Seffernick
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| | - Krzysztof Mrózek
- Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University, Columbus, OH, USA
- The Ohio State Comprehensive Cancer Center, Columbus, OH, USA
| | - Deedra Nicolet
- Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University, Columbus, OH, USA
- The Ohio State Comprehensive Cancer Center, Columbus, OH, USA
- Alliance Statistics and Data Management Center, The Ohio State University Comprehensive Cancer Center, Columbus, OH, USA
| | - Richard M Stone
- Dana Farber/Partners Cancer Care, Harvard University, Boston, MA, USA
| | - Ann-Kathrin Eisfeld
- Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University, Columbus, OH, USA
- The Ohio State Comprehensive Cancer Center, Columbus, OH, USA
| | - John C Byrd
- Department of Internal Medicine, University of Cincinnati, Cincinnati, OH, USA
| | - Kellie J Archer
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
12
|
Qian K, Fu S, Li H, Li WV. The scINSIGHT Package for Integrating Single-Cell RNA-Seq Data from Different Biological Conditions. J Comput Biol 2022; 29:1233-1236. [PMID: 35920848 PMCID: PMC9700338 DOI: 10.1089/cmb.2022.0244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Data integration is a critical step in the analysis of multiple single-cell RNA sequencing samples to account for heterogeneity due to both biological and technical variability. scINSIGHT is a new integration method for single-cell gene expression data, and can effectively use the information of biological condition to improve the integration of multiple single-cell samples. scINSIGHT is based on a novel non-negative matrix factorization model that learns common and condition-specific gene modules in samples from different biological or experimental conditions. Using these gene modules, scINSIGHT can further identify cellular identities and active biological processes in different cell types or conditions. Here we introduce the installation and main functionality of the scINSIGHT R package, including how to preprocess the data, apply the scINSIGHT algorithm, and analyze the output.
Collapse
Affiliation(s)
- Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Shiwei Fu
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Department of Statistics, University of California, Riverside, California, USA
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Department of Statistics, University of California, Riverside, California, USA
| |
Collapse
|
13
|
Uncertainty measurement for a gene space based on class-consistent technology: an application in gene selection. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03657-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
14
|
Qian K, Fu S, Li H, Li WV. scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data. Genome Biol 2022; 23:82. [PMID: 35313930 PMCID: PMC8935111 DOI: 10.1186/s13059-022-02649-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 03/07/2022] [Indexed: 12/30/2022] Open
Abstract
The increasing number of scRNA-seq data emphasizes the need for integrative analysis to interpret similarities and differences between single-cell samples. Although different batch effect removal methods have been developed, none are suitable for heterogeneous single-cell samples coming from multiple biological conditions. We propose a method, scINSIGHT, to learn coordinated gene expression patterns that are common among, or specific to, different biological conditions, and identify cellular identities and processes across single-cell samples. We compare scINSIGHT with state-of-the-art methods using simulated and real data, which demonstrate its improved performance. Our results show the applicability of scINSIGHT in diverse biomedical and clinical problems.
Collapse
Affiliation(s)
- Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, Hubei, China
| | - Shiwei Fu
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, 08854, NJ, USA
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, Hubei, China
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, 08854, NJ, USA.
| |
Collapse
|
15
|
Li WV. Phitest for analyzing the homogeneity of single-cell populations. Bioinformatics 2022; 38:2639-2641. [PMID: 35238346 PMCID: PMC9048696 DOI: 10.1093/bioinformatics/btac130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 01/24/2022] [Accepted: 02/28/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Single-cell RNA sequencing technologies facilitate the characterization of transcriptomic landscapes in diverse species, tissues and cell types with unprecedented molecular resolution. In order to better understand animal development, physiology, and pathology, unsupervised clustering analysis is often used to identify relevant cell populations. Although considerable progress has been made in terms of clustering algorithms in recent years, it remains challenging to evaluate the quality of the inferred single-cell clusters, which can greatly impact downstream analysis and interpretation. RESULTS We propose a bioinformatics tool named Phitest to analyze the homogeneity of single-cell populations. Phitest is able to distinguish between homogeneous and heterogeneous cell populations, providing an objective and automatic method to optimize the performance of single-cell clustering analysis. AVAILABILITY AND IMPLEMENTATION The PhitestR package is freely available on both Github (https://github.com/Vivianstats/PhitestR) and the Comprehensive R Archive Network (CRAN). There is no new genomic data associated with this article. Published data used in the analysis are described in detail in the Supplementary Data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
16
|
Liu M, Chen H, Gao D, Ma CY, Zhang ZY. Identification of Helicobacter pylori Membrane Proteins Using Sequence-Based Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7493834. [PMID: 35069791 PMCID: PMC8769816 DOI: 10.1155/2022/7493834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Accepted: 12/16/2021] [Indexed: 11/28/2022]
Abstract
Helicobacter pylori (H. pylori) is the most common risk factor for gastric cancer worldwide. The membrane proteins of the H. pylori are involved in bacterial adherence and play a vital role in the field of drug discovery. Thus, an accurate and cost-effective computational model is needed to predict the uncharacterized membrane proteins of H. pylori. In this study, a reliable benchmark dataset consisted of 114 membrane and 219 nonmembrane proteins was constructed based on UniProt. A support vector machine- (SVM-) based model was developed for discriminating H. pylori membrane proteins from nonmembrane proteins by using sequence information. Cross-validation showed that our method achieved good performance with an accuracy of 91.29%. It is anticipated that the proposed model will be useful for the annotation of H. pylori membrane proteins and the development of new anti-H. pylori agents.
Collapse
Affiliation(s)
- Mujiexin Liu
- Ineye Hospital of Chengdu University of TCM, Chengdu University of TCM, Chengdu 610084, China
| | - Hui Chen
- School of Healthcare Technology, Chengdu Neusoft University, 611844 Chengdu, China
| | - Dong Gao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cai-Yi Ma
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- School of Healthcare Technology, Chengdu Neusoft University, 611844 Chengdu, China
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
17
|
Wang A, Liu H, Yang J, Chen G. Ensemble feature selection for stable biomarker identification and cancer classification from microarray expression data. Comput Biol Med 2022; 142:105208. [PMID: 35016102 DOI: 10.1016/j.compbiomed.2021.105208] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 12/19/2021] [Accepted: 12/31/2021] [Indexed: 01/31/2023]
Abstract
Microarray technology facilitates the simultaneous measurement of expression of tens of thousands of genes and enables us to study cancers and tumors at the molecular level. Because microarray data are typically characterized by small sample size and high dimensionality, accurate and stable feature selection is thus of fundamental importance to the diagnostic accuracy and deep understanding of disease mechanism. Hence, we in this study present an ensemble feature selection framework to improve the discrimination and stability of finally selected features. Specifically, we utilize sampling techniques to obtain multiple sampled datasets, from each of which we use a base feature selector to select a subset of features. Afterwards, we develop two aggregation strategies to combine multiple feature subsets into one set. Finally, comparative experiments are conducted on four publicly available microarray datasets covering both binary and multi-class cases in terms of classification accuracy and three stability metrics. Results show that the proposed method obtains better stability scores and achieves comparable to and even better classification performance than its competitors.
Collapse
Affiliation(s)
- Aiguo Wang
- School of Electronic Information Engineering, Foshan University, Foshan, China.
| | - Huancheng Liu
- School of Electronic Information Engineering, Foshan University, Foshan, China.
| | - Jing Yang
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China.
| | - Guilin Chen
- School of Computer and Information Engineering, Chuzhou University, Chuzhou, China.
| |
Collapse
|