1
|
Wang K, Gong Y, Yan Z, Dang Z, Wang J, Wu M, Zhang Y. Protocol for analyzing functional gene module perturbation during the progression of diseases using a single-cell Bayesian biclustering framework. STAR Protoc 2024; 5:103349. [PMID: 39352811 PMCID: PMC11472622 DOI: 10.1016/j.xpro.2024.103349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 07/26/2024] [Accepted: 09/06/2024] [Indexed: 10/04/2024] Open
Abstract
The pathogenesis of complex diseases involves intricate gene regulation across cell types, necessitating a comprehensive analysis approach. Here, we present a protocol for analyzing functional gene module (FGM) perturbation during the progression of diseases using a single-cell Bayesian biclustering (scBC) framework. We describe steps for setting up the scBC workspace, preparing and exploring input data, training the model, and reconstructing the data matrix. We then detail procedures for Bayesian biclustering, exploring biclustering results, and uncovering pathway perturbations. For complete details on the use and execution of this protocol, please refer to Gong et al.1.
Collapse
Affiliation(s)
- Kunyue Wang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Yuqiao Gong
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Zixin Yan
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Zhiyuan Dang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Junhao Wang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Maoying Wu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China.
| | - Yue Zhang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China; SJTU-Yale Joint Center for Biostatistics and Data Science Organization, Shanghai Jiao Tong University, Shanghai, China; Center for Biomedical Data Science, Translational Science Institute, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
2
|
Gong Y, Xu J, Wu M, Gao R, Sun J, Yu Z, Zhang Y. Single-cell biclustering for cell-specific transcriptomic perturbation detection in AD progression. CELL REPORTS METHODS 2024; 4:100742. [PMID: 38554701 PMCID: PMC11045878 DOI: 10.1016/j.crmeth.2024.100742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Revised: 10/30/2023] [Accepted: 03/07/2024] [Indexed: 04/02/2024]
Abstract
The pathogenesis of Alzheimer disease (AD) involves complex gene regulatory changes across different cell types. To help decipher this complexity, we introduce single-cell Bayesian biclustering (scBC), a framework for identifying cell-specific gene network biomarkers in scRNA and snRNA-seq data. Through biclustering, scBC enables the analysis of perturbations in functional gene modules at the single-cell level. Applying the scBC framework to AD snRNA-seq data reveals the perturbations within gene modules across distinct cell groups and sheds light on gene-cell correlations during AD progression. Notably, our method helps to overcome common challenges in single-cell data analysis, including batch effects and dropout events. Incorporating prior knowledge further enables the framework to yield more biologically interpretable results. Comparative analyses on simulated and real-world datasets demonstrate the precision and robustness of our approach compared to other state-of-the-art biclustering methods. scBC holds potential for unraveling the mechanisms underlying polygenic diseases characterized by intricate gene coexpression patterns.
Collapse
Affiliation(s)
- Yuqiao Gong
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Jingsi Xu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Maoying Wu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Ruitian Gao
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Jianle Sun
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Zhangsheng Yu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China; SJTU-Yale Joint Center for Biostatistics and Data Science Organization, Shanghai Jiao Tong University, Shanghai, China; Clinical Research Institute, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Center for Biomedical Data Science, Translational Science Institute, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| | - Yue Zhang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China; SJTU-Yale Joint Center for Biostatistics and Data Science Organization, Shanghai Jiao Tong University, Shanghai, China; Center for Biomedical Data Science, Translational Science Institute, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
3
|
Feng X, Zhang H, Lin H, Long H. Single-cell RNA-seq data analysis based on directed graph neural network. Methods 2023; 211:48-60. [PMID: 36804214 DOI: 10.1016/j.ymeth.2023.02.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/09/2022] [Accepted: 02/13/2023] [Indexed: 02/17/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data scale surges with high-throughput sequencing technology development. However, although single-cell data analysis is a powerful tool, various issues have been reported, such as sequencing sparsity and complex differential patterns in gene expression. Statistical or traditional machine learning methods are inefficient, and the accuracy needs to be improved. The methods based on deep learning can not directly process non-Euclidean spatial data, such as cell diagrams. In this study, we have developed graph autoencoders and graph attention network for scRNA-seq analysis based on a directed graph neural network named scDGAE. Directed graph neural networks cannot only retain the connection properties of the directed graph but also expand the receptive field of the convolution operation. Cosine similarity, median L1 distance, and root-mean-squared error are used to measure the gene imputation performance of different methods with scDGAE. Furthermore, adjusted mutual information, normalized mutual information, completeness score, and Silhouette coefficient score are used to measure the cell clustering performance of different methods with scDGAE. Experiment results show that the scDGAE model achieves promising performance in gene imputation and cell clustering prediction on four scRNA-seq data sets with gold-standard cell labels. Furthermore, it is a robust framework that can be applied to general scRNA-Seq analyses.
Collapse
Affiliation(s)
- Xiang Feng
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan 571158, China
| | - Hongqi Zhang
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan 571158, China
| | - Hao Lin
- School of Mathematics and Statistics, Hainan Normal University, Haikou, Hainan 571158, China; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Haixia Long
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan 571158, China.
| |
Collapse
|
4
|
Feng X, Fang F, Long H, Zeng R, Yao Y. Single-cell RNA-seq data analysis using graph autoencoders and graph attention networks. Front Genet 2022; 13:1003711. [PMID: 36568390 PMCID: PMC9780469 DOI: 10.3389/fgene.2022.1003711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 11/21/2022] [Indexed: 12/13/2022] Open
Abstract
With the development of high-throughput sequencing technology, the scale of single-cell RNA sequencing (scRNA-seq) data has surged. Its data are typically high-dimensional, with high dropout noise and high sparsity. Therefore, gene imputation and cell clustering analysis of scRNA-seq data is increasingly important. Statistical or traditional machine learning methods are inefficient, and improved accuracy is needed. The methods based on deep learning cannot directly process non-Euclidean spatial data, such as cell diagrams. In this study, we developed scGAEGAT, a multi-modal model with graph autoencoders and graph attention networks for scRNA-seq analysis based on graph neural networks. Cosine similarity, median L1 distance, and root-mean-squared error were used to measure the gene imputation performance of different methods for comparison with scGAEGAT. Furthermore, adjusted mutual information, normalized mutual information, completeness score, and Silhouette coefficient score were used to measure the cell clustering performance of different methods for comparison with scGAEGAT. Experimental results demonstrated promising performance of the scGAEGAT model in gene imputation and cell clustering prediction on four scRNA-seq data sets with gold-standard cell labels.
Collapse
Affiliation(s)
- Xiang Feng
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan, China
| | - Fang Fang
- College of Information Engineering, Hainan Vocational University of Science and Technology, Haikou, Hainan, China
| | - Haixia Long
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan, China
| | - Yuhua Yao
- College of Mathematics and Statistics, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
5
|
Fang Q, Su D, Ng W, Feng J. An Effective Biclustering-Based Framework for Identifying Cell Subpopulations From scRNA-seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2249-2260. [PMID: 32167906 DOI: 10.1109/tcbb.2020.2979717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) techniques opens up new opportunities for studying the cell-specific changes in the transcriptomic data. An important research problem related with scRNA-seq data analysis is to identify cell subpopulations with distinct functions. However, the expression profiles of individual cells are usually measured over tens of thousands of genes, and it remains a difficult problem to effectively cluster the cells based on the high-dimensional profiles. An additional challenge of performing the analysis is that, the scRNA-seq data are often noisy and sometimes extremely sparse due to technical limitations and sampling deficiencies. In this paper, we propose a biclustering-based framework called DivBiclust that effectively identifies the cell subpopulations based on the high-dimensional noisy scRNA-seq data. Compared with nine state-of-the-art methods, DivBiclust excels in identifying cell subpopulations with high accuracy as evidenced by our experiments on ten real scRNA-seq datasets with different size and diverse dropout rates. The supplemental materials of DivBiclust, including the source codes, data, and a supplementary document, are available at https://www.github.com/Qiong-Fang/DivBiclust.
Collapse
|
6
|
Chang Y, Allen C, Wan C, Chung D, Zhang C, Li Z, Ma Q. IRIS-FGM: an integrative single-cell RNA-Seq interpretation system for functional gene module analysis. Bioinformatics 2021; 37:3045-3047. [PMID: 33595622 PMCID: PMC8479672 DOI: 10.1093/bioinformatics/btab108] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 12/28/2020] [Accepted: 02/15/2021] [Indexed: 02/02/2023] Open
Abstract
SUMMARY Single-cell RNA-Seq (scRNA-Seq) data is useful in discovering cell heterogeneity and signature genes in specific cell populations in cancer and other complex diseases. Specifically, the investigation of condition-specific functional gene modules (FGM) can help to understand interactive gene networks and complex biological processes in different cell clusters. QUBIC2 is recognized as one of the most efficient and effective biclustering tools for condition-specific FGM identification from scRNA-Seq data. However, its limited availability to a C implementation restricted its application to only a few downstream analysis functionalities. We developed an R package named IRIS-FGM (Integrative scRNA-Seq Interpretation System for Functional Gene Module analysis) to support the investigation of FGMs and cell clustering using scRNA-Seq data. Empowered by QUBIC2, IRIS-FGM can effectively identify condition-specific FGMs, predict cell types/clusters, uncover differentially expressed genes and perform pathway enrichment analysis. It is noteworthy that IRIS-FGM can also take Seurat objects as input, facilitating easy integration with the existing analysis pipeline. AVAILABILITY AND IMPLEMENTATION IRIS-FGM is implemented in the R environment (as of version 3.6) with the source code freely available at https://github.com/BMEngineeR/IRISFGM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuzhou Chang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
| | - Carter Allen
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
| | - Changlin Wan
- Center for Computational Biology and Bioinformatics and Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Dongjun Chung
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
| | - Chi Zhang
- Center for Computational Biology and Bioinformatics and Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA,To whom correspondence should be addressed.
| | - Zihai Li
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA,To whom correspondence should be addressed.
| | - Qin Ma
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA,To whom correspondence should be addressed.
| |
Collapse
|
7
|
Wang YXR, Li L, Li JJ, Huang H. Network Modeling in Biology: Statistical Methods for Gene and Brain Networks. Stat Sci 2021; 36:89-108. [PMID: 34305304 PMCID: PMC8296984 DOI: 10.1214/20-sts792] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The rise of network data in many different domains has offered researchers new insight into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical network modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using covariates as a first step. We provide a discussion on existing statistical and computational methods for edge esitimation and subsequent statistical inference problems in these two types of biological networks.
Collapse
Affiliation(s)
- Y X Rachel Wang
- School of Mathematics and Statistics, University of Sydney, Australia
| | - Lexin Li
- Department of Biostatistics and Epidemiology, School of Public Health, University of California, Berkeley
| | | | - Haiyan Huang
- Department of Statistics, University of California, Berkeley
| |
Collapse
|
8
|
Banerjee T, Bhattacharya BB, Mukherjee G. A nearest-neighbor based nonparametric test for viral remodeling in heterogeneous single-cell proteomic data. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Xie J, Ma A, Fennell A, Ma Q, Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 2020; 20:1449-1464. [PMID: 29490019 DOI: 10.1093/bib/bby014] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 01/16/2018] [Indexed: 12/12/2022] Open
Abstract
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Collapse
|
10
|
Qi R, Ma A, Ma Q, Zou Q. Clustering and classification methods for single-cell RNA-sequencing data. Brief Bioinform 2019; 21:1196-1208. [PMID: 31271412 DOI: 10.1093/bib/bbz062] [Citation(s) in RCA: 104] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 04/24/2019] [Accepted: 04/25/2019] [Indexed: 12/12/2022] Open
Abstract
Appropriate ways to measure the similarity between single-cell RNA-sequencing (scRNA-seq) data are ubiquitous in bioinformatics, but using single clustering or classification methods to process scRNA-seq data is generally difficult. This has led to the emergence of integrated methods and tools that aim to automatically process specific problems associated with scRNA-seq data. These approaches have attracted a lot of interest in bioinformatics and related fields. In this paper, we systematically review the integrated methods and tools, highlighting the pros and cons of each approach. We not only pay particular attention to clustering and classification methods but also discuss methods that have emerged recently as powerful alternatives, including nonlinear and linear methods and descending dimension methods. Finally, we focus on clustering and classification methods for scRNA-seq data, in particular, integrated methods, and provide a comprehensive description of scRNA-seq data and download URLs.
Collapse
Affiliation(s)
- Ren Qi
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, USA
| | - Qin Ma
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|