1
|
Gong M, Yu Y, Wang Z, Zhang J, Wang X, Fu C, Zhang Y, Wang X. scAuto as a comprehensive framework for single-cell chromatin accessibility data analysis. Comput Biol Med 2024; 171:108230. [PMID: 38442554 DOI: 10.1016/j.compbiomed.2024.108230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 02/06/2024] [Accepted: 02/25/2024] [Indexed: 03/07/2024]
Abstract
Interpreting single-cell chromatin accessibility data is crucial for understanding intercellular heterogeneity regulation. Despite the progress in computational methods for analyzing this data, there is still a lack of a comprehensive analytical framework and a user-friendly online analysis tool. To fill this gap, we developed a pre-trained deep learning-based framework, single-cell auto-correlation transformers (scAuto), to overcome the challenge. Following DNABERT's methodology of pre-training and fine-tuning, scAuto learns a general understanding of DNA sequence's grammar by being pre-trained on unlabeled human genome via self-supervision; it is then transferred to the single-cell chromatin accessibility analysis task of scATAC-seq data for supervised fine-tuning. We extensively validated scAuto on the Buenrostro2018 dataset, demonstrating its superior performance on chromatin accessibility prediction, single-cell clustering, and data denoising. Based on scAuto, we further developed an interactive web server for single-cell chromatin accessibility data analysis. It integrates tutorial-style interfaces for those with limited programming skills. The platform is accessible at http://zhanglab.icaup.cn. To our knowledge, this work is expected to help analyze single-cell chromatin accessibility data and facilitate the development of precision medicine.
Collapse
Affiliation(s)
- Meiqin Gong
- Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, 610041, China
| | - Yun Yu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Zixuan Wang
- College of Electronics and information Engineering, SiChuan University, Chengdu, 610065, China
| | - Junming Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Xiongyi Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Cheng Fu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Xiaodong Wang
- Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, 610041, China.
| |
Collapse
|
2
|
Zhang W, Ma Z, Wang L, Fan D, Ho YY. Genome-wide search algorithms for identifying dynamic gene co-expression via Bayesian variable selection. Stat Med 2023; 42:5616-5629. [PMID: 37806971 DOI: 10.1002/sim.9928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 08/08/2023] [Accepted: 09/19/2023] [Indexed: 10/10/2023]
Abstract
A wealth of gene expression data generated by high-throughput techniques provides exciting opportunities for studying gene-gene interactions systematically. Gene-gene interactions in a biological system are tightly regulated and are often highly dynamic. The interactions can change flexibly under various internal cellular signals or external stimuli. Previous studies have developed statistical methods to examine these dynamic changes in gene-gene interactions. However, due to the massive number of possible gene combinations that need to be considered in a typical genomic dataset, intensive computation is a common challenge for exploring gene-gene interactions. On the other hand, oftentimes only a small proportion of gene combinations exhibit dynamic co-expression changes. To solve this problem, we propose Bayesian variable selection approaches based on spike-and-slab priors. The proposed algorithms reduce the computational intensity by focusing on identifying subsets of promising gene combinations in the search space. We also adopt a Bayesian multiple hypothesis testing procedure to identify strong dynamic gene co-expression changes. Simulation studies are performed to compare the proposed approaches with existing exhaustive search heuristics. We demonstrate the implementation of our proposed approach to study the association between gene co-expression patterns and overall survival using the RNA-sequencing dataset from The Cancer Genome Atlas breast cancer BRCA-US project.
Collapse
Affiliation(s)
- Wenda Zhang
- Walmart Global Tech, Sunnyvale, California, USA
| | - Zichen Ma
- Department of Mathematics, Colgate University, Hamilton, New York, USA
| | - Lianming Wang
- Department of Statistics, University of South Carolina, Columbia, South Carolina, USA
| | - Daping Fan
- Department of Cell Biology and Anatomy, University of South Carolina, Columbia, South Carolina, USA
| | - Yen-Yi Ho
- Department of Statistics, University of South Carolina, Columbia, South Carolina, USA
| |
Collapse
|
3
|
Cho H, Liu C, Preisser JS, Wu D. A bivariate zero-inflated negative binomial model and its applications to biomedical settings. Stat Methods Med Res 2023; 32:1300-1317. [PMID: 37167422 PMCID: PMC10500952 DOI: 10.1177/09622802231172028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The zero-inflated negative binomial distribution has been widely used for count data analyses in various biomedical settings due to its capacity of modeling excess zeros and overdispersion. When there are correlated count variables, a bivariate model is essential for understanding their full distributional features. Examples include measuring correlation of two genes in sparse single-cell RNA sequencing data and modeling dental caries count indices on two different tooth surface types. For these purposes, we develop a richly parametrized bivariate zero-inflated negative binomial model that has a simple latent variable framework and eight free parameters with intuitive interpretations. In the scRNA-seq data example, the correlation is estimated after adjusting for the effects of dropout events represented by excess zeros. In the dental caries data, we analyze how the treatment with Xylitol lozenges affects the marginal mean and other patterns of response manifested in the two dental caries traits. An R package "bzinb" is available on Comprehensive R Archive Network.
Collapse
Affiliation(s)
- Hunyong Cho
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC, USA
| | - Chuwen Liu
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC, USA
| | - John S Preisser
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC, USA
| | - Di Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC, USA
- Division of Oral and Craniofacial Health Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, NC, USA
| |
Collapse
|
4
|
Sun L, Zhou H, Zhao X, Zhang H, Wang Y, Li G. Small RNA sequencing identified miR-3180 as a potential prognostic biomarker for Chinese hepatocellular carcinoma patients. Front Genet 2023; 14:1102171. [PMID: 37051592 PMCID: PMC10083302 DOI: 10.3389/fgene.2023.1102171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Accepted: 03/10/2023] [Indexed: 03/28/2023] Open
Abstract
MicroRNAs (miRNAs) and their target genes are aberrantly expressed in many cancers and are linked to carcinogenesis and metastasis, especially among hepatocellular carcinoma (HCC) patients. This study sought to identify new biomarkers related to HCC prognosis using small RNA sequencing from the tumor and matched normal adjacent tissue of 32 patients with HCC. Eight miRNAs were downregulated and 61 were upregulated more than twofold. Of these, five miRNAs, hsa-miR-3180, hsa-miR-5589-5p, hsa-miR-490-5p, hsa-miR-137, and hsa-miR-378i, were significantly associated with 5-year overall survival (OS) rates. Differential upregulation of hsa-miR-3180 and downregulation of hsa-miR-378i in tumor samples supported the finding that low and high concentrations of hsa-miR-3180 (p = 0.029) and hsa-miR-378i (p = 0.047), respectively, were associated with higher 5-year OS. Cox regression analyses indicated that hsa-miR-3180 (HR = 0.08; p = 0.013) and hsa-miR-378i (HR = 18.34; p = 0.045) were independent prognostic factors of poor survival. However, high hsa-miR-3180 expression obtained larger AUCs for OS and progression-free survival (PFS) and had better nomogram prediction than hsa-miR-378i. These findings indicate that hsa-miR-3180 may be associated with HCC progression and could serve as a potential biomarker for this disease.
Collapse
Affiliation(s)
- Libo Sun
- General Surgery Center, Beijing YouAn Hospital, Capital Medical University, Beijing, China
| | - Hansheng Zhou
- Department of Pharmacy, Linyi People’s Hospital, Linyi, Shandong, China
| | - Xiaofei Zhao
- General Surgery Center, Beijing YouAn Hospital, Capital Medical University, Beijing, China
| | - Haitao Zhang
- General Surgery Center, Beijing YouAn Hospital, Capital Medical University, Beijing, China
| | - Yan Wang
- CAS Key Lab of Mental Health, Institute of Psychology, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
- *Correspondence: Yan Wang, ; Guangming Li,
| | - Guangming Li
- General Surgery Center, Beijing YouAn Hospital, Capital Medical University, Beijing, China
- *Correspondence: Yan Wang, ; Guangming Li,
| |
Collapse
|
5
|
What can scatterplots teach us about doing data science better? INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2022. [DOI: 10.1007/s41060-022-00362-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
6
|
Li L, Zeng J, Zhang X. Generalized Liquid Association Analysis for Multimodal Data Integration. J Am Stat Assoc 2022; 118:1984-1996. [PMID: 38099062 PMCID: PMC10720690 DOI: 10.1080/01621459.2021.2024437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 12/27/2021] [Indexed: 10/19/2022]
Abstract
Multimodal data are now prevailing in scientific research. One of the central questions in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the literature. In this article, we propose a novel generalized liquid association analysis method, which offers a new and unique angle to this important class of problems of studying three-way associations. We extend the notion of liquid association of Li (2002) from the univariate setting to the sparse, multivariate, and high-dimensional setting. We establish a population dimension reduction model, transform the problem to sparse Tucker decomposition of a three-way tensor, and develop a higher-order orthogonal iteration algorithm for parameter estimation. We derive the non-asymptotic error bound and asymptotic consistency of the proposed estimator, while allowing the variable dimensions to be larger than and diverge with the sample size. We demonstrate the efficacy of the method through both simulations and a multimodal neuroimaging application for Alzheimer's disease research.
Collapse
Affiliation(s)
- Lexin Li
- University of California at Berkeley
| | | | | |
Collapse
|
7
|
Yang Z, Ho YY. Modeling dynamic correlation in zero-inflated bivariate count data with applications to single-cell RNA sequencing data. Biometrics 2021; 78:766-776. [PMID: 33720414 PMCID: PMC8477913 DOI: 10.1111/biom.13457] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Revised: 03/03/2021] [Accepted: 03/08/2021] [Indexed: 12/13/2022]
Abstract
Interactions between biological molecules in a cell are tightly coordinated and often highly dynamic. As a result of these varying signaling activities, changes in gene coexpression patterns could often be observed. The advancements in next‐generation sequencing technologies bring new statistical challenges for studying these dynamic changes of gene coexpression. In recent years, methods have been developed to examine genomic information from individual cells. Single‐cell RNA sequencing (scRNA‐seq) data are count‐based, and often exhibit characteristics such as overdispersion and zero inflation. To explore the dynamic dependence structure in scRNA‐seq data and other zero‐inflated count data, new approaches are needed. In this paper, we consider overdispersion and zero inflation in count outcomes and propose a ZEro‐inflated negative binomial dynamic COrrelation model (ZENCO). The observed count data are modeled as a mixture of two components: success amplifications and dropout events in ZENCO. A latent variable is incorporated into ZENCO to model the covariate‐dependent correlation structure. We conduct simulation studies to evaluate the performance of our proposed method and to compare it with existing approaches. We also illustrate the implementation of our proposed approach using scRNA‐seq data from a study of minimal residual disease in melanoma.
Collapse
Affiliation(s)
- Zhen Yang
- Department of Statistics, University of South Carolina, Columbia, South Carolina, USA
| | - Yen-Yi Ho
- Department of Statistics, University of South Carolina, Columbia, South Carolina, USA
| |
Collapse
|
8
|
Wen X, Gao L, Hu Y. LAceModule: Identification of Competing Endogenous RNA Modules by Integrating Dynamic Correlation. Front Genet 2020; 11:235. [PMID: 32256525 PMCID: PMC7093494 DOI: 10.3389/fgene.2020.00235] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 02/27/2020] [Indexed: 12/14/2022] Open
Abstract
Competing endogenous RNAs (ceRNAs) regulate each other by competitively binding microRNAs they share. This is a vital post-transcriptional regulation mechanism and plays critical roles in physiological and pathological processes. Current computational methods for the identification of ceRNA pairs are mainly based on the correlation of the expression of ceRNA candidates and the number of shared microRNAs, without considering the sensitivity of the correlation to the expression levels of the shared microRNAs. To overcome this limitation, we introduced liquid association (LA), a dynamic correlation measure, which can evaluate the sensitivity of the correlation of ceRNAs to microRNAs, as an additional factor for the detection of ceRNAs. To this end, we firstly analyzed the effect of LA on detecting ceRNA pairs. Subsequently, we proposed an LA-based framework, termed LAceModule, to identify ceRNA modules by integrating the conventional Pearson correlation coefficient and dynamic correlation LA with multi-view non-negative matrix factorization. Using breast and liver cancer datasets, the experimental results demonstrated that LA is a useful measure in the detection of ceRNA pairs and modules. We found that the identified ceRNA modules play roles in cell adhesion, cell migration, and cell-cell communication. Furthermore, our results show that ceRNAs may represent potential drug targets and markers for the treatment and prognosis of cancer.
Collapse
Affiliation(s)
- Xiao Wen
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
9
|
Lu J, Lu Y, Ding Y, Xiao Q, Liu L, Cai Q, Kong Y, Bai Y, Yu T. DNLC: differential network local consistency analysis. BMC Bioinformatics 2019; 20:489. [PMID: 31874600 PMCID: PMC6929334 DOI: 10.1186/s12859-019-3046-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Accepted: 08/21/2019] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND The biological network is highly dynamic. Functional relations between genes can be activated or deactivated depending on the biological conditions. On the genome-scale network, subnetworks that gain or lose local expression consistency may shed light on the regulatory mechanisms related to the changing biological conditions, such as disease status or tissue developmental stages. RESULTS In this study, we develop a new method to select genes and modules on the existing biological network, in which local expression consistency changes significantly between clinical conditions. The method is called DNLC: Differential Network Local Consistency. In simulations, our algorithm detected artificially created local consistency changes effectively. We applied the method on two publicly available datasets, and the method detected novel genes and network modules that were biologically plausible. CONCLUSIONS The new method is effective in finding modules in which the gene expression consistency change between clinical conditions. It is a useful tool that complements traditional differential expression analyses to make discoveries from gene expression data. The R package is available at https://cran.r-project.org/web/packages/DNLC.
Collapse
Affiliation(s)
- Jianwei Lu
- School of Software Engineering, Tongji University, Shanghai, China
- Institute of Advanced Translational Medicine, Tongji University, Shanghai, China
| | - Yao Lu
- School of Software Engineering, Tongji University, Shanghai, China
| | - Yusheng Ding
- School of Software Engineering, Tongji University, Shanghai, China
| | - Qingyang Xiao
- Department of Environmental Health, Emory University, Atlanta, GA USA
| | - Linqing Liu
- School of Software Engineering, Tongji University, Shanghai, China
| | - Qingpo Cai
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA USA
| | - Yunchuan Kong
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA USA
| | - Yun Bai
- Department of Pharmaceutical Sciences, School of Pharmacy, Philadelphia College of Osteopathic Medicine, Georgia Campus, Suwanee, GA USA
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA USA
| |
Collapse
|
10
|
Zeng T, Dai H. Single-Cell RNA Sequencing-Based Computational Analysis to Describe Disease Heterogeneity. Front Genet 2019; 10:629. [PMID: 31354786 PMCID: PMC6640157 DOI: 10.3389/fgene.2019.00629] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Accepted: 06/17/2019] [Indexed: 12/25/2022] Open
Abstract
The trillions of cells in the human body can be viewed as elementary but essential biological units that achieve different body states, but the low resolution of previous cell isolation and measurement approaches limits our understanding of the cell-specific molecular profiles. The recent establishment and rapid growth of single-cell sequencing technology has facilitated the identification of molecular profiles of heterogeneous cells, especially on the transcription level of single cells [single-cell RNA sequencing (scRNA-seq)]. As a novel method, the robustness of scRNA-seq under changing conditions will determine its practical potential in major research programs and clinical applications. In this review, we first briefly presented the scRNA-seq-related methods from the point of view of experiments and computation. Then, we compared several state-of-the-art scRNA-seq analysis frameworks mainly by analyzing their performance robustness on independent scRNA-seq datasets for the same complex disease. Finally, we elaborated on our hypothesis on consensus scRNA-seq analysis and summarized the potential indicative and predictive roles of individual cells in understanding disease heterogeneity by single-cell technologies.
Collapse
Affiliation(s)
- Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | | |
Collapse
|
11
|
A hypergraph-based method for large-scale dynamic correlation study at the transcriptomic scale. BMC Genomics 2019; 20:397. [PMID: 31117943 PMCID: PMC6530038 DOI: 10.1186/s12864-019-5787-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 05/09/2019] [Indexed: 12/22/2022] Open
Abstract
Background The biological regulatory system is highly dynamic. Correlations between functionally related genes change over different biological conditions, which are often unobserved in the data. At the gene level, the dynamic correlations result in three-way gene interactions involving a pair of genes that change correlation, and a third gene that reflects the underlying cellular conditions. This type of ternary relation can be quantified by the Liquid Association statistic. Studying these three-way interactions at the gene triplet level have revealed important regulatory mechanisms in the biological system. Currently, due to the extremely large amount of possible combinations of triplets within a high-throughput gene expression dataset, no method is available to examine the ternary relationship at the biological system level and formally address the false discovery issue. Results Here we propose a new method, Hypergraph for Dynamic Correlation (HDC), to construct module-level three-way interaction networks. The method is able to present integrative uniform hypergraphs to reflect the global dynamic correlation pattern in the biological system, providing guidance to down-stream gene triplet-level analyses. To validate the method’s ability, we conducted two real data experiments using a melanoma RNA-seq dataset from The Cancer Genome Atlas (TCGA) and a yeast cell cycle dataset. The resulting hypergraphs are clearly biologically plausible, and suggest novel relations relevant to the biological conditions in the data. Conclusions We believe the new approach provides a valuable alternative method to analyze omics data that can extract higher order structures. The software is at https://github.com/yunchuankong/HypergraphDynamicCorrelation. Electronic supplementary material The online version of this article (10.1186/s12864-019-5787-x) contains supplementary material, which is available to authorized users.
Collapse
|