1
|
LeRoy N, Smith J, Zheng G, Rymuza J, Gharavi E, Brown D, Zhang A, Sheffield N. Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings. NAR Genom Bioinform 2024; 6:lqae073. [PMID: 38974799 PMCID: PMC11224678 DOI: 10.1093/nargab/lqae073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 04/29/2024] [Accepted: 06/20/2024] [Indexed: 07/09/2024] Open
Abstract
Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.
Collapse
Affiliation(s)
- Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Jason P Smith
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Julia Rymuza
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
2
|
Rachid Zaim S, Pebworth MP, McGrath I, Okada L, Weiss M, Reading J, Czartoski JL, Torgerson TR, McElrath MJ, Bumol TF, Skene PJ, Li XJ. MOCHA's advanced statistical modeling of scATAC-seq data enables functional genomic inference in large human cohorts. Nat Commun 2024; 15:6828. [PMID: 39122670 PMCID: PMC11316085 DOI: 10.1038/s41467-024-50612-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 07/13/2024] [Indexed: 08/12/2024] Open
Abstract
Single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) is being increasingly used to study gene regulation. However, major analytical gaps limit its utility in studying gene regulatory programs in complex diseases. In response, MOCHA (Model-based single cell Open CHromatin Analysis) presents major advances over existing analysis tools, including: 1) improving identification of sample-specific open chromatin, 2) statistical modeling of technical drop-out with zero-inflated methods, 3) mitigation of false positives in single cell analysis, 4) identification of alternative transcription-starting-site regulation, and 5) modules for inferring temporal gene regulatory networks from longitudinal data. These advances, in addition to open chromatin analyses, provide a robust framework after quality control and cell labeling to study gene regulatory programs in human disease. We benchmark MOCHA with four state-of-the-art tools to demonstrate its advances. We also construct cross-sectional and longitudinal gene regulatory networks, identifying potential mechanisms of COVID-19 response. MOCHA provides researchers with a robust analytical tool for functional genomic inference from scATAC-seq data.
Collapse
Affiliation(s)
| | | | | | - Lauren Okada
- Allen Institute for Immunology, Seattle, WA, USA
| | - Morgan Weiss
- Allen Institute for Immunology, Seattle, WA, USA
| | | | - Julie L Czartoski
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | | | - M Juliana McElrath
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | | | | | - Xiao-Jun Li
- Allen Institute for Immunology, Seattle, WA, USA.
| |
Collapse
|
3
|
Tian T, Zhang J, Lin X, Wei Z, Hakonarson H. Dependency-aware deep generative models for multitasking analysis of spatial omics data. Nat Methods 2024; 21:1501-1513. [PMID: 38783067 DOI: 10.1038/s41592-024-02257-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Accepted: 03/25/2024] [Indexed: 05/25/2024]
Abstract
Spatially resolved transcriptomics (SRT) technologies have significantly advanced biomedical research, but their data analysis remains challenging due to the discrete nature of the data and the high levels of noise, compounded by complex spatial dependencies. Here, we propose spaVAE, a dependency-aware, deep generative spatial variational autoencoder model that probabilistically characterizes count data while capturing spatial correlations. spaVAE introduces a hybrid embedding combining a Gaussian process prior with a Gaussian prior to explicitly capture spatial correlations among spots. It then optimizes the parameters of deep neural networks to approximate the distributions underlying the SRT data. With the approximated distributions, spaVAE can contribute to several analytical tasks that are essential for SRT data analysis, including dimensionality reduction, visualization, clustering, batch integration, denoising, differential expression, spatial interpolation, resolution enhancement and identification of spatially variable genes. Moreover, we have extended spaVAE to spaPeakVAE and spaMultiVAE to characterize spatial ATAC-seq (assay for transposase-accessible chromatin using sequencing) data and spatial multi-omics data, respectively.
Collapse
Affiliation(s)
- Tian Tian
- School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, Hubei, China
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Jie Zhang
- National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China
| | - Xiang Lin
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA.
| | - Hakon Hakonarson
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Division of Human Genetics, Department of Pediatrics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
4
|
Lyu Z, Dahal S, Zeng S, Wang J, Xu D, Joshi T. CrossMP: Enabling Cross-Modality Translation between Single-Cell RNA-Seq and Single-Cell ATAC-Seq through Web-Based Portal. Genes (Basel) 2024; 15:882. [PMID: 39062661 PMCID: PMC11276538 DOI: 10.3390/genes15070882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 06/22/2024] [Accepted: 07/03/2024] [Indexed: 07/28/2024] Open
Abstract
In recent years, there has been a growing interest in profiling multiomic modalities within individual cells simultaneously. One such example is integrating combined single-cell RNA sequencing (scRNA-seq) data and single-cell transposase-accessible chromatin sequencing (scATAC-seq) data. Integrated analysis of diverse modalities has helped researchers make more accurate predictions and gain a more comprehensive understanding than with single-modality analysis. However, generating such multimodal data is technically challenging and expensive, leading to limited availability of single-cell co-assay data. Here, we propose a model for cross-modal prediction between the transcriptome and chromatin profiles in single cells. Our model is based on a deep neural network architecture that learns the latent representations from the source modality and then predicts the target modality. It demonstrates reliable performance in accurately translating between these modalities across multiple paired human scATAC-seq and scRNA-seq datasets. Additionally, we developed CrossMP, a web-based portal allowing researchers to upload their single-cell modality data through an interactive web interface and predict the other type of modality data, using high-performance computing resources plugged at the backend.
Collapse
Affiliation(s)
- Zhen Lyu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA; (Z.L.); (S.D.); (S.Z.); (D.X.)
| | - Sabin Dahal
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA; (Z.L.); (S.D.); (S.Z.); (D.X.)
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA; (Z.L.); (S.D.); (S.Z.); (D.X.)
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Juexin Wang
- Department of BioHealth Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University Indianapolis, Indianapolis, IN 46202, USA;
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA; (Z.L.); (S.D.); (S.Z.); (D.X.)
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
- MU Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, USA
| | - Trupti Joshi
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA; (Z.L.); (S.D.); (S.Z.); (D.X.)
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
- MU Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, USA
- Department of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
5
|
Wang X, Lian Q, Dong H, Xu S, Su Y, Wu X. Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae014. [PMID: 39049508 PMCID: PMC11423854 DOI: 10.1093/gpbjnl/qzae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 06/20/2023] [Accepted: 06/25/2023] [Indexed: 07/27/2024]
Abstract
Gene set scoring (GSS) has been routinely conducted for gene expression analysis of bulk or single-cell RNA sequencing (RNA-seq) data, which helps to decipher single-cell heterogeneity and cell type-specific variability by incorporating prior knowledge from functional gene sets. Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a powerful technique for interrogating single-cell chromatin-based gene regulation, and genes or gene sets with dynamic regulatory potentials can be regarded as cell type-specific markers as if in single-cell RNA-seq (scRNA-seq). However, there are few GSS tools specifically designed for scATAC-seq, and the applicability and performance of RNA-seq GSS tools on scATAC-seq data remain to be investigated. Here, we systematically benchmarked ten GSS tools, including four bulk RNA-seq tools, five scRNA-seq tools, and one scATAC-seq method. First, using matched scATAC-seq and scRNA-seq datasets, we found that the performance of GSS tools on scATAC-seq data was comparable to that on scRNA-seq, suggesting their applicability to scATAC-seq. Then, the performance of different GSS tools was extensively evaluated using up to ten scATAC-seq datasets. Moreover, we evaluated the impact of gene activity conversion, dropout imputation, and gene set collections on the results of GSS. Results show that dropout imputation can significantly promote the performance of almost all GSS tools, while the impact of gene activity conversion methods or gene set collections on GSS performance is more dependent on GSS tools or datasets. Finally, we provided practical guidelines for choosing appropriate preprocessing methods and GSS tools in different application scenarios.
Collapse
Affiliation(s)
- Xi Wang
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
- Department of Automation, Xiamen University, Xiamen 361005, China
| | - Qiwei Lian
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
- Department of Automation, Xiamen University, Xiamen 361005, China
| | - Haoyu Dong
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
| | - Shuo Xu
- Department of Automation, Xiamen University, Xiamen 361005, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China
| | - Xiaohui Wu
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
| |
Collapse
|
6
|
Zhao SH, Ji XY, Yuan GZ, Cheng T, Liang HY, Liu SQ, Yang FY, Tang Y, Shi S. A Bibliometric Analysis of the Spatial Transcriptomics Literature from 2006 to 2023. Cell Mol Neurobiol 2024; 44:50. [PMID: 38856921 PMCID: PMC11164738 DOI: 10.1007/s10571-024-01484-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 05/28/2024] [Indexed: 06/11/2024]
Abstract
In recent years, spatial transcriptomics (ST) research has become a popular field of study and has shown great potential in medicine. However, there are few bibliometric analyses in this field. Thus, in this study, we aimed to find and analyze the frontiers and trends of this medical research field based on the available literature. A computerized search was applied to the WoSCC (Web of Science Core Collection) Database for literature published from 2006 to 2023. Complete records of all literature and cited references were extracted and screened. The bibliometric analysis and visualization were performed using CiteSpace, VOSviewer, Bibliometrix R Package software, and Scimago Graphica. A total of 1467 papers and reviews were included. The analysis revealed that the ST publication and citation results have shown a rapid upward trend over the last 3 years. Nature Communications and Nature were the most productive and most co-cited journals, respectively. In the comprehensive global collaborative network, the United States is the country with the most organizations and publications, followed closely by China and the United Kingdom. The author Joakim Lundeberg published the most cited paper, while Patrik L. Ståhl ranked first among co-cited authors. The hot topics in ST are tissue recognition, cancer, heterogeneity, immunotherapy, differentiation, and models. ST technologies have greatly contributed to in-depth research in medical fields such as oncology and neuroscience, opening up new possibilities for the diagnosis and treatment of diseases. Moreover, artificial intelligence and big data drive additional development in ST fields.
Collapse
Affiliation(s)
- Shu-Han Zhao
- Guang'an Men Hospital, China Academy of Chinese Medical Sciences, No. 5 Beixiange Street, Xicheng District, Beijing, 100053, People's Republic of China
- Beijing University of Chinese Medicine, No. 11, Beisanhuan East Road, Chaoyang District, Beijing, 100029, People's Republic of China
| | - Xin-Yu Ji
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, No. 16 Nanxiaojie, Dongzhimennei Ave, Beijing, 100700, People's Republic of China
| | - Guo-Zhen Yuan
- Guang'an Men Hospital, China Academy of Chinese Medical Sciences, No. 5 Beixiange Street, Xicheng District, Beijing, 100053, People's Republic of China
| | - Tao Cheng
- Guang'an Men Hospital, China Academy of Chinese Medical Sciences, No. 5 Beixiange Street, Xicheng District, Beijing, 100053, People's Republic of China
| | - Hai-Yi Liang
- Beijing University of Chinese Medicine, No. 11, Beisanhuan East Road, Chaoyang District, Beijing, 100029, People's Republic of China
| | - Si-Qi Liu
- Beijing University of Chinese Medicine, No. 11, Beisanhuan East Road, Chaoyang District, Beijing, 100029, People's Republic of China
| | - Fu-Yi Yang
- Beijing University of Chinese Medicine, No. 11, Beisanhuan East Road, Chaoyang District, Beijing, 100029, People's Republic of China
| | - Yang Tang
- School of Chinese Medicine, Beijing University of Chinese Medicine, No. 11, Beisanhuan East Road, Chaoyang District, Beijing, 100029, People's Republic of China.
| | - Shuai Shi
- Guang'an Men Hospital, China Academy of Chinese Medical Sciences, No. 5 Beixiange Street, Xicheng District, Beijing, 100053, People's Republic of China.
| |
Collapse
|
7
|
Wei Z. Discrete latent embeddings illuminate cellular diversity in single-cell epigenomics. NATURE COMPUTATIONAL SCIENCE 2024; 4:316-317. [PMID: 38811822 DOI: 10.1038/s43588-024-00634-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Affiliation(s)
- Zhi Wei
- Department of Computer Science, Ying Wu College of Computing, New Jersey Institute of Technology, Newark, NJ, USA.
| |
Collapse
|
8
|
Cui X, Chen X, Li Z, Gao Z, Chen S, Jiang R. Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity. NATURE COMPUTATIONAL SCIENCE 2024; 4:346-359. [PMID: 38730185 DOI: 10.1038/s43588-024-00625-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024]
Abstract
Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models-especially variational autoencoders-have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE's capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.
Collapse
Affiliation(s)
- Xuejian Cui
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Xiaoyang Chen
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zhen Li
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China.
| |
Collapse
|
9
|
Ran W, Yu Q. Data-driven clustering approach to identify novel clusters of high cognitive impairment risk among Chinese community-dwelling elderly people with normal cognition: A national cohort study. J Glob Health 2024; 14:04088. [PMID: 38638099 PMCID: PMC11026990 DOI: 10.7189/jogh.14.04088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/20/2024] Open
Abstract
Background Cognitive impairment is a highly heterogeneous disorder that necessitates further investigation into the distinct characteristics of populations at varying risk levels of cognitive impairment. Using a large-scale registry cohort of elderly individuals, we applied a data-driven approach to identify novel clusters based on diverse sociodemographic features. Methods A prospective cohort of 6398 elderly people from the Chinese Longitudinal Healthy Longevity Survey, followed between 2008-14, was used to develop and validate the model. Participants were aged ≥60 years, community-dwelling, and the Chinese version of the Mini-Mental State Examination (MMSE) score ≥18 were included. Sixty-nine sociodemographic features were included in the analysis. The total population was divided into two-thirds for the derivation cohort (n = 4265) and one-third for the validation cohort (n = 2133). In the derivation cohort, an unsupervised Gaussian mixture model was applied to categorise participants into distinct clusters. A classifier was developed based on the most important 10 factors and was applied to categorise participants into their corresponding clusters in a validation cohort. The difference in the three-year risk of cognitive impairment was compared across the clusters. Results We identified four clusters with distinct features in the derivation cohort. Cluster 1 was associated with the worst life independence, longest sleep duration, and the oldest age. Cluster 2 demonstrated the highest loneliness, characterised by non-marital status and living alone. Cluster 3 was characterised by the lowest sense of loneliness and the highest proportions in marital status and family co-residence. Cluster 4 demonstrated heightened engagement in exercise and leisure activity, along with independent decision-making, hygiene, and a diverse diet. In comparison to Cluster 4, Cluster 1 exhibited the highest three-year cognitive impairment risk (adjusted odds ratio (aOR) = 3.31; 95% confidence interval (CI) = 1.81-6.05), followed by Cluster 2 and Cluster 3 after adjustment for baseline MMSE, residence, sex, age, years of education, drinking, smoking, hypertension, diabetes, heart disease and stroke or cardiovascular diseases. Conclusions A data-driven approach can be instrumental in identifying individuals at high risk of cognitive impairment among cognitively normal elderly populations. Based on various sociodemographic features, these clusters can suggest individualised intervention plans.
Collapse
Affiliation(s)
- Wang Ran
- Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, China
| | - Qiutong Yu
- Medical Education Department, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, China
| |
Collapse
|
10
|
Zeng Y, Luo M, Shangguan N, Shi P, Feng J, Xu J, Chen K, Lu Y, Yu W, Yang Y. Deciphering cell types by integrating scATAC-seq data with genome sequences. NATURE COMPUTATIONAL SCIENCE 2024; 4:285-298. [PMID: 38600256 DOI: 10.1038/s43588-024-00622-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/18/2024] [Indexed: 04/12/2024]
Abstract
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Big Data and Software Engineering, Chongqing University, Chongqing, China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Mai Luo
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Ningyuan Shangguan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Peiyu Shi
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Junxi Feng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Jin Xu
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Ken Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Weijiang Yu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou, China.
| |
Collapse
|
11
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
12
|
Goshisht MK. Machine Learning and Deep Learning in Synthetic Biology: Key Architectures, Applications, and Challenges. ACS OMEGA 2024; 9:9921-9945. [PMID: 38463314 PMCID: PMC10918679 DOI: 10.1021/acsomega.3c05913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 01/19/2024] [Accepted: 01/30/2024] [Indexed: 03/12/2024]
Abstract
Machine learning (ML), particularly deep learning (DL), has made rapid and substantial progress in synthetic biology in recent years. Biotechnological applications of biosystems, including pathways, enzymes, and whole cells, are being probed frequently with time. The intricacy and interconnectedness of biosystems make it challenging to design them with the desired properties. ML and DL have a synergy with synthetic biology. Synthetic biology can be employed to produce large data sets for training models (for instance, by utilizing DNA synthesis), and ML/DL models can be employed to inform design (for example, by generating new parts or advising unrivaled experiments to perform). This potential has recently been brought to light by research at the intersection of engineering biology and ML/DL through achievements like the design of novel biological components, best experimental design, automated analysis of microscopy data, protein structure prediction, and biomolecular implementations of ANNs (Artificial Neural Networks). I have divided this review into three sections. In the first section, I describe predictive potential and basics of ML along with myriad applications in synthetic biology, especially in engineering cells, activity of proteins, and metabolic pathways. In the second section, I describe fundamental DL architectures and their applications in synthetic biology. Finally, I describe different challenges causing hurdles in the progress of ML/DL and synthetic biology along with their solutions.
Collapse
Affiliation(s)
- Manoj Kumar Goshisht
- Department of Chemistry, Natural and
Applied Sciences, University of Wisconsin—Green
Bay, Green
Bay, Wisconsin 54311-7001, United States
| |
Collapse
|
13
|
Gong M, Yu Y, Wang Z, Zhang J, Wang X, Fu C, Zhang Y, Wang X. scAuto as a comprehensive framework for single-cell chromatin accessibility data analysis. Comput Biol Med 2024; 171:108230. [PMID: 38442554 DOI: 10.1016/j.compbiomed.2024.108230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 02/06/2024] [Accepted: 02/25/2024] [Indexed: 03/07/2024]
Abstract
Interpreting single-cell chromatin accessibility data is crucial for understanding intercellular heterogeneity regulation. Despite the progress in computational methods for analyzing this data, there is still a lack of a comprehensive analytical framework and a user-friendly online analysis tool. To fill this gap, we developed a pre-trained deep learning-based framework, single-cell auto-correlation transformers (scAuto), to overcome the challenge. Following DNABERT's methodology of pre-training and fine-tuning, scAuto learns a general understanding of DNA sequence's grammar by being pre-trained on unlabeled human genome via self-supervision; it is then transferred to the single-cell chromatin accessibility analysis task of scATAC-seq data for supervised fine-tuning. We extensively validated scAuto on the Buenrostro2018 dataset, demonstrating its superior performance on chromatin accessibility prediction, single-cell clustering, and data denoising. Based on scAuto, we further developed an interactive web server for single-cell chromatin accessibility data analysis. It integrates tutorial-style interfaces for those with limited programming skills. The platform is accessible at http://zhanglab.icaup.cn. To our knowledge, this work is expected to help analyze single-cell chromatin accessibility data and facilitate the development of precision medicine.
Collapse
Affiliation(s)
- Meiqin Gong
- Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, 610041, China
| | - Yun Yu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Zixuan Wang
- College of Electronics and information Engineering, SiChuan University, Chengdu, 610065, China
| | - Junming Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Xiongyi Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Cheng Fu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Xiaodong Wang
- Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, 610041, China.
| |
Collapse
|
14
|
Tang S, Cui X, Wang R, Li S, Li S, Huang X, Chen S. scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data. Nat Commun 2024; 15:1629. [PMID: 38388573 PMCID: PMC10884038 DOI: 10.1038/s41467-024-46045-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 02/12/2024] [Indexed: 02/24/2024] Open
Abstract
Single-cell chromatin accessibility sequencing (scCAS) has emerged as a valuable tool for interrogating and elucidating epigenomic heterogeneity and gene regulation. However, scCAS data inherently suffers from limitations such as high sparsity and dimensionality, which pose significant challenges for downstream analyses. Although several methods are proposed to enhance scCAS data, there are still challenges and limitations that hinder the effectiveness of these methods. Here, we propose scCASE, a scCAS data enhancement method based on non-negative matrix factorization which incorporates an iteratively updating cell-to-cell similarity matrix. Through comprehensive experiments on multiple datasets, we demonstrate the advantages of scCASE over existing methods for scCAS data enhancement. The interpretable cell type-specific peaks identified by scCASE can provide valuable biological insights into cell subpopulations. Moreover, to leverage the large compendia of available omics data as a reference, we further expand scCASE to scCASER, which enables the incorporation of external reference data to improve enhancement performance.
Collapse
Affiliation(s)
- Songming Tang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
| | - Xuejian Cui
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, 100084, Beijing, China
| | - Rongxiang Wang
- Department of Computer Science, University of Virginia, Charlottesville, VA, 22903, USA
| | - Sijie Li
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
| | - Siyu Li
- School of Statistics and Data Science, Nankai University, Tianjin, 300071, China
| | - Xin Huang
- Beijing Key Laboratory for Radiobiology, Department of Radiation Biology, Beijing Institute of Radiation Medicine, 100850, Beijing, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China.
| |
Collapse
|
15
|
Zhang K, Zemke NR, Armand EJ, Ren B. A fast, scalable and versatile tool for analysis of single-cell omics data. Nat Methods 2024; 21:217-227. [PMID: 38191932 PMCID: PMC10864184 DOI: 10.1038/s41592-023-02139-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/23/2023] [Indexed: 01/10/2024]
Abstract
Single-cell omics technologies have revolutionized the study of gene regulation in complex tissues. A major computational challenge in analyzing these datasets is to project the large-scale and high-dimensional data into low-dimensional space while retaining the relative relationships between cells. This low dimension embedding is necessary to decompose cellular heterogeneity and reconstruct cell-type-specific gene regulatory programs. Traditional dimensionality reduction techniques, however, face challenges in computational efficiency and in comprehensively addressing cellular diversity across varied molecular modalities. Here we introduce a nonlinear dimensionality reduction algorithm, embodied in the Python package SnapATAC2, which not only achieves a more precise capture of single-cell omics data heterogeneities but also ensures efficient runtime and memory usage, scaling linearly with the number of cells. Our algorithm demonstrates exceptional performance, scalability and versatility across diverse single-cell omics datasets, including single-cell assay for transposase-accessible chromatin using sequencing, single-cell RNA sequencing, single-cell Hi-C and single-cell multi-omics datasets, underscoring its utility in advancing single-cell analysis.
Collapse
Affiliation(s)
- Kai Zhang
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA
- Westlake Laboratory of Life Sciences and Biomedicine, School of Life Sciences, Westlake University, Hangzhou, China
| | - Nathan R Zemke
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA
- Center for Epigenomics, University of California, San Diego School of Medicine, La Jolla, CA, USA
| | - Ethan J Armand
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA, USA
| | - Bing Ren
- Department of Cellular and Molecular Medicine, University of California, San Diego School of Medicine, La Jolla, CA, USA.
- Center for Epigenomics, University of California, San Diego School of Medicine, La Jolla, CA, USA.
- Ludwig Institute for Cancer Research, La Jolla, CA, USA.
- Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
16
|
Mihai IS, Chafle S, Henriksson J. Representing and extracting knowledge from single-cell data. Biophys Rev 2024; 16:29-56. [PMID: 38495441 PMCID: PMC10937862 DOI: 10.1007/s12551-023-01091-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Accepted: 06/28/2023] [Indexed: 03/19/2024] Open
Abstract
Single-cell analysis is currently one of the most high-resolution techniques to study biology. The large complex datasets that have been generated have spurred numerous developments in computational biology, in particular the use of advanced statistics and machine learning. This review attempts to explain the deeper theoretical concepts that underpin current state-of-the-art analysis methods. Single-cell analysis is covered from cell, through instruments, to current and upcoming models. The aim of this review is to spread concepts which are not yet in common use, especially from topology and generative processes, and how new statistical models can be developed to capture more of biology. This opens epistemological questions regarding our ontology and models, and some pointers will be given to how natural language processing (NLP) may help overcome our cognitive limitations for understanding single-cell data.
Collapse
Affiliation(s)
- Ionut Sebastian Mihai
- The Laboratory for Molecular Infection Medicine Sweden (MIMS), Umeå, Sweden
- Umeå Centre for Microbial Research (UCMR), Department of Molecular Biology, Umeå University, Umeå, Sweden
- Industrial Doctoral School, Umeå University, Umeå, Sweden
| | - Sarang Chafle
- The Laboratory for Molecular Infection Medicine Sweden (MIMS), Umeå, Sweden
- Umeå Centre for Microbial Research (UCMR), Department of Molecular Biology, Umeå University, Umeå, Sweden
| | - Johan Henriksson
- The Laboratory for Molecular Infection Medicine Sweden (MIMS), Umeå, Sweden
- Umeå Centre for Microbial Research (UCMR), Department of Molecular Biology, Umeå University, Umeå, Sweden
| |
Collapse
|
17
|
Yu J, Leng J, Hou Z, Sun D, Wu LY. Incorporating network diffusion and peak location information for better single-cell ATAC-seq data analysis. Brief Bioinform 2024; 25:bbae093. [PMID: 38493346 PMCID: PMC10944575 DOI: 10.1093/bib/bbae093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 12/22/2023] [Accepted: 02/20/2024] [Indexed: 03/18/2024] Open
Abstract
Single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) data provided new insights into the understanding of epigenetic heterogeneity and transcriptional regulation. With the increasing abundance of dataset resources, there is an urgent need to extract more useful information through high-quality data analysis methods specifically designed for scATAC-seq. However, analyzing scATAC-seq data poses challenges due to its near binarization, high sparsity and ultra-high dimensionality properties. Here, we proposed a novel network diffusion-based computational method to comprehensively analyze scATAC-seq data, named Single-Cell ATAC-seq Analysis via Network Refinement with Peaks Location Information (SCARP). SCARP formulates the Network Refinement diffusion method under the graph theory framework to aggregate information from different network orders, effectively compensating for missing signals in the scATAC-seq data. By incorporating distance information between adjacent peaks on the genome, SCARP also contributes to depicting the co-accessibility of peaks. These two innovations empower SCARP to obtain lower-dimensional representations for both cells and peaks more effectively. We have demonstrated through sufficient experiments that SCARP facilitated superior analyses of scATAC-seq data. Specifically, SCARP exhibited outstanding cell clustering performance, enabling better elucidation of cell heterogeneity and the discovery of new biologically significant cell subpopulations. Additionally, SCARP was also instrumental in portraying co-accessibility relationships of accessible regions and providing new insight into transcriptional regulation. Consequently, SCARP identified genes that were involved in key Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways related to diseases and predicted reliable cis-regulatory interactions. To sum up, our studies suggested that SCARP is a promising tool to comprehensively analyze the scATAC-seq data.
Collapse
Affiliation(s)
- Jiating Yu
- School of Mathematics and Statistics, Nanjing University of Information Science & Technology, Nanjing 210044, China
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jiacheng Leng
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
- Zhejiang Lab, Hangzhou 311121, China
| | - Zhichao Hou
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Duanchen Sun
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Ling-Yun Wu
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
18
|
Chen Y, Zheng R, Liu J, Li M. scMLC: an accurate and robust multiplex community detection method for single-cell multi-omics data. Brief Bioinform 2024; 25:bbae101. [PMID: 38493339 PMCID: PMC10944569 DOI: 10.1093/bib/bbae101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 01/03/2024] [Accepted: 02/15/2024] [Indexed: 03/18/2024] Open
Abstract
Clustering cells based on single-cell multi-modal sequencing technologies provides an unprecedented opportunity to create high-resolution cell atlas, reveal cellular critical states and study health and diseases. However, effectively integrating different sequencing data for cell clustering remains a challenging task. Motivated by the successful application of Louvain in scRNA-seq data, we propose a single-cell multi-modal Louvain clustering framework, called scMLC, to tackle this problem. scMLC builds multiplex single- and cross-modal cell-to-cell networks to capture modal-specific and consistent information between modalities and then adopts a robust multiplex community detection method to obtain the reliable cell clusters. In comparison with 15 state-of-the-art clustering methods on seven real datasets simultaneously measuring gene expression and chromatin accessibility, scMLC achieves better accuracy and stability in most datasets. Synthetic results also indicate that the cell-network-based integration strategy of multi-omics data is superior to other strategies in terms of generalization. Moreover, scMLC is flexible and can be extended to single-cell sequencing data with more than two modalities.
Collapse
Affiliation(s)
- Yuxuan Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jin Liu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
19
|
Miao Z, Kim J. Uniform quantification of single-nucleus ATAC-seq data with Paired-Insertion Counting (PIC) and a model-based insertion rate estimator. Nat Methods 2024; 21:32-36. [PMID: 38049698 PMCID: PMC10776405 DOI: 10.1038/s41592-023-02103-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 10/25/2023] [Indexed: 12/06/2023]
Abstract
Existing approaches to scoring single-nucleus assay for transposase-accessible chromatin with sequencing (snATAC-seq) feature matrices from sequencing reads are inconsistent, affecting downstream analyses and displaying artifacts. We show that, even with sparse single-cell data, quantitative counts are informative for estimating the regulatory state of a cell, which calls for a consistent treatment. We propose Paired-Insertion Counting as a uniform method for snATAC-seq feature characterization and provide a probability model for inferring latent insertion dynamics from snATAC-seq count matrices.
Collapse
Affiliation(s)
- Zhen Miao
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Junhyong Kim
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
20
|
Martens LD, Fischer DS, Yépez VA, Theis FJ, Gagneur J. Modeling fragment counts improves single-cell ATAC-seq analysis. Nat Methods 2024; 21:28-31. [PMID: 38049697 PMCID: PMC10776385 DOI: 10.1038/s41592-023-02112-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 10/25/2023] [Indexed: 12/06/2023]
Abstract
Single-cell ATAC sequencing coverage in regulatory regions is typically binarized as an indicator of open chromatin. Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration. Fragment counts, but not read counts, should instead be modeled, which preserves quantitative regulatory information. These results have immediate implications for single-cell ATAC sequencing analysis.
Collapse
Affiliation(s)
- Laura D Martens
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
- Helmholtz Association, Munich School for Data Science (MUDS), Munich, Germany
| | - David S Fischer
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Vicente A Yépez
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Fabian J Theis
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
- Helmholtz Association, Munich School for Data Science (MUDS), Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
- Helmholtz Association, Munich School for Data Science (MUDS), Munich, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
| |
Collapse
|
21
|
Li K, Chen X, Song S, Hou L, Chen S, Jiang R. Cofea: correlation-based feature selection for single-cell chromatin accessibility data. Brief Bioinform 2023; 25:bbad458. [PMID: 38113078 PMCID: PMC10782922 DOI: 10.1093/bib/bbad458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 11/19/2023] [Accepted: 11/20/2023] [Indexed: 12/21/2023] Open
Abstract
Single-cell chromatin accessibility sequencing (scCAS) technologies have enabled characterizing the epigenomic heterogeneity of individual cells. However, the identification of features of scCAS data that are relevant to underlying biological processes remains a significant gap. Here, we introduce a novel method Cofea, to fill this gap. Through comprehensive experiments on 5 simulated and 54 real datasets, Cofea demonstrates its superiority in capturing cellular heterogeneity and facilitating downstream analysis. Applying this method to identification of cell type-specific peaks and candidate enhancers, as well as pathway enrichment analysis and partitioned heritability analysis, we illustrate the potential of Cofea to uncover functional biological process.
Collapse
Affiliation(s)
- Keyi Li
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaoyang Chen
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Shuang Song
- Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| | - Lin Hou
- Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
22
|
Akhtyamov P, Shaheen L, Raevskiy M, Stupnikov A, Medvedeva YA. scATAC-seq preprocessing and imputation evaluation system for visualization, clustering and digital footprinting. Brief Bioinform 2023; 25:bbad447. [PMID: 38084919 PMCID: PMC10714317 DOI: 10.1093/bib/bbad447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 10/29/2023] [Accepted: 11/14/2023] [Indexed: 12/18/2023] Open
Abstract
Single-cell ATAC-seq (scATAC-seq) is a recently developed approach that provides means to investigate open chromatin at single cell level, to assess epigenetic regulation and transcription factors binding landscapes. The sparsity of the scATAC-seq data calls for imputation. Similarly, preprocessing (filtering) may be required to reduce computational load due to the large number of open regions. However, optimal strategies for both imputation and preprocessing have not been yet evaluated together. We present SAPIEnS (scATAC-seq Preprocessing and Imputation Evaluation System), a benchmark for scATAC-seq imputation frameworks, a combination of state-of-the-art imputation methods with commonly used preprocessing techniques. We assess different types of scATAC-seq analysis, i.e. clustering, visualization and digital genomic footprinting, and attain optimal preprocessing-imputation strategies. We discuss the benefits of the imputation framework depending on the task and the number of the dataset features (peaks). We conclude that the preprocessing with the Boruta method is beneficial for the majority of tasks, while imputation is helpful mostly for small datasets. We also implement a SAPIEnS database with pre-computed transcription factor footprints based on imputed data with their activity scores in a specific cell type. SAPIEnS is published at: https://github.com/lab-medvedeva/SAPIEnS. SAPIEnS database is available at: https://sapiensdb.com.
Collapse
Affiliation(s)
- Pavel Akhtyamov
- Department of Biomedical Physics, Moscow Institute of Physics and Technology (National Research University), 9 Institutskiy per., 141701, Moscow Region, Russian Federation
- The National Medical Research Center for Endocrinology, Dm. Ulyanova, 11, 117036, Moscow, Russian Federation
| | - Layal Shaheen
- Department of Biomedical Physics, Moscow Institute of Physics and Technology (National Research University), 9 Institutskiy per., 141701, Moscow Region, Russian Federation
- The National Medical Research Center for Endocrinology, Dm. Ulyanova, 11, 117036, Moscow, Russian Federation
| | - Mikhail Raevskiy
- Department, École Polytechnique Fédérale de Lausanne, Rte Cantonale, 1015, Lausanne, Vaud, Switzerland
| | - Alexey Stupnikov
- Department of Biomedical Physics, Moscow Institute of Physics and Technology (National Research University), 9 Institutskiy per., 141701, Moscow Region, Russian Federation
- The National Medical Research Center for Endocrinology, Dm. Ulyanova, 11, 117036, Moscow, Russian Federation
- Institute of Bioengineering, Research Center of Biotechnology, Russian Academy of Science, Leninsky prospect, 33, build. 2, 119071, Moscow, Russian Federation
| | - Yulia A Medvedeva
- Department of Biomedical Physics, Moscow Institute of Physics and Technology (National Research University), 9 Institutskiy per., 141701, Moscow Region, Russian Federation
- The National Medical Research Center for Endocrinology, Dm. Ulyanova, 11, 117036, Moscow, Russian Federation
- Institute of Bioengineering, Research Center of Biotechnology, Russian Academy of Science, Leninsky prospect, 33, build. 2, 119071, Moscow, Russian Federation
| |
Collapse
|
23
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|
24
|
Zhang K, Zemke NR, Armand EJ, Ren B. SnapATAC2: a fast, scalable and versatile tool for analysis of single-cell omics data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.11.557221. [PMID: 37745443 PMCID: PMC10515871 DOI: 10.1101/2023.09.11.557221] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Single-cell omics technologies have ushered in a new era for the study of dynamic gene regulation in complex tissues during development and disease pathogenesis. A major computational challenge in analyzing these datasets is to project the large-scale and high dimensional data into low-dimensional space while retaining the relative relationships between cells in order to decompose the cellular heterogeneity and reconstruct cell-type-specific gene regulatory programs. Conventional dimensionality reduction methods suffer from computational inefficiency, difficulty to capture the full spectrum of cellular heterogeneity, or inability to apply across diverse molecular modalities. Here, we report a fast and nonlinear dimensionality reduction algorithm that not only more accurately captures the heterogeneities of single-cell omics data, but also features runtime and memory usage that is computational efficient and linearly proportional to cell numbers. We implement this algorithm in a Python package named SnapATAC2, and demonstrate its superior performance, remarkable scalability and general adaptability using an array of single-cell omics data types, including single-cell ATAC-seq, single-cell RNA-seq, single-cell Hi-C, and single-cell multiomics datasets.
Collapse
|
25
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
26
|
Halawani R, Buchert M, Chen YPP. Deep learning exploration of single-cell and spatially resolved cancer transcriptomics to unravel tumour heterogeneity. Comput Biol Med 2023; 164:107274. [PMID: 37506451 DOI: 10.1016/j.compbiomed.2023.107274] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 07/03/2023] [Accepted: 07/16/2023] [Indexed: 07/30/2023]
Abstract
Tumour heterogeneity is one of the critical confounding aspects in decoding tumour growth. Malignant cells display variations in their gene transcription profiles and mutation spectra even when originating from a single progenitor cell. Single-cell and spatial transcriptomics sequencing have recently emerged as key technologies for unravelling tumour heterogeneity. Single-cell sequencing promotes individual cell-type identification through transcriptome-wide gene expression measurements of each cell. Spatial transcriptomics facilitates identification of cell-cell interactions and the structural organization of heterogeneous cells within a tumour tissue through associating spatial RNA abundance of cells at distinct spots in the tissue section. However, extracting features and analyzing single-cell and spatial transcriptomics data poses challenges. Single-cell transcriptome data is extremely noisy and its sparse nature and dropouts can lead to misinterpretation of gene expression and the misclassification of cell types. Deep learning predictive power can overcome data challenges, provide high-resolution analysis and enhance precision oncology applications that involve early cancer prognosis, diagnosis, patient survival estimation and anti-cancer therapy planning. In this paper, we provide a background to and review of the recent progress of deep learning frameworks to investigate tumour heterogeneity using both single-cell and spatial transcriptomics data types.
Collapse
Affiliation(s)
- Raid Halawani
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
| | - Michael Buchert
- School of Cancer Medicine, La Trobe University, Melbourne, Victoria, Australia; Olivia Newton-John Cancer Research Institute, Melbourne, Victoria, Australia
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia.
| |
Collapse
|
27
|
Raimundo F, Prompsy P, Vert JP, Vallot C. A benchmark of computational pipelines for single-cell histone modification data. Genome Biol 2023; 24:143. [PMID: 37340307 PMCID: PMC10280832 DOI: 10.1186/s13059-023-02981-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 06/07/2023] [Indexed: 06/22/2023] Open
Abstract
BACKGROUND Single-cell histone post translational modification (scHPTM) assays such as scCUT&Tag or scChIP-seq allow single-cell mapping of diverse epigenomic landscapes within complex tissues and are likely to unlock our understanding of various mechanisms involved in development or diseases. Running scHTPM experiments and analyzing the data produced remains challenging since few consensus guidelines currently exist regarding good practices for experimental design and data analysis pipelines. RESULTS We perform a computational benchmark to assess the impact of experimental parameters and data analysis pipelines on the ability of the cell representation to recapitulate known biological similarities. We run more than ten thousand experiments to systematically study the impact of coverage and number of cells, of the count matrix construction method, of feature selection and normalization, and of the dimension reduction algorithm used. This allows us to identify key experimental parameters and computational choices to obtain a good representation of single-cell HPTM data. We show in particular that the count matrix construction step has a strong influence on the quality of the representation and that using fixed-size bin counts outperforms annotation-based binning. Dimension reduction methods based on latent semantic indexing outperform others, and feature selection is detrimental, while keeping only high-quality cells has little influence on the final representation as long as enough cells are analyzed. CONCLUSIONS This benchmark provides a comprehensive study on how experimental parameters and computational choices affect the representation of single-cell HPTM data. We propose a series of recommendations regarding matrix construction, feature and cell selection, and dimensionality reduction algorithms.
Collapse
Affiliation(s)
- Félix Raimundo
- Google Research, Brain team, 75009, Paris, France
- Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France
| | - Pacôme Prompsy
- Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France
- CNRS UMR3244, Institut Curie, PSL Research University, 75005, Paris, France
| | - Jean-Philippe Vert
- Google Research, Brain team, 75009, Paris, France.
- Owkin, Inc, NY, New York, USA.
| | - Céline Vallot
- Translational Research Department, Institut Curie, PSL Research University, 75005, Paris, France.
- CNRS UMR3244, Institut Curie, PSL Research University, 75005, Paris, France.
| |
Collapse
|
28
|
Yu L, Liu C, Yang JYH, Yang P. Ensemble deep learning of embeddings for clustering multimodal single-cell omics data. Bioinformatics 2023; 39:btad382. [PMID: 37314966 PMCID: PMC10287920 DOI: 10.1093/bioinformatics/btad382] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/16/2023] [Accepted: 06/12/2023] [Indexed: 06/16/2023] Open
Abstract
MOTIVATION Recent advances in multimodal single-cell omics technologies enable multiple modalities of molecular attributes, such as gene expression, chromatin accessibility, and protein abundance, to be profiled simultaneously at a global level in individual cells. While the increasing availability of multiple data modalities is expected to provide a more accurate clustering and characterization of cells, the development of computational methods that are capable of extracting information embedded across data modalities is still in its infancy. RESULTS We propose SnapCCESS for clustering cells by integrating data modalities in multimodal single-cell omics data using an unsupervised ensemble deep learning framework. By creating snapshots of embeddings of multimodality using variational autoencoders, SnapCCESS can be coupled with various clustering algorithms for generating consensus clustering of cells. We applied SnapCCESS with several clustering algorithms to various datasets generated from popular multimodal single-cell omics technologies. Our results demonstrate that SnapCCESS is effective and more efficient than conventional ensemble deep learning-based clustering methods and outperforms other state-of-the-art multimodal embedding generation methods in integrating data modalities for clustering cells. The improved clustering of cells from SnapCCESS will pave the way for more accurate characterization of cell identity and types, an essential step for various downstream analyses of multimodal single-cell omics data. AVAILABILITY AND IMPLEMENTATION SnapCCESS is implemented as a Python package and is freely available from https://github.com/PYangLab/SnapCCESS under the open-source license of GPL-3. The data used in this study are publicly available (see section 'Data availability').
Collapse
Affiliation(s)
- Lijia Yu
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D4H), Hong Kong Science Park, Hong Kong SAR, China
| | - Pengyi Yang
- Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D4H), Hong Kong Science Park, Hong Kong SAR, China
| |
Collapse
|
29
|
Taguchi YH, Turki T. Tensor decomposition discriminates tissues using scATAC-seq. Biochim Biophys Acta Gen Subj 2023; 1867:130360. [PMID: 37003566 DOI: 10.1016/j.bbagen.2023.130360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 02/14/2023] [Accepted: 02/19/2023] [Indexed: 04/03/2023]
Abstract
ATAC-seq is a powerful tool for measuring the landscape structure of a chromosome. scATAC-seq is a recently updated version of ATAC-seq performed in a single cell. The problem with scATAC-seq is data sparsity and most of the genomic sites are inaccessible. Here, tensor decomposition (TD) was used to fill in missing values. In this study, TD was applied to massive scATAC-seq datasets generated by approximately 200 bp intervals, and this number can reach 13,627,618. Currently, no other methods can deal with large sparse matrices. The proposed method could not only provide UMAP embedding that coincides with tissue specificity, but also select genes associated with various biological enrichment terms and transcription factor targeting. This suggests that TD is a useful tool to process a large sparse matrix generated from scATAC-seq.
Collapse
Affiliation(s)
- Y-H Taguchi
- Department of Physics, Chuo university, 1-13-27, Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan.
| | - Turki Turki
- Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| |
Collapse
|
30
|
Wang D, Hu X, Ye H, Wang Y, Yang Q, Liang X, Wang Z, Zhou Y, Wen M, Yuan X, Zheng X, Ye W, Guo B, Yusuyin M, Russinova E, Zhou Y, Wang K. Cell-specific clock-controlled gene expression program regulates rhythmic fiber cell growth in cotton. Genome Biol 2023; 24:49. [PMID: 36918913 PMCID: PMC10012527 DOI: 10.1186/s13059-023-02886-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 02/26/2023] [Indexed: 03/16/2023] Open
Abstract
BACKGROUND The epidermis of cotton ovule produces fibers, the most important natural cellulose source for the global textile industry. However, the molecular mechanism of fiber cell growth is still poorly understood. RESULTS Here, we develop an optimized protoplasting method, and integrate single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) to systematically characterize the cells of the outer integument of ovules from wild type and fuzzless/lintless (fl) cotton (Gossypium hirsutum). By jointly analyzing the scRNA-seq data from wildtype and fl, we identify five cell populations including the fiber cell type and construct the development trajectory for fiber lineage cells. Interestingly, by time-course diurnal transcriptomic analysis, we demonstrate that the primary growth of fiber cells is a highly regulated circadian rhythmic process. Moreover, we identify a small peptide GhRALF1 that circadian rhythmically controls fiber growth possibly through oscillating auxin signaling and proton pump activity in the plasma membrane. Combining with scATAC-seq, we further identify two cardinal cis-regulatory elements (CREs, TCP motif, and TCP-like motif) which are bound by the trans factors GhTCP14s to modulate the circadian rhythmic metabolism of mitochondria and protein translation through regulating approximately one third of genes that are highly expressed in fiber cells. CONCLUSIONS We uncover a fiber-specific circadian clock-controlled gene expression program in regulating fiber growth. This study unprecedentedly reveals a new route to improve fiber traits by engineering the circadian clock of fiber cells.
Collapse
Affiliation(s)
- Dehe Wang
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China
| | - Xiao Hu
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Hanzhe Ye
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Yue Wang
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Qian Yang
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Xiaodong Liang
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Zilin Wang
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Yifan Zhou
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Hubei Hongshan Laboratory, Wuhan, China
| | - Miaomiao Wen
- Institute for Advanced Studies, Wuhan University, Wuhan, China.,TaiKang Center for Life and Medical Sciences, RNA Institute, Remin Hospital, Wuhan University, Wuhan, China
| | - Xueyan Yuan
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China
| | - Xiaomin Zheng
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China
| | - Wen Ye
- Medical Research Institute, Frontier Science Center for Immunology and Metabolism, School of Medicine, Wuhan University, Wuhan, China
| | - Boyu Guo
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium.,Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - Mayila Yusuyin
- Research Institute of Economic Crops, Xinjiang Academy of Agricultural Sciences, Urumqi, China
| | - Eugenia Russinova
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium.,Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - Yu Zhou
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China. .,Institute for Advanced Studies, Wuhan University, Wuhan, China. .,TaiKang Center for Life and Medical Sciences, RNA Institute, Remin Hospital, Wuhan University, Wuhan, China. .,Medical Research Institute, Frontier Science Center for Immunology and Metabolism, School of Medicine, Wuhan University, Wuhan, China.
| | - Kun Wang
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan, China. .,Hubei Hongshan Laboratory, Wuhan, China. .,Institute for Advanced Studies, Wuhan University, Wuhan, China.
| |
Collapse
|
31
|
Wang Z, Zhang Y, Yu Y, Zhang J, Liu Y, Zou Q. A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder. Int J Mol Sci 2023; 24:ijms24054784. [PMID: 36902216 PMCID: PMC10003007 DOI: 10.3390/ijms24054784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/02/2023] [Accepted: 02/22/2023] [Indexed: 03/06/2023] Open
Abstract
Recent advances in single-cell sequencing assays for the transposase-accessibility chromatin (scATAC-seq) technique have provided cell-specific chromatin accessibility landscapes of cis-regulatory elements, providing deeper insights into cellular states and dynamics. However, few research efforts have been dedicated to modeling the relationship between regulatory grammars and single-cell chromatin accessibility and incorporating different analysis scenarios of scATAC-seq data into the general framework. To this end, we propose a unified deep learning framework based on the ProdDep Transformer Encoder, dubbed PROTRAIT, for scATAC-seq data analysis. Specifically motivated by the deep language model, PROTRAIT leverages the ProdDep Transformer Encoder to capture the syntax of transcription factor (TF)-DNA binding motifs from scATAC-seq peaks for predicting single-cell chromatin accessibility and learning single-cell embedding. Based on cell embedding, PROTRAIT annotates cell types using the Louvain algorithm. Furthermore, according to the identified likely noises of raw scATAC-seq data, PROTRAIT denoises these values based on predated chromatin accessibility. In addition, PROTRAIT employs differential accessibility analysis to infer TF activity at single-cell and single-nucleotide resolution. Extensive experiments based on the Buenrostro2018 dataset validate the effeteness of PROTRAIT for chromatin accessibility prediction, cell type annotation, and scATAC-seq data denoising, therein outperforming current approaches in terms of different evaluation metrics. Besides, we confirm the consistency between the inferred TF activity and the literature review. We also demonstrate the scalability of PROTRAIT to analyze datasets containing over one million cells.
Collapse
Affiliation(s)
- Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yun Yu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Junming Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Correspondence:
| |
Collapse
|
32
|
Choi Y, Li R, Quon G. siVAE: interpretable deep generative models for single-cell transcriptomes. Genome Biol 2023; 24:29. [PMID: 36803416 PMCID: PMC9940350 DOI: 10.1186/s13059-023-02850-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 01/06/2023] [Indexed: 02/22/2023] Open
Abstract
Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.
Collapse
Affiliation(s)
- Yongin Choi
- Graduate Group in Biomedical Engineering, University of California, Davis, Davis, CA, USA
- Genome Center, University of California, Davis, Davis, CA, USA
| | - Ruoxin Li
- Genome Center, University of California, Davis, Davis, CA, USA
- Graduate Group in Biostatistics, University of California, Davis, Davis, CA, USA
| | - Gerald Quon
- Graduate Group in Biomedical Engineering, University of California, Davis, Davis, CA, USA.
- Genome Center, University of California, Davis, Davis, CA, USA.
- Department of Molecular and Cellular Biology, University of California, Davis, Davis, CA, USA.
| |
Collapse
|
33
|
Mishra S, Pandey N, Chawla S, Sharma M, Chandra O, Jha IP, SenGupta D, Natarajan KN, Kumar V. Matching queried single-cell open-chromatin profiles to large pools of single-cell transcriptomes and epigenomes for reference supported analysis. Genome Res 2023; 33:218-231. [PMID: 36653120 PMCID: PMC10069468 DOI: 10.1101/gr.277015.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 01/09/2023] [Indexed: 01/19/2023]
Abstract
The true benefits of large single-cell transcriptome and epigenome data sets can be realized only with the development of new approaches and search tools for annotating individual cells. Matching a single-cell epigenome profile to a large pool of reference cells remains a major challenge. Here, we present scEpiSearch, which enables searching, comparison, and independent classification of single-cell open-chromatin profiles against a large reference of single-cell expression and open-chromatin data sets. Across performance benchmarks, scEpiSearch outperformed multiple methods in accuracy of search and low-dimensional coembedding of single-cell profiles, irrespective of platforms and species. Here we also demonstrate the unconventional utilities of scEpiSearch by applying it on single-cell epigenome profiles of K562 cells and samples from patients with acute leukaemia to reveal different aspects of their heterogeneity, multipotent behavior, and dedifferentiated states. Applying scEpiSearch on our single-cell open-chromatin profiles from embryonic stem cells (ESCs), we identified ESC subpopulations with more activity and poising for endoplasmic reticulum stress and unfolded protein response. Thus, scEpiSearch solves the nontrivial problem of amalgamating information from a large pool of single cells to identify and study the regulatory states of cells using their single-cell epigenomes.
Collapse
Affiliation(s)
- Shreya Mishra
- Department for Computational Biology, IIIT Delhi 110020, India
| | - Neetesh Pandey
- Department for Computational Biology, IIIT Delhi 110020, India
| | - Smriti Chawla
- Department for Computational Biology, IIIT Delhi 110020, India
| | - Madhu Sharma
- Department for Computational Biology, IIIT Delhi 110020, India
| | - Omkar Chandra
- Department for Computational Biology, IIIT Delhi 110020, India
| | | | - Debarka SenGupta
- Department for Computational Biology, IIIT Delhi 110020, India.,Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane 4001, Australia
| | - Kedar Nath Natarajan
- DTU Bioengineering, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark
| | - Vibhor Kumar
- Department for Computational Biology, IIIT Delhi 110020, India;
| |
Collapse
|
34
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
35
|
Zhang Z, Chen S, Lin Z. RefTM: reference-guided topic modeling of single-cell chromatin accessibility data. Brief Bioinform 2023; 24:6895319. [PMID: 36513377 DOI: 10.1093/bib/bbac540] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 10/27/2022] [Accepted: 11/09/2022] [Indexed: 12/15/2022] Open
Abstract
Single-cell analysis is a valuable approach for dissecting the cellular heterogeneity, and single-cell chromatin accessibility sequencing (scCAS) can profile the epigenetic landscapes for thousands of individual cells. It is challenging to analyze scCAS data, because of its high dimensionality and a higher degree of sparsity compared with scRNA-seq data. Topic modeling in single-cell data analysis can lead to robust identification of the cell types and it can provide insight into the regulatory mechanisms. Reference-guided approach may facilitate the analysis of scCAS data by utilizing the information in existing datasets. We present RefTM (Reference-guided Topic Modeling of single-cell chromatin accessibility data), which not only utilizes the information in existing bulk chromatin accessibility and annotated scCAS data, but also takes advantage of topic models for single-cell data analysis. RefTM simultaneously models: (1) the shared biological variation among reference data and the target scCAS data; (2) the unique biological variation in scCAS data; (3) other variations from known covariates in scCAS data.
Collapse
Affiliation(s)
- Zheng Zhang
- Department of Statistics in the Chinese University of Hong Kong
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC in Nankai university
| | - Zhixiang Lin
- Department of Statistics in the Chinese University of Hong Kong
| |
Collapse
|
36
|
Preissl S, Gaulton KJ, Ren B. Characterizing cis-regulatory elements using single-cell epigenomics. Nat Rev Genet 2023; 24:21-43. [PMID: 35840754 PMCID: PMC9771884 DOI: 10.1038/s41576-022-00509-1] [Citation(s) in RCA: 69] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/24/2022] [Indexed: 12/24/2022]
Abstract
Cell type-specific gene expression patterns and dynamics during development or in disease are controlled by cis-regulatory elements (CREs), such as promoters and enhancers. Distinct classes of CREs can be characterized by their epigenomic features, including DNA methylation, chromatin accessibility, combinations of histone modifications and conformation of local chromatin. Tremendous progress has been made in cataloguing CREs in the human genome using bulk transcriptomic and epigenomic methods. However, single-cell epigenomic and multi-omic technologies have the potential to provide deeper insight into cell type-specific gene regulatory programmes as well as into how they change during development, in response to environmental cues and through disease pathogenesis. Here, we highlight recent advances in single-cell epigenomic methods and analytical tools and discuss their readiness for human tissue profiling.
Collapse
Affiliation(s)
- Sebastian Preissl
- Center for Epigenomics, University of California San Diego, La Jolla, CA, USA.
- Institute of Experimental and Clinical Pharmacology and Toxicology, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
| | - Kyle J Gaulton
- Department of Paediatrics, Paediatric Diabetes Research Center, University of California San Diego, La Jolla, CA, USA.
| | - Bing Ren
- Center for Epigenomics, University of California San Diego, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA.
- Ludwig Institute for Cancer Research, La Jolla, CA, USA.
| |
Collapse
|
37
|
Zandavi SM, Liu D, Chung V, Anaissi A, Vafaee F. Fotomics: fourier transform-based omics imagification for deep learning-based cell-identity mapping using single-cell omics profiles. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10357-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
38
|
Duan H, Li F, Shang J, Liu J, Li Y, Liu X. scVAEBGM: Clustering Analysis of Single-Cell ATAC-seq Data Using a Deep Generative Model. Interdiscip Sci 2022; 14:917-928. [PMID: 35939233 DOI: 10.1007/s12539-022-00536-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 06/15/2023]
Abstract
A surge in research has occurred because of current developments in single-cell technologies. Above all, single-cell Assay for Transposase-Accessible Chromatin with high throughput sequencing (scATAC-seq) is a popular approach of analyzing chromatin accessibility differences at the level of single cell, either within or between groups. As a result, it is critical to examine cell heterogeneity at a previously unseen level and to identify both recognized and unknown cell types. However, with the ever-increasing number of cells engendered by technological development and the characteristics of the data, such as high noise, sparsity and dimension, challenges in distinguishing cell types have emerged. We propose scVAEBGM, which integrates a Variational Autoencoder (VAE) with a Bayesian Gaussian-mixture model (BGM) to process and analyze scATAC-seq data. This method combines and takes benefits of a Bayesian Gaussian mixture model to estimate the number of cell types without determining the cluster number in a beforehand. In other words, the size of the clusters is inferred from the data, thus avoiding biases introduced by subjective assessments when manually determining the size of the clusters. Additionally, the method is more robust to noise and can better represent single-cell data in lower dimensions. We also create a further clustering strategy. It is indicated by experiments that further clustering based on the already completed clustering can improve the clustering accuracy again. We test on six public datasets, and scVAEBGM outperforms various dimension reduction baselines. In downstream applications, scVAEBGM can reveal biological cell types.
Collapse
Affiliation(s)
- Hongyu Duan
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China.
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Jinxing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Yan Li
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China
| | - Xikui Liu
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China
| |
Collapse
|
39
|
Liu Y, Liang S, Wang B, Zhao J, Zi X, Yan S, Dou T, Jia J, Wang K, Ge C. Advances in Single-Cell Sequencing Technology and Its Application in Poultry Science. Genes (Basel) 2022; 13:genes13122211. [PMID: 36553479 PMCID: PMC9778011 DOI: 10.3390/genes13122211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 11/20/2022] [Accepted: 11/23/2022] [Indexed: 11/29/2022] Open
Abstract
Single-cell sequencing (SCS) uses a single cell as the research material and involves three dimensions: genes, phenotypes and cell biological mechanisms. This type of research can locate target cells, analyze the dynamic changes in the target cells and the relationships between the cells, and pinpoint the molecular mechanism of cell formation. Currently, a common problem faced by animal husbandry scientists is how to apply existing science and technology to promote the production of high-quality livestock and poultry products and to breed livestock for disease resistance; this is also a bottleneck for the sustainable development of animal husbandry. In recent years, although SCS technology has been successfully applied in the fields of medicine and bioscience, its application in poultry science has been rarely reported. With the sustainable development of science and technology and the poultry industry, SCS technology has great potential in the application of poultry science (or animal husbandry). Therefore, it is necessary to review the innovation of SCS technology and its application in poultry science. This article summarizes the current main technical methods of SCS and its application in poultry, which can provide potential references for its future applications in precision breeding, disease prevention and control, immunity, and cell identification.
Collapse
Affiliation(s)
- Yong Liu
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Shuangmin Liang
- College of Food Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Bo Wang
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Jinbo Zhao
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Xiannian Zi
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Shixiong Yan
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Tengfei Dou
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Junjing Jia
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Kun Wang
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Changrong Ge
- College of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
- Correspondence:
| |
Collapse
|
40
|
Zeng P, Ma Y, Lin Z. scAWMV: an adaptively weighted multi-view learning framework for the integrative analysis of parallel scRNA-seq and scATAC-seq data. Bioinformatics 2022; 39:6831091. [PMID: 36383176 PMCID: PMC9805575 DOI: 10.1093/bioinformatics/btac739] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 10/16/2022] [Accepted: 11/15/2022] [Indexed: 11/17/2022] Open
Abstract
MOTIVATION Technological advances have enabled us to profile single-cell multi-omics data from the same cells, providing us with an unprecedented opportunity to understand the cellular phenotype and links to its genotype. The available protocols and multi-omics datasets [including parallel single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data profiled from the same cell] are growing increasingly. However, such data are highly sparse and tend to have high level of noise, making data analysis challenging. The methods that integrate the multi-omics data can potentially improve the capacity of revealing the cellular heterogeneity. RESULTS We propose an adaptively weighted multi-view learning (scAWMV) method for the integrative analysis of parallel scRNA-seq and scATAC-seq data profiled from the same cell. scAWMV considers both the difference in importance across different modalities in multi-omics data and the biological connection of the features in the scRNA-seq and scATAC-seq data. It generates biologically meaningful low-dimensional representations for the transcriptomic and epigenomic profiles via unsupervised learning. Application to four real datasets demonstrates that our framework scAWMV is an efficient method to dissect cellular heterogeneity for single-cell multi-omics data. AVAILABILITY AND IMPLEMENTATION The software and datasets are available at https://github.com/pengchengzeng/scAWMV. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pengcheng Zeng
- Institute of Mathematical Sciences, ShanghaiTech University, Shanghai 201210, China
| | - Yuanyuan Ma
- School of Computer and Information Engineering, Anyang Normal University, Henan 455000, China
| | | |
Collapse
|
41
|
Zhang R, Meng-papaxanthos L, Vert JP, Noble WS. Multimodal Single-Cell Translation and Alignment with Semi-Supervised Learning. J Comput Biol 2022; 29:1198-1212. [PMID: 36251758 PMCID: PMC9700358 DOI: 10.1089/cmb.2022.0264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Single-cell multi-omics technologies enable comprehensive interrogation of cellular regulation, yet most single-cell assays measure only one type of activity-such as transcription, chromatin accessibility, DNA methylation, or 3D chromatin architecture-for each cell. To enable a multimodal view for individual cells, we propose Polarbear, a semi-supervised machine learning framework that facilitates missing modality profile prediction and single-cell cross-modality alignment. Polarbear learns to translate between modalities by using data from co-assay measurements coupled with the large quantity of single-assay data available in public databases. This semi-supervised scheme mitigates issues related to low cell quantities and high sparsity in co-assay data. Polarbear first pre-trains a beta-variational autoencoder for each modality using both co-assay and single-assay profiles to learn robust representations of individual cells, and it then uses the co-assay labels to train a translator between these cell representations. This semi-supervised framework enables us to predict missing modality profiles and match single cells across modalities with improved accuracy compared with fully supervised methods, thus facilitating multimodal data integration.
Collapse
Affiliation(s)
- Ran Zhang
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | | | | | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| |
Collapse
|
42
|
Ding L, Mane R, Wu Z, Jiang Y, Meng X, Jing J, Ou W, Wang X, Liu Y, Lin J, Zhao X, Li H, Wang Y, Li Z. Data-driven clustering approach to identify novel phenotypes using multiple biomarkers in acute ischaemic stroke: A retrospective, multicentre cohort study. EClinicalMedicine 2022; 53:101639. [PMID: 36105873 PMCID: PMC9465270 DOI: 10.1016/j.eclinm.2022.101639] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/31/2022] [Accepted: 08/10/2022] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Acute ischaemic stroke (AIS) is a highly heterogeneous disorder and warrants further investigation to stratify patients with different outcomes and treatment responses. Using a large-scale stroke registry cohort, we applied data-driven approach to identify novel phenotypes based on multiple biomarkers. METHODS In a nationwide, prospective, 201-hospital registry study taking place in China between August 01, 2015 and March 31, 2018, the patients with AIS who were over 18 years of age and admitted to the hospital within 7 days from symptom onset were included. 92 biomarkers were included in the analysis. In the derivation cohort (n=9539), an unsupervised Gaussian mixture model was applied to categorize patients into distinct phenotypes. A classifier was developed using the most important biomarkers and was applied to categorize patients into their corresponding phenotypes in an validation cohort (n=2496). The differences in biological features, clinical outcomes, and treatment response were compared across the phenotypes. FINDINGS We identified four phenotypes with distinct characteristics in 9288 patients with non-cardioembolic ischaemic stroke. Phenotype 1 was associated with abnormal glucose and lipid metabolism. Phenotype 2 was characterized by inflammation and abnormal renal function. Phenotype 3 had the least laboratory abnormalities and small infarct lesions. Phenotype 4 was characterized by disturbance in homocysteine metabolism. Findings were replicated in the validation cohort. In comparison with phenotype 3, the risk of stroke recurrence (adjusted hazard ratio [aHR] 2.02, 95% confidence intervals [CI] 1.04-3.94), and mortality (aHR 18.14, 95%CI 6.62-49.71) at 3-month post-stroke were highest in phenotype 2, followed by phenotype 4 and phenotype 1, after adjustment for age, gender, smoking, drinking, history of stroke, hypertension, diabetes mellitus, dyslipidemia, and coronary heart disease. The Monte Carlo simulation showed that the patients with phenotype 2 could benefit from high-intensity statin therapy. INTERPRETATION A data-driven approach could aid in the identification of patients at a higher risk of adverse clinical outcomes following non-cardioembolic ischaemic stroke. These phenotypes, based on different pathophysiology, can suggest individualized treatment plans. FUNDING Beijing Natural Science Foundation (grant number Z200016), Beijing Municipal Committee of Science and Technology (grant number Z201100005620010), National Natural Science Foundation of China (grant number 82101360, 92046016, 82171270), Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences (grant number 2019-I2M-5-029).
Collapse
Affiliation(s)
- Lingling Ding
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Research Unit of Artificial Intelligence in Cerebrovascular Disease, Chinese Academy of Medical Sciences, Beijing, China
| | - Ravikiran Mane
- CNCRC-Hanalytics Artificial Intelligence Research Centre for Neurological Disorders
| | - Zhenzhou Wu
- CNCRC-Hanalytics Artificial Intelligence Research Centre for Neurological Disorders
| | - Yong Jiang
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Research Unit of Artificial Intelligence in Cerebrovascular Disease, Chinese Academy of Medical Sciences, Beijing, China
| | - Xia Meng
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Jing Jing
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Research Unit of Artificial Intelligence in Cerebrovascular Disease, Chinese Academy of Medical Sciences, Beijing, China
| | - Weike Ou
- CNCRC-Hanalytics Artificial Intelligence Research Centre for Neurological Disorders
| | - Xueyun Wang
- CNCRC-Hanalytics Artificial Intelligence Research Centre for Neurological Disorders
| | - Yu Liu
- CNCRC-Hanalytics Artificial Intelligence Research Centre for Neurological Disorders
| | - Jinxi Lin
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Xingquan Zhao
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Research Unit of Artificial Intelligence in Cerebrovascular Disease, Chinese Academy of Medical Sciences, Beijing, China
| | - Hao Li
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Yongjun Wang
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Research Unit of Artificial Intelligence in Cerebrovascular Disease, Chinese Academy of Medical Sciences, Beijing, China
- Corresponding author at: Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, No. 119 South 4th Ring West Road, Fengtai District, Beijing, 100070, China.
| | - Zixiao Li
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Research Unit of Artificial Intelligence in Cerebrovascular Disease, Chinese Academy of Medical Sciences, Beijing, China
- Chinese Institute for Brain Research, Beijing, China
- Corresponding author at: Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, No 119 S 4th Ring W Rd, Fengtai District, Beijing 100070, China.
| |
Collapse
|
43
|
Brombacher E, Hackenberg M, Kreutz C, Binder H, Treppner M. The performance of deep generative models for learning joint embeddings of single-cell multi-omics data. Front Mol Biosci 2022; 9:962644. [PMID: 36387277 PMCID: PMC9643784 DOI: 10.3389/fmolb.2022.962644] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 10/12/2022] [Indexed: 11/07/2023] Open
Abstract
Recent extensions of single-cell studies to multiple data modalities raise new questions regarding experimental design. For example, the challenge of sparsity in single-omics data might be partly resolved by compensating for missing information across modalities. In particular, deep learning approaches, such as deep generative models (DGMs), can potentially uncover complex patterns via a joint embedding. Yet, this also raises the question of sample size requirements for identifying such patterns from single-cell multi-omics data. Here, we empirically examine the quality of DGM-based integrations for varying sample sizes. We first review the existing literature and give a short overview of deep learning methods for multi-omics integration. Next, we consider eight popular tools in more detail and examine their robustness to different cell numbers, covering two of the most common multi-omics types currently favored. Specifically, we use data featuring simultaneous gene expression measurements at the RNA level and protein abundance measurements for cell surface proteins (CITE-seq), as well as data where chromatin accessibility and RNA expression are measured in thousands of cells (10x Multiome). We examine the ability of the methods to learn joint embeddings based on biological and technical metrics. Finally, we provide recommendations for the design of multi-omics experiments and discuss potential future developments.
Collapse
Affiliation(s)
- Eva Brombacher
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
- Spemann Graduate School of Biology and Medicine (SGBM) University of Freiburg, Freiburg, Germany
- Centre for Integrative Biological Signaling Studies (CIBSS) University of Freiburg, Freiburg, Germany
- Faculty of Biology University of Freiburg, Freiburg, Germany
| | - Maren Hackenberg
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
- Centre for Integrative Biological Signaling Studies (CIBSS) University of Freiburg, Freiburg, Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
| | - Martin Treppner
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
| |
Collapse
|
44
|
Mukherjee P, Park SH, Pathak N, Patino CA, Bao G, Espinosa HD. Integrating Micro and Nano Technologies for Cell Engineering and Analysis: Toward the Next Generation of Cell Therapy Workflows. ACS NANO 2022; 16:15653-15680. [PMID: 36154011 DOI: 10.1021/acsnano.2c05494] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The emerging field of cell therapy offers the potential to treat and even cure a diverse array of diseases for which existing interventions are inadequate. Recent advances in micro and nanotechnology have added a multitude of single cell analysis methods to our research repertoire. At the same time, techniques have been developed for the precise engineering and manipulation of cells. Together, these methods have aided the understanding of disease pathophysiology, helped formulate corrective interventions at the cellular level, and expanded the spectrum of available cell therapeutic options. This review discusses how micro and nanotechnology have catalyzed the development of cell sorting, cellular engineering, and single cell analysis technologies, which have become essential workflow components in developing cell-based therapeutics. The review focuses on the technologies adopted in research studies and explores the opportunities and challenges in combining the various elements of cell engineering and single cell analysis into the next generation of integrated and automated platforms that can accelerate preclinical studies and translational research.
Collapse
Affiliation(s)
- Prithvijit Mukherjee
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
- Theoretical and Applied Mechanics Program, Northwestern University, Evanston, Illinois 60208, United States
| | - So Hyun Park
- Department of Bioengineering, Rice University, 6500 Main Street, Houston, Texas 77030, United States
| | - Nibir Pathak
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
- Theoretical and Applied Mechanics Program, Northwestern University, Evanston, Illinois 60208, United States
| | - Cesar A Patino
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
| | - Gang Bao
- Department of Bioengineering, Rice University, 6500 Main Street, Houston, Texas 77030, United States
| | - Horacio D Espinosa
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
- Theoretical and Applied Mechanics Program, Northwestern University, Evanston, Illinois 60208, United States
| |
Collapse
|
45
|
Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space. Nat Commun 2022; 13:6118. [PMID: 36253379 PMCID: PMC9574176 DOI: 10.1038/s41467-022-33758-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 09/30/2022] [Indexed: 12/24/2022] Open
Abstract
Computational tools for integrative analyses of diverse single-cell experiments are facing formidable new challenges including dramatic increases in data scale, sample heterogeneity, and the need to informatively cross-reference new data with foundational datasets. Here, we present SCALEX, a deep-learning method that integrates single-cell data by projecting cells into a batch-invariant, common cell-embedding space in a truly online manner (i.e., without retraining the model). SCALEX substantially outperforms online iNMF and other state-of-the-art non-online integration methods on benchmark single-cell datasets of diverse modalities, (e.g., single-cell RNA sequencing, scRNA-seq, single-cell assay for transposase-accessible chromatin use sequencing, scATAC-seq), especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We showcase SCALEX's advantages by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19 patients, each assembled from diverse data sources and growing with every new data. The online data integration capacity and superior performance makes SCALEX particularly appropriate for large-scale single-cell applications to build upon previous scientific insights.
Collapse
|
46
|
Zheng Y, Shen S, Keleş S. Normalization and de-noising of single-cell Hi-C data with BandNorm and scVI-3D. Genome Biol 2022; 23:222. [PMID: 36253828 PMCID: PMC9575231 DOI: 10.1186/s13059-022-02774-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Accepted: 09/19/2022] [Indexed: 11/10/2022] Open
Abstract
Single-cell high-throughput chromatin conformation capture methodologies (scHi-C) enable profiling of long-range genomic interactions. However, data from these technologies are prone to technical noise and biases that hinder downstream analysis. We develop a normalization approach, BandNorm, and a deep generative modeling framework, scVI-3D, to account for scHi-C specific biases. In benchmarking experiments, BandNorm yields leading performances in a time and memory efficient manner for cell-type separation, identification of interacting loci, and recovery of cell-type relationships, while scVI-3D exhibits advantages for rare cell types and under high sparsity scenarios. Application of BandNorm coupled with gene-associating domain analysis reveals scRNA-seq validated sub-cell type identification.
Collapse
Affiliation(s)
- Ye Zheng
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, USA
| | - Siqi Shen
- Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, USA
| | - Sündüz Keleş
- Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, USA
- Department of Statistics, University of Wisconsin - Madison, Madison, USA
| |
Collapse
|
47
|
Dong X, Tang K, Xu Y, Wei H, Han T, Wang C. Single-cell gene regulation network inference by large-scale data integration. Nucleic Acids Res 2022; 50:e126. [PMID: 36155797 PMCID: PMC9756951 DOI: 10.1093/nar/gkac819] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Revised: 08/11/2022] [Accepted: 09/14/2022] [Indexed: 12/24/2022] Open
Abstract
Single-cell ATAC-seq (scATAC-seq) has proven to be a state-of-art approach to investigating gene regulation at the single-cell level. However, existing methods cannot precisely uncover cell-type-specific binding of transcription regulators (TRs) and construct gene regulation networks (GRNs) in single-cell. ChIP-seq has been widely used to profile TR binding sites in the past decades. Here, we developed SCRIP, an integrative method to infer single-cell TR activity and targets based on the integration of scATAC-seq and a large-scale TR ChIP-seq reference. Our method showed improved performance in evaluating TR binding activity compared to the existing motif-based methods and reached a higher consistency with matched TR expressions. Besides, our method enables identifying TR target genes as well as building GRNs at the single-cell resolution based on a regulatory potential model. We demonstrate SCRIP's utility in accurate cell-type clustering, lineage tracing, and inferring cell-type-specific GRNs in multiple biological systems. SCRIP is freely available at https://github.com/wanglabtongji/SCRIP.
Collapse
Affiliation(s)
| | | | - Yunfan Xu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration of Ministry of Education, Department of Orthopedics, Tongji Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China,Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Hailin Wei
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration of Ministry of Education, Department of Orthopedics, Tongji Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China,Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Tong Han
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration of Ministry of Education, Department of Orthopedics, Tongji Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China,Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Chenfei Wang
- To whom correspondence should be addressed. Tel: +86 21 65981195; Fax: +86 21 65981195;
| |
Collapse
|
48
|
Cao Y, Fu L, Wu J, Peng Q, Nie Q, Zhang J, Xie X. Integrated analysis of multimodal single-cell data with structural similarity. Nucleic Acids Res 2022; 50:e121. [PMID: 36130281 PMCID: PMC9757079 DOI: 10.1093/nar/gkac781] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 08/15/2022] [Accepted: 09/02/2022] [Indexed: 12/24/2022] Open
Abstract
Multimodal single-cell sequencing technologies provide unprecedented information on cellular heterogeneity from multiple layers of genomic readouts. However, joint analysis of two modalities without properly handling the noise often leads to overfitting of one modality by the other and worse clustering results than vanilla single-modality analysis. How to efficiently utilize the extra information from single cell multi-omics to delineate cell states and identify meaningful signal remains as a significant computational challenge. In this work, we propose a deep learning framework, named SAILERX, for efficient, robust, and flexible analysis of multi-modal single-cell data. SAILERX consists of a variational autoencoder with invariant representation learning to correct technical noises from sequencing process, and a multimodal data alignment mechanism to integrate information from different modalities. Instead of performing hard alignment by projecting both modalities to a shared latent space, SAILERX encourages the local structures of two modalities measured by pairwise similarities to be similar. This strategy is more robust against overfitting of noises, which facilitates various downstream analysis such as clustering, imputation, and marker gene detection. Furthermore, the invariant representation learning part enables SAILERX to perform integrative analysis on both multi- and single-modal datasets, making it an applicable and scalable tool for more general scenarios.
Collapse
Affiliation(s)
| | | | - Jie Wu
- Department of Biological Chemistry, University of California, Irvine, CA 92697, USA
| | - Qinke Peng
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, Shannxi 710049, China
| | - Qing Nie
- Department of Mathematics, University of California, Irvine, CA 92697, USA,Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA,NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, CA 92697, USA
| | - Jing Zhang
- To whom correspondence should be addressed. Tel: +1 949 824 9979;
| | - Xiaohui Xie
- Correspondence may also be addressed to Xiaohui Xie. Tel: +1 949 824 9289;
| |
Collapse
|
49
|
scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat Methods 2022; 19:1088-1096. [PMID: 35941239 DOI: 10.1038/s41592-022-01562-8] [Citation(s) in RCA: 39] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 06/27/2022] [Indexed: 12/25/2022]
Abstract
Single-cell assay for transposase-accessible chromatin using sequencing (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain important challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single-cell multiome datasets, including cell clustering, scATAC profile denoising, data integration across assays and transcription factor activity inference.
Collapse
|
50
|
Aerts S. How regulatory sequences learn cell representations. Nat Methods 2022; 19:1041-1043. [PMID: 35941240 DOI: 10.1038/s41592-022-01570-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Stein Aerts
- VIB Center for Brain & Disease Research, Leuven, Belgium. .,Department of Human Genetics, KU Leuven, Leuven, Belgium.
| |
Collapse
|