1
|
Marmarelis MG, Littman R, Battaglin F, Niedzwiecki D, Venook A, Ambite JL, Galstyan A, Lenz HJ, Ver Steeg G. q-Diffusion leverages the full dimensionality of gene coexpression in single-cell transcriptomics. Commun Biol 2024; 7:400. [PMID: 38565955 PMCID: PMC11255321 DOI: 10.1038/s42003-024-06104-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 03/25/2024] [Indexed: 04/04/2024] Open
Abstract
Unlocking the full dimensionality of single-cell RNA sequencing data (scRNAseq) is the next frontier to a richer, fuller understanding of cell biology. We introduce q-diffusion, a framework for capturing the coexpression structure of an entire library of genes, improving on state-of-the-art analysis tools. The method is demonstrated via three case studies. In the first, q-diffusion helps gain statistical significance for differential effects on patient outcomes when analyzing the CALGB/SWOG 80405 randomized phase III clinical trial, suggesting precision guidance for the treatment of metastatic colorectal cancer. Secondly, q-diffusion is benchmarked against existing scRNAseq classification methods using an in vitro PBMC dataset, in which the proposed method discriminates IFN-γ stimulation more accurately. The same case study demonstrates improvements in unsupervised cell clustering with the recent Tabula Sapiens human atlas. Finally, a local distributional segmentation approach for spatial scRNAseq, driven by q-diffusion, yields interpretable structures of human cortical tissue.
Collapse
Affiliation(s)
- Myrl G Marmarelis
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA.
| | - Russell Littman
- University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Francesca Battaglin
- Keck School of Medicine, University of Southern California, 1975 Zonal Ave., Los Angeles, CA, 90033, USA
| | | | - Alan Venook
- University of California San Francisco, San Francisco, CA, 94143, USA
| | - Jose-Luis Ambite
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
| | - Aram Galstyan
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
| | - Heinz-Josef Lenz
- Keck School of Medicine, University of Southern California, 1975 Zonal Ave., Los Angeles, CA, 90033, USA
| | - Greg Ver Steeg
- Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
- University of California Riverside, Riverside, CA, 92521, USA
| |
Collapse
|
2
|
Mao Y, Lin YY, Wong NKY, Volik S, Sar F, Collins C, Ester M. Phenotype prediction from single-cell RNA-seq data using attention-based neural networks. Bioinformatics 2024; 40:btae067. [PMID: 38390963 PMCID: PMC10902676 DOI: 10.1093/bioinformatics/btae067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 12/15/2023] [Accepted: 02/21/2024] [Indexed: 02/24/2024] Open
Abstract
MOTIVATION A patient's disease phenotype can be driven and determined by specific groups of cells whose marker genes are either unknown or can only be detected at late-stage using conventional bulk assays such as RNA-Seq technology. Recent advances in single-cell RNA sequencing (scRNA-seq) enable gene expression profiling in cell-level resolution, and therefore have the potential to identify those cells driving the disease phenotype even while the number of these cells is small. However, most existing methods rely heavily on accurate cell type detection, and the number of available annotated samples is usually too small for training deep learning predictive models. RESULTS Here, we propose the method ScRAT for phenotype prediction using scRNA-seq data. To train ScRAT with a limited number of samples of different phenotypes, such as coronavirus disease (COVID) and non-COVID, ScRAT first applies a mixup module to increase the number of training samples. A multi-head attention mechanism is employed to learn the most informative cells for each phenotype without relying on a given cell type annotation. Using three public COVID datasets, we show that ScRAT outperforms other phenotype prediction methods. The performance edge of ScRAT over its competitors increases as the number of training samples decreases, indicating the efficacy of our sample mixup. Critical cell types detected based on high-attention cells also support novel findings in the original papers and the recent literature. This suggests that ScRAT overcomes the challenge of missing marker genes and limited sample number with great potential revealing novel molecular mechanisms and/or therapies. AVAILABILITY AND IMPLEMENTATION The code of our proposed method ScRAT is published at https://github.com/yuzhenmao/ScRAT.
Collapse
Affiliation(s)
- Yuzhen Mao
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Yen-Yi Lin
- Department of Urologic Sciences, University of British Columbia, Vancouver BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Nelson K Y Wong
- Department of Experimental Therapeutics, BC Cancer, Vancouver BC V5Z 1L3, Canada
| | | | - Funda Sar
- Department of Urologic Sciences, University of British Columbia, Vancouver BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Colin Collins
- Department of Urologic Sciences, University of British Columbia, Vancouver BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Martin Ester
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| |
Collapse
|
3
|
Yin Q, Chen L. CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data. Brief Bioinform 2023; 25:bbad449. [PMID: 38061196 PMCID: PMC10703497 DOI: 10.1093/bib/bbad449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/30/2023] [Accepted: 11/14/2023] [Indexed: 12/18/2023] Open
Abstract
Identifying cell types is crucial for understanding the functional units of an organism. Machine learning has shown promising performance in identifying cell types, but many existing methods lack biological significance due to poor interpretability. However, it is of the utmost importance to understand what makes cells share the same function and form a specific cell type, motivating us to propose a biologically interpretable method. CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types. CellTICS usually outperforms existing methods in prediction accuracy. Moreover, CellTICS can reveal pathways that define a cell type or a cell type under specific physiological conditions, such as disease or aging. The nonlinear nature of neural networks enables us to identify many novel pathways. Interestingly, some of the pathways identified by CellTICS exhibit differential expression "variability" rather than differential expression across cell types, indicating that expression stochasticity within a pathway could be an important feature characteristic of a cell type. Overall, CellTICS provides a biologically interpretable method for identifying and characterizing cell types, shedding light on the underlying pathways that define cellular heterogeneity and its role in organismal function. CellTICS is available at https://github.com/qyyin0516/CellTICS.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| | - Liang Chen
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| |
Collapse
|
4
|
Martínez-Enguita D, Dwivedi SK, Jörnsten R, Gustafsson M. NCAE: data-driven representations using a deep network-coherent DNA methylation autoencoder identify robust disease and risk factor signatures. Brief Bioinform 2023; 24:bbad293. [PMID: 37587790 PMCID: PMC10516364 DOI: 10.1093/bib/bbad293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 07/25/2023] [Accepted: 07/29/2023] [Indexed: 08/18/2023] Open
Abstract
Precision medicine relies on the identification of robust disease and risk factor signatures from omics data. However, current knowledge-driven approaches may overlook novel or unexpected phenomena due to the inherent biases in biological knowledge. In this study, we present a data-driven signature discovery workflow for DNA methylation analysis utilizing network-coherent autoencoders (NCAEs) with biologically relevant latent embeddings. First, we explored the architecture space of autoencoders trained on a large-scale pan-tissue compendium (n = 75 272) of human epigenome-wide association studies. We observed the emergence of co-localized patterns in the deep autoencoder latent space representations that corresponded to biological network modules. We determined the NCAE configuration with the strongest co-localization and centrality signals in the human protein interactome. Leveraging the NCAE embeddings, we then trained interpretable deep neural networks for risk factor (aging, smoking) and disease (systemic lupus erythematosus) prediction and classification tasks. Remarkably, our NCAE embedding-based models outperformed existing predictors, revealing novel DNA methylation signatures enriched in gene sets and pathways associated with the studied condition in each case. Our data-driven biomarker discovery workflow provides a generally applicable pipeline to capture relevant risk factor and disease information. By surpassing the limitations of knowledge-driven methods, our approach enhances the understanding of complex epigenetic processes, facilitating the development of more effective diagnostic and therapeutic strategies.
Collapse
Affiliation(s)
- David Martínez-Enguita
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| | - Sanjiv K Dwivedi
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| | - Rebecka Jörnsten
- Department of Mathematical Sciences, Chalmers University of Technology, Sweden
| | - Mika Gustafsson
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| |
Collapse
|
5
|
Lyu P, Zhai Y, Li T, Qian J. CellAnn: a comprehensive, super-fast, and user-friendly single-cell annotation web server. Bioinformatics 2023; 39:btad521. [PMID: 37610325 PMCID: PMC10477937 DOI: 10.1093/bioinformatics/btad521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 07/17/2023] [Accepted: 08/22/2023] [Indexed: 08/24/2023] Open
Abstract
MOTIVATION Single-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use. RESULTS Here, we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis. AVAILABILITY AND IMPLEMENTATION The web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.
Collapse
Affiliation(s)
- Pin Lyu
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21287, United States
| | - Yijie Zhai
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21287, United States
| | - Taibo Li
- Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21218, United States
| | - Jiang Qian
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21287, United States
| |
Collapse
|
6
|
Xiong YX, Wang MG, Chen L, Zhang XF. Cell-type annotation with accurate unseen cell-type identification using multiple references. PLoS Comput Biol 2023; 19:e1011261. [PMID: 37379341 PMCID: PMC10335708 DOI: 10.1371/journal.pcbi.1011261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 07/11/2023] [Accepted: 06/11/2023] [Indexed: 06/30/2023] Open
Abstract
The recent advances in single-cell RNA sequencing (scRNA-seq) techniques have stimulated efforts to identify and characterize the cellular composition of complex tissues. With the advent of various sequencing techniques, automated cell-type annotation using a well-annotated scRNA-seq reference becomes popular. But it relies on the diversity of cell types in the reference, which may not capture all the cell types present in the query data of interest. There are generally unseen cell types in the query data of interest because most data atlases are obtained for different purposes and techniques. Identifying previously unseen cell types is essential for improving annotation accuracy and uncovering novel biological discoveries. To address this challenge, we propose mtANN (multiple-reference-based scRNA-seq data annotation), a new method to automatically annotate query data while accurately identifying unseen cell types with the aid of multiple references. Key innovations of mtANN include the integration of deep learning and ensemble learning to improve prediction accuracy, and the introduction of a new metric that considers three complementary aspects to distinguish between unseen cell types and shared cell types. Additionally, we provide a data-driven method to adaptively select a threshold for identifying previously unseen cell types. We demonstrate the advantages of mtANN over state-of-the-art methods for unseen cell-type identification and cell-type annotation on two benchmark dataset collections, as well as its predictive power on a collection of COVID-19 datasets. The source code and tutorial are available at https://github.com/Zhangxf-ccnu/mtANN.
Collapse
Affiliation(s)
- Yi-Xuan Xiong
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, China
| | - Meng-Guo Wang
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, China
| | - Luonan Chen
- State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou, China
- Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, China
| |
Collapse
|
7
|
Nie X, Qin D, Zhou X, Duo H, Hao Y, Li B, Liang G. Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges. Comput Biol Med 2023; 159:106939. [PMID: 37075602 DOI: 10.1016/j.compbiomed.2023.106939] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/31/2023] [Accepted: 04/14/2023] [Indexed: 04/21/2023]
Abstract
With the rapid development of single-cell RNA-sequencing techniques, various computational methods and tools were proposed to analyze these high-throughput data, which led to an accelerated reveal of potential biological information. As one of the core steps of single-cell transcriptome data analysis, clustering plays a crucial role in identifying cell types and interpreting cellular heterogeneity. However, the results generated by different clustering methods showed distinguishing, and those unstable partitions can affect the accuracy of the analysis to a certain extent. To overcome this challenge and obtain more accurate results, currently clustering ensemble is frequently applied to cluster analysis of single-cell transcriptome datasets, and the results generated by all clustering ensembles are nearly more reliable than those from most of the single clustering partitions. In this review, we summarize applications and challenges of the clustering ensemble method in single-cell transcriptome data analysis, and provide constructive thoughts and references for researchers in this field.
Collapse
Affiliation(s)
- Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, China; College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Dan Qin
- Department of Biology, College of Science, Northeastern University, Boston, MA, 02115, USA
| | - Xinyi Zhou
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China.
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, China.
| |
Collapse
|
8
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
9
|
New Developments and Possibilities in Reanalysis and Reinterpretation of Whole Exome Sequencing Datasets for Unsolved Rare Diseases Using Machine Learning Approaches. Int J Mol Sci 2022; 23:ijms23126792. [PMID: 35743235 PMCID: PMC9224427 DOI: 10.3390/ijms23126792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 06/13/2022] [Accepted: 06/15/2022] [Indexed: 11/21/2022] Open
Abstract
Rare diseases impact the lives of 300 million people in the world. Rapid advances in bioinformatics and genomic technologies have enabled the discovery of causes of 20–30% of rare diseases. However, most rare diseases have remained as unsolved enigmas to date. Newer tools and availability of high throughput sequencing data have enabled the reanalysis of previously undiagnosed patients. In this review, we have systematically compiled the latest developments in the discovery of the genetic causes of rare diseases using machine learning methods. Importantly, we have detailed methods available to reanalyze existing whole exome sequencing data of unsolved rare diseases. We have identified different reanalysis methodologies to solve problems associated with sequence alterations/mutations, variation re-annotation, protein stability, splice isoform malfunctions and oligogenic analysis. In addition, we give an overview of new developments in the field of rare disease research using whole genome sequencing data and other omics.
Collapse
|