1
|
Qin H, Shi X, Zhou H. scSwinFormer: A Transformer-Based Cell-Type Annotation Method for scRNA-Seq Data Using Smooth Gene Embedding and Global Features. J Chem Inf Model 2024; 64:6316-6323. [PMID: 39101690 DOI: 10.1021/acs.jcim.4c00616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/06/2024]
Abstract
Single-cell omics techniques have made it possible to analyze individual cells in biological samples, providing us with a more detailed understanding of cellular heterogeneity and biological systems. Accurate identification of cell types is critical for single-cell RNA sequencing (scRNA-seq) analysis. However, scRNA-seq data are usually high dimensional and sparse, posing a great challenge to analyze scRNA-seq data. Existing cell-type annotation methods are either constrained in modeling scRNA-seq data or lack consideration of long-term dependencies of characterized genes. In this work, we developed a Transformer-based deep learning method, scSwinFormer, for the cell-type annotation of large-scale scRNA-seq data. Sequence modeling of scRNA-seq data is performed using the smooth gene embedding module, and then, the potential dependencies of genes are captured by the self-attention module. Subsequently, the global information inherent in scRNA-seq data is synthesized using the Cell Token, thereby facilitating accurate cell-type annotation. We evaluated the performance of our model against current state-of-the-art scRNA-seq cell-type annotation methods on multiple real data sets. ScSwinFormer outperforms the current state-of-the-art scRNA-seq cell-type annotation methods in both external and benchmark data set experiments.
Collapse
Affiliation(s)
- Hengyu Qin
- School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
| | - Xiumin Shi
- School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
| | - Han Zhou
- School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
2
|
Hu M, Chikina M. Heterogeneous pseudobulk simulation enables realistic benchmarking of cell-type deconvolution methods. Genome Biol 2024; 25:169. [PMID: 38956606 PMCID: PMC11218230 DOI: 10.1186/s13059-024-03292-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 05/29/2024] [Indexed: 07/04/2024] Open
Abstract
BACKGROUND Computational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. RESULTS In our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. CONCLUSIONS Our heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly package https://github.com/humengying0907/deconvBenchmarking and https://doi.org/10.5281/zenodo.8206516 , enabling further developments in deconvolution methods.
Collapse
Affiliation(s)
- Mengying Hu
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, USA
- Joint Carnegie Mellon - University of Pittsburgh Computational Biology PhD Program, University of Pittsburgh, Pittsburgh, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, USA.
- Joint Carnegie Mellon - University of Pittsburgh Computational Biology PhD Program, University of Pittsburgh, Pittsburgh, USA.
| |
Collapse
|
3
|
Jin YW, Hu P, Liu Q. NNICE: a deep quantile neural network algorithm for expression deconvolution. Sci Rep 2024; 14:14040. [PMID: 38890415 PMCID: PMC11189483 DOI: 10.1038/s41598-024-65053-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 06/17/2024] [Indexed: 06/20/2024] Open
Abstract
The composition of cell-type is a key indicator of health. Advancements in bulk gene expression data curation, single cell RNA-sequencing technologies, and computational deconvolution approaches offer a new perspective to learn about the composition of different cell types in a quick and affordable way. In this study, we developed a quantile regression and deep learning-based method called Neural Network Immune Contexture Estimator (NNICE) to estimate the cell type abundance and its uncertainty by automatically deconvolving bulk RNA-seq data. The proposed NNICE model was able to successfully recover ground-truth cell type fraction values given unseen bulk mixture gene expression profiles from the same dataset it was trained on. Compared with baseline methods, NNICE achieved better performance on deconvolve both pseudo-bulk gene expressions (Pearson correlation R = 0.9) and real bulk gene expression data (Pearson correlation R = 0.9) across all cell types. In conclusion, NNICE combines statistic inference with deep learning to provide accurate and interpretable cell type deconvolution from bulk gene expression.
Collapse
Affiliation(s)
- Yong Won Jin
- Department of Biochemistry & Medical Genetics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, R3E 0J9, Canada
| | - Pingzhao Hu
- Department of Biochemistry & Medical Genetics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, R3E 0J9, Canada
- Department of Biochemistry, Schulich School of Medicine & Dentistry, Western University, London, ON, N6A 5C1, Canada
| | - Qian Liu
- Department of Applied Computer Science, University of Winnipeg, Winnipeg, MB, R3B 2E9, Canada.
| |
Collapse
|
4
|
Nguyen H, Nguyen H, Tran D, Draghici S, Nguyen T. Fourteen years of cellular deconvolution: methodology, applications, technical evaluation and outstanding challenges. Nucleic Acids Res 2024; 52:4761-4783. [PMID: 38619038 PMCID: PMC11109966 DOI: 10.1093/nar/gkae267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/01/2024] [Accepted: 04/02/2024] [Indexed: 04/16/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-Seq) is a recent technology that allows for the measurement of the expression of all genes in each individual cell contained in a sample. Information at the single-cell level has been shown to be extremely useful in many areas. However, performing single-cell experiments is expensive. Although cellular deconvolution cannot provide the same comprehensive information as single-cell experiments, it can extract cell-type information from bulk RNA data, and therefore it allows researchers to conduct studies at cell-type resolution from existing bulk datasets. For these reasons, a great effort has been made to develop such methods for cellular deconvolution. The large number of methods available, the requirement of coding skills, inadequate documentation, and lack of performance assessment all make it extremely difficult for life scientists to choose a suitable method for their experiment. This paper aims to fill this gap by providing a comprehensive review of 53 deconvolution methods regarding their methodology, applications, performance, and outstanding challenges. More importantly, the article presents a benchmarking of all these 53 methods using 283 cell types from 30 tissues of 63 individuals. We also provide an R package named DeconBenchmark that allows readers to execute and benchmark the reviewed methods (https://github.com/tinnlab/DeconBenchmark).
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | - Ha Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | - Duc Tran
- Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA
- Advaita Bioinformatics, Ann Arbor, MI, USA
| | - Tin Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| |
Collapse
|
5
|
Hozumi Y, Tanemura KA, Wei GW. Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection. J Chem Inf Model 2024; 64:2829-2838. [PMID: 37402705 PMCID: PMC11009150 DOI: 10.1021/acs.jcim.3c00674] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Kiyoto Aramis Tanemura
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
6
|
Davidson NR, Zhang F, Greene CS. BuDDI: BulkDeconvolution withDomainInvariance to predict cell-type-specific perturbations from bulk. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.20.549951. [PMID: 37503097 PMCID: PMC10370205 DOI: 10.1101/2023.07.20.549951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
While single-cell experiments provide deep cellular resolution within a single sample, some single-cell experiments are inherently more challenging than bulk experiments due to dissociation difficulties, cost, or limited tissue availability. This creates a situation where we have deep cellular profiles of one sample or condition, and bulk profiles across multiple samples and conditions. To bridge this gap, we propose BuDDI (BUlk Deconvolution with Domain Invariance). BuDDI utilizes domain adaptation techniques to effectively integrate available corpora of case-control bulk and reference scRNA-seq observations to infer cell-type-specific perturbation effects. BuDDI achieves this by learning independent latent spaces within a single variational autoencoder (VAE) encompassing at least four sources of variability: 1) cell type proportion, 2) perturbation effect, 3) structured experimental variability, and 4) remaining variability. Since each latent space is encouraged to be independent, we simulate perturbation responses by independently composing each latent space to simulate cell-type-specific perturbation responses. We evaluated BuDDI's performance on simulated and real data with experimental designs of increasing complexity. We first validated that BuDDI could learn domain invariant latent spaces on data with matched samples across each source of variability. Then we validated that BuDDI could accurately predict cell-type-specific perturbation response when no single-cell perturbed profiles were used during training; instead, only bulk samples had both perturbed and non-perturbed observations. Finally, we validated BuDDI on predicting sex-specific differences, an experimental design where it is not possible to have matched samples. In each experiment, BuDDI outperformed all other comparative methods and baselines. As more reference atlases are completed, BuDDI provides a path to combine these resources with bulk-profiled treatment or disease signatures to study perturbations, sex differences, or other factors at single-cell resolution.
Collapse
Affiliation(s)
- Natalie R Davidson
- Department of Biomedical Informatics, University of Colorado Anschutz School of Medicine, Aurora, Colorado, United States of America · Funded by the Gordon and Betty Moore Foundation (GBMF 4552), NHGRI of the National Institutes of Health (K99HG012945), NCI of the National Institutes of Health (R01CA237170, R01CA243188, R01CA200854)
| | - Fan Zhang
- Department of Medicine Rheumatology, University of Colorado Anschutz School of Medicine, Aurora, Colorado, United States of America; Department of Biomedical Informatics, University of Colorado Anschutz School of Medicine, Aurora, Colorado, United States of America · Funded by the Arthritis National Research Foundation Award, the PhRMA foundation, and the University of Colorado Translational Research Scholars Program Award
| | - Casey S Greene
- Department of Biomedical Informatics, University of Colorado Anschutz School of Medicine, Aurora, Colorado, United States of America · Funded by the Gordon and Betty Moore Foundation (GBMF 4552), NCI of the National Institutes of Health (R01CA237170, R01CA243188, R01CA200854)
| |
Collapse
|
7
|
Zhu J, Yang J, Chen X, Wang Y, Wang X, Zhao M, Li G, Wang Y, Zhu Y, Yan F, Liu T, Jiang L. Integrated Bulk and Single-cell RNA Sequencing Data Constructs and Validates a Prognostic Model for Non-small Cell Lung Cancer. J Cancer 2024; 15:796-808. [PMID: 38213729 PMCID: PMC10777029 DOI: 10.7150/jca.90768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 11/28/2023] [Indexed: 01/13/2024] Open
Abstract
Background: Most of the current research on prognostic model construction for non-small cell lung cancer (NSCLC) only involves in bulk RNA-seq data without integration of single-cell RNA-seq (scRNA-seq) data. Besides, most of the prognostic models are constructed by predictive genes, ignoring other predictive variables such as clinical features. Methods: We obtained scRNA-seq data from GEO database and bulk RNA-seq data from TCGA database. We construct a prognostic model through the Least Absolute Shrinkage and Selection Operator (LASSO) and Cox regression. Furthermore, we performed ESTIMATE, CIBERSORT, immune checkpoint-related analyses and compared drug sensitivity using pRRophetic method judged by IC50 between different risk groups. Results: 14 tumor-related genes were extracted for model construction. The AUC for 1-, 3-, and 5 years overall survival prediction in TCGA and three validation cohorts were almost higher than 0.65, some of which were even higher than 0.7, even 0.8. Besides, calibration curves suggested no departure between model prediction and perfect fit. Additionally, immune-related and drug sensitivity results revealed potential targets and strategies for treatment, which can provide clinical guidance. Conclusion: We integrated traditional bulk RNA-seq and scRNA-seq data, along with predictive clinical features to develop a prognostic model for patients with NSCLC. According to the constructed model, patients in different groups can follow precise and individual therapeutic schedules based on immune characteristics as well as drug sensitivity.
Collapse
Affiliation(s)
- Junkai Zhu
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Junluo Yang
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Xinyi Chen
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Yang Wang
- Department of Radiology, Zhujiang Hospital, Southern Medical University, 253 Gongye Middle Avenue, Haizhu District, Guangzhou, Guangdong, 510282, P.R. China
| | - Xin Wang
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Mengmeng Zhao
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Guanjie Li
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Yuhang Wang
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Yuyao Zhu
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Fangrong Yan
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Tiantian Liu
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| | - Liyun Jiang
- Research Center of Biostatistics and Computational Pharmacy, China Pharmaceutical University, Nanjing 210009, P.R. China
| |
Collapse
|
8
|
Halawani R, Buchert M, Chen YPP. Deep learning exploration of single-cell and spatially resolved cancer transcriptomics to unravel tumour heterogeneity. Comput Biol Med 2023; 164:107274. [PMID: 37506451 DOI: 10.1016/j.compbiomed.2023.107274] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 07/03/2023] [Accepted: 07/16/2023] [Indexed: 07/30/2023]
Abstract
Tumour heterogeneity is one of the critical confounding aspects in decoding tumour growth. Malignant cells display variations in their gene transcription profiles and mutation spectra even when originating from a single progenitor cell. Single-cell and spatial transcriptomics sequencing have recently emerged as key technologies for unravelling tumour heterogeneity. Single-cell sequencing promotes individual cell-type identification through transcriptome-wide gene expression measurements of each cell. Spatial transcriptomics facilitates identification of cell-cell interactions and the structural organization of heterogeneous cells within a tumour tissue through associating spatial RNA abundance of cells at distinct spots in the tissue section. However, extracting features and analyzing single-cell and spatial transcriptomics data poses challenges. Single-cell transcriptome data is extremely noisy and its sparse nature and dropouts can lead to misinterpretation of gene expression and the misclassification of cell types. Deep learning predictive power can overcome data challenges, provide high-resolution analysis and enhance precision oncology applications that involve early cancer prognosis, diagnosis, patient survival estimation and anti-cancer therapy planning. In this paper, we provide a background to and review of the recent progress of deep learning frameworks to investigate tumour heterogeneity using both single-cell and spatial transcriptomics data types.
Collapse
Affiliation(s)
- Raid Halawani
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
| | - Michael Buchert
- School of Cancer Medicine, La Trobe University, Melbourne, Victoria, Australia; Olivia Newton-John Cancer Research Institute, Melbourne, Victoria, Australia
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia.
| |
Collapse
|
9
|
Zhang B, Fan T. Knowledge structure and emerging trends in the application of deep learning in genetics research: A bibliometric analysis [2000–2021]. Front Genet 2022; 13:951939. [PMID: 36081985 PMCID: PMC9445221 DOI: 10.3389/fgene.2022.951939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 07/13/2022] [Indexed: 11/13/2022] Open
Abstract
Introduction: Deep learning technology has been widely used in genetic research because of its characteristics of computability, statistical analysis, and predictability. Herein, we aimed to summarize standardized knowledge and potentially innovative approaches for deep learning applications of genetics by evaluating publications to encourage more research.Methods: The Science Citation Index Expanded TM (SCIE) database was searched for deep learning applications for genomics-related publications. Original articles and reviews were considered. In this study, we derived a clustered network from 69,806 references that were cited by the 1,754 related manuscripts identified. We used CiteSpace and VOSviewer to identify countries, institutions, journals, co-cited references, keywords, subject evolution, path, current characteristics, and emerging topics.Results: We assessed the rapidly increasing publications concerned about deep learning applications of genomics approaches and identified 1,754 articles that published reports focusing on this subject. Among these, a total of 101 countries and 2,487 institutes contributed publications, The United States of America had the most publications (728/1754) and the highest h-index, and the US has been in close collaborations with China and Germany. The reference clusters of SCI articles were clustered into seven categories: deep learning, logic regression, variant prioritization, random forests, scRNA-seq (single-cell RNA-seq), genomic regulation, and recombination. The keywords representing the research frontiers by year were prediction (2016–2021), sequence (2017–2021), mutation (2017–2021), and cancer (2019–2021).Conclusion: Here, we summarized the current literature related to the status of deep learning for genetics applications and analyzed the current research characteristics and future trajectories in this field. This work aims to provide resources for possible further intensive exploration and encourages more researchers to overcome the research of deep learning applications in genetics.
Collapse
Affiliation(s)
- Bijun Zhang
- Department of Clinical Genetics, Shengjing Hospital of China Medical University, Shenyang, China
| | - Ting Fan
- Department of Computer, School of Intelligent Medicine, China Medical University, Shenyang, China
- *Correspondence: Ting Fan,
| |
Collapse
|
10
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
11
|
Vadapalli S, Abdelhalim H, Zeeshan S, Ahmed Z. Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine. Brief Bioinform 2022; 23:6590150. [PMID: 35595537 DOI: 10.1093/bib/bbac191] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 04/02/2022] [Accepted: 04/26/2022] [Indexed: 12/16/2022] Open
Abstract
Precision medicine uses genetic, environmental and lifestyle factors to more accurately diagnose and treat disease in specific groups of patients, and it is considered one of the most promising medical efforts of our time. The use of genetics is arguably the most data-rich and complex components of precision medicine. The grand challenge today is the successful assimilation of genetics into precision medicine that translates across different ancestries, diverse diseases and other distinct populations, which will require clever use of artificial intelligence (AI) and machine learning (ML) methods. Our goal here was to review and compare scientific objectives, methodologies, datasets, data sources, ethics and gaps of AI/ML approaches used in genomics and precision medicine. We selected high-quality literature published within the last 5 years that were indexed and available through PubMed Central. Our scope was narrowed to articles that reported application of AI/ML algorithms for statistical and predictive analyses using whole genome and/or whole exome sequencing for gene variants, and RNA-seq and microarrays for gene expression. We did not limit our search to specific diseases or data sources. Based on the scope of our review and comparative analysis criteria, we identified 32 different AI/ML approaches applied in variable genomics studies and report widely adapted AI/ML algorithms for predictive diagnostics across several diseases.
Collapse
Affiliation(s)
- Sreya Vadapalli
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Habiba Abdelhalim
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Rutgers Cancer Institute of New Jersey, Rutgers University, 195 Little Albany St, New Brunswick, NJ, USA
| | - Zeeshan Ahmed
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, 125 Paterson St, New Brunswick, NJ, USA
| |
Collapse
|
12
|
Lu J, Chen Y, Zhang X, Guo J, Xu K, Li L. A novel prognostic model based on single-cell RNA sequencing data for hepatocellular carcinoma. Cancer Cell Int 2022; 22:38. [PMID: 35078458 PMCID: PMC8787928 DOI: 10.1186/s12935-022-02469-2] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 01/13/2022] [Indexed: 12/24/2022] Open
Abstract
Background The tumour heterogeneous make-up of immune cell infiltrates is a key factor for the therapy response and prognosis of hepatocellular carcinoma (HCC). However, it is still a major challenge to comprehensively understand the tumour immune microenvironment (TIME) at the genetic and cellular levels. Methods HCC single-cell RNA sequencing (scRNA-seq) data were downloaded from the Gene Expression Omnibus (GEO) database, and gene expression data were retrieved from The Cancer Genome Atlas (TCGA) database and International Cancer Genome Consortium (ICGC) database. Cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT) was performed to evaluate the abundance of immune infiltrating cells. We employed weighted gene coexpression network analysis (WGCNA) to construct a gene coexpression network. Univariate Cox and least absolute shrinkage and selection operator (LASSO) analyses were further used to construct a risk model. Moreover, the expression levels of model genes were assessed by qPCR. Results We defined 25 cell clusters based on the scRNA-seq dataset (GSE149614), and the clusters were labelled as various cell types by marker genes. Then, we constructed a weighted coexpression network and identified a total of 6 modules, among which the brown module was most highly correlated with tumours. Moreover, we found that the brown module was most closely related to monocytes (cluster 21). Through univariate Cox and LASSO analyses, we constructed a 3-gene risk model (RiskScore = 0.257*Expression CSTB + 0.263* Expression TALDO1 + 0.313* Expression CLTA). This risk model showed excellent predictive efficacy for prognosis in the TCGA-LIHC and ICGC cohorts. Additionally, patients with high risk scores were found to be less likely to benefit from immunotherapy. Conclusions We developed a 3-gene signature (including CLTA, TALDO1 and CSTB) based on the heterogeneity of the TIME to predict the survival outcome and immunotherapy response. Supplementary Information The online version contains supplementary material available at 10.1186/s12935-022-02469-2.
Collapse
|
13
|
Flores M, Liu Z, Zhang T, Hasib MM, Chiu YC, Ye Z, Paniagua K, Jo S, Zhang J, Gao SJ, Jin YF, Chen Y, Huang Y. Deep learning tackles single-cell analysis-a survey of deep learning for scRNA-seq analysis. Brief Bioinform 2022; 23:bbab531. [PMID: 34929734 PMCID: PMC8769926 DOI: 10.1093/bib/bbab531] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 11/15/2021] [Accepted: 11/16/2021] [Indexed: 12/17/2022] Open
Abstract
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.
Collapse
Affiliation(s)
- Mario Flores
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Zhentao Liu
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Tinghe Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Md Musaddaqui Hasib
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yu-Chiao Chiu
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Zhenqing Ye
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Karla Paniagua
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Sumin Jo
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Jianqiu Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Shou-Jiang Gao
- Department of Microbiology and Molecular Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| | - Yu-Fang Jin
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yidong Chen
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Yufei Huang
- Department of Medicine, School of Medicine, University of Pittsburgh, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| |
Collapse
|
14
|
Interpretable Autoencoders Trained on Single Cell Sequencing Data Can Transfer Directly to Data from Unseen Tissues. Cells 2021; 11:cells11010085. [PMID: 35011647 PMCID: PMC8750521 DOI: 10.3390/cells11010085] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 12/17/2021] [Accepted: 12/21/2021] [Indexed: 01/04/2023] Open
Abstract
Autoencoders have been used to model single-cell mRNA-sequencing data with the purpose of denoising, visualization, data simulation, and dimensionality reduction. We, and others, have shown that autoencoders can be explainable models and interpreted in terms of biology. Here, we show that such autoencoders can generalize to the extent that they can transfer directly without additional training. In practice, we can extract biological modules, denoise, and classify data correctly from an autoencoder that was trained on a different dataset and with different cells (a foreign model). We deconvoluted the biological signal encoded in the bottleneck layer of scRNA-models using saliency maps and mapped salient features to biological pathways. Biological concepts could be associated with specific nodes and interpreted in relation to biological pathways. Even in this unsupervised framework, with no prior information about cell types or labels, the specific biological pathways deduced from the model were in line with findings in previous research. It was hypothesized that autoencoders could learn and represent meaningful biology; here, we show with a systematic experiment that this is true and even transcends the training data. This means that carefully trained autoencoders can be used to assist the interpretation of new unseen data.
Collapse
|
15
|
Ren L, Li J, Wang C, Lou Z, Gao S, Zhao L, Wang S, Chaulagain A, Zhang M, Li X, Tang J. Single cell RNA sequencing for breast cancer: present and future. Cell Death Discov 2021; 7:104. [PMID: 33990550 PMCID: PMC8121804 DOI: 10.1038/s41420-021-00485-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Revised: 03/03/2021] [Accepted: 04/15/2021] [Indexed: 01/01/2023] Open
Abstract
Breast cancer is one of the most common malignant tumors in women. It is a heterogeneous disease related to genetic and environmental factors. Presently, the treatment of breast cancer still faces challenges due to recurrence and metastasis. The emergence of single-cell RNA sequencing (scRNA-seq) technology has brought new strategies to deeply understand the biological behaviors of breast cancer. By analyzing cell phenotypes and transcriptome differences at the single-cell level, scRNA-seq reveals the heterogeneity, dynamic growth and differentiation process of cells. This review summarizes the application of scRNA-seq technology in breast cancer research, such as in studies on cell heterogeneity, cancer cell metastasis, drug resistance, and prognosis. scRNA-seq technology is of great significance to deeply analyze the mechanism of breast cancer occurrence and development, identify new therapeutic targets and develop new therapeutic approaches for breast cancer.
Collapse
Affiliation(s)
- Lili Ren
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Junyi Li
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Chuhan Wang
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Zheqi Lou
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Shuangshu Gao
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Lingyu Zhao
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Shuoshuo Wang
- Department of Pathology, Harbin Medical University, Harbin, 150081, China
| | - Anita Chaulagain
- Department of Microbiology, Harbin Medical University, Harbin, 150081, China
| | - Minghui Zhang
- Department of Oncology, Chifeng City Hospital, Chifeng, 024000, China.
| | - Xiaobo Li
- Department of Pathology, Harbin Medical University, Harbin, 150081, China.
| | - Jing Tang
- Department of Pathology, Harbin Medical University, Harbin, 150081, China.
| |
Collapse
|
16
|
Bhinder B, Gilvary C, Madhukar NS, Elemento O. Artificial Intelligence in Cancer Research and Precision Medicine. Cancer Discov 2021; 11:900-915. [PMID: 33811123 DOI: 10.1158/2159-8290.cd-21-0090] [Citation(s) in RCA: 192] [Impact Index Per Article: 64.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 02/06/2021] [Accepted: 02/08/2021] [Indexed: 11/16/2022]
Abstract
Artificial intelligence (AI) is rapidly reshaping cancer research and personalized clinical care. Availability of high-dimensionality datasets coupled with advances in high-performance computing, as well as innovative deep learning architectures, has led to an explosion of AI use in various aspects of oncology research. These applications range from detection and classification of cancer, to molecular characterization of tumors and their microenvironment, to drug discovery and repurposing, to predicting treatment outcomes for patients. As these advances start penetrating the clinic, we foresee a shifting paradigm in cancer care becoming strongly driven by AI. SIGNIFICANCE: AI has the potential to dramatically affect nearly all aspects of oncology-from enhancing diagnosis to personalizing treatment and discovering novel anticancer drugs. Here, we review the recent enormous progress in the application of AI to oncology, highlight limitations and pitfalls, and chart a path for adoption of AI in the cancer clinic.
Collapse
Affiliation(s)
- Bhavneet Bhinder
- Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, New York.,Department of Physiology and Biophysics, Weill Cornell Medicine, New York, New York
| | | | | | - Olivier Elemento
- Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, New York. .,Department of Physiology and Biophysics, Weill Cornell Medicine, New York, New York.,OneThree Biotech, New York, New York
| |
Collapse
|
17
|
Yan R, Fan C, Yin Z, Wang T, Chen X. Potential applications of deep learning in single-cell RNA sequencing analysis for cell therapy and regenerative medicine. Stem Cells 2021; 39:511-521. [PMID: 33587792 DOI: 10.1002/stem.3336] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 12/07/2020] [Indexed: 12/26/2022]
Abstract
When used in cell therapy and regenerative medicine strategies, stem cells have potential to treat many previously incurable diseases. However, current application methods using stem cells are underdeveloped, as these cells are used directly regardless of their culture medium and subgroup. For example, when using mesenchymal stem cells (MSCs) in cell therapy, researchers do not consider their source and culture method nor their application angle and function (soft tissue regeneration, hard tissue regeneration, suppression of immune function, or promotion of immune function). By combining machine learning methods (such as deep learning) with data sets obtained through single-cell RNA sequencing (scRNA-seq) technology, we can discover the hidden structure of these cells, predict their effects more accurately, and effectively use subpopulations with differentiation potential for stem cell therapy. scRNA-seq technology has changed the study of transcription, because it can express single-cell genes with single-cell anatomical resolution. However, this powerful technology is sensitive to biological and technical noise. The subsequent data analysis can be computationally difficult for a variety of reasons, such as denoising single cell data, reducing dimensionality, imputing missing values, and accounting for the zero-inflated nature. In this review, we discussed how deep learning methods combined with scRNA-seq data for research, how to interpret scRNA-seq data in more depth, improve the follow-up analysis of stem cells, identify potential subgroups, and promote the implementation of cell therapy and regenerative medicine measures.
Collapse
Affiliation(s)
- Ruojin Yan
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| | - Chunmei Fan
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| | - Zi Yin
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| | - Tingzhang Wang
- Key Laboratory of Microbial Technology and Bioinformatics of Zhejiang Province, Hangzhou, People's Republic of China.,NMPA Key laboratory for Testing and Risk Warning of Pharmaceutical Microbiology, Hangzhou, People's Republic of China
| | - Xiao Chen
- Dr. Li Dak Sum - Yip Yio Chin Center for Stem Cells and Regenerative Medicine and Department of Orthopedic Surgery of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Key Laboratory of Tissue Engineering and Regenerative Medicine of Zhejiang Province, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,Department of Sports Medicine, Zhejiang University School of Medicine, Hangzhou, People's Republic of China.,China Orthopedic Regenerative Medicine Group (CORMed), Hangzhou, People's Republic of China
| |
Collapse
|
18
|
Chen Z, Wu A. Progress and challenge for computational quantification of tissue immune cells. Brief Bioinform 2021; 22:6065002. [PMID: 33401306 DOI: 10.1093/bib/bbaa358] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/23/2020] [Accepted: 11/07/2020] [Indexed: 12/28/2022] Open
Abstract
Tissue immune cells have long been recognized as important regulators for the maintenance of balance in the body system. Quantification of the abundance of different immune cells will provide enhanced understanding of the correlation between immune cells and normal or abnormal situations. Currently, computational methods to predict tissue immune cell compositions from bulk transcriptomes have been largely developed. Therefore, summarizing the advantages and disadvantages is appropriate. In addition, an examination of the challenges and possible solutions for these computational models will assist the development of this field. The common hypothesis of these models is that the expression of signature genes for immune cell types might represent the proportion of immune cells that contribute to the tissue transcriptome. In general, we grouped all reported tools into three groups, including reference-free, reference-based scoring and reference-based deconvolution methods. In this review, a summary of all the currently reported computational immune cell quantification tools and their applications, limitations, and perspectives are presented. Furthermore, some critical problems are found that have limited the performance and application of these models, including inadequate immune cell type, the collinearity problem, the impact of the tissue environment on the immune cell expression level, and the deficiency of standard datasets for model validation. To address these issues, tissue specific training datasets that include all known immune cells, a hierarchical computational framework, and benchmark datasets including both tissue expression profiles and the abundances of all the immune cells are proposed to further promote the development of this field.
Collapse
Affiliation(s)
- Ziyi Chen
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| | - Aiping Wu
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| |
Collapse
|
19
|
Abstract
The extraordinary diversity, variability, and complexity of cell types in the vertebrate brain is overwhelming and far exceeds that of any other organ. This complexity is the result of multiple cell divisions and intricate gene regulation and cell movements that take place during embryonic development. Understanding the cellular and molecular mechanisms underlying these complicated developmental processes requires the ability to obtain a complete registry of interconnected events often taking place far apart from each other. To assist with this challenging task, developmental neuroscientists take advantage of a broad set of methods and technologies, often adopted from other fields of research. Here, we review some of the methods developed in recent years whose use has rapidly spread for application in the field of developmental neuroscience. We also provide several considerations regarding the promise that these techniques hold for the near future and share some ideas on how existing methods from other research fields could help with the analysis of how neural circuits emerge.
Collapse
Affiliation(s)
- Augusto Escalante
- Instituto de Neurociencias (Consejo Superior de Investigaciones Científicas-Universidad Miguel Hernández, CSIC-UMH), Campus San Juan, Av. Ramón y Cajal s/n, Alicante 03550, Spain
| | - Rocío González-Martínez
- Instituto de Neurociencias (Consejo Superior de Investigaciones Científicas-Universidad Miguel Hernández, CSIC-UMH), Campus San Juan, Av. Ramón y Cajal s/n, Alicante 03550, Spain
| | - Eloísa Herrera
- Instituto de Neurociencias (Consejo Superior de Investigaciones Científicas-Universidad Miguel Hernández, CSIC-UMH), Campus San Juan, Av. Ramón y Cajal s/n, Alicante 03550, Spain
| |
Collapse
|
20
|
|