151
|
Galdos FX, Xu S, Goodyer WR, Duan L, Huang YV, Lee S, Zhu H, Lee C, Wei N, Lee D, Wu SM. devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data. Nat Commun 2022; 13:5271. [PMID: 36071107 PMCID: PMC9452519 DOI: 10.1038/s41467-022-33045-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 08/31/2022] [Indexed: 11/09/2022] Open
Abstract
A major informatic challenge in single cell RNA-sequencing analysis is the precise annotation of datasets where cells exhibit complex multilayered identities or transitory states. Here, we present devCellPy a highly accurate and precise machine learning-enabled tool that enables automated prediction of cell types across complex annotation hierarchies. To demonstrate the power of devCellPy, we construct a murine cardiac developmental atlas from published datasets encompassing 104,199 cells from E6.5-E16.5 and train devCellPy to generate a cardiac prediction algorithm. Using this algorithm, we observe a high prediction accuracy (>90%) across multiple layers of annotation and across de novo murine developmental data. Furthermore, we conduct a cross-species prediction of cardiomyocyte subtypes from in vitro-derived human induced pluripotent stem cells and unexpectedly uncover a predominance of left ventricular (LV) identity that we confirmed by an LV-specific TBX5 lineage tracing system. Together, our results show devCellPy to be a useful tool for automated cell prediction across complex cellular hierarchies, species, and experimental systems.
Collapse
Affiliation(s)
- Francisco X Galdos
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA
| | - Sidra Xu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - William R Goodyer
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA
- Division of Pediatric Cardiology, Department of Pediatrics, Stanford University School of Medicine, Palo Alto, USA
| | - Lauren Duan
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Yuhsin V Huang
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Soah Lee
- Biopharmaceutical Convergence, School of Pharmacy, Sungkyunkwan University, Suwon, South Korea
| | - Han Zhu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, USA
| | - Carissa Lee
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Nicholas Wei
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Daniel Lee
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Sean M Wu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA.
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA.
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, USA.
| |
Collapse
|
152
|
Zheng H, Wang S, Li X, Hu H. INSISTC: Incorporating network structure information for single-cell type classification. Genomics 2022; 114:110480. [PMID: 36075505 DOI: 10.1016/j.ygeno.2022.110480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Revised: 08/30/2022] [Accepted: 09/04/2022] [Indexed: 11/27/2022]
Abstract
Uncovering gene regulatory mechanisms in individual cells can provide insight into cell heterogeneity and function. Recent accumulated Single-Cell RNA-Seq data have made it possible to analyze gene regulation at single-cell resolution. Understanding cell-type-specific gene regulation can assist in more accurate cell type and state identification. Computational approaches utilizing such relationships are under development. Methods pioneering in integrating gene regulatory mechanism discovery with cell-type classification encounter challenges such as determine gene regulatory relationships and incorporate gene regulatory network structure. To fill this gap, we developed INSISTC, a computational method to incorporate gene regulatory network structure information for single-cell type classification. INSISTC is capable of identifying cell-type-specific gene regulatory mechanisms while performing single-cell type classification. INSISTC demonstrated its accuracy in cell type classification and its potential for providing insight into molecular mechanisms specific to individual cells. In comparison with the alternative methods, INSISTC demonstrated its complementary performance for gene regulation interpretation.
Collapse
Affiliation(s)
- Hansi Zheng
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Saidi Wang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA.
| | - Haiyan Hu
- Department of Computer Science, Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA.
| |
Collapse
|
153
|
Contrastive learning enables rapid mapping to multimodal single-cell atlas of multimillion scale. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00518-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
154
|
Ma WF, Turner AW, Gancayco C, Wong D, Song Y, Mosquera JV, Auguste G, Hodonsky CJ, Prabhakar A, Ekiz HA, van der Laan SW, Miller CL. PlaqView 2.0: A comprehensive web portal for cardiovascular single-cell genomics. Front Cardiovasc Med 2022; 9:969421. [PMID: 36003902 PMCID: PMC9393487 DOI: 10.3389/fcvm.2022.969421] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 07/21/2022] [Indexed: 11/13/2022] Open
Abstract
Single-cell RNA-seq (scRNA-seq) is a powerful genomics technology to interrogate the cellular composition and behaviors of complex systems. While the number of scRNA-seq datasets and available computational analysis tools have grown exponentially, there are limited systematic data sharing strategies to allow rapid exploration and re-analysis of single-cell datasets, particularly in the cardiovascular field. We previously introduced PlaqView, an open-source web portal for the exploration and analysis of published atherosclerosis single-cell datasets. Now, we introduce PlaqView 2.0 (www.plaqview.com), which provides expanded features and functionalities as well as additional cardiovascular single-cell datasets. We showcase improved PlaqView functionality, backend data processing, user-interface, and capacity. PlaqView brings new or improved tools to explore scRNA-seq data, including gene query, metadata browser, cell identity prediction, ad hoc RNA-trajectory analysis, and drug-gene interaction prediction. PlaqView serves as one of the largest central repositories for cardiovascular single-cell datasets, which now includes data from human aortic aneurysm, gene-specific mouse knockouts, and healthy references. PlaqView 2.0 brings advanced tools and high-performance computing directly to users without the need for any programming knowledge. Lastly, we outline steps to generalize and repurpose PlaqView's framework for single-cell datasets from other fields.
Collapse
Affiliation(s)
- Wei Feng Ma
- Medical Scientist Training Program, University of Virginia, Charlottesville, VA, United States
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Adam W. Turner
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Christina Gancayco
- Research Computing, University of Virginia, Charlottesville, VA, United States
| | - Doris Wong
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
| | - Yipei Song
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Computer Engineering, University of Virginia, Charlottesville, VA, United States
| | - Jose Verdezoto Mosquera
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Research Computing, University of Virginia, Charlottesville, VA, United States
| | - Gaëlle Auguste
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Chani J. Hodonsky
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Ajay Prabhakar
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - H. Atakan Ekiz
- Department of Molecular Biology and Genetics, Izmir Institute of Technology, Gülbahçe, Turkey
| | - Sander W. van der Laan
- Central Diagnostics Laboratory, Division Laboratories, Pharmacy, and Biomedical Genetics, University Medical Center Utrecht, Utrecht University, Utrecht, Netherlands
| | - Clint L. Miller
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
155
|
scWizard: a web-based automated tool for classifying and annotating single cells and downstream analysis of single-cell RNA-seq data in cancers. Comput Struct Biotechnol J 2022; 20:4902-4909. [PMID: 36147672 PMCID: PMC9474308 DOI: 10.1016/j.csbj.2022.08.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/27/2022] [Accepted: 08/12/2022] [Indexed: 11/22/2022] Open
Abstract
scWizard provides comprehensive analysis pipeline for integration strategies of cancer scRNA-seq data. scWizard enables classification of 47 cell subtypes within the TME based on hierarchical model by deep neural network. scWizard gives a higher accuracy for annotation cell subtypes within the TME compared with five methods. scWizard packages is a point-and-click tool helping for researchers without proficient programming skills.
The emerging number of single-cell RNA-seq (scRNA-Seq) datasets allows the characterization of cell types across various cancer types. However, there is still lack of effective tools to integrate the various analysis of single-cells, especially for making fine annotation on subtype cells within the tumor microenvironment (TME). We developed scWizard, a point-and-click tool packaging automated process including our developed cell annotation method based on deep neural network learning and 11 downstream analyses methods. scWizard used 113,976 cells across 13 cancer types as a built-in reference dataset for training the hierarchical model enabling to automatedly classify and annotate 7 major cell types and 47 cell subtypes in the TME. scWizard provides a built-in pre-training set for user’s flexible choice, and gives a higher accuracy for annotation subtypes of tumor-derived T-lymphocytes/natural killer cells (T/NK) and myeloid cells from different cancer types compared with the existing five methods. scWizard has good robustness in three independent cancer datasets, with an accuracy of 0.98 in annotating major cell types, 0.85 in annotating myeloid cell subtypes and 0.79 in annotating T/NK cell subtypes, indicting the wide applicability of scWizard in different cell types of cancers. Finally, the automatic analysis and visualization function of scWizard are presented by using the intrahepatic cholangiocarcinoma (ICC) scRNA-Seq dataset as a case. scWizard focuses on decoding TME and covers various analysis flows for cancer scRNA-Seq study, and provides an easy-to-use tool and a user-friendly interface for researchers widely, to further accelerate the biological discovery of cancer research.
Collapse
|
156
|
Hou W, Ji Z. Palo: spatially aware color palette optimization for single-cell and spatial data. Bioinformatics 2022; 38:3654-3656. [PMID: 35642896 PMCID: PMC9272793 DOI: 10.1093/bioinformatics/btac368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 05/18/2022] [Accepted: 05/26/2022] [Indexed: 11/15/2022] Open
Abstract
SUMMARY In the exploratory data analysis of single-cell or spatial genomic data, single-cells or spatial spots are often visualized using a two-dimensional plot where cell clusters or spot clusters are marked with different colors. With tens of clusters, current visualization methods often assign visually similar colors to spatially neighboring clusters, making it hard to identify the distinction between clusters. To address this issue, we developed Palo that optimizes the color palette assignment for single-cell and spatial data in a spatially aware manner. Palo identifies pairs of clusters that are spatially neighboring to each other and assigns visually distinct colors to those neighboring pairs. We demonstrate that Palo leads to improved visualization in real single-cell and spatial genomic datasets. AVAILABILITY AND IMPLEMENTATION Palo R package is freely available at Github (https://github.com/Winnie09/Palo) and Zenodo (https://doi.org/10.5281/zenodo.6562505). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenpin Hou
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Zhicheng Ji
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27710, USA
| |
Collapse
|
157
|
Ellis D, Wu D, Datta S. SAREV: A review on statistical analytics of single-cell RNA sequencing data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14:e1558. [PMID: 36034329 PMCID: PMC9400796 DOI: 10.1002/wics.1558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 04/09/2021] [Indexed: 06/15/2023]
Abstract
Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.
Collapse
Affiliation(s)
- Dorothy Ellis
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| | - Dongyuan Wu
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| | - Susmita Datta
- Department of Biostatistics, University of Florida, School of Public Health and Health Professions, Gainesville, FL
| |
Collapse
|
158
|
Chen Z, Goldwasser J, Tuckman P, Liu J, Zhang J, Gerstein M. Forest Fire Clustering for single-cell sequencing combines iterative label propagation with parallelized Monte Carlo simulations. Nat Commun 2022; 13:3538. [PMID: 35725981 PMCID: PMC9209427 DOI: 10.1038/s41467-022-31107-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 06/06/2022] [Indexed: 11/09/2022] Open
Abstract
In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies", highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.
Collapse
Affiliation(s)
- Zhanlin Chen
- Department of Statistics and Data Science, Yale University, New Haven, CT, 06520, USA
| | - Jeremy Goldwasser
- Department of Statistics and Data Science, Yale University, New Haven, CT, 06520, USA
| | - Philip Tuckman
- Department of Earth, Atmosphere, and Planetary Sciences, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Jason Liu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, CA, 92617, USA.
| | - Mark Gerstein
- Department of Statistics and Data Science, Yale University, New Haven, CT, 06520, USA.
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520, USA.
- Department of Computer Science, Yale University, New Haven, CT, 06520, USA.
| |
Collapse
|
159
|
Zandavi SM, Koch FC, Vijayan A, Zanini F, Mora F, Ortega D, Vafaee F. Disentangling single-cell omics representation with a power spectral density-based feature extraction. Nucleic Acids Res 2022; 50:5482-5492. [PMID: 35639509 PMCID: PMC9178020 DOI: 10.1093/nar/gkac436] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2021] [Revised: 04/26/2022] [Accepted: 05/10/2022] [Indexed: 12/13/2022] Open
Abstract
Emerging single-cell technologies provide high-resolution measurements of distinct cellular modalities opening new avenues for generating detailed cellular atlases of many and diverse tissues. The high dimensionality, sparsity, and inaccuracy of single cell sequencing measurements, however, can obscure discriminatory information, mask cellular subtype variations and complicate downstream analyses which can limit our understanding of cell function and tissue heterogeneity. Here, we present a novel pre-processing method (scPSD) inspired by power spectral density analysis that enhances the accuracy for cell subtype separation from large-scale single-cell omics data. We comprehensively benchmarked our method on a wide range of single-cell RNA-sequencing datasets and showed that scPSD pre-processing, while being fast and scalable, significantly reduces data complexity, enhances cell-type separation, and enables rare cell identification. Additionally, we applied scPSD to transcriptomics and chromatin accessibility cell atlases and demonstrated its capacity to discriminate over 100 cell types across the whole organism and across different modalities of single-cell omics data.
Collapse
Affiliation(s)
- Seid Miad Zandavi
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
- Programs in Metabolism and Medical & Population Genetics, Broad Institute, Cambridge, MA, USA
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Forrest C Koch
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
| | - Abhishek Vijayan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
| | - Fabio Zanini
- Prince of Wales Clinical School, UNSW Sydney, Australia
- Cellular Genomics Future Institute, UNSW Sydney, Australia
| | - Fatima Valdes Mora
- Children's Cancer Institute, Lowy Cancer Research Centre, UNSW Sydney, Australia
- School of Women's and Children's Health, Faculty of Medicine, UNSW, Sydney, Australia
| | - David Gallego Ortega
- School of Biomedical Engineering, University of Technology Sydney (UTS), Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
- Cellular Genomics Future Institute, UNSW Sydney, Australia
- UNSW Data Science Hub (uDASH), UNSW Sydney, Australia
| |
Collapse
|
160
|
Li J, Chen S, Pan X, Yuan Y, Shen HB. Cell clustering for spatial transcriptomics data with graph neural networks. NATURE COMPUTATIONAL SCIENCE 2022; 2:399-408. [PMID: 38177586 DOI: 10.1038/s43588-022-00266-5] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 05/19/2022] [Indexed: 01/06/2024]
Abstract
Spatial transcriptomics data can provide high-throughput gene expression profiling and the spatial structure of tissues simultaneously. Most studies have relied on only the gene expression information but cannot utilize the spatial information efficiently. Taking advantage of spatial transcriptomics and graph neural networks, we introduce cell clustering for spatial transcriptomics data with graph neural networks, an unsupervised cell clustering method based on graph convolutional networks to improve ab initio cell clustering and discovery of cell subtypes based on curated cell category annotation. On the basis of its application to five in vitro and in vivo spatial datasets, we show that cell clustering for spatial transcriptomics outperforms other spatial clustering approaches on spatial transcriptomics datasets and can clearly identify all four cell cycle phases from multiplexed error-robust fluorescence in situ hybridization data of cultured cells. From enhanced sequential fluorescence in situ hybridization data of brain, cell clustering for spatial transcriptomics finds functional cell subtypes with different micro-environments, which are all validated experimentally, inspiring biological hypotheses about the underlying interactions among the cell state, cell type and micro-environment.
Collapse
Affiliation(s)
- Jiachen Li
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Siheng Chen
- Cooperative Medianet Innovation Center (CMIC), Shanghai Jiao Tong University, Shanghai, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Ye Yuan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China.
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China.
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China.
| |
Collapse
|
161
|
Single-cell views of the Plasmodium life cycle. Trends Parasitol 2022; 38:748-757. [DOI: 10.1016/j.pt.2022.05.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 05/16/2022] [Accepted: 05/17/2022] [Indexed: 02/08/2023]
|
162
|
Dohmen J, Baranovskii A, Ronen J, Uyar B, Franke V, Akalin A. Identifying tumor cells at the single-cell level using machine learning. Genome Biol 2022; 23:123. [PMID: 35637521 PMCID: PMC9150321 DOI: 10.1186/s13059-022-02683-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 05/06/2022] [Indexed: 12/15/2022] Open
Abstract
Tumors are complex tissues of cancerous cells surrounded by a heterogeneous cellular microenvironment with which they interact. Single-cell sequencing enables molecular characterization of single cells within the tumor. However, cell annotation-the assignment of cell type or cell state to each sequenced cell-is a challenge, especially identifying tumor cells within single-cell or spatial sequencing experiments. Here, we propose ikarus, a machine learning pipeline aimed at distinguishing tumor cells from normal cells at the single-cell level. We test ikarus on multiple single-cell datasets, showing that it achieves high sensitivity and specificity in multiple experimental contexts.
Collapse
Affiliation(s)
- Jan Dohmen
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany
| | - Artem Baranovskii
- Non-coding RNAs and Mechanisms of Cytoplasmic Gene Regulation Lab, Berlin Institute for Medical Systems Biology, Hannoversche Str. 28, 10115, Berlin, Germany
- Free University Berlin, Kaiserswerther Str. 16-18, 14195, Berlin, Germany
| | - Jonathan Ronen
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany
| | - Bora Uyar
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany
| | - Vedran Franke
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany.
| | - Altuna Akalin
- Bioinformatics and Omics Data Science Platform, Berlin Institute For Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Hannoversche Str.28, 10115, Berlin, Germany.
| |
Collapse
|
163
|
Kumar S, Song M. Overcoming biases in causal inference of molecular interactions. Bioinformatics 2022; 38:2818-2825. [PMID: 35561208 DOI: 10.1093/bioinformatics/btac206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 02/03/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Computer inference of biological mechanisms is increasingly approachable due to dynamically rich data sources such as single-cell genomics. Inferred molecular interactions can prioritize hypotheses for wet-lab experiments to expedite biological discovery. However, complex data often come with unwanted biological or technical variations, exposing biases over marginal distribution and sample size in current methods to favor spurious causal relationships. RESULTS Considering function direction and strength as evidence for causality, we present an adapted functional chi-squared test (AdpFunChisq) that rewards functional patterns over non-functional or independent patterns. On synthetic and three biology datasets, we demonstrate the advantages of AdpFunChisq over 10 methods on overcoming biases that give rise to wide fluctuations in the performance of alternative approaches. On single-cell multiomics data of multiple phenotype acute leukemia, we found that the T-cell surface glycoprotein CD3 delta chain may causally mediate specific genes in the viral carcinogenesis pathway. Using the causality-by-functionality principle, AdpFunChisq offers a viable option for robust causal inference in dynamical systems. AVAILABILITY AND IMPLEMENTATION The AdpFunChisq test is implemented in the R package 'FunChisq' (2.5.2 or above) at https://cran.r-project.org/package=FunChisq. All other source code along with pre-processed data is available at Code Ocean https://doi.org/10.24433/CO.2907738.v1. SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sajal Kumar
- Department of Computer Science, New Mexico State University, Las Cruces, NM 88003, USA
| | - Mingzhou Song
- Department of Computer Science, New Mexico State University, Las Cruces, NM 88003, USA
- Molecular Biology and Interdisciplinary Life Sciences Graduate Program, New Mexico State University, Las Cruces, NM 88003, USA
| |
Collapse
|
164
|
Storrs EP, Zhou DC, Wendl MC, Wyczalkowski MA, Karpova A, Wang LB, Li Y, Southard-Smith A, Jayasinghe RG, Yao L, Liu R, Wu Y, Terekhanova NV, Zhu H, Herndon JM, Puram S, Chen F, Gillanders WE, Fields RC, Ding L. Pollock: fishing for cell states. BIOINFORMATICS ADVANCES 2022; 2:vbac028. [PMID: 35603231 PMCID: PMC9115775 DOI: 10.1093/bioadv/vbac028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 04/06/2022] [Accepted: 05/10/2022] [Indexed: 11/24/2022]
Abstract
Motivation The use of single-cell methods is expanding at an ever-increasing rate. While there are established algorithms that address cell classification, they are limited in terms of cross platform compatibility, reliance on the availability of a reference dataset and classification interpretability. Here, we introduce Pollock, a suite of algorithms for cell type identification that is compatible with popular single-cell methods and analysis platforms, provides a set of pretrained human cancer reference models, and reports interpretability scores that identify the genes that drive cell type classifications. Results Pollock performs comparably to existing classification methods, while offering easily deployable pretrained classification models across a wide variety of tissue and data types. Additionally, it demonstrates utility in immune pan-cancer analysis. Availability and implementation Source code and documentation are available at https://github.com/ding-lab/pollock. Pretrained models and datasets are available for download at https://zenodo.org/record/5895221. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Erik P Storrs
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Daniel Cui Zhou
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Michael C Wendl
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Matthew A Wyczalkowski
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Alla Karpova
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Liang-Bo Wang
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Yize Li
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Austin Southard-Smith
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Reyka G Jayasinghe
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Lijun Yao
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Ruiyang Liu
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Yige Wu
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Nadezhda V Terekhanova
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - Houxiang Zhu
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA
| | - John M Herndon
- Department of Surgery, Washington University in St. Louis, St. Louis, MO 63110, USA,Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - Sid Puram
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - Feng Chen
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - William E Gillanders
- Department of Surgery, Washington University in St. Louis, St. Louis, MO 63110, USA,Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - Ryan C Fields
- Department of Surgery, Washington University in St. Louis, St. Louis, MO 63110, USA,Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - Li Ding
- Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, USA,McDonnell Genome Institute, Washington University in St. Louis, St. Louis, MO 63108, USA,Siteman Cancer Center, Washington University in St. Louis, St. Louis, MO 63110, USA,To whom correspondence should be addressed.
| |
Collapse
|
165
|
Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, Howlett SK, Suchanek O, Polanski K, King HW, Mamanova L, Huang N, Szabo PA, Richardson L, Bolt L, Fasouli ES, Mahbubani KT, Prete M, Tuck L, Richoz N, Tuong ZK, Campos L, Mousa HS, Needham EJ, Pritchard S, Li T, Elmentaite R, Park J, Rahmani E, Chen D, Menon DK, Bayraktar OA, James LK, Meyer KB, Yosef N, Clatworthy MR, Sims PA, Farber DL, Saeb-Parsy K, Jones JL, Teichmann SA. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 2022; 376:eabl5197. [PMID: 35549406 PMCID: PMC7612735 DOI: 10.1126/science.abl5197] [Citation(s) in RCA: 289] [Impact Index Per Article: 144.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Despite their crucial role in health and disease, our knowledge of immune cells within human tissues remains limited. We surveyed the immune compartment of 16 tissues from 12 adult donors by single-cell RNA sequencing and VDJ sequencing generating a dataset of ~360,000 cells. To systematically resolve immune cell heterogeneity across tissues, we developed CellTypist, a machine learning tool for rapid and precise cell type annotation. Using this approach, combined with detailed curation, we determined the tissue distribution of finely phenotyped immune cell types, revealing hitherto unappreciated tissue-specific features and clonal architecture of T and B cells. Our multitissue approach lays the foundation for identifying highly resolved immune cell types by leveraging a common reference dataset, tissue-integrated expression analysis, and antigen receptor sequencing.
Collapse
Affiliation(s)
- C Domínguez Conde
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - C Xu
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - LB Jarvis
- Department of Clinical Neurosciences, University of Cambridge
| | - DB Rainbow
- Department of Clinical Neurosciences, University of Cambridge
| | - SB Wells
- Department of Systems Biology, Columbia University Irving Medical Center
| | - T Gomes
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - SK Howlett
- Department of Clinical Neurosciences, University of Cambridge
| | - O Suchanek
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, UK
| | - K Polanski
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - HW King
- Centre for Immunobiology, Blizard Institute, Queen Mary University of London, London, UK
| | - L Mamanova
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - N Huang
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - PA Szabo
- Department of Microbiology and Immunology, Columbia University Irving Medical Center
| | - L Richardson
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - L Bolt
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - ES Fasouli
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - KT Mahbubani
- Department of Surgery, University of Cambridge and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
| | - M Prete
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - L Tuck
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - N Richoz
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, UK
| | - ZK Tuong
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, UK
| | - L Campos
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- West Suffolk Hospital NHS Trust, Bury Saint Edmunds, UK
| | - HS Mousa
- Department of Clinical Neurosciences, University of Cambridge
| | - EJ Needham
- Department of Clinical Neurosciences, University of Cambridge
| | - S Pritchard
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - T Li
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - R Elmentaite
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - J Park
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - E Rahmani
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - D Chen
- Department of Systems Biology, Columbia University Irving Medical Center
| | - DK Menon
- Department of Anaesthesia, University of Cambridge, Cambridge, UK
| | - OA Bayraktar
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - LK James
- Centre for Immunobiology, Blizard Institute, Queen Mary University of London, London, UK
| | - KB Meyer
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - N Yosef
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Ragon Institute of MGH, MIT and Harvard, Cambridge, MA, USA
| | - MR Clatworthy
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, UK
| | - PA Sims
- Department of Systems Biology, Columbia University Irving Medical Center
| | - DL Farber
- Department of Microbiology and Immunology, Columbia University Irving Medical Center
| | - K Saeb-Parsy
- Department of Surgery, University of Cambridge and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
| | - JL Jones
- Department of Clinical Neurosciences, University of Cambridge
| | - SA Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Theory of Condensed Matter, Cavendish Laboratory, Department of Physics, University of Cambridge, JJ Thomson Ave, Cambridge CB3 0HE, UK
| |
Collapse
|
166
|
Zhang Y, Zhang F, Wang Z, Wu S, Tian W. scMAGIC: accurately annotating single cells using two rounds of reference-based classification. Nucleic Acids Res 2022; 50:e43. [PMID: 34986249 PMCID: PMC9071478 DOI: 10.1093/nar/gkab1275] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 11/08/2021] [Accepted: 12/14/2021] [Indexed: 11/21/2022] Open
Abstract
Here, we introduce scMAGIC (Single Cell annotation using MArker Genes Identification and two rounds of reference-based Classification [RBC]), a novel method that uses well-annotated single-cell RNA sequencing (scRNA-seq) data as the reference to assist in the classification of query scRNA-seq data. A key innovation in scMAGIC is the introduction of a second-round RBC in which those query cells whose cell identities are confidently validated in the first round are used as a new reference to again classify query cells, therefore eliminating the batch effects between the reference and the query data. scMAGIC significantly outperforms 13 competing RBC methods with their optimal parameter settings across 86 benchmark tests, especially when the cell types in the query dataset are not completely covered by the reference dataset and when there exist significant batch effects between the reference and the query datasets. Moreover, when no reference dataset is available, scMAGIC can annotate query cells with reasonably high accuracy by using an atlas dataset as the reference.
Collapse
Affiliation(s)
- Yu Zhang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
- Department of Histoembryology, Genetics and Developmental Biology, Shanghai Key Laboratory of Reproductive Medicine, Key Laboratory of Cell Differentiation and Apoptosis of Chinese Ministry of Education, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Zekun Wang
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Siyi Wu
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai 200438, P.R. China
- Qilu Children's Hospital of Shandong University, No 23976 Jingshi Road, Jinan, Shandong, China
- Children’s Hospital of Fudan University, Shanghai 201102, China
| |
Collapse
|
167
|
Zeng Y, Wei Z, Zhong F, Pan Z, Lu Y, Yang Y. A parameter-free deep embedded clustering method for single-cell RNA-seq data. Brief Bioinform 2022; 23:6582003. [PMID: 35524494 DOI: 10.1093/bib/bbac172] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 03/25/2022] [Accepted: 04/18/2022] [Indexed: 11/12/2022] Open
Abstract
Clustering analysis is widely used in single-cell ribonucleic acid (RNA)-sequencing (scRNA-seq) data to discover cell heterogeneity and cell states. While many clustering methods have been developed for scRNA-seq analysis, most of these methods require to provide the number of clusters. However, it is not easy to know the exact number of cell types in advance, and experienced determination is not always reliable. Here, we have developed ADClust, an automatic deep embedding clustering method for scRNA-seq data, which can accurately cluster cells without requiring a predefined number of clusters. Specifically, ADClust first obtains low-dimensional representation through pre-trained autoencoder and uses the representations to cluster cells into initial micro-clusters. The clusters are then compared in between by a statistical test, and similar micro-clusters are merged into larger clusters. According to the clustering, cell representations are updated so that each cell will be pulled toward centers of its assigned cluster and similar clusters, while cells are separated to keep distances between clusters. This is accomplished through jointly optimizing the carefully designed clustering and autoencoder loss functions. This merging process continues until convergence. ADClust was tested on 11 real scRNA-seq datasets and was shown to outperform existing methods in terms of both clustering performance and the accuracy on the number of the determined clusters. More importantly, our model provides high speed and scalability for large datasets.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zhuoyi Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Fengqi Zhong
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China
| |
Collapse
|
168
|
Hosseini N, Mehrabian A, Mostafavi H. Modeling climate change effects on spatial distribution of wild Aegilops L. (Poaceae) toward food security management and biodiversity conservation in Iran. INTEGRATED ENVIRONMENTAL ASSESSMENT AND MANAGEMENT 2022; 18:697-708. [PMID: 34617662 DOI: 10.1002/ieam.4531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 09/14/2021] [Accepted: 09/28/2021] [Indexed: 06/13/2023]
Abstract
The demand for food resources is increasing quickly because human populations are growing; therefore, food security may become one of the largest human challenges of this century. Crop wild relatives (CWRs) are the most valuable plant genetic resources (PGR) for the conservation of genetic diversity in crops. However, climate change is an added pressure on biodiversity, particularly on this valuable group of plants. It is predicted that more than 50% of this group may be lost by 2055 as a result of the effects of climate change. Iran ranks high in the world in its conservation priorities for CWRs. This study investigates the impacts of climate change on Aegilops L. as important CWRs. MaxEnt was applied to predict the spatial distribution of seven Aegilops species under different climatic scenarios (RCP 2.6 and RCP 8.5) of 2050 and 2080. According to the findings, all species exhibited reduction or expansion responses under all of the above-mentioned climatic scenarios. However, the range change was negative for some species (i.e., Aegilops columnaris, Aegilops cylindrica, Aegilops speltoides, Aegilops tauschii [in all scenarios of 2050 and 2080], and Aegilops kotschyi [RCP 2.6 2050 and 2080]), and positive for others (i.e., Aegilops crassa, Aegilops triuncialis [in all scenarios of 2050 and 2080], and Aegilops kotschyi [RCP 8.5 2050 and 2080]). The results of this study emphasize the need for conservation plans for the country's genetic resources, including regular monitoring and assessment of ecological and demographic changes. Integr Environ Assess Manag 2022;18:697-708. © 2021 SETAC.
Collapse
Affiliation(s)
- Naser Hosseini
- Department of Plant Sciences and Biotechnology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran
| | - Ahmadreza Mehrabian
- Department of Plant Sciences and Biotechnology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran
| | - Hossein Mostafavi
- Department of Biodiversity and Ecosystem Management, Environmental Sciences Research Institute, Shahid Beheshti University, Tehran, Iran
| |
Collapse
|
169
|
Abondio P, De Intinis C, da Silva Gonçalves Vianez Júnior JL, Pace L. SINGLE CELL MULTIOMIC APPROACHES TO DISENTANGLE T CELL HETEROGENEITY. Immunol Lett 2022; 246:37-51. [DOI: 10.1016/j.imlet.2022.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Revised: 04/16/2022] [Accepted: 04/26/2022] [Indexed: 11/29/2022]
|
170
|
Bridges K, Miller-Jensen K. Mapping and Validation of scRNA-Seq-Derived Cell-Cell Communication Networks in the Tumor Microenvironment. Front Immunol 2022; 13:885267. [PMID: 35572582 PMCID: PMC9096838 DOI: 10.3389/fimmu.2022.885267] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Accepted: 03/25/2022] [Indexed: 01/25/2023] Open
Abstract
Recent advances in single-cell technologies, particularly single-cell RNA-sequencing (scRNA-seq), have permitted high throughput transcriptional profiling of a wide variety of biological systems. As scRNA-seq supports inference of cell-cell communication, this technology has and continues to anchor groundbreaking studies into the efficacy and mechanism of novel immunotherapies for cancer treatment. In this review, we will highlight methods developed to infer inter- and intracellular signaling from scRNA-seq and discuss how they have contributed to studies of immunotherapeutic intervention in the tumor microenvironment (TME). However, a central challenge remains in validating the hypothesized cell-cell interactions. Therefore, this review will also cover strategies for integration of these scRNA-seq-derived interaction networks with existing experimental and computational approaches. Integration of these networks with imaging, protein secretion measurements, and network analysis and mathematical modeling tools addresses challenges that remain with scRNA-seq to enhance studies of immunosuppressive and immunotherapy-altered signaling in the TME.
Collapse
Affiliation(s)
- Kate Bridges
- Department of Biomedical Engineering, Yale University, New Haven, CT, United States
| | - Kathryn Miller-Jensen
- Department of Biomedical Engineering, Yale University, New Haven, CT, United States
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT, United States
- Systems Biology Institute, Yale University, New Haven, CT, United States
| |
Collapse
|
171
|
CASSL: A cell-type annotation method for single cell transcriptomics data using semi-supervised learning. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03440-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
172
|
Upadhyay P, Ray S. A Regularized Multi-Task Learning Approach for Cell Type Detection in Single-Cell RNA Sequencing Data. Front Genet 2022; 13:788832. [PMID: 35495159 PMCID: PMC9043858 DOI: 10.3389/fgene.2022.788832] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 02/16/2022] [Indexed: 11/29/2022] Open
Abstract
Cell type prediction is one of the most challenging goals in single-cell RNA sequencing (scRNA-seq) data. Existing methods use unsupervised learning to identify signature genes in each cluster, followed by a literature survey to look up those genes for assigning cell types. However, finding potential marker genes in each cluster is cumbersome, which impedes the systematic analysis of single-cell RNA sequencing data. To address this challenge, we proposed a framework based on regularized multi-task learning (RMTL) that enables us to simultaneously learn the subpopulation associated with a particular cell type. Learning the structure of subpopulations is treated as a separate task in the multi-task learner. Regularization is used to modulate the multi-task model (e.g., W1, W2, … Wt) jointly, according to the specific prior. For validating our model, we trained it with reference data constructed from a single-cell RNA sequencing experiment and applied it to a query dataset. We also predicted completely independent data (the query dataset) from the reference data which are used for training. We have checked the efficacy of the proposed method by comparing it with other state-of-the-art techniques well known for cell type detection. Results revealed that the proposed method performed accurately in detecting the cell type in scRNA-seq data and thus can be utilized as a useful tool in the scRNA-seq pipeline.
Collapse
Affiliation(s)
- Piu Upadhyay
- B.P. Poddar Institute of Management and Technology, Kolkata, India
| | - Sumanta Ray
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
- Health Analytics Network, Pittsburgh, PA, United States
- *Correspondence: Sumanta Ray, ,
| |
Collapse
|
173
|
Jiang H, Huang Y, Li Q. Spectral clustering of single cells using Siamese nerual network combined with improved affinity matrix. Brief Bioinform 2022; 23:6567703. [PMID: 35419595 DOI: 10.1093/bib/bbac113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 03/02/2022] [Accepted: 03/08/2022] [Indexed: 11/14/2022] Open
Abstract
Limitations of bulk sequencing techniques on cell heterogeneity and diversity analysis have been pushed with the development of single-cell RNA-sequencing (scRNA-seq). To detect clusters of cells is a key step in the analysis of scRNA-seq. However, the high-dimensionality of scRNA-seq data and the imbalances in the number of different subcellular types are ubiquitous in real scRNA-seq data sets, which poses a huge challenge to the single-cell-type detection.We propose a meta-learning-based model, SiaClust, which is the combination of Siamese Convolutional Neural Network (CNN) and improved spectral clustering, to achieve scRNA-seq cell type detection. To be specific, with the help of the constrained Sigmoid kernel, the raw high-dimensionality data is mapped to a low-dimensional space, and the Siamese CNN learns the differences between the cell types in the low-dimensional feature space. The similarity matrix learned by Siamese CNN is used in combination with improved spectral clustering and t-distribution Stochastic Neighbor Embedding (t-SNE) for visualization. SiaClust highlights the differences between cell types by comparing the similarity of the samples, whereas blurring the differences within the cell types is better in processing high-dimensional and imbalanced data. SiaClust significantly improves clustering accuracy by using data generated by nine different species and tissues through different scNA-seq protocols for extensive evaluation, as well as analogies to state-of-the-art single-cell clustering models. More importantly, SiaClust accurately locates the exact site of dropout gene, and is more flexible with data size and cell type.
Collapse
Affiliation(s)
- Hanjing Jiang
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, Institute of Artificial Intelligence, School of Artificial Intelligence and Automation, 430074, Wuhan, China
| | - Yabing Huang
- Renmin Hospital of Wuhan University, Department of Pathology, 430060, Wuhan, China
| | - Qianpeng Li
- Chinese Academy of Sciences, Institute of Automation, 100190, Beijing, China
| |
Collapse
|
174
|
Heydari AA, Davalos OA, Zhao L, Hoyer KK, Sindi SS. ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders. Bioinformatics 2022; 38:2194-2201. [PMID: 35179571 PMCID: PMC9004654 DOI: 10.1093/bioinformatics/btac095] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 01/19/2022] [Accepted: 02/15/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNAseq) technologies allow for measurements of gene expression at a single-cell resolution. This provides researchers with a tremendous advantage for detecting heterogeneity, delineating cellular maps or identifying rare subpopulations. However, a critical complication remains: the low number of single-cell observations due to limitations by rarity of subpopulation, tissue degradation or cost. This absence of sufficient data may cause inaccuracy or irreproducibility of downstream analysis. In this work, we present Automated Cell-Type-informed Introspective Variational Autoencoder (ACTIVA): a novel framework for generating realistic synthetic data using a single-stream adversarial variational autoencoder conditioned with cell-type information. Within a single framework, ACTIVA can enlarge existing datasets and generate specific subpopulations on demand, as opposed to two separate models [such as single-cell GAN (scGAN) and conditional scGAN (cscGAN)]. Data generation and augmentation with ACTIVA can enhance scRNAseq pipelines and analysis, such as benchmarking new algorithms, studying the accuracy of classifiers and detecting marker genes. ACTIVA will facilitate analysis of smaller datasets, potentially reducing the number of patients and animals necessary in initial studies. RESULTS We train and evaluate models on multiple public scRNAseq datasets. In comparison to GAN-based models (scGAN and cscGAN), we demonstrate that ACTIVA generates cells that are more realistic and harder for classifiers to identify as synthetic which also have better pair-wise correlation between genes. Data augmentation with ACTIVA significantly improves classification of rare subtypes (more than 45% improvement compared with not augmenting and 4% better than cscGAN) all while reducing run-time by an order of magnitude in comparison to both models. AVAILABILITY AND IMPLEMENTATION The codes and datasets are hosted on Zenodo (https://doi.org/10.5281/zenodo.5879639). Tutorials are available at https://github.com/SindiLab/ACTIVA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA 95343, USA
- Health Sciences Research Institute, University of California, Merced, CA 95343, USA
| | - Oscar A Davalos
- Health Sciences Research Institute, University of California, Merced, CA 95343, USA
- Quantitative and Systems Biology Graduate Program, University of California, Merced, CA 95343, USA
| | - Lihong Zhao
- Department of Applied Mathematics, University of California, Merced, CA 95343, USA
| | - Katrina K Hoyer
- Health Sciences Research Institute, University of California, Merced, CA 95343, USA
- Department of Molecular and Cell Biology, University of California, Merced, CA 95343, USA
| | - Suzanne S Sindi
- Department of Applied Mathematics, University of California, Merced, CA 95343, USA
- Health Sciences Research Institute, University of California, Merced, CA 95343, USA
| |
Collapse
|
175
|
Yin Q, Liu Q, Fu Z, Zeng W, Zhang B, Zhang X, Jiang R, Lv H. scGraph: a graph neural network-based approach to automatically identify cell types. Bioinformatics 2022; 38:2996-3003. [PMID: 35394015 DOI: 10.1093/bioinformatics/btac199] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 12/13/2021] [Accepted: 04/07/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Single cell technologies play a crucial role in revolutionizing biological research over the past decade, which strengthens our understanding in cell differentiation, development, and regulation from a single-cell level perspective. Single-cell RNA sequencing (scRNA-seq) is one of the most common single cell technologies, which enables probing transcriptional states in thousands of cells in one experiment. Identification of cell types from scRNA-seq measurements is a fundamental and crucial question to answer. Most previous studies directly take gene expression as input while ignoring the comprehensive gene-gene interactions. RESULTS We propose scGraph, an automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell type identification. ScGraph is based on a graph neural network to aggregate the information of interacting genes. In a series of experiments, we demonstrate that scGraph is accurate and outperforms eight comparison methods in the task of cell type identification. Moreover, scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism. AVAILABILITY scGraph is freely available at https://github.com/QijinYin/scGraph and https://figshare.com/articles/software/scGraph/17157743. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Qiao Liu
- Department of Statistics, Stanford University Stanford, CA 94305
| | - Zhuoran Fu
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wanwen Zeng
- Department of Statistics, Stanford University Stanford, CA 94305.,College of Software, Nankai University, Tianjin, 300350, China
| | - Boheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Hairong Lv
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.,Fuzhou Institute of Data Technology, Changle, Fuzhou, 350200, China
| |
Collapse
|
176
|
Kong W, Fu YC, Holloway EM, Garipler G, Yang X, Mazzoni EO, Morris SA. Capybara: A computational tool to measure cell identity and fate transitions. Cell Stem Cell 2022; 29:635-649.e11. [PMID: 35354062 PMCID: PMC9040453 DOI: 10.1016/j.stem.2022.03.001] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 02/18/2022] [Accepted: 03/03/2022] [Indexed: 01/14/2023]
Abstract
Measuring cell identity in development, disease, and reprogramming is challenging as cell types and states are in continual transition. Here, we present Capybara, a computational tool to classify discrete cell identity and intermediate "hybrid" cell states, supporting a metric to quantify cell fate transition dynamics. We validate hybrid cells using experimental lineage tracing data to demonstrate the multi-lineage potential of these intermediate cell states. We apply Capybara to diagnose shortcomings in several cell engineering protocols, identifying hybrid states in cardiac reprogramming and off-target identities in motor neuron programming, which we alleviate by adding exogenous signaling factors. Further, we establish a putative in vivo correlate for induced endoderm progenitors. Together, these results showcase the utility of Capybara to dissect cell identity and fate transitions, prioritizing interventions to enhance the efficiency and fidelity of stem cell engineering.
Collapse
Affiliation(s)
- Wenjun Kong
- Department of Developmental Biology, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Center of Regenerative Medicine, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA
| | - Yuheng C Fu
- Department of Developmental Biology, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Center of Regenerative Medicine, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA
| | - Emily M Holloway
- Department of Developmental Biology, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Center of Regenerative Medicine, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA
| | - Görkem Garipler
- Department of Biology, New York University, New York, NY 10003, USA
| | - Xue Yang
- Department of Developmental Biology, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Center of Regenerative Medicine, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA
| | | | - Samantha A Morris
- Department of Developmental Biology, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Department of Genetics, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA; Center of Regenerative Medicine, Washington University School of Medicine in St. Louis, 660 S. Euclid Avenue, Campus Box 8103, St. Louis, MO 63110, USA.
| |
Collapse
|
177
|
Xu W, He H, Guo Z, Li W. Evaluation of machine learning models on protein level inference from prioritized RNA features. Brief Bioinform 2022; 23:6555405. [PMID: 35352096 DOI: 10.1093/bib/bbac091] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 02/16/2022] [Accepted: 02/23/2022] [Indexed: 11/12/2022] Open
Abstract
The parallel measurement of transcriptome and proteome revealed unmatched profiles. Since proteomic analysis is more expensive and challenging than transcriptomic analysis, the question of how to use messenger RNA (mRNA) expression data to predict protein level is extremely important. Here, we comprehensively evaluated 13 machine learning models on inferring protein expression levels using RNA expression profile. A total of 20 proteogenomic datasets from three mainstream proteomic platforms with >2500 samples of 13 human tissues were collected for model evaluation. Our results highlighted that the appropriate feature selection methods combined with classical machine learning models could achieve excellent predictive performance. The voting ensemble model outperformed other candidate models across datasets. Adding the mRNA proxy model to the regression model further improved the prediction performance. The dataset and gene characteristics could affect the prediction performance. Finally, we applied the model to the brain transcriptome of cerebral cortex regions to infer the protein profile for better understanding the functional characteristics of the brain regions. This benchmarking work not only provides useful hints on the inherent correlation between transcriptome and proteome, but also has practical value of the transcriptome-based prediction of protein expression levels.
Collapse
Affiliation(s)
- Wenjian Xu
- Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Rare Disease Center, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Haochen He
- Department of Radiation Protection and Health Physics, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Zhengguang Guo
- Core Facility of Instruments, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, 5 Dong Dan San Tiao, Beijing 100005, China
| | - Wei Li
- Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Rare Disease Center, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| |
Collapse
|
178
|
Zeng Z, Li Y, Li Y, Luo Y. Statistical and machine learning methods for spatially resolved transcriptomics data analysis. Genome Biol 2022; 23:83. [PMID: 35337374 PMCID: PMC8951701 DOI: 10.1186/s13059-022-02653-7] [Citation(s) in RCA: 55] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 03/15/2022] [Indexed: 01/28/2023] Open
Abstract
The recent advancement in spatial transcriptomics technology has enabled multiplexed profiling of cellular transcriptomes and spatial locations. As the capacity and efficiency of the experimental technologies continue to improve, there is an emerging need for the development of analytical approaches. Furthermore, with the continuous evolution of sequencing protocols, the underlying assumptions of current analytical methods need to be re-evaluated and adjusted to harness the increasing data complexity. To motivate and aid future model development, we herein review the recent development of statistical and machine learning methods in spatial transcriptomics, summarize useful resources, and highlight the challenges and opportunities ahead.
Collapse
Affiliation(s)
- Zexian Zeng
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100084, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100084, China
- Department of Data Sciences, Dana Farber Cancer Institute, Harvard T.H. Chan School of Public Health, Boston, MA, 02215, USA
| | - Yawei Li
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Yiming Li
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Yuan Luo
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA.
- Northwestern University Clinical and Translational Sciences Institute, Chicago, IL, 60611, USA.
- Institute for Augmented Intelligence in Medicine, Northwestern University, Chicago, IL, 60611, USA.
- Center for Health Information Partnerships, Northwestern University, Chicago, IL, 60611, USA.
| |
Collapse
|
179
|
Cao X, Xing L, Majd E, He H, Gu J, Zhang X. A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data. Front Genet 2022; 13:836798. [PMID: 35281805 PMCID: PMC8905542 DOI: 10.3389/fgene.2022.836798] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/18/2022] [Indexed: 11/13/2022] Open
Abstract
The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers.
Collapse
Affiliation(s)
- Xiaowen Cao
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China.,Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Li Xing
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Elham Majd
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Hua He
- School of Science, Hebei University of Technology, Tianjin, China
| | - Junhua Gu
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xuekui Zhang
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| |
Collapse
|
180
|
Sun X, Lin X, Li Z, Wu H. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq. Brief Bioinform 2022; 23:6502554. [PMID: 35021202 PMCID: PMC8921620 DOI: 10.1093/bib/bbab567] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 11/19/2021] [Accepted: 12/11/2021] [Indexed: 01/26/2023] Open
Abstract
The cell type identification is among the most important tasks in single-cell RNA-sequencing (scRNA-seq) analysis. Many in silico methods have been developed and can be roughly categorized as either supervised or unsupervised. In this study, we investigated the performances of 8 supervised and 10 unsupervised cell type identification methods using 14 public scRNA-seq datasets of different tissues, sequencing protocols and species. We investigated the impacts of a number of factors, including total amount of cells, number of cell types, sequencing depth, batch effects, reference bias, cell population imbalance, unknown/novel cell type, and computational efficiency and scalability. Instead of merely comparing individual methods, we focused on factors' impacts on the general category of supervised and unsupervised methods. We found that in most scenarios, the supervised methods outperformed the unsupervised methods, except for the identification of unknown cell types. This is particularly true when the supervised methods use a reference dataset with high informational sufficiency, low complexity and high similarity to the query dataset. However, such outperformance could be undermined by some undesired dataset properties investigated in this study, which lead to uninformative and biased reference datasets. In these scenarios, unsupervised methods could be comparable to supervised methods. Our study not only explained the cell typing methods' behaviors under different experimental settings but also provided a general guideline for the choice of method according to the scientific goal and dataset properties. Finally, our evaluation workflow is implemented as a modularized R pipeline that allows future evaluation of new methods. Availability: All the source codes are available at https://github.com/xsun28/scRNAIdent.
Collapse
Affiliation(s)
- Xiaobo Sun
- Department of Statistics, School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, Hubei, China
| | - Xiaochu Lin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, U.S
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| |
Collapse
|
181
|
Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 2022; 13:1246. [PMID: 35273156 PMCID: PMC8913782 DOI: 10.1038/s41467-022-28803-w] [Citation(s) in RCA: 185] [Impact Index Per Article: 92.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 02/03/2022] [Indexed: 12/29/2022] Open
Abstract
Identification of cell populations often relies on manual annotation of cell clusters using established marker genes. However, the selection of marker genes is a time-consuming process that may lead to sub-optimal annotations as the markers must be informative of both the individual cell clusters and various cell types present in the sample. Here, we developed a computational platform, ScType, which enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information. Using six scRNA-seq datasets from various human and mouse tissues, we show how ScType provides unbiased and accurate cell type annotations by guaranteeing the specificity of positive and negative marker genes across cell clusters and cell types. We also demonstrate how ScType distinguishes between healthy and malignant cell populations, based on single-cell calling of single-nucleotide variants, making it a versatile tool for anticancer applications. The widely applicable method is deployed both as an interactive web-tool (https://sctype.app), and as an open-source R-package. Cell types are typically identified in single cell transcriptomic data by manual annotation of cell clusters using established marker genes. Here the authors present a fully-automated computational platform that can quickly and accurately distinguish between cell types.
Collapse
Affiliation(s)
- Aleksandr Ianevski
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.,Helsinki Institute for Information Technology (HIIT), Aalto University, Helsinki, Finland
| | - Anil K Giri
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland. .,Helsinki Institute for Information Technology (HIIT), Aalto University, Helsinki, Finland. .,Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Oslo, Norway. .,Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo, Norway.
| |
Collapse
|
182
|
Andreatta M, Berenstein AJ, Carmona SJ. scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets. Bioinformatics 2022; 38:2642-2644. [PMID: 35258562 PMCID: PMC9048671 DOI: 10.1093/bioinformatics/btac141] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/21/2022] [Accepted: 03/04/2022] [Indexed: 01/22/2023] Open
Abstract
Summary A common bioinformatics task in single-cell data analysis is to purify a cell type or cell population of interest from heterogeneous datasets. Here, we present scGate, an algorithm that automatizes marker-based purification of specific cell populations, without requiring training data or reference gene expression profiles. scGate purifies a cell population of interest using a set of markers organized in a hierarchical structure, akin to gating strategies employed in flow cytometry. scGate outperforms state-of-the-art single-cell classifiers and it can be applied to multiple modalities of single-cell data (e.g. RNA-seq, ATAC-seq, CITE-seq). scGate is implemented as an R package and integrated with the Seurat framework, providing an intuitive tool to isolate cell populations of interest from heterogeneous single-cell datasets. Availability and implementation scGate is available as an R package at https://github.com/carmonalab/scGate (https://doi.org/10.5281/zenodo.6202614). Several reproducible workflows describing the main functions and usage of the package on different single-cell modalities, as well as the code to reproduce the benchmark, can be found at https://github.com/carmonalab/scGate.demo (https://doi.org/10.5281/zenodo.6202585) and https://github.com/carmonalab/scGate.benchmark. Test data are available at https://doi.org/10.6084/m9.figshare.16826071. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Massimo Andreatta
- Ludwig Institute for Cancer Research, Lausanne Branch, and Department of Oncology, CHUV and University of Lausanne, Lausanne, 1011, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ariel J Berenstein
- Laboratorio de Biología Molecular, División Patología, Instituto Multidisciplinario de Investigaciones en Patologías Pediátricas (IMIPP), CONICET-GCBA, Buenos Aires C1425EFD, Argentina
| | - Santiago J Carmona
- Ludwig Institute for Cancer Research, Lausanne Branch, and Department of Oncology, CHUV and University of Lausanne, Lausanne, 1011, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
183
|
Li D, Velazquez JJ, Ding J, Hislop J, Ebrahimkhani MR, Bar-Joseph Z. TraSig: inferring cell-cell interactions from pseudotime ordering of scRNA-Seq data. Genome Biol 2022; 23:73. [PMID: 35255944 PMCID: PMC8900372 DOI: 10.1186/s13059-022-02629-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 02/09/2022] [Indexed: 02/08/2023] Open
Abstract
A major advantage of single cell RNA-sequencing (scRNA-Seq) data is the ability to reconstruct continuous ordering and trajectories for cells. Here we present TraSig, a computational method for improving the inference of cell-cell interactions in scRNA-Seq studies that utilizes the dynamic information to identify significant ligand-receptor pairs with similar trajectories, which in turn are used to score interacting cell clusters. We applied TraSig to several scRNA-Seq datasets and obtained unique predictions that improve upon those identified by prior methods. Functional experiments validate the ability of TraSig to identify novel signaling interactions that impact vascular development in liver organoids.Software https://github.com/doraadong/TraSig .
Collapse
Affiliation(s)
- Dongshunyi Li
- Computational Biology Department, School of Computer Science, Carnegie Mellon Universit, Pittsburgh, 15213, PA, USA
| | - Jeremy J Velazquez
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, 15213, PA, USA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, 15261, PA, USA
| | - Jun Ding
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, H4A 3J1, Quebec, Canada
| | - Joshua Hislop
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, 15213, PA, USA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, 15261, PA, USA
- Department of Bioengineering, Swanson School of Engineering, University of Pittsburgh, Pittsburgh, 15261, PA, USA
| | - Mo R Ebrahimkhani
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, 15213, PA, USA.
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, 15261, PA, USA.
- Department of Bioengineering, Swanson School of Engineering, University of Pittsburgh, Pittsburgh, 15261, PA, USA.
- McGowan Institute for Regenerative Medicine, University of Pittsburgh, Pittsburgh, 15219, PA, USA.
| | - Ziv Bar-Joseph
- Computational Biology Department, School of Computer Science, Carnegie Mellon Universit, Pittsburgh, 15213, PA, USA
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, 15213, PA, USA
| |
Collapse
|
184
|
Goyal M, Serrano G, Argemi J, Shomorony I, Hernaez M, Ochoa I. JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation. Bioinformatics 2022; 38:2488-2495. [PMID: 35253844 PMCID: PMC9278043 DOI: 10.1093/bioinformatics/btac140] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 02/24/2022] [Accepted: 03/03/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods in order to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND, a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. RESULTS We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds. Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch. AVAILABILITY Implementation for JIND is available at https://github.com/mohit1997/JIND and at https://doi.org/10.5281/zenodo.6246322. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohit Goyal
- Electrical and Computer Engineering Department, University of Illinois, Urbana, IL, USA
| | - Guillermo Serrano
- Computational Biology Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| | - Josepmaria Argemi
- Center for Liver Diseases, Pittsburgh Liver Research Center, Division of Gastroenterology, Hepatology and Nutrition, University of Pittsburgh Medical Center, Pittsburgh, PA, USA.,Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas, Madrid, Spain.,Liver Unit, Clinica Universitaria de Navarra, Pamplona, Spain.,Hepatology Program, Center for Applied Medical Research (CIMA) Universidad de Navarra, Pamplona, Spain
| | - Ilan Shomorony
- Electrical and Computer Engineering Department, University of Illinois, Urbana, IL, USA
| | - Mikel Hernaez
- Computational Biology Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain.,Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, USA.,Artificial Intelligence and Data Science Institute (DATAI), University of Navarra, Pamplona, Spain
| | - Idoia Ochoa
- Electrical and Computer Engineering Department, University of Illinois, Urbana, IL, USA.,Artificial Intelligence and Data Science Institute (DATAI), University of Navarra, Pamplona, Spain.,Department of Electrical Engineering, Tecnun, University of Navarra, Donostia, Spain
| |
Collapse
|
185
|
Zhang R, Luo Y, Ma J, Zhang M, Wang S. scPretrain: multi-task self-supervised learning for cell-type classification. Bioinformatics 2022; 38:1607-1614. [PMID: 34999749 DOI: 10.1093/bioinformatics/btac007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 12/25/2021] [Accepted: 01/04/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell-type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this article, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell-type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset. RESULTS We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell-type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers, such as random forest, logistic regression and support-vector machines. scPretrain is able to effectively utilize the massive amount of unlabeled data and be applied to annotating increasingly generated scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION The data and code underlying this article are available in scPretrain: Multi-task self-supervised learning for cell type classification, at https://github.com/ruiyi-zhang/scPretrain and https://zenodo.org/record/5802306. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ruiyi Zhang
- School of EECS, Peking University, Beijing, China
| | - Yunan Luo
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jianzhu Ma
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.,Department of Biochemistry, Purdue University, West Lafayette, IN, USA
| | - Ming Zhang
- School of EECS, Peking University, Beijing, China
| | - Sheng Wang
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
186
|
Xu Y, Baumgart SJ, Stegmann CM, Hayat S. MACA: marker-based automatic cell-type annotation for single-cell expression data. Bioinformatics 2022; 38:1756-1760. [PMID: 34935911 DOI: 10.1093/bioinformatics/btab840] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 10/07/2021] [Accepted: 12/17/2021] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Accurately identifying cell types is a critical step in single-cell sequencing analyses. Here, we present marker-based automatic cell-type annotation (MACA), a new tool for annotating single-cell transcriptomics datasets. We developed MACA by testing four cell-type scoring methods with two public cell-marker databases as reference in six single-cell studies. MACA compares favorably to four existing marker-based cell-type annotation methods in terms of accuracy and speed. We show that MACA can annotate a large single-nuclei RNA-seq study in minutes on human hearts with ∼290K cells. MACA scales easily to large datasets and can broadly help experts to annotate cell types in single-cell transcriptomics datasets, and we envision MACA provides a new opportunity for integration and standardization of cell-type annotation across multiple datasets. AVAILABILITY AND IMPLEMENTATION MACA is written in python and released under GNU General Public License v3.0. The source code is available at https://github.com/ImXman/MACA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Xu
- Bayer-Broad Joint Precision Cardiology Lab, 75 Ames Street, Cambridge, MA 02142, USA
| | - Simon J Baumgart
- Bayer-Broad Joint Precision Cardiology Lab, 75 Ames Street, Cambridge, MA 02142, USA
| | - Christian M Stegmann
- Bayer-Broad Joint Precision Cardiology Lab, 75 Ames Street, Cambridge, MA 02142, USA
| | - Sikander Hayat
- Bayer-Broad Joint Precision Cardiology Lab, 75 Ames Street, Cambridge, MA 02142, USA
| |
Collapse
|
187
|
Lin L, Shi W, Ye J, Li J. Multi‐source single‐cell data integration by MAW barycenter for gaussian mixture models. Biometrics 2022. [DOI: 10.1111/biom.13630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Accepted: 01/29/2022] [Indexed: 11/26/2022]
Affiliation(s)
- Lin Lin
- Department of Biostatistics and Bioinformatics Duke University Durham NC 27710 USA
| | - Wei Shi
- Department of Statistics and Data Science National University of Singapore 117546 Singapore
| | - Jianbo Ye
- Amazon Lab126 Sunnyvale CA 94089 USA
| | - Jia Li
- Department of Statistics Pennsylvania State University University Park PA 16802 USA
| |
Collapse
|
188
|
Li H, Qu L, Yang Y, Zhang H, Li X, Zhang X. Single-cell Transcriptomic Architecture Unraveling the Complexity of Tumor Heterogeneity in Distal Cholangiocarcinoma. Cell Mol Gastroenterol Hepatol 2022; 13:1592-1609.e9. [PMID: 35219893 PMCID: PMC9043309 DOI: 10.1016/j.jcmgh.2022.02.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/17/2022] [Accepted: 02/17/2022] [Indexed: 01/03/2023]
Abstract
BACKGROUND & AIMS Distal cholangiocarcinoma (dCCA) are a group of epithelial cell malignancies that occurs at the distal common bile duct, and account for approximately 40% of all cholangiocarcinoma cases. dCCA remains a highly lethal disease as it typically features remarkable cellular heterogeneity. A comprehensive exploration of cellular diversity and the tumor microenvironment is essential to depict the mechanisms driving dCCA progression. METHODS Single-cell RNA sequencing was used here to dissect the heterogeneity landscape and tumor microenvironment composition of human dCCAs. Seven human dCCAs and adjacent normal bile duct samples were included in the current study for single-cell RNA sequencing and subsequent validation approaches. Additionally, the results of the analyses were compared with bulk transcriptomic datasets from extrahepatic cholangiocarcinoma and single-cell RNA data from intrahepatic cholangiocarcinoma. RESULTS We sequenced a total of 49,717 single cells derived from human dCCAs and adjacent tissues, identifying 11 distinct cell types. Malignant cells displayed remarkable inter- and intra-tumor heterogeneity with 5 distinct subsets were defined in tumor samples. The malignant cells displayed variable degree of aneuploidy, which can be classified into low- and high-copy number variation groups based on either amplification or deletion of chr17q12 - chr17q21.2. Additionally, we identified 4 distinct T lymphocytes subsets, of which cytotoxic CD8+ T cells predominated as effectors in tumor tissues, whereas tumor infiltrating FOXP3+ CD4+ regulatory T cells exhibited highly immunosuppressive characteristics. CONCLUSION Our single-cell transcriptomic dataset depicts the inter- and intra-tumor heterogeneity of human dCCAs at the expression level.
Collapse
Affiliation(s)
- Hongguang Li
- Department of Hepatobiliary Surgery, Shandong Provincial Hospital, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Lingxin Qu
- Department of Physiology and Pathophysiology, School of Basic Medical Sciences, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Yongheng Yang
- Department of Physiology and Pathophysiology, School of Basic Medical Sciences, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Haibin Zhang
- Department of Physiology and Pathophysiology, School of Basic Medical Sciences, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Xuexin Li
- Division of Genome Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Xiaolu Zhang
- Department of Physiology and Pathophysiology, School of Basic Medical Sciences, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China,Correspondence Address correspondence to: Xiaolu Zhang, Department of Physiology and Pathophysiology, School of Basic Medical Sciences, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, 250012, China. tel: (+86) 17862933917; fax: (+86) 53188565657.
| |
Collapse
|
189
|
Wilson SB, Howden SE, Vanslambrouck JM, Dorison A, Alquicira-Hernandez J, Powell JE, Little MH. DevKidCC allows for robust classification and direct comparisons of kidney organoid datasets. Genome Med 2022. [PMID: 35189942 DOI: 10.1101/2021.01.20.427346] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/09/2023] Open
Abstract
BACKGROUND While single-cell transcriptional profiling has greatly increased our capacity to interrogate biology, accurate cell classification within and between datasets is a key challenge. This is particularly so in pluripotent stem cell-derived organoids which represent a model of a developmental system. Here, clustering algorithms and selected marker genes can fail to accurately classify cellular identity while variation in analyses makes it difficult to meaningfully compare datasets. Kidney organoids provide a valuable resource to understand kidney development and disease. However, direct comparison of relative cellular composition between protocols has proved challenging. Hence, an unbiased approach for classifying cell identity is required. METHODS The R package, scPred, was trained on multiple single cell RNA-seq datasets of human fetal kidney. A hierarchical model classified cellular subtypes into nephron, stroma and ureteric epithelial elements. This model, provided in the R package DevKidCC ( github.com/KidneyRegeneration/DevKidCC ), was then used to predict relative cell identity within published kidney organoid datasets generated using distinct cell lines and differentiation protocols, interrogating the impact of such variations. The package contains custom functions for the display of differential gene expression within cellular subtypes. RESULTS DevKidCC was used to directly compare between distinct kidney organoid protocols, identifying differences in relative proportions of cell types at all hierarchical levels of the model and highlighting variations in stromal and unassigned cell types, nephron progenitor prevalence and relative maturation of individual epithelial segments. Of note, DevKidCC was able to distinguish distal nephron from ureteric epithelium, cell types with overlapping profiles that have previously confounded analyses. When applied to a variation in protocol via the addition of retinoic acid, DevKidCC identified a consequential depletion of nephron progenitors. CONCLUSIONS The application of DevKidCC to kidney organoids reproducibly classifies component cellular identity within distinct single-cell datasets. The application of the tool is summarised in an interactive Shiny application, as are examples of the utility of in-built functions for data presentation. This tool will enable the consistent and rapid comparison of kidney organoid protocols, driving improvements in patterning to kidney endpoints and validating new approaches.
Collapse
Affiliation(s)
- Sean B Wilson
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia
- Department of Paediatrics, The University of Melbourne, Victoria, Parkville, Australia
| | - Sara E Howden
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia
- Department of Paediatrics, The University of Melbourne, Victoria, Parkville, Australia
| | | | - Aude Dorison
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia
| | - Jose Alquicira-Hernandez
- Garvan-Weizmann Centre for Cellular Genomics, The Kinghorn Cancer Centre, Darlinghurst, New South Wales, Australia
| | - Joseph E Powell
- Garvan-Weizmann Centre for Cellular Genomics, The Kinghorn Cancer Centre, Darlinghurst, New South Wales, Australia
- UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, New South Wales, Australia
| | - Melissa H Little
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia.
- Department of Paediatrics, The University of Melbourne, Victoria, Parkville, Australia.
- Department of Anatomy and Neuroscience, The University of Melbourne, Victoria, Parkville, Australia.
- Novo Nordisk Foundation Centre for Stem Cell Medicine, Copenhagen, Denmark.
| |
Collapse
|
190
|
Wilson SB, Howden SE, Vanslambrouck JM, Dorison A, Alquicira-Hernandez J, Powell JE, Little MH. DevKidCC allows for robust classification and direct comparisons of kidney organoid datasets. Genome Med 2022; 14:19. [PMID: 35189942 PMCID: PMC8862535 DOI: 10.1186/s13073-022-01023-z] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Accepted: 02/08/2022] [Indexed: 12/20/2022] Open
Abstract
Background While single-cell transcriptional profiling has greatly increased our capacity to interrogate biology, accurate cell classification within and between datasets is a key challenge. This is particularly so in pluripotent stem cell-derived organoids which represent a model of a developmental system. Here, clustering algorithms and selected marker genes can fail to accurately classify cellular identity while variation in analyses makes it difficult to meaningfully compare datasets. Kidney organoids provide a valuable resource to understand kidney development and disease. However, direct comparison of relative cellular composition between protocols has proved challenging. Hence, an unbiased approach for classifying cell identity is required. Methods The R package, scPred, was trained on multiple single cell RNA-seq datasets of human fetal kidney. A hierarchical model classified cellular subtypes into nephron, stroma and ureteric epithelial elements. This model, provided in the R package DevKidCC (github.com/KidneyRegeneration/DevKidCC), was then used to predict relative cell identity within published kidney organoid datasets generated using distinct cell lines and differentiation protocols, interrogating the impact of such variations. The package contains custom functions for the display of differential gene expression within cellular subtypes. Results DevKidCC was used to directly compare between distinct kidney organoid protocols, identifying differences in relative proportions of cell types at all hierarchical levels of the model and highlighting variations in stromal and unassigned cell types, nephron progenitor prevalence and relative maturation of individual epithelial segments. Of note, DevKidCC was able to distinguish distal nephron from ureteric epithelium, cell types with overlapping profiles that have previously confounded analyses. When applied to a variation in protocol via the addition of retinoic acid, DevKidCC identified a consequential depletion of nephron progenitors. Conclusions The application of DevKidCC to kidney organoids reproducibly classifies component cellular identity within distinct single-cell datasets. The application of the tool is summarised in an interactive Shiny application, as are examples of the utility of in-built functions for data presentation. This tool will enable the consistent and rapid comparison of kidney organoid protocols, driving improvements in patterning to kidney endpoints and validating new approaches. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01023-z.
Collapse
Affiliation(s)
- Sean B Wilson
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia.,Department of Paediatrics, The University of Melbourne, Victoria, Parkville, Australia
| | - Sara E Howden
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia.,Department of Paediatrics, The University of Melbourne, Victoria, Parkville, Australia
| | | | - Aude Dorison
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia
| | - Jose Alquicira-Hernandez
- Garvan-Weizmann Centre for Cellular Genomics, The Kinghorn Cancer Centre, Darlinghurst, New South Wales, Australia
| | - Joseph E Powell
- Garvan-Weizmann Centre for Cellular Genomics, The Kinghorn Cancer Centre, Darlinghurst, New South Wales, Australia.,UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, New South Wales, Australia
| | - Melissa H Little
- Murdoch Children's Research Institute, Flemington Rd, Parkville, Victoria, Australia. .,Department of Paediatrics, The University of Melbourne, Victoria, Parkville, Australia. .,Department of Anatomy and Neuroscience, The University of Melbourne, Victoria, Parkville, Australia. .,Novo Nordisk Foundation Centre for Stem Cell Medicine, Copenhagen, Denmark.
| |
Collapse
|
191
|
Chen X, Chen S, Song S, Gao Z, Hou L, Zhang X, Lv H, Jiang R. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-021-00432-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
192
|
Tang H, Yu X, Liu R, Zeng T. Vec2image: an explainable artificial intelligence model for the feature representation and classification of high-dimensional biological data by vector-to-image conversion. Brief Bioinform 2022; 23:6518046. [PMID: 35106553 PMCID: PMC8921615 DOI: 10.1093/bib/bbab584] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 12/06/2021] [Accepted: 12/20/2021] [Indexed: 01/05/2023] Open
Abstract
Feature representation and discriminative learning are proven models and technologies in artificial intelligence fields; however, major challenges for machine learning on large biological datasets are learning an effective model with mechanistical explanation on the model determination and prediction. To satisfy such demands, we developed Vec2image, an explainable convolutional neural network framework for characterizing the feature engineering, feature selection and classifier training that is mainly based on the collaboration of principal component coordinate conversion, deep residual neural networks and embedded k-nearest neighbor representation on pseudo images of high-dimensional biological data, where the pseudo images represent feature measurements and feature associations simultaneously. Vec2image has achieved better performance compared with other popular methods and illustrated its efficiency on feature selection in cell marker identification from tissue-specific single-cell datasets. In particular, in a case study on type 2 diabetes (T2D) by multiple human islet scRNA-seq datasets, Vec2image first displayed robust performance on T2D classification model building across different datasets, then a specific Vec2image model was trained to accurately recognize the cell state and efficiently rank feature genes relevant to T2D which uncovered potential T2D cellular pathogenesis; and next the cell activity changes, cell composition imbalances and cell–cell communication dysfunctions were associated to our finding T2D feature genes from both population-shared and individual-specific perspectives. Collectively, Vec2image is a new and efficient explainable artificial intelligence methodology that can be widely applied in human-readable classification and prediction on the basis of pseudo image representation of biological deep sequencing data.
Collapse
Affiliation(s)
- Hui Tang
- School of Mathematics, South China University of Technology, Guangzhou, 510640, China
| | - Xiangtian Yu
- Clinical Research Center, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, 200233, China
| | - Rui Liu
- School of Mathematics, South China University of Technology, Guangzhou, 510640, China.,Pazhou Lab, Guangzhou 510330, China
| | - Tao Zeng
- Guangzhou Laboratory, Guangzhou, China.,Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
193
|
Amblard E, Bac J, Chervov A, Soumelis V, Zinovyev A. Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data. Bioinformatics 2022; 38:1045-1051. [PMID: 34871374 DOI: 10.1093/bioinformatics/btab795] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 11/05/2021] [Accepted: 11/17/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Single-cell RNA-seq (scRNAseq) datasets are characterized by large ambient dimensionality, and their analyses can be affected by various manifestations of the dimensionality curse. One of these manifestations is the hubness phenomenon, i.e. existence of data points with surprisingly large incoming connectivity degree in the datapoint neighbourhood graph. Conventional approach to dampen the unwanted effects of high dimension consists in applying drastic dimensionality reduction. It remains unexplored if this step can be avoided thus retaining more information than contained in the low-dimensional projections, by correcting directly hubness. RESULTS We investigated hubness in scRNAseq data. We show that hub cells do not represent any visible technical or biological bias. The effect of various hubness reduction methods is investigated with respect to the clustering, trajectory inference and visualization tasks in scRNAseq datasets. We show that hubness reduction generates neighbourhood graphs with properties more suitable for applying machine learning methods; and that it outperforms other state-of-the-art methods for improving neighbourhood graphs. As a consequence, clustering, trajectory inference and visualization perform better, especially for datasets characterized by large intrinsic dimensionality. Hubness is an important phenomenon characterizing data point neighbourhood graphs computed for various types of sequencing datasets. Reducing hubness can be beneficial for the analysis of scRNAseq data with large intrinsic dimensionality in which case it can be an alternative to drastic dimensionality reduction. AVAILABILITY AND IMPLEMENTATION The code used to analyze the datasets and produce the figures of this article is available from https://github.com/sysbio-curie/schubness. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Elise Amblard
- Université de Paris, INSERM, HIPI, F-75010 Paris, France
| | - Jonathan Bac
- Institut Curie, PSL Research University, F-75005 Paris, France.,INSERM, U900, F-75005 Paris, France.,CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France
| | - Alexander Chervov
- Institut Curie, PSL Research University, F-75005 Paris, France.,INSERM, U900, F-75005 Paris, France.,CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France
| | | | - Andrei Zinovyev
- Institut Curie, PSL Research University, F-75005 Paris, France.,INSERM, U900, F-75005 Paris, France.,CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.,Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603000 Nizhny Novgorod, Russia
| |
Collapse
|
194
|
Teng H, Yuan Y, Bar-Joseph Z. Clustering spatial transcriptomics data. Bioinformatics 2022; 38:997-1004. [PMID: 34623423 PMCID: PMC8796363 DOI: 10.1093/bioinformatics/btab704] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 08/28/2021] [Accepted: 10/06/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Recent advancements in fluorescence in situ hybridization (FISH) techniques enable them to concurrently obtain information on the location and gene expression of single cells. A key question in the initial analysis of such spatial transcriptomics data is the assignment of cell types. To date, most studies used methods that only rely on the expression levels of the genes in each cell for such assignments. To fully utilize the data and to improve the ability to identify novel sub-types, we developed a new method, FICT, which combines both expression and neighborhood information when assigning cell types. RESULTS FICT optimizes a probabilistic function that we formalize and for which we provide learning and inference algorithms. We used FICT to analyze both simulated and several real spatial transcriptomics data. As we show, FICT can accurately identify cell types and sub-types, improving on expression only methods and other methods proposed for clustering spatial transcriptomics data. Some of the spatial sub-types identified by FICT provide novel hypotheses about the new functions for excitatory and inhibitory neurons. AVAILABILITY AND IMPLEMENTATION FICT is available at: https://github.com/haotianteng/FICT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haotian Teng
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Ye Yuan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Ziv Bar-Joseph
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
195
|
Liu B, Li Y, Zhang L. Analysis and Visualization of Spatial Transcriptomic Data. Front Genet 2022; 12:785290. [PMID: 35154244 PMCID: PMC8829434 DOI: 10.3389/fgene.2021.785290] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 12/24/2021] [Indexed: 12/21/2022] Open
Abstract
Human and animal tissues consist of heterogeneous cell types that organize and interact in highly structured manners. Bulk and single-cell sequencing technologies remove cells from their original microenvironments, resulting in a loss of spatial information. Spatial transcriptomics is a recent technological innovation that measures transcriptomic information while preserving spatial information. Spatial transcriptomic data can be generated in several ways. RNA molecules are measured by in situ sequencing, in situ hybridization, or spatial barcoding to recover original spatial coordinates. The inclusion of spatial information expands the range of possibilities for analysis and visualization, and spurred the development of numerous novel methods. In this review, we summarize the core concepts of spatial genomics technology and provide a comprehensive review of current analysis and visualization methods for spatial transcriptomics.
Collapse
|
196
|
Li J, Sheng Q, Shyr Y, Liu Q. scMRMA: single cell multiresolution marker-based annotation. Nucleic Acids Res 2022; 50:e7. [PMID: 34648021 PMCID: PMC8789072 DOI: 10.1093/nar/gkab931] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 09/10/2021] [Accepted: 09/28/2021] [Indexed: 01/22/2023] Open
Abstract
Single-cell RNA sequencing has become a powerful tool for identifying and characterizing cellular heterogeneity. One essential step to understanding cellular heterogeneity is determining cell identities. The widely used strategy predicts identities by projecting cells or cell clusters unidirectionally against a reference to find the best match. Here, we develop a bidirectional method, scMRMA, where a hierarchical reference guides iterative clustering and deep annotation with enhanced resolutions. Taking full advantage of the reference, scMRMA greatly improves the annotation accuracy. scMRMA achieved better performance than existing methods in four benchmark datasets and successfully revealed the expansion of CD8 T cell populations in squamous cell carcinoma after anti-PD-1 treatment.
Collapse
Affiliation(s)
- Jia Li
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Quanhu Sheng
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| |
Collapse
|
197
|
Lin Y, Wu TY, Wan S, Yang JYH, Wong WH, Wang YXR. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat Biotechnol 2022; 40:703-710. [DOI: 10.1038/s41587-021-01161-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/16/2021] [Indexed: 12/11/2022]
|
198
|
Flores M, Liu Z, Zhang T, Hasib MM, Chiu YC, Ye Z, Paniagua K, Jo S, Zhang J, Gao SJ, Jin YF, Chen Y, Huang Y. Deep learning tackles single-cell analysis-a survey of deep learning for scRNA-seq analysis. Brief Bioinform 2022; 23:bbab531. [PMID: 34929734 PMCID: PMC8769926 DOI: 10.1093/bib/bbab531] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 11/15/2021] [Accepted: 11/16/2021] [Indexed: 12/17/2022] Open
Abstract
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.
Collapse
Affiliation(s)
- Mario Flores
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Zhentao Liu
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Tinghe Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Md Musaddaqui Hasib
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yu-Chiao Chiu
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Zhenqing Ye
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Karla Paniagua
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Sumin Jo
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Jianqiu Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Shou-Jiang Gao
- Department of Microbiology and Molecular Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| | - Yu-Fang Jin
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yidong Chen
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Yufei Huang
- Department of Medicine, School of Medicine, University of Pittsburgh, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| |
Collapse
|
199
|
Watson ER, Taherian Fard A, Mar JC. Computational Methods for Single-Cell Imaging and Omics Data Integration. Front Mol Biosci 2022; 8:768106. [PMID: 35111809 PMCID: PMC8801747 DOI: 10.3389/fmolb.2021.768106] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 11/29/2021] [Indexed: 12/12/2022] Open
Abstract
Integrating single cell omics and single cell imaging allows for a more effective characterisation of the underlying mechanisms that drive a phenotype at the tissue level, creating a comprehensive profile at the cellular level. Although the use of imaging data is well established in biomedical research, its primary application has been to observe phenotypes at the tissue or organ level, often using medical imaging techniques such as MRI, CT, and PET. These imaging technologies complement omics-based data in biomedical research because they are helpful for identifying associations between genotype and phenotype, along with functional changes occurring at the tissue level. Single cell imaging can act as an intermediary between these levels. Meanwhile new technologies continue to arrive that can be used to interrogate the genome of single cells and its related omics datasets. As these two areas, single cell imaging and single cell omics, each advance independently with the development of novel techniques, the opportunity to integrate these data types becomes more and more attractive. This review outlines some of the technologies and methods currently available for generating, processing, and analysing single-cell omics- and imaging data, and how they could be integrated to further our understanding of complex biological phenomena like ageing. We include an emphasis on machine learning algorithms because of their ability to identify complex patterns in large multidimensional data.
Collapse
Affiliation(s)
| | - Atefeh Taherian Fard
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, Australia
| | - Jessica Cara Mar
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
200
|
Nguyen V, Griss J. scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data. BMC Bioinformatics 2022; 23:44. [PMID: 35038984 PMCID: PMC8762856 DOI: 10.1186/s12859-022-04574-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Accepted: 01/11/2022] [Indexed: 12/02/2022] Open
Abstract
Background Automatic cell type identification is essential to alleviate a key bottleneck in scRNA-seq data analysis. While most existing classification tools show good sensitivity and specificity, they often fail to adequately not-classify cells that are missing in the used reference. Additionally, many tools do not scale to the continuously increasing size of current scRNA-seq datasets. Therefore, additional tools are needed to solve these challenges. Results scAnnotatR is a novel R package that provides a complete framework to classify cells in scRNA-seq datasets using pre-trained classifiers. It supports both Seurat and Bioconductor’s SingleCellExperiment and is thereby compatible with the vast majority of R-based analysis workflows. scAnnotatR uses hierarchically organised SVMs to distinguish a specific cell type versus all others. It shows comparable or even superior accuracy, sensitivity and specificity compared to existing tools while being able to not-classify unknown cell types. Moreover, scAnnotatR is the only of the best performing tools able to process datasets containing more than 600,000 cells. Conclusions scAnnotatR is freely available on GitHub (https://github.com/grisslab/scAnnotatR) and through Bioconductor (from version 3.14). It is consistently among the best performing tools in terms of classification accuracy while scaling to the largest datasets. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04574-5.
Collapse
Affiliation(s)
- Vy Nguyen
- Department of Dermatology, Medical University of Vienna, Währinger Gürtel 18-20, 1090, Vienna, Austria
| | - Johannes Griss
- Department of Dermatology, Medical University of Vienna, Währinger Gürtel 18-20, 1090, Vienna, Austria.
| |
Collapse
|