1
|
Goss K, Horwitz EM. Single-cell multiomics to advance cell therapy. Cytotherapy 2025; 27:137-145. [PMID: 39530970 DOI: 10.1016/j.jcyt.2024.10.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/21/2024] [Accepted: 10/21/2024] [Indexed: 11/16/2024]
Abstract
Single-cell RNA-sequencing (scRNAseq) was first introduced in 2009 and has evolved with many technological advancements over the last decade. Not only are there several scRNAseq platforms differing in many aspects, but there are also a large number of computational pipelines available for downstream analyses which are being developed at an exponential rate. Such computational data appear in many scientific publications in virtually every field of study; thus, investigators should be able to understand and interpret data in this rapidly evolving field. Here, we discuss key differences in scRNAseq platforms, crucial steps in scRNAseq experiments, standard downstream analyses and introduce newly developed multimodal approaches. We then discuss how single-cell omics has been applied to advance the field of cell therapy.
Collapse
Affiliation(s)
- Kyndal Goss
- Marcus Center for Advanced Cellular Therapy, Children's Healthcare of Atlanta, Atlanta, Georgia, USA; Aflac Cancer & Blood Disorders Center, Children's Healthcare of Atlanta, Atlanta, Georgia, USA; Graduate Division of Biology and Biomedical Sciences, Emory University Laney Graduate School, Atlanta, Georgia, USA
| | - Edwin M Horwitz
- Marcus Center for Advanced Cellular Therapy, Children's Healthcare of Atlanta, Atlanta, Georgia, USA; Aflac Cancer & Blood Disorders Center, Children's Healthcare of Atlanta, Atlanta, Georgia, USA; Department of Pediatrics, Emory University School of Medicine, Atlanta, Georgia, USA; Graduate Division of Biology and Biomedical Sciences, Emory University Laney Graduate School, Atlanta, Georgia, USA.
| |
Collapse
|
2
|
Sharifitabar M, Kazempour S, Razavian J, Sajedi S, Solhjoo S, Zare H. A deep neural network to de-noise single-cell RNA sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.20.624552. [PMID: 39605470 PMCID: PMC11601639 DOI: 10.1101/2024.11.20.624552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq), a powerful technique for investigating the transcriptome of individual cells, enables the discovery of heterogeneous cell populations, rare cell types, and transcriptional dynamics in separate cells. Yet, scRNA-seq data analysis is limited by the problem of measurement dropouts, i.e., genes displaying zero expression levels. We introduce ZiPo, a deep artificial neural network for rate estimation and library size prediction in scRNA-seq data which incorporates adjustable zero inflation in the distribution to capture the dropouts. ZiPo builds upon established concepts, including using deep autoencoders and adopting the Poisson and negative binomial distributions, by taking advantage of novel strategies, including library size prediction and residual connections, to improve the overall performance. A significant innovation of ZiPo is the introduction of a scale-invariant loss term, making the weights sparse and, hence, the model biologically more interpretable. ZiPo quickly handles vast singular and mixed datasets, with the processing time directly proportional to the number of cells. In this paper, we demonstrate the power of ZiPo on three datasets and show its advantages over other current techniques. The code used to produce the results in this manuscript is available at https://bitbucket.org/habilzare/alzheimer/src/master/code/deep/ZiPo/.
Collapse
|
3
|
Özden F, Minary P. Learning to quantify uncertainty in off-target activity for CRISPR guide RNAs. Nucleic Acids Res 2024; 52:e87. [PMID: 39275984 PMCID: PMC11472043 DOI: 10.1093/nar/gkae759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 08/07/2024] [Accepted: 08/23/2024] [Indexed: 09/16/2024] Open
Abstract
CRISPR-based genome editing technologies have revolutionised the field of molecular biology, offering unprecedented opportunities for precise genetic manipulation. However, off-target effects remain a significant challenge, potentially leading to unintended consequences and limiting the applicability of CRISPR-based genome editing technologies in clinical settings. Current literature predominantly focuses on point predictions for off-target activity, which may not fully capture the range of possible outcomes and associated risks. Here, we present crispAI, a neural network architecture-based approach for predicting uncertainty estimates for off-target cleavage activity, providing a more comprehensive risk assessment and facilitating improved decision-making in single guide RNA (sgRNA) design. Our approach makes use of the count noise model Zero Inflated Negative Binomial (ZINB) to model the uncertainty in the off-target cleavage activity data. In addition, we present the first-of-its-kind genome-wide sgRNA efficiency score, crispAI-aggregate, enabling prioritization among sgRNAs with similar point aggregate predictions by providing richer information compared to existing aggregate scores. We show that uncertainty estimates of our approach are calibrated and its predictive performance is superior to the state-of-the-art in silico off-target cleavage activity prediction methods. The tool and the trained models are available at https://github.com/furkanozdenn/crispr-offtarget-uncertainty.
Collapse
Affiliation(s)
- Furkan Özden
- Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK
| | - Peter Minary
- Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK
| |
Collapse
|
4
|
Tiong KL, Luzhbin D, Yeang CH. Assessing transcriptomic heterogeneity of single-cell RNASeq data by bulk-level gene expression data. BMC Bioinformatics 2024; 25:209. [PMID: 38867193 PMCID: PMC11167951 DOI: 10.1186/s12859-024-05825-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 06/03/2024] [Indexed: 06/14/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (sc-RNASeq) data illuminate transcriptomic heterogeneity but also possess a high level of noise, abundant missing entries and sometimes inadequate or no cell type annotations at all. Bulk-level gene expression data lack direct information of cell population composition but are more robust and complete and often better annotated. We propose a modeling framework to integrate bulk-level and single-cell RNASeq data to address the deficiencies and leverage the mutual strengths of each type of data and enable a more comprehensive inference of their transcriptomic heterogeneity. Contrary to the standard approaches of factorizing the bulk-level data with one algorithm and (for some methods) treating single-cell RNASeq data as references to decompose bulk-level data, we employed multiple deconvolution algorithms to factorize the bulk-level data, constructed the probabilistic graphical models of cell-level gene expressions from the decomposition outcomes, and compared the log-likelihood scores of these models in single-cell data. We term this framework backward deconvolution as inference operates from coarse-grained bulk-level data to fine-grained single-cell data. As the abundant missing entries in sc-RNASeq data have a significant effect on log-likelihood scores, we also developed a criterion for inclusion or exclusion of zero entries in log-likelihood score computation. RESULTS We selected nine deconvolution algorithms and validated backward deconvolution in five datasets. In the in-silico mixtures of mouse sc-RNASeq data, the log-likelihood scores of the deconvolution algorithms were strongly anticorrelated with their errors of mixture coefficients and cell type specific gene expression signatures. In the true bulk-level mouse data, the sample mixture coefficients were unknown but the log-likelihood scores were strongly correlated with accuracy rates of inferred cell types. In the data of autism spectrum disorder (ASD) and normal controls, we found that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGN-expressing neurons than normal controls. In datasets of breast cancer and low-grade gliomas (LGG), we compared the log-likelihood scores of three simple hypotheses about the gene expression patterns of the cell types underlying the tumor subtypes. The model that tumors of each subtype were dominated by one cell type persistently outperformed an alternative model that each cell type had elevated expression in one gene group and tumors were mixtures of those cell types. Superiority of the former model is also supported by comparing the real breast cancer sc-RNASeq clusters with those generated by simulated sc-RNASeq data. CONCLUSIONS The results indicate that backward deconvolution serves as a sensible model selection tool for deconvolution algorithms and facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens such as tumors.
Collapse
Affiliation(s)
- Khong-Loon Tiong
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Dmytro Luzhbin
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | | |
Collapse
|
5
|
Wang L, Hong C, Song J, Yao J. CTEC: a cross-tabulation ensemble clustering approach for single-cell RNA sequencing data analysis. Bioinformatics 2024; 40:btae130. [PMID: 38552307 PMCID: PMC10985676 DOI: 10.1093/bioinformatics/btae130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 02/11/2024] [Indexed: 04/04/2024] Open
Abstract
MOTIVATION Cell-type clustering is a crucial first step for single-cell RNA-seq data analysis. However, existing clustering methods often provide different results on cluster assignments with respect to their own data pre-processing, choice of distance metrics, and strategies of feature extraction, thereby limiting their practical applications. RESULTS We propose Cross-Tabulation Ensemble Clustering (CTEC) method that formulates two re-clustering strategies (distribution- and outlier-based) via cross-tabulation. Benchmarking experiments on five scRNA-Seq datasets illustrate that the proposed CTEC method offers significant improvements over the individual clustering methods. Moreover, CTEC-DB outperforms the state-of-the-art ensemble methods for single-cell data clustering, with 45.4% and 17.1% improvement over the single-cell aggregated from ensemble clustering method (SAFE) and the single-cell aggregated clustering via Mixture model ensemble method (SAME), respectively, on the two-method ensemble test. AVAILABILITY AND IMPLEMENTATION The source code of the benchmark in this work is available at the GitHub repository https://github.com/LWCHN/CTEC.git.
Collapse
Affiliation(s)
| | - Chenyang Hong
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, 999077, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
| | | |
Collapse
|
6
|
Wusiman D, Li W, Guo L, Huang Z, Zhang Y, Zhang X, Zhao X, Li L, An Z, Li Z, Ying J, An C. Comprehensive analysis of single-cell and bulk RNA-sequencing data identifies B cell marker genes signature that predicts prognosis and analysis of immune checkpoints expression in head and neck squamous cell carcinoma. Heliyon 2023; 9:e22656. [PMID: 38125461 PMCID: PMC10731009 DOI: 10.1016/j.heliyon.2023.e22656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 11/13/2023] [Accepted: 11/16/2023] [Indexed: 12/23/2023] Open
Abstract
Recent studies have shown that B cells and the associated tertiary lymphoid structures (TLS) correlate with the response of patients to immune checkpoint inhibitors (ICIs) and predict overall survival (OS) in cancer patients. We screened 145 B cell marker genes (BCMG) by a comprehensive analysis of single-cell RNA-sequencing (scRNA-seq) data of head and neck squamous cell carcinoma (HNSC) from the Gene Expression Omnibus (GEO) database. The BCMG signature (BCMGS) was established using The Cancer Genome Atlas (TCGA) dataset of HNSC and verified in four independent datasets. The multivariate Cox regression analysis identified the signature as an independent prognostic factor. A prognostic nomogram was constructed with independent prognostic factors using the TCGA dataset. GO and KEGG analysis revealed the underlying signaling pathways related to this signature. Study of immune profiles showed that patients in the low-risk group presented discriminative immune-cell infiltrations. Furthermore, the low-risk group was featured by higher TCR and BCR diversity, which suggested that low-risk patients may be more sensitive to ICIs. Immunohistochemistry was performed, and we found that high expression of FTH1 was significantly correlated with poor OS (P = 0.025). The expression of TIM-3, LAG-3 and PD-1 was positively correlated and associated with better OS in HNSC. However, there was no statistically significant difference between PD-L1, PD-L2, CTLA-4, TIGIT and prognosis. The BCMGS was a promising prognostic biomarker in HNSC, which may help to interpret the responses to immunotherapy and provide a new perspective for future research on the treatment in HNSC.
Collapse
Affiliation(s)
- Dilinaer Wusiman
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Wenbin Li
- Department of Pathology, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Lei Guo
- Department of Pathology, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Zehao Huang
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Yi Zhang
- Department of Pathology, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Xiwei Zhang
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Xiaohui Zhao
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Lin Li
- Department of Pathology, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Zhaohong An
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Zhengjiang Li
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Jianming Ying
- Department of Pathology, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Changming An
- Department of Head and Neck Surgery, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| |
Collapse
|
7
|
van den Oord EJCG, Aberg KA. Fine-grained cell-type specific association studies with human bulk brain data using a large single-nucleus RNA sequencing based reference panel. Sci Rep 2023; 13:13004. [PMID: 37563216 PMCID: PMC10415334 DOI: 10.1038/s41598-023-39864-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 08/01/2023] [Indexed: 08/12/2023] Open
Abstract
Brain disorders are leading causes of disability worldwide. Gene expression studies provide promising opportunities to better understand their etiology but it is critical that expression is studied on a cell-type level. Cell-type specific association studies can be performed with bulk expression data using statistical methods that capitalize on cell-type proportions estimated with the help of a reference panel. To create a fine-grained reference panel for the human prefrontal cortex, we performed an integrated analysis of the seven largest single nucleus RNA-seq studies. Our panel included 17 cell-types that were robustly detected across all studies, subregions of the prefrontal cortex, and sex and age groups. To estimate the cell-type proportions, we used an empirical Bayes estimator that substantially outperformed three estimators recommended previously after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data. This is important as being able to precisely estimate the cell-type proportions may avoid unreliable results in downstream analyses particularly for the multiple cell-types that had low abundances. Transcriptome-wide association studies performed with permuted bulk expression data showed that it is possible to perform transcriptome-wide association studies for even the rarest cell-types without an increased risk of false positives.
Collapse
Affiliation(s)
- Edwin J C G van den Oord
- Center for Biomarker Research and Precision Medicine, Virginia Commonwealth University, McGuire Hall, Room 216A, 1112 East Clay Street, P. O. Box 980533, Richmond, VA, 23298-0581, USA.
| | - Karolina A Aberg
- Center for Biomarker Research and Precision Medicine, Virginia Commonwealth University, McGuire Hall, Room 216A, 1112 East Clay Street, P. O. Box 980533, Richmond, VA, 23298-0581, USA
| |
Collapse
|
8
|
Pan Y, Landis JT, Moorad R, Wu D, Marron JS, Dittmer DP. The Poisson distribution model fits UMI-based single-cell RNA-sequencing data. BMC Bioinformatics 2023; 24:256. [PMID: 37330471 PMCID: PMC10276395 DOI: 10.1186/s12859-023-05349-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/24/2023] [Indexed: 06/19/2023] Open
Abstract
BACKGROUND Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson.
Collapse
Affiliation(s)
- Yue Pan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Justin T Landis
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Razia Moorad
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Di Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Adam School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - J S Marron
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Dirk P Dittmer
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA.
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA.
| |
Collapse
|
9
|
Jee DJ, Kong Y, Chun H. Deep Nonnegative Matrix Factorization Using a Variational Autoencoder With Application to Single-Cell RNA Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:883-893. [PMID: 35511832 DOI: 10.1109/tcbb.2022.3172723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Single-cell RNA sequencing is used to analyze the gene expression data of individual cells, thereby adding to existing knowledge of biological phenomena. Accordingly, this technology is widely used in numerous biomedical studies. Recently, the variational autoencoder has emerged and has been adopted for the analysis of single-cell data owing to its high capacity to manage large-scale data. Many different variants of the variational autoencoder have been applied, and have yielded superior results. However, because it is nonlinear, the model does not provide parameters that can be used to explain the underlying biological patterns. In this paper, we propose an interpretable nonnegative matrix factorization method that decomposes parameters into those shared across cells and those that are cell-specific. Effective nonlinear dimension reduction was achieved via a variational autoencoder applied to the cell-specific parameters. In addition to achieving nonlinear dimension reduction, our model could estimate the cell-type-specific gene expression. To improve the estimation accuracy, we introduced log-regularization, which reflects the single-cell property. Overall, our approach displayed excellent performance in a simulation study and in real data analyses, while maintaining good biological interpretability.
Collapse
|
10
|
Karikomi M, Zhou P, Nie Q. DURIAN: an integrative deconvolution and imputation method for robust signaling analysis of single-cell transcriptomics data. Brief Bioinform 2022; 23:6609525. [PMID: 35709795 PMCID: PMC9294432 DOI: 10.1093/bib/bbac223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 04/29/2022] [Accepted: 05/11/2022] [Indexed: 01/31/2023] Open
Abstract
Single-cell RNA sequencing trades read-depth for dimensionality, often leading to loss of critical signaling gene information that is typically present in bulk data sets. We introduce DURIAN (Deconvolution and mUltitask-Regression-based ImputAtioN), an integrative method for recovery of gene expression in single-cell data. Through systematic benchmarking, we demonstrate the accuracy, robustness and empirical convergence of DURIAN using both synthetic and published data sets. We show that use of DURIAN improves single-cell clustering, low-dimensional embedding, and recovery of intercellular signaling networks. Our study resolves several inconsistent results of cell-cell communication analysis using single-cell or bulk data independently. The method has broad application in biomarker discovery and cell signaling analysis using single-cell transcriptomics data sets.
Collapse
Affiliation(s)
| | - Peijie Zhou
- Corresponding authors: Peijie Zhou, 540P Rowland Hall, University of California Irvine, Irvine CA 92697, USA. Tel: 949-824-5530; Fax: 949-8247993; ; Qing Nie, 540F Rowland Hall, University of California Irvine, Irvine CA 92697, USA. Tel: 949-824-5530; Fax: 949-8247993;
| | - Qing Nie
- Corresponding authors: Peijie Zhou, 540P Rowland Hall, University of California Irvine, Irvine CA 92697, USA. Tel: 949-824-5530; Fax: 949-8247993; ; Qing Nie, 540F Rowland Hall, University of California Irvine, Irvine CA 92697, USA. Tel: 949-824-5530; Fax: 949-8247993;
| |
Collapse
|
11
|
Ni Z, Zheng X, Zheng X, Zou X. scLRTD : A Novel Low Rank Tensor Decomposition Method for Imputing Missing Values in Single-Cell Multi-Omics Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1144-1153. [PMID: 32960767 DOI: 10.1109/tcbb.2020.3025804] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the successful application of single-cell sequencing technology, a large number of single-cell multi-omics sequencing (scMO-seq)data have been generated, which enables researchers to study heterogeneity between individual cells. One prominent problem in single-cell data analysis is the prevalence of dropouts, caused by failures in amplification during the experiments. It is necessary to develop effective approaches for imputing the missing values. Different with general methods imputing single type of single-cell data, we propose an imputation method called scLRTD, using low-rank tensor decomposition based on nuclear norm to impute scMO-seq data and single-cell RNA-sequencing (scRNA-seq)data with different stages, tissues or conditions. Furthermore, four sets of simulated and two sets of real scRNA-seq data from mouse embryonic stem cells and hepatocellular carcinoma, respectively, are used to carry out numerical experiments and compared with other six published methods. Error accuracy and clustering results demonstrate the effectiveness of proposed method. Moreover, we clearly identify two cell subpopulations after imputing the real scMO-seq data from hepatocellular carcinoma. Further, Gene Ontology identifies 7 genes in Bile secretion pathway, which is related to metabolism in hepatocellular carcinoma. The survival analysis using the database TCGA also show that two cell subpopulations after imputing have distinguished survival rates.
Collapse
|
12
|
Jiang R, Sun T, Song D, Li JJ. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol 2022; 23:31. [PMID: 35063006 PMCID: PMC8783472 DOI: 10.1186/s13059-022-02601-5] [Citation(s) in RCA: 178] [Impact Index Per Article: 59.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 01/04/2022] [Indexed: 12/13/2022] Open
Abstract
Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.
Collapse
Affiliation(s)
- Ruochen Jiang
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Tianyi Sun
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, 90095-7246, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
13
|
Bartlett TE, Jia P, Chandna S, Roy S. Inference of tissue relative proportions of the breast epithelial cell types luminal progenitor, basal, and luminal mature. Sci Rep 2021; 11:23702. [PMID: 34880407 PMCID: PMC8655091 DOI: 10.1038/s41598-021-03161-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Accepted: 11/26/2021] [Indexed: 12/15/2022] Open
Abstract
Single-cell analysis has revolutionised genomic science in recent years. However, due to cost and other practical considerations, single-cell analyses are impossible for studies based on medium or large patient cohorts. For example, a single-cell analysis usually costs thousands of euros for one tissue sample from one volunteer, meaning that typical studies using single-cell analyses are based on very few individuals. While single-cell genomic data can be used to examine the phenotype of individual cells, cell-type deconvolution methods are required to track the quantities of these cells in bulk-tissue genomic data. Hormone receptor negative breast cancers are highly aggressive, and are thought to originate from a subtype of epithelial cells called the luminal progenitor. In this paper, we show how to quantify the number of luminal progenitor cells as well as other epithelial subtypes in breast tissue samples using DNA and RNA based measurements. We find elevated levels of cells which resemble these hormone receptor negative luminal progenitor cells in breast tumour biopsies of hormone receptor negative cancers, as well as in healthy breast tissue samples from BRCA1 (FANCS) mutation carriers. We also find that breast tumours from carriers of heterozygous mutations in non-BRCA Fanconi Anaemia pathway genes are much more likely to be hormone receptor negative. These findings have implications for understanding hormone receptor negative breast cancers, and for breast cancer screening in carriers of heterozygous mutations of Fanconi Anaemia pathway genes.
Collapse
Affiliation(s)
- Thomas E Bartlett
- Department of Statistical Science, University College London, London, UK.
| | - Peiwen Jia
- Department of Statistical Science, University College London, London, UK
| | - Swati Chandna
- Department of Economics, Mathematics and Statistics, Birkbeck University of London, London, UK
| | - Sandipan Roy
- Department of Mathematical Sciences, University of Bath, Bath, UK
| |
Collapse
|
14
|
Wang J, Roeder K, Devlin B. Bayesian estimation of cell type-specific gene expression with prior derived from single-cell data. Genome Res 2021; 31:1807-1818. [PMID: 33837133 PMCID: PMC8494232 DOI: 10.1101/gr.268722.120] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 03/31/2021] [Indexed: 11/25/2022]
Abstract
When assessed over a large number of samples, bulk RNA sequencing provides reliable data for gene expression at the tissue level. Single-cell RNA sequencing (scRNA-seq) deepens those analyses by evaluating gene expression at the cellular level. Both data types lend insights into disease etiology. With current technologies, scRNA-seq data are known to be noisy. Constrained by costs, scRNA-seq data are typically generated from a relatively small number of subjects, which limits their utility for some analyses, such as identification of gene expression quantitative trait loci (eQTLs). To address these issues while maintaining the unique advantages of each data type, we develop a Bayesian method (bMIND) to integrate bulk and scRNA-seq data. With a prior derived from scRNA-seq data, we propose to estimate sample-level cell type-specific (CTS) expression from bulk expression data. The CTS expression enables large-scale sample-level downstream analyses, such as detection of CTS differentially expressed genes (DEGs) and eQTLs. Through simulations, we show that bMIND improves the accuracy of sample-level CTS expression estimates and increases the power to discover CTS DEGs when compared to existing methods. To further our understanding of two complex phenotypes, autism spectrum disorder and Alzheimer's disease, we apply bMIND to gene expression data of relevant brain tissue to identify CTS DEGs. Our results complement findings for CTS DEGs obtained from snRNA-seq studies, replicating certain DEGs in specific cell types while nominating other novel genes for those cell types. Finally, we calculate CTS eQTLs for 11 brain regions by analyzing Genotype-Tissue Expression Project data, creating a new resource for biological insights.
Collapse
Affiliation(s)
- Jiebiao Wang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA
| | - Kathryn Roeder
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - Bernie Devlin
- Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania 15213, USA
| |
Collapse
|
15
|
Yang Y, Li G, Xie Y, Wang L, Lagler TM, Yang Y, Liu J, Qian L, Li Y. iSMNN: batch effect correction for single-cell RNA-seq data via iterative supervised mutual nearest neighbor refinement. Brief Bioinform 2021; 22:bbab122. [PMID: 33839756 PMCID: PMC8579191 DOI: 10.1093/bib/bbab122] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 02/26/2021] [Accepted: 03/15/2021] [Indexed: 01/23/2023] Open
Abstract
Batch effect correction is an essential step in the integrative analysis of multiple single-cell RNA-sequencing (scRNA-seq) data. One state-of-the-art strategy for batch effect correction is via unsupervised or supervised detection of mutual nearest neighbors (MNNs). However, both types of methods only detect MNNs across batches of uncorrected data, where the large batch effects may affect the MNN search. To address this issue, we presented a batch effect correction approach via iterative supervised MNN (iSMNN) refinement across data after correction. Our benchmarking on both simulation and real datasets showed the advantages of the iterative refinement of MNNs on the performance of correction. Compared to popular alternative methods, our iSMNN is able to better mix the cells of the same cell type across batches. In addition, iSMNN can also facilitate the identification of differentially expressed genes (DEGs) that are relevant to the biological function of certain cell types. These results indicated that iSMNN will be a valuable method for integrating multiple scRNA-seq datasets that can facilitate biological and medical studies at single-cell level.
Collapse
Affiliation(s)
- Yuchen Yang
- Department of Pathology and Laboratory Medicine and McAllister Heart Institute at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Gang Li
- Department of Statistics and Operations Research at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Yifang Xie
- Department of Pathology and Laboratory Medicine at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Li Wang
- Department of Pathology and Laboratory Medicine and McAllister Heart Institute at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Taylor M Lagler
- Department of Biostatistics at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Yingxi Yang
- Department of Statistics at the Sun Yat-sen University, NC 27599, USA
| | - Jiandong Liu
- Department of Pathology and Laboratory Medicine and McAllister Heart Institute at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Li Qian
- Department of Pathology and Laboratory Medicine and McAllister Heart Institute at the University of North Carolina at Chapel Hill, NC 27599, USA
| | - Yun Li
- Departments of Genetics, Biostatistics and Computer Science at the University of North Carolina at Chapel Hill, NC 27599, USA
| |
Collapse
|
16
|
Patruno L, Maspero D, Craighero F, Angaroni F, Antoniotti M, Graudenzi A. A review of computational strategies for denoising and imputation of single-cell transcriptomic data. Brief Bioinform 2021; 22:bbaa222. [PMID: 33003202 DOI: 10.1093/bib/bbaa222] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 08/07/2020] [Accepted: 08/19/2020] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION The advancements of single-cell sequencing methods have paved the way for the characterization of cellular states at unprecedented resolution, revolutionizing the investigation on complex biological systems. Yet, single-cell sequencing experiments are hindered by several technical issues, which cause output data to be noisy, impacting the reliability of downstream analyses. Therefore, a growing number of data science methods has been proposed to recover lost or corrupted information from single-cell sequencing data. To date, however, no quantitative benchmarks have been proposed to evaluate such methods. RESULTS We present a comprehensive analysis of the state-of-the-art computational approaches for denoising and imputation of single-cell transcriptomic data, comparing their performance in different experimental scenarios. In detail, we compared 19 denoising and imputation methods, on both simulated and real-world datasets, with respect to several performance metrics related to imputation of dropout events, recovery of true expression profiles, characterization of cell similarity, identification of differentially expressed genes and computation time. The effectiveness and scalability of all methods were assessed with regard to distinct sequencing protocols, sample size and different levels of biological variability and technical noise. As a result, we identify a subset of versatile approaches exhibiting solid performances on most tests and show that certain algorithmic families prove effective on specific tasks but inefficient on others. Finally, most methods appear to benefit from the introduction of appropriate assumptions on noise distribution of biological processes.
Collapse
Affiliation(s)
- Lucrezia Patruno
- Department of Informatics, Systems and Communication of the University of Milan-Bicocca
| | - Davide Maspero
- Department of Informatics, Systems and Communication of the University of Milan-Bicocca
| | - Francesco Craighero
- Department of Informatics, Systems and Communication of the University of Milan-Bicocca
| | - Fabrizio Angaroni
- Department of Informatics, Systems and Communication of the University of Milan-Bicocca
| | - Marco Antoniotti
- Department of Informatics, Systems and Communication of the University of Milan-Bicocca
| | - Alex Graudenzi
- Department of Informatics, Systems and Communication of the University of Milan-Bicocca
| |
Collapse
|
17
|
Cui Y, Zhang S, Liang Y, Wang X, Ferraro TN, Chen Y. Consensus clustering of single-cell RNA-seq data by enhancing network affinity. Brief Bioinform 2021; 22:6308199. [PMID: 34160582 PMCID: PMC8574980 DOI: 10.1093/bib/bbab236] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 05/29/2021] [Accepted: 06/01/2021] [Indexed: 12/18/2022] Open
Abstract
Elucidation of cell subpopulations at high resolution is a key and challenging goal of single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) data analysis. Although unsupervised clustering methods have been proposed for de novo identification of cell populations, their performance and robustness suffer from the high variability, low capture efficiency and high dropout rates which are characteristic of scRNA-seq experiments. Here, we present a novel unsupervised method for Single-cell Clustering by Enhancing Network Affinity (SCENA), which mainly employed three strategies: selecting multiple gene sets, enhancing local affinity among cells and clustering of consensus matrices. Large-scale validations on 13 real scRNA-seq datasets show that SCENA has high accuracy in detecting cell populations and is robust against dropout noise. When we applied SCENA to large-scale scRNA-seq data of mouse brain cells, known cell types were successfully detected, and novel cell types of interneurons were identified with differential expression of gamma-aminobutyric acid receptor subunits and transporters. SCENA is equipped with CPU + GPU (Central Processing Units + Graphics Processing Units) heterogeneous parallel computing to achieve high running speed. The high performance and running speed of SCENA combine into a new and efficient platform for biological discoveries in clustering analysis of large and diverse scRNA-seq datasets.
Collapse
Affiliation(s)
- Yaxuan Cui
- College of Computer and Information Engineering, Tianjin Normal University, China
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, China
| | - Ying Liang
- College of Computer and Information Engineering, Tianjin Normal University, China
| | - Xiangyun Wang
- College of Computer and Information Engineering, Tianjin Normal University, China
| | - Thomas N Ferraro
- Department of Biomedical Sciences at CMSRU, Rowan University, NJ 08028, USA
| | - Yong Chen
- Department of Molecular and Cellular Biosciences at Rowan University, Rowan University, NJ 08028, USA
| |
Collapse
|
18
|
Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 2021; 53:770-777. [PMID: 34031584 PMCID: PMC8370014 DOI: 10.1038/s41588-021-00873-4] [Citation(s) in RCA: 109] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 04/22/2021] [Indexed: 01/21/2023]
Abstract
The high proportion of zeros in typical single-cell RNA sequencing datasets has led to widespread but inconsistent use of terminology such as dropout and missing data. Here, we argue that much of this terminology is unhelpful and confusing, and outline simple ideas to help to reduce confusion. These include: (1) observed single-cell RNA sequencing counts reflect both true gene expression levels and measurement error, and carefully distinguishing between these contributions helps to clarify thinking; and (2) method development should start with a Poisson measurement model, rather than more complex models, because it is simple and generally consistent with existing data. We outline how several existing methods can be viewed within this framework and highlight how these methods differ in their assumptions about expression variation. We also illustrate how our perspective helps to address questions of biological interest, such as whether messenger RNA expression levels are multimodal among cells.
Collapse
Affiliation(s)
- Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
19
|
Kong Y, Kozik A, Nakatsu CH, Jones-Hall YL, Chun H. A zero-inflated non-negative matrix factorization for the deconvolution of mixed signals of biological data. Int J Biostat 2021; 18:203-218. [PMID: 33783171 DOI: 10.1515/ijb-2020-0039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 02/23/2021] [Indexed: 12/18/2022]
Abstract
A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.
Collapse
Affiliation(s)
- Yixin Kong
- Department of Mathematics and Statistics, Boston University, Boston, MA02215, USA
| | - Ariangela Kozik
- Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI48104, USA
| | - Cindy H Nakatsu
- Department of Agronomy, Purdue University, West Lafayette, IN47905, USA
| | - Yava L Jones-Hall
- College of Veterinary Medicine and Biomedical Sciences, Texas A&M University, College Station, Texas77843, USA
| | - Hyonho Chun
- Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology, Daejeon34141, South Korea
| |
Collapse
|
20
|
Sánchez JA, Gil-Martinez AL, Cisterna A, García-Ruíz S, Gómez-Pascual A, Reynolds RH, Nalls M, Hardy J, Ryten M, Botía JA. Modeling multifunctionality of genes with secondary gene co-expression networks in human brain provides novel disease insights. Bioinformatics 2021; 37:2905-2911. [PMID: 33734320 PMCID: PMC8479669 DOI: 10.1093/bioinformatics/btab175] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 02/14/2021] [Accepted: 03/16/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Co-expression networks are a powerful gene expression analysis method to study how genes co-express together in clusters with functional coherence that usually resemble specific cell type behavior for the genes involved. They can be applied to bulk-tissue gene expression profiling and assign function, and usually cell type specificity, to a high percentage of the gene pool used to construct the network. One of the limitations of this method is that each gene is predicted to play a role in a specific set of coherent functions in a single cell type (i.e. at most we get a single <gene, function, cell type> for each gene). We present here GMSCA (Gene Multifunctionality Secondary Co-expression Analysis), a software tool that exploits the co-expression paradigm to increase the number of functions and cell types ascribed to a gene in bulk-tissue co-expression networks. RESULTS We applied GMSCA to 27 co-expression networks derived from bulk-tissue gene expression profiling of a variety of brain tissues. Neurons and glial cells (microglia, astrocytes and oligodendrocytes) were considered the main cell types. Applying this approach, we increase the overall number of predicted triplets <gene, function, cell type> by 46.73%. Moreover, GMSCA predicts that the SNCA gene, traditionally associated to work mainly in neurons, also plays a relevant function in oligodendrocytes. AVAILABILITYAND IMPLEMENTATION The tool is available at GitHub, https://github.com/drlaguna/GMSCA as open-source software. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juan A Sánchez
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia E-30100, Spain
| | - Ana L Gil-Martinez
- Department of Neurodegenerative Diseases, UCL Institute of Neurology, London WC1E 6BT, UK
| | - Alejandro Cisterna
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia E-30100, Spain
| | - Sonia García-Ruíz
- Department of Neurodegenerative Diseases, UCL Institute of Neurology, London WC1E 6BT, UK
| | - Alicia Gómez-Pascual
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia E-30100, Spain
| | - Regina H Reynolds
- Department of Neurodegenerative Diseases, UCL Institute of Neurology, London WC1E 6BT, UK
| | - Mike Nalls
- Laboratory of Neurogenetics, Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD 20892, USA,Data Tecnica International, Glen Echo, MD 20812, USA
| | - John Hardy
- Department of Neurodegenerative Diseases, UCL Institute of Neurology, London WC1E 6BT, UK
| | - Mina Ryten
- Department of Neurodegenerative Diseases, UCL Institute of Neurology, London WC1E 6BT, UK,To whom correspondence should be addressed. or
| | - Juan A Botía
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia E-30100, Spain,Department of Neurodegenerative Diseases, UCL Institute of Neurology, London WC1E 6BT, UK,To whom correspondence should be addressed. or
| |
Collapse
|
21
|
Sokolowski DJ, Faykoo-Martinez M, Erdman L, Hou H, Chan C, Zhu H, Holmes MM, Goldenberg A, Wilson MD. Single-cell mapper (scMappR): using scRNA-seq to infer the cell-type specificities of differentially expressed genes. NAR Genom Bioinform 2021; 3:lqab011. [PMID: 33655208 PMCID: PMC7902236 DOI: 10.1093/nargab/lqab011] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 12/23/2020] [Accepted: 02/04/2021] [Indexed: 12/11/2022] Open
Abstract
RNA sequencing (RNA-seq) is widely used to identify differentially expressed genes (DEGs) and reveal biological mechanisms underlying complex biological processes. RNA-seq is often performed on heterogeneous samples and the resulting DEGs do not necessarily indicate the cell-types where the differential expression occurred. While single-cell RNA-seq (scRNA-seq) methods solve this problem, technical and cost constraints currently limit its widespread use. Here we present single cell Mapper (scMappR), a method that assigns cell-type specificity scores to DEGs obtained from bulk RNA-seq by leveraging cell-type expression data generated by scRNA-seq and existing deconvolution methods. After evaluating scMappR with simulated RNA-seq data and benchmarking scMappR using RNA-seq data obtained from sorted blood cells, we asked if scMappR could reveal known cell-type specific changes that occur during kidney regeneration. scMappR appropriately assigned DEGs to cell-types involved in kidney regeneration, including a relatively small population of immune cells. While scMappR can work with user-supplied scRNA-seq data, we curated scRNA-seq expression matrices for ∼100 human and mouse tissues to facilitate its stand-alone use with bulk RNA-seq data from these species. Overall, scMappR is a user-friendly R package that complements traditional differential gene expression analysis of bulk RNA-seq data.
Collapse
Affiliation(s)
- Dustin J Sokolowski
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | | | - Lauren Erdman
- Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, M5G 0A4, Canada
| | - Huayun Hou
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | - Cadia Chan
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | - Helen Zhu
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Melissa M Holmes
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Anna Goldenberg
- Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, M5G 0A4, Canada
| | - Michael D Wilson
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| |
Collapse
|
22
|
Dumitrascu B, Villar S, Mixon DG, Engelhardt BE. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat Commun 2021; 12:1186. [PMID: 33608535 PMCID: PMC7895823 DOI: 10.1038/s41467-021-21453-4] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 01/27/2021] [Indexed: 11/17/2022] Open
Abstract
Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization. The selection of a small set of cellular labels to distinguish a subpopulation of cells from a complex mixture is an important task in cell biology. Here the authors propose a method for supervised genetic marker selection using linear programming and provides a Python package scGeneFit that implements this approach.
Collapse
Affiliation(s)
- Bianca Dumitrascu
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Soledad Villar
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA.,Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, MD, USA
| | - Dustin G Mixon
- Department of Mathematics, The Ohio State University, Columbus, OH, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, USA. .,Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA.
| |
Collapse
|
23
|
Dong M, Thennavan A, Urrutia E, Li Y, Perou CM, Zou F, Jiang Y. SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Brief Bioinform 2021; 22:416-427. [PMID: 31925417 PMCID: PMC7820884 DOI: 10.1093/bib/bbz166] [Citation(s) in RCA: 147] [Impact Index Per Article: 36.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 11/04/2019] [Accepted: 12/02/2019] [Indexed: 12/14/2022] Open
Abstract
Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.
Collapse
Affiliation(s)
| | | | | | | | | | - Fei Zou
- Corresponding authors: Fei Zou and Yuchao Jiang, Department of Biostatistics and Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. ,
| | - Yuchao Jiang
- Corresponding authors: Fei Zou and Yuchao Jiang, Department of Biostatistics and Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. ,
| |
Collapse
|
24
|
Zeng P, Wangwu J, Lin Z. Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data. Brief Bioinform 2020; 22:6024740. [PMID: 33279962 DOI: 10.1093/bib/bbaa347] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 10/29/2020] [Accepted: 10/30/2020] [Indexed: 12/11/2022] Open
Abstract
Unsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. The most current clustering methods are designed for one data type only, such as single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq) or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. The integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. In this paper, we propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered. Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. Our method coupleCoC is also computationally efficient and can scale up to large datasets. Availability: The software and datasets are available at https://github.com/cuhklinlab/coupleCoC.
Collapse
Affiliation(s)
- Pengcheng Zeng
- Department of Statistics, The Chinese University of Hong Kong
| | - Jiaxuan Wangwu
- Department of Statistics, The Chinese University of Hong Kong
| | - Zhixiang Lin
- Department of Statistics, The Chinese University of Hong Kong
| |
Collapse
|
25
|
Camerlenghi F, Dumitrascu B, Ferrari F, Engelhardt BE, Favaro S. Nonparametric Bayesian multiarmed bandits for single-cell experiment design. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1370] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Xu J, Cai L, Liao B, Zhu W, Yang J. CMF-Impute: an accurate imputation tool for single-cell RNA-seq data. Bioinformatics 2020; 36:3139-3147. [PMID: 32073612 DOI: 10.1093/bioinformatics/btaa109] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Revised: 01/16/2020] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Single-cell RNA-sequencing (scRNA-seq) technology provides a powerful tool for investigating cell heterogeneity and cell subpopulations by allowing the quantification of gene expression at single-cell level. However, scRNA-seq data analysis remains challenging because of various technical noises such as dropout events (i.e. excessive zero counts in the expression matrix). RESULTS By taking consideration of the association among cells and genes, we propose a novel collaborative matrix factorization-based method called CMF-Impute to impute the dropout entries of a given scRNA-seq expression matrix. We test CMF-Impute and compare it with the other five state-of-the-art methods on six popular real scRNA-seq datasets of various sizes and three simulated datasets. For simulated datasets, CMF-Impute outperforms other methods in imputing the closest dropouts to the original expression values as evaluated by both the sum of squared error and Pearson correlation coefficient. For real datasets, CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or T-SNE followed by K-means as evaluated by both adjusted rand index and normalized mutual information. Finally, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories. AVAILABILITY AND IMPLEMENTATION CMF-Impute is written as a Matlab package which is available at https://github.com/xujunlin123/CMFImpute.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Junlin Xu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, P.R. China
| | - Lijun Cai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, P.R. China
| | - Bo Liao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, P.R. China
| | - Wen Zhu
- School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, P.R. China
| | - JiaLiang Yang
- School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, P.R. China.,Geneis Beijing Co., Ltd, Beijing 100102, China
| |
Collapse
|
27
|
Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 2020; 18:2789-2798. [PMID: 33101615 PMCID: PMC7568192 DOI: 10.1016/j.csbj.2020.09.014] [Citation(s) in RCA: 81] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 09/09/2020] [Accepted: 09/10/2020] [Indexed: 12/21/2022] Open
Abstract
Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as "zero-inflation" was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.
Collapse
Affiliation(s)
- Justin D Silverman
- College of Information Science and Technology, Pennsylvania State University, State College, PA 16802, United States
- Institute for Computational and Data Science, Pennsylvania State University, State College, PA 16802, United States
- Department of Medicine, Pennsylvania State University, Hershey, PA 17033, United States
| | - Kimberly Roche
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Sayan Mukherjee
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
- Departments of Statistical Science, Mathematics, Computer Science, Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, United States
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States
| | - Lawrence A David
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, United States
| |
Collapse
|
28
|
Sun B, Chen L. Quantile regression for challenging cases of eQTL mapping. Brief Bioinform 2020; 21:1756-1765. [PMID: 31688892 PMCID: PMC7673343 DOI: 10.1093/bib/bbz097] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 06/24/2019] [Accepted: 07/06/2019] [Indexed: 11/13/2022] Open
Abstract
Mapping of expression quantitative trait loci (eQTLs) facilitates interpretation of the regulatory path from genetic variants to their associated disease or traits. High-throughput sequencing of RNA (RNA-seq) has expedited the exploration of these regulatory variants. However, eQTL mapping is usually confronted with the analysis challenges caused by overdispersion and excessive dropouts in RNA-seq. The heavy-tailed distribution of gene expression violates the assumption of Gaussian distributed errors in linear regression for eQTL detection, which results in increased Type I or Type II errors. Applying rank-based inverse normal transformation (INT) can make the expression values more normally distributed. However, INT causes information loss and leads to uninterpretable effect size estimation. After comprehensive examination of the impact from overdispersion and excessive dropouts, we propose to apply a robust model, quantile regression, to map eQTLs for genes with high degree of overdispersion or large number of dropouts. Simulation studies show that quantile regression has the desired robustness to outliers and dropouts, and it significantly improves eQTL mapping. From a real data analysis, the most significant eQTL discoveries differ between quantile regression and the conventional linear model. Such discrepancy becomes more prominent when the dropout effect or the overdispersion effect is large. All the results suggest that quantile regression provides more reliable and accurate eQTL mapping than conventional linear models. It deserves more attention for the large-scale eQTL mapping.
Collapse
Affiliation(s)
- Bo Sun
- Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, USA
| | - Liang Chen
- Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, USA
| |
Collapse
|
29
|
Gan L, Vinci G, Allen GI. Correlation Imputation in Single cell RNA-seq using Auxiliary Information and Ensemble Learning. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2020; 2020. [PMID: 34278382 DOI: 10.1145/3388440.3412462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Single cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.
Collapse
|
30
|
Tao Y, Lei H, Lee AV, Ma J, Schwartz R. Neural Network Deconvolution Method for Resolving Pathway-Level Progression of Tumor Clonal Expression Programs With Application to Breast Cancer Brain Metastases. Front Physiol 2020; 11:1055. [PMID: 33013452 PMCID: PMC7499245 DOI: 10.3389/fphys.2020.01055] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 07/31/2020] [Indexed: 02/03/2023] Open
Abstract
Metastasis is the primary mechanism by which cancer results in mortality and there are currently no reliable treatment options once it occurs, making the metastatic process a critical target for new diagnostics and therapeutics. Treating metastasis before it appears is challenging, however, in part because metastases may be quite distinct genomically from the primary tumors from which they presumably emerged. Phylogenetic studies of cancer development have suggested that changes in tumor genomics over stages of progression often result from shifts in the abundance of clonal cellular populations, as late stages of progression may derive from or select for clonal populations rare in the primary tumor. The present study develops computational methods to infer clonal heterogeneity and dynamics across progression stages via deconvolution and clonal phylogeny reconstruction of pathway-level expression signatures in order to reconstruct how these processes might influence average changes in genomic signatures over progression. We show, via application to a study of gene expression in a collection of matched breast primary tumor and metastatic samples, that the method can infer coarse-grained substructure and stromal infiltration across the metastatic transition. The results suggest that genomic changes observed in metastasis, such as gain of the ErbB signaling pathway, are likely caused by early events in clonal evolution followed by expansion of minor clonal populations in metastasis, a finding that may have translational implications for early detection or prevention of metastasis.
Collapse
Affiliation(s)
- Yifeng Tao
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA, United States
| | - Haoyun Lei
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA, United States
| | - Adrian V Lee
- Department of Pharmacology and Chemical Biology, UPMC Hillman Cancer Center, Magee-Womens Research Institute, University of Pittsburgh, Pittsburgh, PA, United States
| | - Jian Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Russell Schwartz
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, United States
| |
Collapse
|
31
|
Zhang S, Yang L, Yang J, Lin Z, Ng MK. Dimensionality reduction for single cell RNA sequencing data using constrained robust non-negative matrix factorization. NAR Genom Bioinform 2020; 2:lqaa064. [PMID: 33575614 PMCID: PMC7671375 DOI: 10.1093/nargab/lqaa064] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 08/10/2020] [Accepted: 08/19/2020] [Indexed: 12/22/2022] Open
Abstract
Single cell RNA-sequencing (scRNA-seq) technology, a powerful tool for analyzing the entire transcriptome at single cell level, is receiving increasing research attention. The presence of dropouts is an important characteristic of scRNA-seq data that may affect the performance of downstream analyses, such as dimensionality reduction and clustering. Cells sequenced to lower depths tend to have more dropouts than those sequenced to greater depths. In this study, we aimed to develop a dimensionality reduction method to address both dropouts and the non-negativity constraints in scRNA-seq data. The developed method simultaneously performs dimensionality reduction and dropout imputation under the non-negative matrix factorization (NMF) framework. The dropouts were modeled as a non-negative sparse matrix. Summation of the observed data matrix and dropout matrix was approximated by NMF. To ensure the sparsity pattern was maintained, a weighted ℓ1 penalty that took into account the dependency of dropouts on the sequencing depth in each cell was imposed. An efficient algorithm was developed to solve the proposed optimization problem. Experiments using both synthetic data and real data showed that dimensionality reduction via the proposed method afforded more robust clustering results compared with those obtained from the existing methods, and that dropout imputation improved the differential expression analysis.
Collapse
Affiliation(s)
- Shuqin Zhang
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Liu Yang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Jinwen Yang
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Zhixiang Lin
- Department of Statistics, Chinese University of Hong Kong, Shatin Hong Kong, China
| | - Michael K Ng
- Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| |
Collapse
|
32
|
Li S, Crawford FW, Gerstein MB. Using sigLASSO to optimize cancer mutation signatures jointly with sampling likelihood. Nat Commun 2020; 11:3575. [PMID: 32681003 PMCID: PMC7368050 DOI: 10.1038/s41467-020-17388-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 06/22/2020] [Indexed: 11/08/2022] Open
Abstract
Multiple mutational processes drive carcinogenesis, leaving characteristic signatures in tumor genomes. Determining the active signatures from a full repertoire of potential ones helps elucidate mechanisms of cancer development. This involves optimally decomposing the counts of cancer mutations, tabulated according to their trinucleotide context, into a linear combination of known signatures. Here, we develop sigLASSO (a software tool at github.com/gersteinlab/siglasso) to carry out this optimization efficiently. sigLASSO has four key aspects: (1) It jointly optimizes the likelihood of sampling and signature fitting, by explicitly factoring multinomial sampling into the objective function. This is particularly important when mutation counts are low and sampling variance is high (e.g., in exome sequencing). (2) sigLASSO uses L1 regularization to parsimoniously assign signatures, leading to sparse and interpretable solutions. (3) It fine-tunes model complexity, informed by data scale and biological priors. (4) Consequently, sigLASSO can assess model uncertainty and abstain from making assignments in low-confidence contexts.
Collapse
Affiliation(s)
- Shantao Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Forrest W Crawford
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
- Yale School of Management, New Haven, CT, USA
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA
| | - Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA.
- Department of Computer Science, Yale University, New Haven, CT, USA.
| |
Collapse
|
33
|
Chowdhury HA, Bhattacharyya DK, Kalita JK. (Differential) Co-Expression Analysis of Gene Expression: A Survey of Best Practices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1154-1173. [PMID: 30668502 DOI: 10.1109/tcbb.2019.2893170] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Analysis of gene expression data is widely used in transcriptomic studies to understand functions of molecules inside a cell and interactions among molecules. Differential co-expression analysis studies diseases and phenotypic variations by finding modules of genes whose co-expression patterns vary across conditions. We review the best practices in gene expression data analysis in terms of analysis of (differential) co-expression, co-expression network, differential networking, and differential connectivity considering both microarray and RNA-seq data along with comparisons. We highlight hurdles in RNA-seq data analysis using methods developed for microarrays. We include discussion of necessary tools for gene expression analysis throughout the paper. In addition, we shed light on scRNA-seq data analysis by including preprocessing and scRNA-seq in co-expression analysis along with useful tools specific to scRNA-seq. To get insights, biological interpretation and functional profiling is included. Finally, we provide guidelines for the analyst, along with research issues and challenges which should be addressed.
Collapse
|
34
|
Tao Y, Lei H, Fu X, Lee AV, Ma J, Schwartz R. Robust and accurate deconvolution of tumor populations uncovers evolutionary mechanisms of breast cancer metastasis. Bioinformatics 2020; 36:i407-i416. [PMID: 32657393 PMCID: PMC7355293 DOI: 10.1093/bioinformatics/btaa396] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION Cancer develops and progresses through a clonal evolutionary process. Understanding progression to metastasis is of particular clinical importance, but is not easily analyzed by recent methods because it generally requires studying samples gathered years apart, for which modern single-cell sequencing is rarely an option. Revealing the clonal evolution mechanisms in the metastatic transition thus still depends on unmixing tumor subpopulations from bulk genomic data. METHODS We develop a novel toolkit called robust and accurate deconvolution (RAD) to deconvolve biologically meaningful tumor populations from multiple transcriptomic samples spanning the two progression states. RAD uses gene module compression to mitigate considerable noise in RNA, and a hybrid optimizer to achieve a robust and accurate solution. Finally, we apply a phylogenetic algorithm to infer how associated cell populations adapt across the metastatic transition via changes in expression programs and cell-type composition. RESULTS We validated the superior robustness and accuracy of RAD over alternative algorithms on a real dataset, and validated the effectiveness of gene module compression on both simulated and real bulk RNA data. We further applied the methods to a breast cancer metastasis dataset, and discovered common early events that promote tumor progression and migration to different metastatic sites, such as dysregulation of ECM-receptor, focal adhesion and PI3k-Akt pathways. AVAILABILITY AND IMPLEMENTATION The source code of the RAD package, models, experiments and technical details such as parameters, is available at https://github.com/CMUSchwartzLab/RAD. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yifeng Tao
- Department of computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15213, USA
| | - Haoyun Lei
- Department of computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15213, USA
| | - Xuecong Fu
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Adrian V Lee
- Department of Pharmacology and Chemical Biology, UPMC Hillman Cancer Center, Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - Jian Ma
- Department of computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Russell Schwartz
- Department of computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
35
|
Zhang L, Zhang S. Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:376-389. [PMID: 29994128 DOI: 10.1109/tcbb.2018.2848633] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Single-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness, and unavailability in some situations need to be addressed in future studies.
Collapse
|
36
|
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini CSO, Aparicio S, Baaijens J, Balvert M, Barbanson BD, Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo TH, Lelieveldt BP, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Rączkowska A, Reinders M, Ridder JD, Saliba AE, Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in single-cell data science. Genome Biol 2020; 21:31. [PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6] [Citation(s) in RCA: 690] [Impact Index Per Article: 138.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 01/02/2020] [Indexed: 02/08/2023] Open
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Collapse
Affiliation(s)
- David Lähnemann
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf, Germany
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Johannes Köster
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA
| | - Ewa Szczurek
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Davis J. McCarthy
- Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia
- Melbourne Integrative Genomics, School of BioSciences–School of Mathematics & Statistics, Faculty of Science, University of Melbourne, Melbourne, Australia
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD USA
| | - Mark D. Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zürich, Zürich, Switzerland
| | - Catalina A. Vallejos
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK
- The Alan Turing Institute, British Library, London, UK
| | - Kieran R. Campbell
- Department of Statistics, University of British Columbia, Vancouver, Canada
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Data Science Institute, University of British Columbia, Vancouver, Canada
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Luca Pinello
- Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA
- Department of Pathology, Harvard Medical School, Boston, USA
- Broad Institute of Harvard and MIT, Cambridge, MA USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, USA
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | | | - Samuel Aparicio
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Jasmijn Baaijens
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
| | - Marleen Balvert
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| | - Buys de Barbanson
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Antonio Cappuccio
- Institute for Advanced Study, University of Amsterdam, Amsterdam, The Netherlands
| | - Giacomo Corleone
- Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, London, UK
| | - Bas E. Dutilh
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Maria Florescu
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
- Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Rens Holmer
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Katharina Jahn
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Thamar Jessurun Lobo
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Emma M. Keizer
- Biometris, Wageningen University & Research, Wageningen, The Netherlands
| | - Indu Khatri
- Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, Leiden, The Netherlands
| | - Szymon M. Kielbasa
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Jan O. Korbel
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alexey M. Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tzu-Hao Kuo
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Boudewijn P.F. Lelieveldt
- PRB lab, Delft University of Technology, Delft, The Netherlands
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Ion I. Mandoiu
- Computer Science & Engineering Department, University of Connecticut, Storrs, USA
| | - John C. Marioni
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Felix Mölder
- Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
| | - Amir Niknejad
- Computation molecular design, Zuse Institute Berlin, Berlin, Germany
- Mathematics Department, Mount Saint Vincent, New York, USA
| | - Alicja Rączkowska
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Marcel Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Antoine-Emmanuel Saliba
- Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany
| | - Antonios Somarakis
- Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Oliver Stegle
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center–DKFZ, Heidelberg, Germany
| | - Fabian J. Theis
- Institute of Computational Biology, Helmholtz Zentrum München–German Research Center for Environmental Health, Neuherberg, Germany
| | - Huan Yang
- Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research–LACDR–Leiden University, Leiden, The Netherlands
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Alice C. McHardy
- Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Sohrab P. Shah
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Alexander Schönhuth
- Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
37
|
Lin Z, Zamanighomi M, Daley T, Ma S, Wong WH. Model-Based Approach to the Joint Analysis of Single-Cell Data on Chromatin Accessibility and Gene Expression. Stat Sci 2020. [DOI: 10.1214/19-sts714] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
38
|
Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res 2020; 30:195-204. [PMID: 31992614 PMCID: PMC7050525 DOI: 10.1101/gr.251603.119] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 11/19/2019] [Indexed: 02/06/2023]
Abstract
Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene-gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.
Collapse
Affiliation(s)
- Rebecca Elyanow
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912, USA
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| | - Bianca Dumitrascu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08540, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey 08540, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| |
Collapse
|
39
|
Jambusaria A, Hong Z, Zhang L, Srivastava S, Jana A, Toth PT, Dai Y, Malik AB, Rehman J. Endothelial heterogeneity across distinct vascular beds during homeostasis and inflammation. eLife 2020; 9:51413. [PMID: 31944177 PMCID: PMC7002042 DOI: 10.7554/elife.51413] [Citation(s) in RCA: 205] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 01/15/2020] [Indexed: 12/18/2022] Open
Abstract
Blood vessels are lined by endothelial cells engaged in distinct organ-specific functions but little is known about their characteristic gene expression profiles. RNA-Sequencing of the brain, lung, and heart endothelial translatome identified specific pathways, transporters and cell-surface markers expressed in the endothelium of each organ, which can be visualized at http://www.rehmanlab.org/ribo. We found that endothelial cells express genes typically found in the surrounding tissues such as synaptic vesicle genes in the brain endothelium and cardiac contractile genes in the heart endothelium. Complementary analysis of endothelial single cell RNA-Seq data identified the molecular signatures shared across the endothelial translatome and single cell transcriptomes. The tissue-specific heterogeneity of the endothelium is maintained during systemic in vivo inflammatory injury as evidenced by the distinct responses to inflammatory stimulation. Our study defines endothelial heterogeneity and plasticity and provides a molecular framework to understand organ-specific vascular disease mechanisms and therapeutic targeting of individual vascular beds. Blood vessels supply nutrients, oxygen and other key molecules to all of the organs in the body. Cells lining the blood vessels, called endothelial cells, regulate which molecules pass from the blood to the organs they supply. For example, brain endothelial cells prevent toxic molecules from getting into the brain, and lung endothelial cells allow immune cells into the lungs to fight off bacteria or viruses. Determining which genes are switched on in the endothelial cells of major organs might allow scientists to determine what endothelial cells do in the brain, heart, and lung, and how they differ; or help scientists deliver drugs to a particular organ. If endothelial cells from different organs switch on different groups of genes, each of these groups of genes can be thought of as a ‘genetic signature’ that identifies endothelial cells from a specific organ. Now, Jambusaria et al. show that brain, heart, and lung endothelial cells have distinct genetic signatures. The experiments used mice that had been genetically modified to have tags on their endothelial cells. These tags made it possible to isolate RNA – a molecule similar to DNA that contains the information about which genes are active – from endothelial cells without separating the cells from their tissue of origin. Next, RNA from endothelial cells in the heart, brain and lung was sequenced and analyzed. The results show that each endothelial cell type has a distinct genetic signature under normal conditions and infection-like conditions. Unexpectedly, the experiments also showed that genes that were thought to only be switched on in the cells of specific tissues are also on in the endothelial cells lining the blood vessels of the tissue. For example, genes switched on in brain cells are also active in brain endothelial cells, and genes allowing heart muscle cells to pump are also on in the endothelial cells of the heart blood vessels. The endothelial cell genetic signatures identified by Jambusaria et al. can be used as “postal codes” to target drugs to a specific organ via the endothelial cells that feed it. It might also be possible to use these genetic signatures to build organ-specific blood vessels from stem cells in the laboratory. Future work will try to answer why endothelial cells serving the heart and brain use genes from these organs.
Collapse
Affiliation(s)
- Ankit Jambusaria
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States.,Department of Bioengineering, The University of Illinois College of Engineering and College of Medicine, Chicago, United States
| | - Zhigang Hong
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States
| | - Lianghui Zhang
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States
| | - Shubhi Srivastava
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States
| | - Arundhati Jana
- Division of Cardiology, Department of Medicine, The University of Illinois College of Medicine, Chicago, United States
| | - Peter T Toth
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States.,Research Resources Center, University of Illinois, Chicago, United States
| | - Yang Dai
- Department of Bioengineering, The University of Illinois College of Engineering and College of Medicine, Chicago, United States
| | - Asrar B Malik
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States
| | - Jalees Rehman
- Department of Pharmacology, The University of Illinois College of Medicine, Chicago, United States.,Division of Cardiology, Department of Medicine, The University of Illinois College of Medicine, Chicago, United States
| |
Collapse
|
40
|
|
41
|
Mardis ER. The Impact of Next-Generation Sequencing on Cancer Genomics: From Discovery to Clinic. Cold Spring Harb Perspect Med 2019; 9:cshperspect.a036269. [PMID: 30397020 DOI: 10.1101/cshperspect.a036269] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The application of next-generation sequencing (NGS) technology to the study of cancer genomes has been transformational. Not only has this technology revealed the genetic and epigenetic underpinnings of disease onset and progression, but also has redefined our clinical diagnosis and treatment paradigms. This rapid translation from discovery to clinical platform has occurred in the context of new pharmaceutical paradigms, enabling the use of NGS for the diagnosis and definition of therapeutic vulnerabilities of cancer. This review explores this transformation and identifies cutting-edge applications of NGS that will result in its additional utility in cancer care.
Collapse
Affiliation(s)
- Elaine R Mardis
- The Ohio State University College of Medicine, Columbus, Ohio 43205
| |
Collapse
|
42
|
Mercatelli D, Ray F, Giorgi FM. Pan-Cancer and Single-Cell Modeling of Genomic Alterations Through Gene Expression. Front Genet 2019; 10:671. [PMID: 31379928 PMCID: PMC6657420 DOI: 10.3389/fgene.2019.00671] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 06/27/2019] [Indexed: 12/27/2022] Open
Abstract
Cancer is a disease often characterized by the presence of multiple genomic alterations, which trigger altered transcriptional patterns and gene expression, which in turn sustain the processes of tumorigenesis, tumor progression, and tumor maintenance. The links between genomic alterations and gene expression profiles can be utilized as the basis to build specific molecular tumorigenic relationships. In this study, we perform pan-cancer predictions of the presence of single somatic mutations and copy number variations using machine learning approaches on gene expression profiles. We show that gene expression can be used to predict genomic alterations in every tumor type, where some alterations are more predictable than others. We propose gene aggregation as a tool to improve the accuracy of alteration prediction models from gene expression profiles. Ultimately, we show how this principle can be beneficial in intrinsically noisy datasets, such as those based on single-cell sequencing.
Collapse
Affiliation(s)
- Daniele Mercatelli
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Forest Ray
- Department of Systems Biology, Columbia University Medical Center, New York, NY, United States
| | - Federico M. Giorgi
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| |
Collapse
|
43
|
Ye W, Ji G, Ye P, Long Y, Xiao X, Li S, Su Y, Wu X. scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data. BMC Genomics 2019; 20:347. [PMID: 31068142 PMCID: PMC6505295 DOI: 10.1186/s12864-019-5747-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 04/29/2019] [Indexed: 12/15/2022] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) is fast becoming a powerful tool for profiling genome-scale transcriptomes of individual cells and capturing transcriptome-wide cell-to-cell variability. However, scRNA-seq technologies suffer from high levels of technical noise and variability, hindering reliable quantification of lowly and moderately expressed genes. Since most downstream analyses on scRNA-seq, such as cell type clustering and differential expression analysis, rely on the gene-cell expression matrix, preprocessing of scRNA-seq data is a critical preliminary step in the analysis of scRNA-seq data. Results We presented scNPF, an integrative scRNA-seq preprocessing framework assisted by network propagation and network fusion, for recovering gene expression loss, correcting gene expression measurements, and learning similarities between cells. scNPF leverages the context-specific topology inherent in the given data and the priori knowledge derived from publicly available molecular gene-gene interaction networks to augment gene-gene relationships in a data driven manner. We have demonstrated the great potential of scNPF in scRNA-seq preprocessing for accurately recovering gene expression values and learning cell similarity networks. Comprehensive evaluation of scNPF across a wide spectrum of scRNA-seq data sets showed that scNPF achieved comparable or higher performance than the competing approaches according to various metrics of internal validation and clustering accuracy. We have made scNPF an easy-to-use R package, which can be used as a versatile preprocessing plug-in for most existing scRNA-seq analysis pipelines or tools. Conclusions scNPF is a universal tool for preprocessing of scRNA-seq data, which jointly incorporates the global topology of priori interaction networks and the context-specific information encapsulated in the scRNA-seq data to capture both shared and complementary knowledge from diverse data sources. scNPF could be used to recover gene signatures and learn cell-to-cell similarities from emerging scRNA-seq data to facilitate downstream analyses such as dimension reduction, cell type clustering, and visualization. Electronic supplementary material The online version of this article (10.1186/s12864-019-5747-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenbin Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China.,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China
| | - Pengchao Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Yuqi Long
- Software Quality Testing Engineering Research Center, China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou, 510610, China
| | - Xuesong Xiao
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Shuchao Li
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350116, China
| | - Xiaohui Wu
- Department of Automation, Xiamen University, Xiamen, 361005, China. .,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen, China. .,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
44
|
Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 2018; 19:562-578. [PMID: 29121214 PMCID: PMC6215955 DOI: 10.1093/biostatistics/kxx053] [Citation(s) in RCA: 324] [Impact Index Per Article: 46.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 09/13/2017] [Indexed: 12/26/2022] Open
Abstract
Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
Collapse
Affiliation(s)
- Stephanie C Hicks
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - F William Townes
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Mingxiang Teng
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| |
Collapse
|