1
|
Nguyen H, Nguyen H, Tran D, Draghici S, Nguyen T. Fourteen years of cellular deconvolution: methodology, applications, technical evaluation and outstanding challenges. Nucleic Acids Res 2024; 52:4761-4783. [PMID: 38619038 PMCID: PMC11109966 DOI: 10.1093/nar/gkae267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/01/2024] [Accepted: 04/02/2024] [Indexed: 04/16/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-Seq) is a recent technology that allows for the measurement of the expression of all genes in each individual cell contained in a sample. Information at the single-cell level has been shown to be extremely useful in many areas. However, performing single-cell experiments is expensive. Although cellular deconvolution cannot provide the same comprehensive information as single-cell experiments, it can extract cell-type information from bulk RNA data, and therefore it allows researchers to conduct studies at cell-type resolution from existing bulk datasets. For these reasons, a great effort has been made to develop such methods for cellular deconvolution. The large number of methods available, the requirement of coding skills, inadequate documentation, and lack of performance assessment all make it extremely difficult for life scientists to choose a suitable method for their experiment. This paper aims to fill this gap by providing a comprehensive review of 53 deconvolution methods regarding their methodology, applications, performance, and outstanding challenges. More importantly, the article presents a benchmarking of all these 53 methods using 283 cell types from 30 tissues of 63 individuals. We also provide an R package named DeconBenchmark that allows readers to execute and benchmark the reviewed methods (https://github.com/tinnlab/DeconBenchmark).
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | - Ha Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | - Duc Tran
- Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA
- Advaita Bioinformatics, Ann Arbor, MI, USA
| | - Tin Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| |
Collapse
|
2
|
Detection of Cell Separation-Induced Gene Expression Through a Penalized Deconvolution Approach. STATISTICS IN BIOSCIENCES 2022. [DOI: 10.1007/s12561-022-09344-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
3
|
Tai AS, Tseng GC, Hsieh WP. BayICE: A Bayesian hierarchical model for semireference-based deconvolution of bulk transcriptomic data. Ann Appl Stat 2021. [DOI: 10.1214/20-aoas1376] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- An-Shun Tai
- Institute of Statistics, National Tsing Hua University
| | | | | |
Collapse
|
4
|
Jaakkola MK, Elo LL. Computational deconvolution to estimate cell type-specific gene expression from bulk data. NAR Genom Bioinform 2021; 3:lqaa110. [PMID: 33575652 PMCID: PMC7803005 DOI: 10.1093/nargab/lqaa110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 12/14/2020] [Accepted: 12/17/2020] [Indexed: 12/24/2022] Open
Abstract
Computational deconvolution is a time and cost-efficient approach to obtain cell type-specific information from bulk gene expression of heterogeneous tissues like blood. Deconvolution can aim to either estimate cell type proportions or abundances in samples, or estimate how strongly each present cell type expresses different genes, or both tasks simultaneously. Among the two separate goals, the estimation of cell type proportions/abundances is widely studied, but less attention has been paid on defining the cell type-specific expression profiles. Here, we address this gap by introducing a novel method Rodeo and empirically evaluating it and the other available tools from multiple perspectives utilizing diverse datasets.
Collapse
Affiliation(s)
- Maria K Jaakkola
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, FI-20520 Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, FI-20520 Turku, Finland
| |
Collapse
|
5
|
Chen Z, Wu A. Progress and challenge for computational quantification of tissue immune cells. Brief Bioinform 2021; 22:6065002. [PMID: 33401306 DOI: 10.1093/bib/bbaa358] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/23/2020] [Accepted: 11/07/2020] [Indexed: 12/28/2022] Open
Abstract
Tissue immune cells have long been recognized as important regulators for the maintenance of balance in the body system. Quantification of the abundance of different immune cells will provide enhanced understanding of the correlation between immune cells and normal or abnormal situations. Currently, computational methods to predict tissue immune cell compositions from bulk transcriptomes have been largely developed. Therefore, summarizing the advantages and disadvantages is appropriate. In addition, an examination of the challenges and possible solutions for these computational models will assist the development of this field. The common hypothesis of these models is that the expression of signature genes for immune cell types might represent the proportion of immune cells that contribute to the tissue transcriptome. In general, we grouped all reported tools into three groups, including reference-free, reference-based scoring and reference-based deconvolution methods. In this review, a summary of all the currently reported computational immune cell quantification tools and their applications, limitations, and perspectives are presented. Furthermore, some critical problems are found that have limited the performance and application of these models, including inadequate immune cell type, the collinearity problem, the impact of the tissue environment on the immune cell expression level, and the deficiency of standard datasets for model validation. To address these issues, tissue specific training datasets that include all known immune cells, a hierarchical computational framework, and benchmark datasets including both tissue expression profiles and the abundances of all the immune cells are proposed to further promote the development of this field.
Collapse
Affiliation(s)
- Ziyi Chen
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| | - Aiping Wu
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| |
Collapse
|
6
|
Saxena A, Ravutla S, Upadhyay V, Jana S, Murhammer D, Giri L. Statistical modeling of cell-to-cell variability in viral infection during passaging in suspension cell culture: Application in Monte-Carlo simulation. Biotechnol Bioeng 2020; 117:1483-1501. [PMID: 32017023 DOI: 10.1002/bit.27295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Revised: 12/13/2019] [Accepted: 02/03/2020] [Indexed: 11/09/2022]
Abstract
Packaging during the passaging of viruses in cell cultures yields various phenotypes and is regulated by viral protein expression in infected cells. Although such a packaging mechanism has a profound effect in controlling the virus yield, little is known about the underlying statistical models followed by virus packaging and protein expression among cells infected with the virus. A predictive framework combining identification of the probability density function (PDF) based on log-likelihood and using the PDF for Monte-Carlo simulations is developed. The Birnbaum-Saunders distribution was found to be consistent with all three-virus packaging levels, including nucleocapsids/occlusion-derived virus (ODV), ODVs/polyhedra, and polyhedra/cell for both wild-type and genetically modified AcMNPV. Next, it was demonstrated that PDF fitting could be used to compare two viruses having distinctly different genetic configurations. Finally, the identified PDF can be incorporated in RNA synthesis parameters for baculovirus infection to predict the cell-to-cell variability in protein expression using Monte-Carlo simulations. The proposed tool can be used for the estimation of uncertainty in the kinetic parameter and prediction of cell-to-cell variability for other biological systems.
Collapse
Affiliation(s)
- Abha Saxena
- Chemical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India
| | - Suryateja Ravutla
- Chemical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India
| | - Vikas Upadhyay
- Chemical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India
| | - Soumya Jana
- Electrical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India
| | - David Murhammer
- Department of Chemical and Biochemical Engineering, The University of Iowa, Iowa City, Iowa
| | - Lopamudra Giri
- Chemical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India
| |
Collapse
|
7
|
Way GP, Greene CS. Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021348] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Pathway and cell type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell type and pathway expression but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools for signature discovery through their ability to provide accurate and interpretable results. In this review, we discuss various machine learning applications to extract pathway and cell type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single-cell RNA data. As data and computational resources increase, there will be more opportunities for machine learning to aid in revealing biological signatures.
Collapse
Affiliation(s)
- Gregory P. Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
8
|
Ogundijo OE, Zhu K, Wang X, Anastassiou D. A sequential Monte Carlo algorithm for inference of subclonal structure in cancer. PLoS One 2019; 14:e0211213. [PMID: 30682127 PMCID: PMC6347199 DOI: 10.1371/journal.pone.0211213] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Accepted: 01/03/2019] [Indexed: 11/19/2022] Open
Abstract
Tumors are heterogeneous in the sense that they consist of multiple subpopulations of cells, referred to as subclones, each of which is characterized by a distinct profile of genomic variations such as somatic mutations. Inferring the underlying clonal landscape has become an important topic in that it can help in understanding cancer development and progression, and thereby help in improving treatment. We describe a novel state-space model, based on the feature allocation framework and an efficient sequential Monte Carlo (SMC) algorithm, using the somatic mutation data obtained from tumor samples to estimate the number of subclones, as well as their characterization. Our approach, by design, is capable of handling any number of mutations. Via extensive simulations, our method exhibits high accuracy, in most cases, and compares favorably with existing methods. Moreover, we demonstrated the validity of our method through analyzing real tumor samples from patients from multiple cancer types (breast, prostate, and lung). Our results reveal driver mutation events specific to cancer types, and indicate clonal expansion by manual phylogenetic analysis. MATLAB code and datasets are available to download at: https://github.com/moyanre/tumor_clones.
Collapse
Affiliation(s)
- Oyetunji E. Ogundijo
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| | - Kaiyi Zhu
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
- Department of Systems Biology, Columbia University, New York, NY, United States of America
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
- * E-mail:
| | - Dimitris Anastassiou
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
- Department of Systems Biology, Columbia University, New York, NY, United States of America
| |
Collapse
|
9
|
Ogundijo OE, Wang X. SeqClone: sequential Monte Carlo based inference of tumor subclones. BMC Bioinformatics 2019; 20:6. [PMID: 30611189 PMCID: PMC6320595 DOI: 10.1186/s12859-018-2562-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 12/06/2018] [Indexed: 11/13/2022] Open
Abstract
Background Tumor samples are heterogeneous. They consist of varying cell populations or subclones and each subclone is characterized with a distinct single nucleotide variant (SNV) profile. This explains the source of genetic heterogeneity observed in tumor sequencing data. To make precise prognosis and design effective therapy for cancer, ascertaining the subclonal composition of a tumor is of great importance. Results In this paper, we propose a state-space formulation of the feature allocation model. This model is interpreted as the blind deconvolution of the expected variant allele fractions (VAFs). VAFs are deconvolved into a binary matrix of genotypes and a matrix of genotype proportions in the samples. Specifically, we consider a sequential construction of the genotype matrix which we model by Indian buffet process (IBP). We describe an efficient sequential Monte Carlo (SMC) algorithm, SeqClone, that jointly estimates the genotypes of subclones and their proportions in the samples. When compared to other methods for resolving tumor heterogeneity, SeqClone provides comparable and sometimes, better estimates of model parameters. By design, SeqClone conveniently handles any number of probed SNVs in the samples. In particular, we can analyze VAFs from newly probed SNVs to improve existing estimates, an attribute not present in existing solutions. Conclusions We show that the SMC algorithm for deconvolving VAFs from tumor sequencing data is a robust and promising alternative for explaining the observed genetic heterogeneity in tumor samples. Electronic supplementary material The online version of this article (10.1186/s12859-018-2562-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Oyetunji E Ogundijo
- Department of Electrical Engineering, Columbia University, New York, NY 10027, USA
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, NY 10027, USA.
| |
Collapse
|
10
|
Ogundijo OE, Wang X. Characterization of tumor heterogeneity by latent haplotypes: a sequential Monte Carlo approach. PeerJ 2018; 6:e4838. [PMID: 29868266 PMCID: PMC5984585 DOI: 10.7717/peerj.4838] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 05/02/2018] [Indexed: 12/16/2022] Open
Abstract
Tumor samples obtained from a single cancer patient spatially or temporally often consist of varying cell populations, each harboring distinct mutations that uniquely characterize its genome. Thus, in any given samples of a tumor having more than two haplotypes, defined as a scaffold of single nucleotide variants (SNVs) on the same homologous genome, is evidence of heterogeneity because humans are diploid and we would therefore only observe up to two haplotypes if all cells in a tumor sample were genetically homogeneous. We characterize tumor heterogeneity by latent haplotypes and present state-space formulation of the feature allocation model for estimating the haplotypes and their proportions in the tumor samples. We develop an efficient sequential Monte Carlo (SMC) algorithm that estimates the states and the parameters of our proposed state-space model, which are equivalently the haplotypes and their proportions in the tumor samples. The sequential algorithm produces more accurate estimates of the model parameters when compared with existing methods. Also, because our algorithm processes the variant allele frequency (VAF) of a locus as the observation at a single time-step, VAF from newly sequenced candidate SNVs from next-generation sequencing (NGS) can be analyzed to improve existing estimates without re-analyzing the previous datasets, a feature that existing solutions do not possess.
Collapse
Affiliation(s)
- Oyetunji E Ogundijo
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| |
Collapse
|
11
|
Ogundijo OE, Wang X. Bayesian estimation of scaled mutation rate under the coalescent: a sequential Monte Carlo approach. BMC Bioinformatics 2017; 18:541. [PMID: 29216822 PMCID: PMC5721689 DOI: 10.1186/s12859-017-1948-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2017] [Accepted: 11/21/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Samples of molecular sequence data of a locus obtained from random individuals in a population are often related by an unknown genealogy. More importantly, population genetics parameters, for instance, the scaled population mutation rate Θ=4N e μ for diploids or Θ=2N e μ for haploids (where N e is the effective population size and μ is the mutation rate per site per generation), which explains some of the evolutionary history and past qualities of the population that the samples are obtained from, is of significant interest. RESULTS In this paper, we present the evolution of sequence data in a Bayesian framework and the approximation of the posterior distributions of the unknown parameters of the model, which include Θ via the sequential Monte Carlo (SMC) samplers for static models. Specifically, we approximate the posterior distributions of the unknown parameters with a set of weighted samples i.e., the set of highly probable genealogies out of the infinite set of possible genealogies that describe the sampled sequences. The proposed SMC algorithm is evaluated on simulated DNA sequence datasets under different mutational models and real biological sequences. In terms of the accuracy of the estimates, the proposed SMC method shows a comparable and sometimes, better performance than the state-of-the-art MCMC algorithms. CONCLUSIONS We showed that the SMC algorithm for static model is a promising alternative to the state-of-the-art approach for simulating from the posterior distributions of population genetics parameters.
Collapse
Affiliation(s)
- Oyetunji E Ogundijo
- Department of Electrical Engineering, Columbia University, New York, 10027, USA
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, 10027, USA.
| |
Collapse
|